Change Redshift Table Distribution: A Guide with Examples

Selecting an appropriate distribution style is a crucial factor in enhancing the efficiency of the Redshift Database. The distribution style you choose can have a direct impact on your query’s performance. Using an incorrect distribution style for a table may adversely affect your Redshift cluster due to increased workload. This article will demonstrate how to change the distribution style of a Redshift table, along with an illustrative example.

Page Content

Introduction
Redshift Table Distribution Styles
Why to Change Redshift Table Distribution style?
How to Change Redshift Table Distribution Style
Changing Redshift Table Distribution Style: An Example
- Change Distribution of Large Table in Redshift
Redshift Table Distribution Performance
Conclusion

Introduction

Amazon Redshift is a popular cloud-based data warehousing solution that enables businesses to store and analyze large amounts of data. Redshift table distribution is a crucial aspect of optimizing query performance in Redshift. In this post, we will explore Redshift table distribution styles, how to change them, and their impact on query performance.

Redshift Table Distribution Styles

There are three types of distribution style available in Amazon Redshift:

EVEN distribution: The even distribution style distributes rows across all nodes evenly.
KEY distribution: The key distribution style distributes rows based on a defined distribution key, which ensures that rows with the same distribution key value are stored on the same node
ALL distribution: while the all distribution style replicates data across all nodes.

You can choose any of the style based on your data, size and performance considerations. You should follow Redshift table distribution best practices while selecting distribution style.

Why to Change Redshift Table Distribution style?

The EVEN distribution style is the most frequently used when you are uncertain about which distribution style to select. If you have generated a table using an EVEN distribution or a column with many duplicated records, it is essential to change the distribution style promptly; otherwise, it can lead to reduced Redshift performance. To modify the distribution style of a Redshift table, a redistribution process is required.

How to Change Redshift Table Distribution Style

To modify the distribution style of a table in Redshift, you can create a new table with the preferred distribution style and copy the data from the original table to the new one. However, it is important to adhere to best practices to ensure a smooth transition, such as minimizing downtime, testing changes in a non-production environment, and guaranteeing data consistency.

Here’s a step-by-step guide on how to change the distribution style of a table in Redshift:

Create a new table with the desired distribution style using the CREATE TABLE AS statement
Copy data from the old table to the new table using the INSERT INTO statement
Rename the old table to a different name using the ALTER TABLE statement
Rename the new table to the original table name using the ALTER TABLE statement

Changing Redshift Table Distribution Style: An Example

As previously stated, modifying the distribution of a Redshift table cannot be done by using the “alter table column” command. Instead, you must redistribute the table data by employing the “CREATE TABLE AS” command with the desired distribution style.

To illustrate, let’s examine the following example of using the CTAS command to redistribute table data in Redshift.

create table event_new_dist
distkey (eventid)
sortkey (eventid, dateid)
as
select eventid, venueid, dateid, eventname
from event;

It’s important to note that this approach is suitable for tables with relatively small amounts of data. If you are dealing with a very large table, please refer to the section below.

Redshift Distribution Styles

Change Distribution of Large Table in Redshift

Redistributing a large table can be challenging task due to space constraints. If you are dealing with a large table, you can use the following steps to redistribute it:

Break up Your Table Data

In the first step, create smaller tables by filtering the data, using date ranges if necessary. For instance, let’s take a look at the following example, which involves creating a temporary table and inserting a few records.

create table event_temp
distkey (eventid)
as
select eventid, venueid, dateid, eventname
from event
limit 0;

insert into event_temp
select eventid, venueid, dateid, eventname
from event
where event_time between '2019-01-01' 
and '2019-06-30';

Next, delete the data from the base table using the same date range, since the same data has been transferred to the temporary table with the new distribution style.

delete from event
where event_time between '2019-01-01' 
and '2019-06-30';

Repeat this process until the events table is empty.

Drop Very Large Table

Since all the data from the base table has been moved to the temporary table, you can drop the base table. For instance, you can drop the “events” table by executing the appropriate command.

drop table events;

Rename Temp Table to Original Table

Lastly, rename the temporary table to the original table name to complete the redistribution process.

As an example, let’s consider the following scenario:

alter table event_temp rename to events;

Now, you have events table with different distribution style.

Redshift Table Distribution Performance

Table distribution has a significant impact on query performance in Redshift. A poorly chosen distribution style can lead to uneven data distribution and query performance issues. To optimize table distribution for performance, it’s essential to analyze query performance using the Amazon Redshift EXPLAIN plan, choose the right distribution key, and consider using sort keys and column encoding.

Conclusion

Redshift table distribution is a crucial aspect of optimizing query performance in Redshift. By choosing the right distribution style, changing the distribution style, and optimizing table distribution for performance, businesses can achieve faster query performance, reduce query time, and improve overall data warehousing performance.

Hope this helps 🙂