Redshift Overview

Amazon Redshift is a fast, fully-managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools.

Read more about Amazon Redshift.

Etlworks Integrator fully supports Amazon Redshift.

Configuring Redshift

Configuring the Firewall

Typically, the TCP port 5439 is used to access Amazon Redshift. If Amazon Redshift and Eltworks Integrator are running on different networks, it is important to enable inbound traffic on port 5439.

Learn how to open an Amazon Redshift port for inbound traffic.

Configuring the IAM role

If you are going to be loading data into Redshift from an Amazon S3 bucket, you will need to grant access to S3 for the Redshift user.

For access to Amazon S3 with the ability to use COPY and UNLOAD, choose either the AmazonS3ReadOnlyAccess role or the AmazonS3FullAccess role.

Working with Redshift

Connecting to Redshift

In order to work with Redshift you will need to create a connection to Amazon Redshift.

Loading data into Redshift

The recommended way of loading data into Amazon Redshift is to use the COPY command. Read more about the COPY command options.

Below are step-by-step instructions for configuring a flow which uses the COPY command to load data into Amazon Redshift.

Step 1. Create an Amazon Redshift connection.

Step 2. Create a flow which reads data from a source and creates CSV files in an Amazon S3 bucket. Use a suitable flow type where the destination is a file. Most likely, the source is going to be a relational database. Use the CSV format as the destination format, and Amazon S3 as the destination connection.

database to S3

Important: for large datasets, it is recommended that you split the source dataset into multiple CSV files. Use a Partition By Transformation and specify the maximum number of lines allowed in a single file, in the Partition By field.

Step 3. Create an SQL flow which executes the COPY command using the connection created in step 1. Below is an example of the SQL that could be used:

copy product 
from 's3://etlworks-redshift-test/product.csv' 
iam_role 'arn:aws:iam::530240367559:role/redshifts3'
csv quote as '"' IGNOREHEADER as 1;

Line-by-line explanation:

copy product - means to copy data to the product table.

from 's3://etlworks-redshift-test/product.csv' - from the file product.csv in the S3 bucket etlworks-redshift-test.

iam_role 'arn:aws:iam::530240367559:role/redshifts3' - using the configured IAM role

csv quote as '"' IGNOREHEADER as 1 - and in the CSV file format, where the header is ignored and " is used as a quote.

Note: you can specify just the prefix for the file name, for example, from 's3://etlworks-redshift-test/product' and Redshift will copy all the files whose names start with product.

redshift copy flow

Step 4. Combine the flows created in step 2 and step 3 into a nested flow.

MERGE not available when loading data into Redshift

Amazon Redshift does not support MERGE (UPSERT).

The recommended technique is to load data into a staging table, using the COPY command, and then use the set INSERT/DELETE SQL commands. For example:

TRUNCATE product_staging;

copy product_staging 
from 's3://etlworks-redshift-test/product.csv' 
iam_role 'arn:aws:iam::530240367559:role/redshifts3'
csv quote as '"' IGNOREHEADER as 1;

delete from product
where sku in (select sku from product_staging);

insert into product
select * from product_staging;

Important: make sure that the table to load data into has all the required index support.

Change Replication

When loading data into a data warehouse, such as Amazon Redshift, it is usually only required to load the changes in the source (typically, an OLTP database) since the last update.

After the initial full load is complete, set up a flow with change replication, which only extracts modified data from the source.

Amazon Redshift is a relational database

Amazon Redshift is a column-based relational database which is loosely based on PostgreSQL. Therefore, you can use the same techniques you would normally use to work with relational databases in Etlworks Integrator. It is, however, important to understand that inserting data into Redshift row by row can be painfully slow. So, it is recommended that you use set operations, such as insert into select from, or the COPY command to load data into Amazon Redshift.

The following data types are available in Amazon Redshift.

Important: Redshift includes a Text data type, which is used as a CLOB in PostgreSQL. However, Redshift automatically translates it to VARCHAR(255). There are other similar restrictions.

Exploring data in Redshift

Use Integrator Explorer to browse, view, and query data in Amazon Redshift.