Amazon Redshift is a fast, fully-managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools.
Read more about Amazon Redshift.
Etlworks Integrator fully supports Amazon Redshift.
Configuring the Firewall
Typically, the TCP port
5439 is used to access Amazon Redshift. If Amazon Redshift and Eltworks Integrator are running on different networks, it is important to enable inbound traffic on port
Learn how to open an Amazon Redshift port for inbound traffic.
Configuring the IAM role
If you are going to be loading data into Redshift from an Amazon S3 bucket, you will need to grant access to S3 for the Redshift user.
For access to Amazon S3 with the ability to use
UNLOAD, choose either the
AmazonS3ReadOnlyAccess role or the
Working with Redshift
Connecting to Redshift
In order to work with Redshift you will need to create a connection to Amazon Redshift.
Loading data into Redshift
The recommended way of loading data into Amazon Redshift is to use the
COPY command. Read more about the COPY command options.
Below are step-by-step instructions for configuring a flow which uses the COPY command to load data into Amazon Redshift.
Step 1. Create an Amazon Redshift connection.
Step 2. Create a flow which reads data from a source and creates CSV files in an Amazon S3 bucket. Use a suitable flow type where the destination is a file. Most likely, the source is going to be a relational database. Use the CSV format as the destination format, and Amazon S3 as the destination connection.
Important: for large datasets, it is recommended that you split the source dataset into multiple CSV files. Use a Partition By Transformation and specify the maximum number of lines allowed in a single file, in the Partition By field.
Step 3. Create an SQL flow which executes the
COPY command using the connection created in step 1. Below is an example of the SQL that could be used:
copy product from 's3://etlworks-redshift-test/product.csv' iam_role 'arn:aws:iam::530240367559:role/redshifts3' csv quote as '"' IGNOREHEADER as 1;
copy product - means to copy data to the product table.
from 's3://etlworks-redshift-test/product.csv' - from the file product.csv in the S3 bucket etlworks-redshift-test.
iam_role 'arn:aws:iam::530240367559:role/redshifts3' - using the configured IAM role
csv quote as '"' IGNOREHEADER as 1 - and in the CSV file format, where the header is ignored and
" is used as a quote.
Note: you can specify just the prefix for the file name, for example,
from 's3://etlworks-redshift-test/product' and Redshift will copy all the files whose names start with
Step 4. Combine the flows created in step 2 and step 3 into a nested flow.
MERGE not available when loading data into Redshift
Amazon Redshift does not support
The recommended technique is to load data into a staging table, using the
COPY command, and then use the set
INSERT/DELETE SQL commands. For example:
TRUNCATE product_staging; copy product_staging from 's3://etlworks-redshift-test/product.csv' iam_role 'arn:aws:iam::530240367559:role/redshifts3' csv quote as '"' IGNOREHEADER as 1; delete from product where sku in (select sku from product_staging); insert into product select * from product_staging;
Important: make sure that the table to load data into has all the required index support.
When loading data into a data warehouse, such as Amazon Redshift, it is usually only required to load the changes in the source (typically, an OLTP database) since the last update.
After the initial full load is complete, set up a flow with change replication, which only extracts modified data from the source.
Amazon Redshift is a relational database
Amazon Redshift is a column-based relational database which is loosely based on PostgreSQL. Therefore, you can use the same techniques you would normally use to work with relational databases in Etlworks Integrator. It is, however, important to understand that inserting data into Redshift row by row can be painfully slow. So, it is recommended that you use set operations, such as
insert into select from, or the
COPY command to load data into Amazon Redshift.
The following data types are available in Amazon Redshift.
Important: Redshift includes a
Text data type, which is used as a CLOB in PostgreSQL. However, Redshift automatically translates it to
VARCHAR(255). There are other similar restrictions.