Data Replication Methods

Data replication takes data from your source databases — Oracle, MySQL, Microsoft SQL Server, PostgreSQL, MongoDB, etc. — and copies it into your destination data warehouse. After you have identified the data you want to bring in, you need to determine the best way to replicate the data so it meets your business needs.

Choosing the right method

The method you choose impacts the end state of your data. Fortunately, there are data replication methods built to integrate with today’s data warehouses and suit a variety of use cases. At Etlworks, we believe in providing users with as much flexibility as possible. 

The table below contains a high-level look at each of Etlworks's Replication Methods and how they compare to each other in terms of pros and cons.

Method Pros: Cons:

High Watermark

Uses a designated field, typically a TIMESTAMP, to track changes in the source.
  • fast
  • no extra moving parts
  • works for all data sources, including all databases, files, and APIs

Change Data Capture (CDC)

Uses database redo [transaction] log to track changes in the source.
  • fast
  • no polling from database tables, uses database redo log instead
  • supports deletes
  • supports [almost] real-time replication
  • currently supports only Postgres, MySQL, SQL Server, and Oracle
  • some older versions of the databases above do not support CDC
  • requires extra setup in the source database

Database Triggers

Uses table(s) updated by the database triggers to track changes in the source.
  • works for any source database which has triggers
  • no extra requirements for the specific version of the database or extra field in each table
  • requires adding triggers to all database tables
  • triggers can negatively impact performance

Real-time CDC with Kafka

Polls CDC events from the Kafka topic(s) to track changes in the source.
  • fast
  • no polling from database tables
  • supports deletes
  • supports real-time replication
  • complicated setup (requires Kafka, Zookeeper, Kafka Connect, and Debezium)
  • currently supports only Postgres, MySQL, SQL Server, Oracle, and MongoDB
  • some older versions of the databases above do not support CDC
  • requires extra setup in the source database

Full refresh

Always polls the entire dataset from the source.
  • the simplest to setup
  • can be quite fast for the relatively small datasets (<100K records)
  • works for all data sources
  • not recommended for large datasets

Ready to start using Etlworks Integrator?

Request a demo or start a 14-day free trial.

 

SIGN UP FREE Request Demo