Working with files

In Integrator, it is possible to read and write data in numerous data exchange formats, such as CSV, JSON, XML, Excel, etc. The actual files may be in file storage, cloud storage, key-value storage, a NoSQL database or on an email server.

Connecting to data storage

Connecting to file storage

To connect to file storage, create a Server Storage, FTP, FTPS, SFTP or WebDAV connection.

Connecting to cloud storage

To connect to cloud storage, create an Amazon S3, Google Cloud Storage, Microsoft Azure Storage or other cloud storage connection.

Connecting to a Redis database

To connect to a Redis database, create a Redis connection.

Connecting to a MongoDB database

To connect to a MongoDB database, create a MongoDB connection.

Connecting to an email server

To connect to an email server, create either an inbound or an outbound email connection.

Testing a connection

Click the Test Connection button on the connection screen to test the connection. Test Connection is only available for Cloud and File storage. In addition to actually connecting to the storage, Integrator attempts to read file names using a configured folder and file name. Integrator supports wildcard file names.

Test file connection

Formats

When working with files it is required that you describe the data exchange format. Integrator supports the most commonly used formats:

Browsing files and viewing data

Use the Explorer add-on to test your connection to a web service and explore API endpoints.

Step 1. Create a file connection.

Step 2. Choose the format.

Step 3. Open Explorer, select the connection created in Step 1 and link it to the format chosen in Step 2. Read more about how to link a format to a connection in Explorer.

Step 4. Explore the metadata (files and fields), view data in a grid, query data and discover dependencies using SQL.

A file is the source

Integrator can extract data from files, transform it and load it into any destination, for example a database, files, or a third-party system via the API, etc.

Read how to configure a transformation when the source is a file.

Source file name

In a FROM-TO transformation, Etlworks Integrator calculates the source (FROM) file name based on two factors: whether wildcard file names are allowed (the default), and whether Override Wildcard File Name set for Connection is enabled (disabled by default).

Both can be set when configuring a connection:

Override wildcard file name

Here are some examples that allow wildcard file names:

Case 1. Override Wildcard File Name set for Connection is disabled.

Case 2. Override Wildcard File Name set for Connection is enabled.

Working with wildcard file names

Integrator can process files which match a wildcard file name, for example *.csv.

Read how to work with wildcard file names.

Processing all files in a folder

In Integrator, you can process all files in a folder using one of the techniques below:

Source file name as a variable

It is fairly common to store the source file name that the flow needs to access in a variable, for example, by placing it within the database. Etlworks Integrator can work with wildcard file names, such as *.csv, since it is not always possible to hardcode the source file name, so storing it as a variable overcomes this ambiguity.

Accessing source file name using global variables

Each time Etlworks Integrator reads a source file, it stores its name as a global variable. The global variable is only available to that particular flow, and all flows within that nested flow.

Let's assume that the source name is *.CSV TO PIPE.JSON 1.

Tip: source name = transformation name. You can always check the source name by clicking the View Flow XML button for that particular flow:

view flow XML

and by checking the actual source name, which is the value of the tag name below the tag source:

source name

Let's also assume that the actual source file name is /user/local/temp/pipe.csv.

So, for the source name *.CSV TO PIPE.JSON 1 and for the actual source file name /user/local/temp/pipe.csv three global variables will be created:

SOURCE NAME.FULL.FILE.NAME.TO.READ: *.CSV TO PIPE.JSON 1.FULL.FILE.NAME.TO.READ = /user/local/temp/pipe.csv

SOURCE NAME.FILE.NAME.TO.READ: *.CSV TO PIPE.JSON 1.FILE.NAME.TO.READ = pipe.csv

SOURCE NAME.BASE.FILE.NAME.TO.READ: *.CSV TO PIPE.JSON 1.BASE.FILE.NAME.TO.READ = pipe

You can then reference any of these {variables} in the connection parameters (such as URL, etc.), or in the FROM/TO fields for the transformation.

variable file name to read

You can also access any of the global variables above using the following JavaScript:

importPackage(com.toolsverse.config);

var sourceFileName = SystemConfig.instance().getProperties().get('variable name above');

Accessing the source file name using JavaScript

The simplest way to access it is to use a dataSet.getFileNameToRead() if the dataSet object is available in that particular JavaScript code. For example:

importPackage(com.toolsverse.config);

SystemConfig.instance().getProperties().put('my unique key for the source file name', 
   FilenameUtils.getName(dataSet.getFileNameToRead()));

and then access it somewhere else:

importPackage(com.toolsverse.config);

var sourceFileName = SystemConfig.instance().getProperties().get('my unique key for the source file name');

Extracting data from complex nested documents

Read more about how to extract data from a nested data source object using SQL and, specifically, writing queries for nested data objects.

Also, read how to extract data from a nested data source object using built-in transformations. Note: there are multiple options, so please read all the paragraphs under this link.

Working with large source files

If a source file is large, especially if it is an XML file, it makes sense to split it into multiple smaller files, and then process them one-by-one in a loop.

Note: while splitting the files, Etlworks Integrator consumes only a minimum amount of RAM and CPU cycles, which allows it to work with very large files without blocking other tasks.

Why would you want to split large files?

  1. If the source file is so large it cannot be parsed due to memory limitations.
  2. If there is a source-to-destination transformation, for example "read from XML file, create JSON and hit HTTP endpoint", and the payload created as a part of the transformation is too large for the destination.
  3. If the source file contains information you don't really need, except for a few repeating segments.

Splitting files

Step 1 Start creating the flow that splits files by opening the Flow Builder window, clicking the + button, and typing in split. Select a flow based on the file type (XML, JSON, CSV, etc.).

Split flows

Step 2 Continue by selecting a source connection (FROM), and a file name (or a wildcard file name).

Step 3 Continue by selecting a destination connection (TO).

Split source

Step 4 Continue by specifying parameters. Click the MAPPING button and select the Parameters tab.

Note: the parameters are different for each file type.

Splitting XML files

Unlike other file types, parsing and processing XML files requires a lot of memory. It is typically a good idea, and if the file is particularly large, a requirement, to split the original file into multiple smaller files, so they can easily fit into memory.

Important: the XML files must have repeating segments in order to be "splittable".

This is an example of XML with repeating segments, in this case called <Entries>:

<?xml version="1.0" ?>
<Transaction>
     <feed>1</feed>
     <source>http</source> 
     <!-- repeating segments -->  
     <Entries>
          <File>N71738</File>
          <HBLNo>3470312436</HBLNo>
          <BOENumber>5064594</BOENumber>
     </Entries>
     <Entries>
          <File>N71738</File>
          <HBLNo>3470312436</HBLNo>
          <BOENumber>11111</BOENumber>
     </Entries>
</Transaction>

An example of XML without repeating segments:

<?xml version="1.0" ?>
<Transaction>
     <feed>1</feed>
     <source>http</source> 
     <File>N71738</File>
     <Data>
          <File>N71738</File>
     </Data>
     <HBLNo>3470312436</HBLNo>
     <BOENumber>5064594</BOENumber>
</Transaction>

To split XML files, select the Split XML files flow type. Follow Steps 1 - 4 above.

The following parameters are available:

<?xml version="1.0" ?>
<Transaction>
     <!-- common segment --> 
     <feed>1</feed>
     <source>http</source> 
     <!-- repeating segments -->  
     <Entries>
          <File>N71738</File>
          <HBLNo>3470312436</HBLNo>
          <BOENumber>5064594</BOENumber>
     </Entries>
     <Entries>
          <File>N71738</File>
          <HBLNo>3470312436</HBLNo>
          <BOENumber>11111</BOENumber>
     </Entries>
</Transaction>

Splitting JSON files

To split JSON files, select the Split JSON files flow type. Follow the same steps as when creating a flow to split XML files.

Splitting CSV files

Splitting a CSV file is as simple as breaking it into chunks using the end-of-line character as a separator. It also automatically detects a row with column names (a header) and replicates it in each chunk.

To split CSV files, select the Split CSV files flow type. Follow Steps 1 - 4.

The following parameters are available:

A File is the destination

Integrator can extract data from files, API endpoints, social web sites, databases, etc., transform it and create files.

Read how to configure a transformation when the destination is a file.

Destination file name

The destination file name is calculated by concatenating the directory specified when configuring the destination connection, and the file name which was entered into the TO field for the transformation.

Folder

+

TO

= Inprogress/test.csv

The destination file name is the same as the source file name

When processing files by using a wildcard file name, it is possible to set the destination file name to be the same as the source file name.

Just enter * into the TO field for the transformation.

The destination file name is equal to the source file name

Calculate the destination file name

In a source-to-destination transformation, if any part of the TO field contains macros of this type: replace(regex, replacement), the system will perform a regular expression replacement of the enclosed part of the string, and will concatenate it with the left and right parts of the string.

Example 1

FROM = *_processing.xml

TO = replace(_processing, _done)

If these are the source files in the source folder: order1_processing.xml, order2_processing.xml, and order3_processing.xml,

Then the following destination files will be created: order1_done.xml, order2_done.xml, and order3_done.xml.

Example 2

FROM = *.xml

TO = replace('', _)

If these are the source files in the source folder: order1 processing.xml, order2 processing.xml, and order3 processing.xml,

Then the following destination files will be created: order1_processing.xml, order2_processing.xml, and order3_processing.xml.

Create files with unique names

Read how to always create files with unique names.

Destination file name as a variable

In many cases, when you are building a pipeline of source-to-destination transformations, a file created in the previous step will be used as a source in the next step. When you are using wildcard file names or automatically-generated unique file names, it is impossible to hardcode the name of the file for the next transformation.

Each time Etlworks Integrator creates a destination file, it stores its name as a global variable. The global variable is only available to that particular flow, and all the flows nested within it.

Let's assume that the destination name is *.CSV TO PIPE.JSON 1.

Tip: the destination name = the transformation name. You can always check the destination name by clicking the View Flow XML button for that particular flow:

view flow XML

or by checking the actual destination name, which is the value of the tag name below the tag destination:

the destination name

Let's also assume that the actual destination file name (which is automatically generated by appending the UUID to the source file name) is /user/local/temp/pipe_4467ea6995904730a10a6c7ddbe25470.json.

So, for the destination name *.CSV TO PIPE.JSON 1 and for the actual destination file name /user/local/temp/pipe_4467ea6995904730a10a6c7ddbe25470.json, three global variables will be created:

DESTINATION NAME.FULL.FILE.NAME.TO.WRITE: *.CSV TO PIPE.JSON 1.FULL.FILE.NAME.TO.WRITE = /user/local/temp/pipe_4467ea6995904730a10a6c7ddbe25470.json

DESTINATION NAME.FILE.NAME.TO.WRITE: *.CSV TO PIPE.JSON 1.FILE.NAME.TO.WRITE = pipe_4467ea6995904730a10a6c7ddbe25470.json

DESTINATION NAME.BASE.FILE.NAME.TO.WRITE: *.CSV TO PIPE.JSON 1.BASE.FILE.NAME.TO.WRITE = pipe_4467ea6995904730a10a6c7ddbe25470

You can then reference any of these {variables} in the connection parameters (such as the URL), or in the FROM/TO fields for the transformation.

variable file name to write

You can also access any of the global variables above using the following JavaScript:

importPackage(com.toolsverse.config);

var destinationFileName = SystemConfig.instance().getProperties().get('variable name above');

Creating complex nested documents

Read how to create a complex nested document where the destination is a file.

If the document is in XML format, read how to use an XSL transformation to create complex XML code.

File management

The destination file name in file operations

When copying, moving, renaming files, or applying an XSL transformation, Etlworks Integrator uses the following rules to calculate the new (destination) file name.


Empty TO field

If nothing is entered in the TO field, the new file name will be the same as the source file name.


Regular expression replacement

If any part of the TO field contains macros of this type: replace(regex, replacement), the system will perform a regular expression replacement of the enclosed part of the string, and will concatenate it with the left and right parts of the string.

Example 1

FROM = *_processing.xml

TO = replace(_processing, _done)

If these are the source files in the source folder: order1_processing.xml, order2_processing.xml, and order3_processing.xml,

then the following destination files will be created: order1_done.xml, order2_done.xml, and order3_done.xml.

Example 2

FROM = *_processing.xml

TO = done/replace(_processing, _done)

If these are the source files in the source folder: order1_processing.xml, order2_processing.xml, and order3_processing.xml,

then the following destination files will be created: done/order1_done.xml, done/order2_done.xml, and done/order3_done.xml.

Example 3

FROM = *.xml

TO: replace('', _)

If these are the source files in the source folder: order1 processing.xml, order2 processing.xml, and order3 processing.xml,

then the following destination files will be created: order1_processing.xml, order2_processing.xml, and order3_processing.xml.


A non-wildcard file name is entered into the TO field

If a non-wildcard file name is entered into the TO field, for example order.csv, the new file name will be TO + optionally _UUID.

For example:

FROM = *.csv

TO = order.csv

If these are the source files in the source folder: or1.csv, or2.csv, and or3.csv,

then the following destination files will be created: order.csv, order_uuid.csv, and order_another_uuid.csv.


A wildcard file name is entered into the TO field

If a wildcard file name is entered into the TO field, for example *_processing.csv, the new file name will depend on the position of the wildcard character * in the TO.

Case 1. The file name entered into the TO starts with a wildcard character:

FROM = or*.csv

TO = *_processing.csv

If these are the source files in the source folder: or1.csv, or2.csv, and or3.csv,

then the following destination files will be created: or1_processing.csv, or2_processing.csv, and or3_processing.csv.

Case 2. The file name entered into the TO ends with a wildcard character:

FROM = or*.csv

TO = order*.csv

If these are the source files in the source folder: or1.csv, or2.csv, and or3.csv,

then the following destination files will be created: order1.csv, order2.csv, and order3.csv.


UUID suffix

If the system already returned a specific file name, it will add _UUID at the end of the file name.

order_758ced62472211e8842f0ed5f89f718b.csv

Copy files

Use this flow type to copy files between files storage locations. You can also save an unmodified response from the HTTP call to a file, and send the contents of the source file as a payload to the HTTP endpoint.

Step 1. Start creating the flow in the Flow Builder by clicking the + button and typing in copy files.

Step 2. Select the source connection. It can be any one of the following connection types:

Step 3. In the FROM field, enter the file name, or a wildcard file name, for the file(s) to copy.

Step 4. Select the destination connection. It can be any one of the following connection types:

Step 5. Optionally, enter a new file name or a new wildcard file name into the TO field. Read how the system calculates a destination (TO) file name.

Step 6. Click the MAPPING button, select the Parameters tab and modify the following parameters, if necessary:

Copy only new files

If you want to copy files that haven't been copied yet, enable the option Do not copy files which have already been copied under the MAPPING->Parameters.

copy new files only

Move files

Using this flow type, you can move files between file storage locations. When a file is moved from the source to a destination, it is first copied to the destination, then deleted from the source.

Step 1. Start creating a flow in the Flow Builder by clicking the + button and typing in move files.

Step 2. Select the source connection. It can be any one of the following connection types:

Step 3. In the FROM field, enter a file name, or a wildcard file name of the file(s) to be moved.

Step 4. Select a destination connection. It can be any one of the following connection types:

Step 5. Optionally, enter a new file name or a new wildcard file name in the TO field. Read how the system calculates a destination (TO) file name.

Step 6. Click the MAPPING button, select the Parameters tab and modify the following parameters, if necessary:

Move only new files

If you want to move files that haven't been moved yet, enable option Do not move files which have already been moved under the MAPPING->Parameters.

move new files only

Rename files

Using this flow type you can rename files.

Step 1. Start creating a flow in the Flow Builder by clicking the + button and typing in rename files.

Step 2. Select the source connection. It can be any one of the following connection types:

Step 3. In the FROM field, enter a file name, or a wildcard file name of the file(s) to rename.

Step 4. Select a destination connection. It can be any one of the following connection types:

Step 5. Enter a new file name into the TO field. Read how the system calculates a new file name.

Step 6. Click the MAPPING button, select the Parameters tab and modify the following parameters, if necessary:

Delete files

Using this flow type, you can delete files.

Step 1. Start creating a flow in the Flow Builder by clicking the + button and typing in delete files.

Step 2. Select the source connection. It can be any one of the following connection types:

Step 3. In the FROM field, enter a file name, or a wildcard file name of the file(s) to delete.

Step 4. Click the MAPPING button, select the Parameters tab and modify the following parameters, if necessary:

Create Folder

Using this flow type, you can create a folder in any supported file storage.

Step 1. Start creating a flow in the Flow Builder by clicking the + button and typing in create folder.

Step 2. Select the source connection. It can be either one of the following connection types:

Step 3. In the FROM field, enter the name of the folder to create. Use folder1/folder2/foldern to create multiple nested folders.

Important: the folder will be created under the base Directory, specified in the connection.

base folder

Step 4. Click the MAPPING button, select the Parameters tab and modify the following parameters, if necessary:

Check number of files in a folder

Using this flow type, you can compare the number of files in a folder with a given constant. If the number of files is not equal to what was expected, Integrator will generate an exception. The most common use case is to generate an exception if the actual number of matching files is not 0.

Step 1. Start creating a flow in the Flow Builder by clicking the + button and typing in check number of files.

Step 2. Select the source connection. It can be either one of the following connection types:

Step 3. In the FROM field, enter the wildcard file name to look for.

Step 4. Click the MAPPING button, select the Parameters tab and modify the following parameters, if necessary:

Zip files

Using this flow type, you can create an archived file in the zip or gzip format.

Step 1. Start creating a flow in the Flow Builder by clicking the + button and typing in zip files.

Step 2. Select the source connection. It should have the following connection type:

Step 3. In the FROM field enter file name or a wildcard file name of the file(s) to zip.

Step 4. Select the destination connection. It should be the following connection type:

Step 5. Enter the name of the archived file into the TO field.

Important: by default, Integrator creates archived files in the Zip format. Enter a file name with the extension gzip to create an archive in the gzip format.

Step 6. Click the MAPPING button, select the Parameters tab and modify the following parameters, if necessary:

Unzip files

Using this flow type, you can unzip an archived file which is in the zip format.

Step 1. Start creating a flow in the Flow Builder by clicking the + button and typing in unzip files.

Step 2. Select the source connection. It should have the following connection type:

Step 3. In the FROM field, enter the archived file name. The wildcard file names are supported.

Step 4. Select the destination connection. It should have the following connection type:

Step 5. Click the MAPPING button, select the Parameters tab and modify the following parameters, if necessary:

All file management operations

Read how to use a common file management flow to chain multiple file operations together.

Other use cases

Inserting images from files into database table

Read how to insert image files into a database table.

Transform XML files using XSLT

Step 1. Create connections for the source and destination documents. It can be the same connection. Only Server Storage connections are allowed for this flow type.

Step 2. Start creating the flow by typing xsl into the Select Flow Type dialog box:

XSLT flow

Step 3. Continue by adding one or more transformations, selecting the source and destination connections created in Step 1, entering a file name or a wildcard file name for the source XMLs into the FROM field, and (optionally) entering the destination file name into the TO field. If the TO field is empty - the destination file(s) will be created with the same name as the source file(s). Read how the system will calculate the destination file name.

Note: to transform multiple files in the separate threads set the value of the Maximum Simultaneous Operations to greater than 1.

Important: this flow collects the files matching the FROM field from the source connection, and creates files in the destination connection. If the source connection and the destination connection are pointed to the same folder, and the TO field is empty - the original source file(s) will be overwritten. It is highly recommended that you use a different folder for the destination connection. You can use the Create folder(s) flow type to create folders at runtime.

batch XSLT flow

Merge CSV files

Read how to merge CSV files in Integrator.

Merge XML files

Read how to merge XML files in Integrator.

Loading files into Snowflake

Use the flow type "Load files into Snowflake".

load files into snowflake

Read how to configure a snowflake flow.

Loading files into Amazon MWS Feeds API

Read how to create and load files into the Amazon MWS API.

Expose file as an API endpoint

In Integrator it is possible to build a REST API that can be used to access some controlled portion of the data from the other side of the firewall. For example, you can create a flow which reads data from files in your corporate network and exposes it as an Integrator API endpoint.

Read more about building a REST API for accessing files.

Tips and tricks

Tips and tricks when working with files

Read about tips and tricks when working with files.

Tips and tricks when working with CSV files

Read about tips and tricks when working with CSV files.

Tips and tricks when working with Excel files

Read about tips and tricks when working with Excel files.

Tips and tricks when working with XML files

Read about tips and tricks when working with XML files.

Tips and tricks when working with Amazon S3

Amazon S3 tips and tricks.

Tips and tricks when working with Google Cloud Storage

Google Cloud Storage tips and tricks.

Common tips and tricks when working with data

Read about common tips and tricks when working with data.

Performance tips

Read about performance tips when working with files.

Also, read about common performance tips.