Working with complex nested datasets

The input and output of different web services and APIs can range from simple and flat to highly-nested and complex.

An example of a nested data object:

[
  {
    "firstName":"Duke",
    "lastName":"Java",
    "test":[
      {
        "who":"duke",
        "why":"because"
      }
    ],
    "age":18,
    "streetAddress":"100 Internet Dr",
    "city":"JavaTown",
    "state":"JA",
    "postalCode":"12345",
    "phoneNumbers":[
      {
        "type":"Mobile",
        "phoneNumber":"111-111-1111"
      },
      {
        "type":"Home",
        "phoneNumber":"222-222-2222"
      }
    ]
  }
]

In relational databases and flat files (such as CSV, Excel, etc.), the data is represented as datasets, with columns and rows:

EMPNO   FIRSTNME    MIDINIT LASTNAME    WORKDEPT    PHONENO
10      CHRISTINE   I       HAAS        A00         3978
20      MICHAEL     L       THOMPSON    B01         3476
30      SALLY       A       KWAN        C01         4738
50      JOHN        B       GEYER       E01         6789
60      IRVING      F       STERN       D11         6423
70      EVA         D       PULASKI     D21         7831

What if the PHONENO column is an array of phones such as below?

"PHONENO":[
  {
    "type":"Mobile",
    "phoneNumber":"111-111-1111"
  },
  {
    "type":"Home",
    "phoneNumber":"222-222-2222"
  }
]

As you can see below, this column can be easily represented as a dataset:

type    phoneNumber 
mobile  111-111-1111
home    222-222-2222      

Now, if we say that the PHONENO column has an ARRAY data type and each PHONENO value in a row is a dataset, we have found a way to encode nested data objects.

This is exactly how nested data objects are represented in Integrator.

There is no limit to "nestedness" (or numbers of levels) in Integrator, so each column anywhere in the nested object can be a nested object itself.

In Integrator, the inner datasets (PHONENO in the example above) are also called dimensions. The outer dataset (the root nested object) is also called a driving dataset.

Integrator can read and write nested data objects, which are encoded in all the supported data exchange formats, such as JSON, XML, HL7, etc.

Important: by default, Integrator attempts to use streaming, which is another way of saying "please keep as little data in memory as possible." Generally speaking, streaming is faster and has a very low memory footprint. Unfortunately, streaming is not always possible, especially when dealing with complex nested data objects, which are typically returned by web services. In this case, it is always a good idea to disable the streaming.

  1. Create a transformation.
  2. Select the Parameters tab.
  3. Uncheck Stream Data.

Do not stream

Let's go through the different usage examples:

The Source (FROM) is a nested data object

As was already mentioned, the output from most of the APIs and web services is nested. For example, when an ORDER is returned from an API, it contains information about the order itself, as well as all items included in the order.

When Integrator reads a nested data object, it parses it and represents the fields in a tree-like structure:

Nested data object

As you can see in the figure above, the dimensions (inner datasets) are test and phoneNumbers.

The initial mapping was created by clicking the Create Mapping button.

Using SQL with a nested source data object

Read more about how to extract data from a nested source data object using SQL.

The Source (FROM) is a nested XML object

When the Source (FROM) is a nested XML object you can use an XSLT to transform it to any desired format.

Read more about using XSLT to transform XML

The Source (FROM) is a nested data object and the Destination (TO) is a flat file

This is the most common use case. You take nested output from a web service or API and transform it into a flat data structure, then store it in a relational database or flat file.

In Integrator the "nested to flat" mapping can be completely accomplished using drag-and-drop, with no coding involved.

Step 1. As usual, create a transformation by defining the source (FROM) and the destination (TO) where the source is a nested data object (for example users.json) and the destination is a flat file (for example users.csv).

Step 2. Define a mapping by clicking the Create Mapping button. Integrator reads the nested data source object and creates the mapping automatically. Important: if the source object is not available when you create a flow, you can "design" it yourself by using a combination of these buttons:

Step 3. Map fields in the source (FROM) to the fields in the destination (TO). It is recommended that you exclude the parent field but keep the child fields. In the example above, if the parent fields test and phoneNumbers are excluded, the following flat dataset would be created:

firstName   lastName    who    why    age   streetAddress   city   state   postalCode   phoneNumber_type   phoneNumber

You might ask, what will happen if there are multiple phone numbers for a single user? If we merely map fields as in the example above, Integrator will only find the first phone number, and will skip all the others.

firstName   lastName    who    why      age  ....  phoneNumber_type   phoneNumber
Duke        Java        duke   because  18   ....  Mobile             111-111-1111

So, what can be done if you want to keep all the data from a nested source object?

The Source (FROM) is a nested data object and the Destination (TO) is a flat file that contains all the fields from the driving dataset and all the fields from the dimension (inner dataset)

Integrator supports a technique called "denormalization" or extracting the dimension.

When the source (FROM) dataset is a nested object, the following transformation parameters are available:

So, if we keep reusing the "users" example above, which extracts all the phone numbers, together with the user data, we would do the following:

Step 1. Skip the mapping all together.

Step 2. In the Parameters tab enter:

As a result, we will create the following flat dataset:

firstName   lastName    age  ....  phoneNumber_type   phoneNumber
Duke        Java        18   ....  Mobile             111-111-1111
Duke        Java        18   ....  Home               222-222-2222

This technique works perfectly if you have just one dimension (one inner dataset). But what can be done if there are multiple dimensions?

The Source (FROM) is a nested data object and the Destination (TO) results in multiple flat files

If you want to keep all the data from the nested source object, including the data in all dimensions, you need to "normalize" the nested data object across multiple linked flat data objects, as relational databases do.

If you were designing a database for the users example above, you would typically be creating the following tables:

Users

This is exactly what Integrator can do for you, automatically.

Step 1. Start by adding a new flow and typing nested into the "Select Flow Type" box.

Step 2. Choose either "Extract nested dataset and create staging files" or "Extract nested dataset and create staging tables", depending on what your destination is.

Step 3. Do a per-field mapping, if needed.

Step 4.1. If the destination is a database, you can specify the following parameters:

Step 4.2. If the destination is a file, you can specify the following parameters:

Integrator uses the following rules when it creates and populates staging tables and files:

Important: after the original source object has been normalized, using multiple dimension tables, you can use the technique below to assemble the nested destination object.

The destination (TO) is a nested data object

All techniques above work if the destination (TO) is a flat object or multiple flat objects, which covers 99% of all cases, when we are getting data from APIs or web services.

What if we need to push data to APIs or web services? Or just create a nested XML or JSON document?

If your destination object (TO) is exactly the same as your source object (FROM), in terms of their structure - you are done. If there is no mapping and you are just converting it from one format to another (for example, from XML to JSON), just specify your source and destination connections and formats and let Integrator do the magic.

If there is a mapping, or if the source structure is not equal to the destination structure, you will need to use a transformation called After extract.

Read how to create complex nested data object using JavaScript.