In this article, I am going to demonstrate how to load JSON data into a relational schema. I’ll be using random user data, as generated by the https://randomuser.me/ public API. The generated user data is loaded into a Postgres database to create a test dataset of 1000 users.

I will be using the Tweakstreet data processing tool (preview build 27) for this task. The preview builds are free. The GA version will remain free for personal and evaluation use.

The final data flow performing the load is available for download. If you’d like to run it yourself, unzip the download, and open the loading.dfl file with Tweakstreet. Make sure to drop the Postgres JDBC driver into your $HOME/.tweakstreet/drivers folder as described in the configuration documentation. Change the database connection definition to match your test database. You can access database connections by selecting the menu option: Flow → Properties…

Input format and output schema

The randomuser.me API generates user information in the following JSON format:

Copy to Clipboard

I’d like to extract a subset of the fields into a small relational schema:

Simplified chassis wiring diagram for a 6 cylinder 1946–1948 Ford

Tweaking the request

Since the API is public, and requires only a single GET request, you can simply use your browser to see the API results. The API documentation gives us some options to tweak the request:

The basic URL is: https://randomuser.me/api/

Since I’d like the results to be stable, I’d like to fix the API version, which leads to: https://randomuser.me/api/1.3/

Depending on your browser you may see unformatted JSON output. Let’s make sure it’s human-readable by adding a corresponding parameter: https://randomuser.me/api/1.3/?format=pretty

Next, I’d like to optimize response time by requesting only fields we need: https://randomuser.me/api/1.3/?format=pretty&inc=gender,name,email,dob,location

We want consistent outcomes during development. If something goes wrong for a particular record, we’d like to repeat the request and get the same results back so we can debug the issue. This is done using a seed parameter. Same seed value generates same results: https://randomuser.me/api/1.3/?format=pretty&inc=gender,name,email,dob,location&seed=myseed

Finally, we need more than one result. We can start with 10 results for the development cycle, then turn it up to 1000 once we’re satisfied with the import process: https://randomuser.me/api/1.3/?format=pretty&inc=gender,name,email,dob,location&seed=myseed&results=10

Now our JSON response looks like this:

Copy to Clipboard

Import strategy

The import process consists the following major sections:

  • make HTTP request to retrieve the data
  • generate individual row per user in API response
  • extract relevant person and location fields from user record
  • ensure location database record exists, reuse location ids if possible
  • write person record

The data flow

The HTTP Request

The data flow starts with the HTTP Request step retrieving the server response. Attaching a dummy step makes it possible to track the data passing between steps. In our case it will be a single row with the response string from the HTTP request:

run and track rows passing between steps

Itemizing users

We want to do two things with the response:

  1. Parse it as JSON, to make the data navigable
  2. Generate a row for every item in the response["results"] array

The Get Items step iterates over a list of items, generating a row for each. The parsing of the data is using the built-in JSON parser function in the entry variables section of the step.

get items step configuration

Extracting fields

The next step is to extract the desired field values for more convenient preview and assignment to database fields. The calculator step is suitable for this, it performs arbitrary calculations for each passing row. In our case, it extracts the desired field values to the top for convenient preview using some formula expressions:

extracting fields from JSON data

The built-in data parsing function is used with a standard formatting string to convert the date of birth from a string to an actual datetime value. The one-character gender marker is extracted from the first character of the gender string, so in our API it will be always be m or f

When running the flow, the data tracking shows the extracted fields:

extracted fields

Populating the location table

We want to ensure there is a combination of the country , state , and city fields in our location table, and retrieve that location’s id to place it in the person record. The Junk Dimension step performs this exact functionality:

the junk dimension step ensures there is a combination of given values in the database table

Writing person records

The SQL Insert step inserts a record for each passing row. We’ll just configure the fields we need:

inserting person data

Creating the tables

The location and person tables must be created before the data flow can populate them. The Junk Dimension and SQL Insert step each can generate the SQL for table and index creation, but as a design-principle they do not execute it autonomously. Table and Index creation are always explicit in Tweakstreet. We can copy the statements the steps generate into a SQL Script step and put it in front of the entire flow.

For Postgres 11 the following DDL does the job:

Copy to Clipboard

Running end-to-end

We’re now ready to run a complete cycle, which will re-create and re-populate our table. It is time to change the results parameter on the request URL to load 1000 records, and also potentially remove the seed parameter to trigger different results on each load:

the entire process loading 1000 records

The loaded data

After the flow completes, the data is loaded as expected:

Copy to Clipboard
Published On: December 10th, 2023 / Categories: JSON data, Postgres, Relational Schema /