When building analytics on top of a relational datbase like Postgres, MySQL, Sql Server or Oracle, star schemas can be useful at various points in the data model. In this article I’m going to build a simple analytics schema starting from a CSV file, and transforming the data into a simple star schema with two dimension tables.
Our sample data contains worldwide dollar-equivalent pricing of certain supermarket goods. A single source CSV file contains observations of prices for supermarket goods as seen in a certain region.
The source file contains data like this:
Copy to Clipboard
All data is artificially generated. The sample files are available for download: input_full.csv.zip 1,000,000 records input file — ca. 3MB zipped
The target data model
When loading this data into a relational database for analytics, we want to keep all information, but avoid duplication. The aim is to keep storage footprint low, and query performance high.
I want to treat the price field as a fact measure that I can average over locations or products. I split the item and location attributes into dimension tables.
One benefit of this design is that location fields and item fields are each going to have a relatively small amount of combinations compared with the total amount of observations. This will keep the dimension tables small and efficient to join to the fact table if needed.
Dimension tables that consist of a dynamic combination of fields with no historization mechanism are sometimes called junk dimensions. They often consist of left-over fields that remain after all other dimension tables have been modeled, hence the term junk dimensions.
In a real-world scenario, the date — and possibly time — of the observation would be another crucial piece of information. It would be modeled as a key into the date dimension table. Additionally, a real-world scenario would include house-keeping fields on all tables, such as a load_id, load_date, or last_updated columns. I am going to omit these aspects in this article, to avoid distractions.
Querying the data model
Once the model and data are in place we’d like to query it. A typical analytical query would retrieve some ranking based on measures, grouped by attributes. The following query retrieves the top 10 countries with the cheapest average price for waffles:
Copy to Clipboard
Once all data is loaded, the result returns a corresponding list:
Copy to Clipboard
Most analytics, OLAP, and reporting tools understand the concept of star schemas — that is fact tables containing measures that are connected to any number of dimension tables through foreign keys — and let you quickly formulate analytical queries like this through their graphical user interface.
Creating the tables
For Postgres 11 the following SQL creates the tables and lookup indexes for the data model:
Copy to Clipboard
Tweakstreet can generate the necessary SQL when configuring the steps that work with database tables.
The loading process
There are two basic approaches to loading fact and dimension tables:
inline: populate dimension tables as part of the fact loading process
tiered: populate all dimensions first, populate all fact tables second
The tiered approach has many benefits. It offers better concurrency properties — please note that the dimension tables are not directly related, so they can be loaded in parallel. It also has good recovery options when technical errors occur: if the dimension loading stage completes successfully, you know that you don’t need to repeat it when recovering from errors during fact loading.
The downside of the tiered approach is that you need to process input data multiple times. Once to fill the dimensions, and another time to load the facts.
For the sake of simplicity, I am going to use the inline loading approach and load the dimensions and facts in a single data flow. In any non-trivial scenario with shared dimension tables, I would recommend using the tiered approach instead.
Loading a dimension table
I am going to use the Junk Dimension step of Tweakstreet to load the dimension tables. The step needs configuration mapping input fields to database tables. It takes care of loading the dimension table efficiently.
The step needs to know the database connection to work with, the table to use, and the fields to use. It returns the primary key of the dimension record corresponding to the configured write fields.
If we had any housekeeping fields like load_id, load_date etc. they would go into the meta fields section, which contains fields that are not part of the lookup, but they are inserted in case a corresponding record is not in the table yet.
How the step works
For every input record the step performs the following operations:
Copy to Clipboard
Following this process, the step returns the primary key for a corresponding dimension record for every input record.
The cache plays an important role in the efficient execution of the step. You can configure how many data rows are kept in cache, and whether it is pre-loaded with data from the database table at the beginning of the process. Every cache-hit reduces the amount of round-trips to the database. Ideally you can make the cache big enough to fit all records in-memory.
For cases where the entire dimension is too large to fit into memory, the step gives you control over which records to initialize the cache with. You can supply a WHERE clause to the cache population query. If you have information about incoming data, such as that based on its source, some fields are likely to be constant, you can use that knowledge to initialize the cache efficiently.
The loading flow
The flow begins by re-creating the fact table. Usually you’d add to fact tables, but this one is a snapshot table we want to re-populate on every run. Next the flow is reading the source fields using the CSV input step. It proceeds to load the dim_item and dim_location tables using the Junk Dimension step, retrieving the keys of the corresponding dimension records. As a final step, the SQL Insert step is used to insert the fact records into the database.
The flow executes all steps in parallel. The read file step reads records from the input file and passes them on to a queue connected to the next step. This way, only a small set of records is in memory at any given time, and the whole process can load millions of records efficiently.
When constructing the data flow, you can execute it at any time, and follow the evolution of the data in your stream. The following screencast shows the entire process of interactively constructing the loading process from scratch, with an example of how to execute it from the command line.
To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
The technical storage or access that is used exclusively for statistical purposes.The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.