When developing data pipelines, we put together a series of processing steps, lookups, computations, etc. in order to re-shape source data into a target form.
The development cycle looks something like this:
add transformation step/calculation
run against incoming data
verify intended effect
move on to next transformation task
If we’re coding pipelines, the simplest way of verifying the effect of our transformations is to put logging statements at the appropriate places, and run the pipeline.
If we’re using a graphical tool, it is likely allowing for test runs. We can see the output of individual steps, and can confirm that our design works as expected.
Both of these approaches have a shortcoming: they run the entire pipeline. This comes with the cognitive overhead of running the whole pipeline and getting back to the place we’re presently working on. Also, the pipeline may have to do some heavy lifting before any data even arrives at the spot we’re working on — for example when we’re working after a sort or aggregation step that needs to see all data before forwarding anything. This means that we’re working with delayed feedback, and every single change we’re making takes longer and longer to verify.
To some extent we can work around these limitations by breaking up the data pipelines into smaller pieces. But we’re then burdened with preparing adequate test data for each individual sub-pipeline. We can’t tap into the source data directly, as this would only fit our very first sub-pipeline, and the output of that is what feeds the others.
Tweakstreet gives you immediate feedback
We designed Tweakstreet to allow for a much shorter feedback cycle when working on transformations. There is no need to re-run the whole pipeline all the time. In fact, you can get immediate feedback on most calculations.
An example task
Let’s look at an example that is fairly common in batch data processing. The pipeline is supposed to read a set of CSV data files which are stored in the following file system layout:
Copy to Clipboard
A whole data set following this pattern:
Copy to Clipboard
Given one data file path the requirements are:
extract the date
extract customer number
extract transaction type
Our pipeline would hand off that path and extracted meta data to some sub-pipeline responsible for loading file contents.
You could start by discovering the files to be loaded using the Get Files step, and passing the paths to a subsequent calculator step for meta data extraction. Like so:
For each input row, the calculator can now execute meta data extraction. The current path is available in scope as in.path.
The inline-calculation approach
There are several good ways to extract and validate the data. I’ll go with splitting the path on /, and then further extracting information from each part.
Tweakstreet’s ability to evaluate expressions inside step configuration allows you to develop computation expressions inline, getting results immediately. Once you’re satisfied with your computations, you switch from test data to input data and go back to run-and-view to assess the big picture.
The following clip shows the entire development cycle for the required data extraction, verifying each step along the way.
To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
The technical storage or access that is used exclusively for statistical purposes.The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.