Whether you’re wrangling data to perform one-off analysis, ingesting data into a data-lake, or building a data warehouse of some sort — all of these activities are moving data, potentially transforming it in the process. You’re creating a sort of data pipeline.
There are plenty of opinions and little agreement among data engineers as to whether data pipelines should be done in code, ETL tools, or mostly within the target system itself.
Dilemma – code my ETL entirely in SSIS or use SSIS to orchestrate T-SQL procs?
— Jesse Seymour (@JesseBizInt) September 13, 2017
Is it acceptable, or even recommended, to hand code the extract, transform, load (ETL) logic for a DW, or is tool-based ETL preferred?
— Dave Welensky (@Ixia_Dave) September 15, 2014
The tooling discussion seems to mostly revolve around developer convenience, performance considerations, and tool support for streaming vs. batch operations. Some developers favor a low-level high-performance approach to do it right, whereas others might go down the path of least resistance. That path is mostly defined by what tooling people are familiar with and what happens to be readily available.
The performance-driven approach — sometimes paired with a bit of developer elitist pride — caters first to constraints of time, ego, and infrastructure. The least-resistance-driven approach caters first to the preferences of individuals, and sometimes to laziness. Neither approach caters primarily to the needs of the organization.
Credibility first
Whatever tooling is employed for technical and infrastructure reasons, a primary concern is that the effects of the pipeline can be objectively and independently verified. Any filters, cleanups, omissions, and corrections performed must be documented, justified and quantified. If they are not, any analysis based on that data is not repeatable, at least not from source.
How much credence can you give to an analysis where the source of the data might have been “tortured” enough to confess to whatever the analyst needed? How do you know Alice works with valid data when Bob gives her a “cleaned up” version of data to base her analysis on? If Bob is working with an unstructured, home-grown Perl script that only God and Bob understand, who can challenge Bob on assumptions he makes when transforming the data, or the correctness of his implementation?
Anything done to the source data should be accessible and comprehensible to interested parties.
A test suite helps, regardless of the approach. But let’s look at the various technology approaches commonly suggested and employed.
Programming
Programming a data pipeline makes it easy to control every detail of every step of a data transformation. The upside is that you control everything, the downside is that you have to do everything. The biggest challenges are efficient I/O and CPU utilization through adequate multi-threading — we all know parallel prog ming ram sy is ea.
Best case scenario is using a framework like spark that takes a lot of architecture and scalability issues off your plate, while also providing data processing primitives and functions.
In general, I recommend to not resort to programming unless there is no other option.
Coding to spec especially may yield unfavorable results, as the spec is not verified against all the data. A tight feedback loop to interactively verify each step of the data pipeline during development helps. Hopefully we get one as part our data processing framework, because printf may not be enough.
When we are programming, we are more concerned with details of the code than we are with the details of the data.
Explaining what is going on to non-developers will be difficult, and risk of unintended effects — bugs — is relatively high.
Visual ETL tools
ETL tools typically solve the I/O, and multi-threading efficiency problem to some degree, and allow you to work with your data directly. As a trade-off you lose the ability to influence overall engine design, and might be forced to work around less than optimal support for file types, data types, or data sources. Jumping through hoops the ETL tool poses is not uncommon, and performance will be less good than a hand-coded solution can be.
The big upside is that inspecting data rows as they pass through transformation steps usually comes free.
People with little technical background have a chance to follow a transformation graph.
The effects of each transformation step can be demonstrated interactively and directly.
Transforming data in the database
If you’re well versed in SQL, but don’t otherwise have a programming background, and learning a visual ETL tool is not something you want to invest in, there’s always the option of first loading raw source data into staging tables, deferring transformations to a set of SQL operations. You derive your target tables through a series of SELECTs, creating temporary tables along the way as you need them.
This approach is reasonable as long as SQL is able to express the transformations you need to perform, and the database technology supports your queries and temporary tables well. The downside is that non-trivial operations tend to be pretty hard to express in SQL.
Nevertheless, if temporary tables are created at reasonable intervals, an interested party can follow the transformation pipeline by following records from source, across temp tables, and into their final form with a few simple SELECTs.
So… which approach is right?
Any non-trivial transformation has parts that are easy to solve using one particular approach, and a headache to get right using another. As such a good choice depends on circumstances.
My default recommendation is to use an ETL tool that natively runs on your infrastructure, supports all your data sources, allows for embedded scripting/programming, and has good support for issuing custom query statements towards your data sources and interpreting their results. It gives you options when you find that the default approach of the tool is not cutting it for your particular problem. At the same time, it preserves the ability to demonstrate how data is transformed, and as such makes the pipeline more accessible to others than a pure-code solution can.
My favorite ETL tool is — obviously — Tweakstreet. We’ve created it over here at Twineworks. It works on Windows/Linux/macOS, and while not feature-complete it, it already supports everything you need to create data pipelines for analytics.
When ETL tools are not a practical choice, I tend to recommend a programming environment and language that the majority of the data team is familiar with. Sometimes that’s SQL, and sometimes that’s a general purpose programming language with a data processing framework on top.
The guiding principle is transparency, maintainability, and avoidance of personal code ownership.