ETL solutions tend to be opaque to the operators running them. Any non-trivial ETL solution has multiple things it can potentially do: reset its housekeeping tables, run the test suite, import master data, execute a load phase, display its environment configuration, and so on.
ETL tools save their artifacts as plain files, entries in a repository of some sort, or they create release packages. They also provide a way to run such artifacts, optionally passing in some parameters.
From a functional perspective, this is all an ETL tool needs to do. It gives developers a way of running their flows. What else is needed when deploying an ETL project?
Suppose the ETL project can import some CSV data into its master data storage. The operations manual will often instruct the operator to run something like this:
Copy to Clipboard
It’s hard to see what’s even going on there. Any self-respecting ops team running that project for a while will likely come up with a set of custom wrapper scripts for common tasks.
But that’s exactly what the project should offer in the first place. Something closer to this:
Copy to Clipboard
Ideally, invocation of the ETL project is very close to conventions already idiomatic in the production environment. Operators should not need ‘training’ in order to invoke the project’s basic functions.
An ETL project’s invocation interface should be expressed in terms of its intended use, not the artifacts of the underlying technology.
A short digression
What are the steps to start your car? Press a button? Turn a key? These days it’s fascinating what the steps used to be.
The way an old-timer starts up is not intuitive. The steps a driver had to perform do not correspond to his or her intention: just starting up the engine.
Similarly an ops team can’t really be very effective if they are exposed to overwhelming implementation details, and then given instructions on how that complexity maps to their intended use cases.
Benefits of explicit entry points
Alignment of invocation interface with problem space
Having a set of entry points that are framed in terms of the customer’s problem space addresses unnecessary mental gymnastics when running the project. As a result the customer and ops team have an immediate intuitive understanding of the invocation interface.
It helps with demos and gathering feedback, too. Business stakeholders are able to follow a technical demo much better when the presenter never conceptually departs from their problem domain.
Users operate most naturally within the mental model of their use cases, not implementation architecture.
Smaller surface area
When there are explicit entry points to the ETL solution, they are the only entry points to the solution. Implementation details, the way flows are structured and composed internally, can change without affecting the outside surface of the solution. This makes refactors and architecture adjustments much less disruptive. The entry points form a contract between the solution and its users. That contract changes rarely, and that is always a conscious decision. It won’t incidentally break because of immaterial implementation changes.
Conventionally, command line utilities offer --help switches that contain some level of usage documentation. Entry point scripts for ETL projects can embrace that convention, as well. Documentation that is kept with the actual code is much more likely to be correct and up to date, as opposed to external documents. Presence of embedded documentation can be automatically tested, at least to some extent. It can certainly be a necessary checkbox in code reviews. Embedded documentation beats external documentation in every way relevant to actually running the software, and the entry point scripts are a great way to place it.
Controlling exit codes
Exposing entry points as scripts also makes it possible to provide semantically consistent exit codes. This might not be a big deal when the ETL solution is running standalone. But when it is invoked via schedulers, or composed into a bigger solution via a shell script, it is vital that it generates reliable, semantically rich exit codes. A generic runner will likely be able to indicate success or failure, but a custom script can return exit codes that encode additional information. If an orchestrating script knows the reason for failure, say, a DB being down vs. a file not being present, it might handle these cases differently.
Entry point scripts have the opportunity to set up environment configuration before invoking any ETL at all. An entry point script is a great place to set and potentially validate environment configurati
It’s a good idea to have an entry point script that does nothing but show, and potentially validate, the environment configuration details it would use for invocation of actual ETL. It helps a lot when debugging an issue, and you want to be really sure, that it’s not due to misconfiguration.
ETL projects are software too. Like any other software, it should provide an interface that is aligned with its intended use case, not the incidental happenstance of the implementation. Explicit entry points help users understand how to use the project, and they offer additional benefits like an explicit and small public surface area, embedded documentation, control over exit codes, and the opportunity to manage environment configuration.
To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
The technical storage or access that is used exclusively for statistical purposes.The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.