Data Preparation

Raw data sets need transformation, enrichment, and cleanup before they can be used effectively.

Data sources often provide data on an as-is basis as opposed to what is needed. The shape, format, and encoding of information happens to be whatever is practical for the system that generated the data. This means that data extracts are often cryptic like this:

C|Acme Inc
N|982
I|MLC|27|90.34
I|BBL|4|9.99
N|989
I|QTA|2|10.00
I|EAD|3|4.35
...

The information content is far from self-evident. The data uses internal mnemonics, identifiers which need to be looked up, and a non-uniform structure that implies various kinds of records and record relationships are present.

Tweakstreet allows you to automate the data preparation process that takes the data as it comes from the source, and transforms it into a usable asset. The process typically consists of several conceptual phases.

Transformation

The structure of the data set needs to be transformed to fit its intended use. Staying with our example from above, we learn that the data given encodes the following logical structure:

Company: Acme Inc
  Invoice: 982
    Line Item: [sku: MLC, count: 27, price: 90.34]
    Line Item: [sku: BBL, count:  4, price:  9.99]
  Invoice: 989
    Line Item: [sku: QTA, count:  2, price: 10.00]
    Line Item: [sku: EAD, count:  3, price:  4.35]
...

Tweakstreet enables you to form and shape such data structures from raw source records. You would then store them in a manner suitable for your usecase, such as a SQL database, JSON, XML, CSV files, Excel files, or online spreadsheets.

Enrichment

Data sets often contain internal mnemonics or ids that need to be resolved or looked up in a reference system. Who, after all, knows that sku BBL refers to a bucket of blue paint, from the Paints product category. That information has to go into the dataset in order for it to be useful.

Company: Acme Inc
  Invoice: 982
    Line Item: [item: Men's Leather Shoes,  category: Shoes, count: 27, price: 90.34]
    Line Item: [item: Bucket of Blue Paint, category: Paints, count:  4, price:  9.99]
  Invoice: 989
    Line Item: [item: Terracotta Vase, category: Earthenware, count:  2, price: 10.00]
    Line Item: [item: Chocolate Drink, category: Food, count:  3, price:  4.35]
...

Tweakstreet allows you to look up reference data from any data source such as databases, reference files or online APIs.

Cleanup

Most data sets need cleanup before being processed further. Invalid or incomplete records need to be identified - and then corrected or filtered out.

Tweakstreet makes it easy to:

  • Identify data exchange format problems
  • Guard against unexpected format changes
  • Validate data against plausibility rules
  • Fix or redirect problematic records
  • Collect bad records for further inspection and discussion with data suppliers
A data flow redirecting invalid records and guarding against systematic errors

Data is only useful when prepared

Whether you're training a ML model, preparing a custom report, or loading data warehouse tables, you'll always need to take raw data as you find it - and make it usable.

With Tweakstreet you can interactively design and automate that process in a visual way. Turning cryptic data sources into queryable information and therefore into insights.

Watch it in action!

Share your pain points, ask questions, and challenge us with data problems - we'll address them in a demo tailored for you.

Subscribe to our newsletter

We share tutorials, release notes, and resources. Get them weekly in your inbox.

    © 2020 Twineworks GmbH. All rights reserved.