Perhaps as much as 25% of data teams’ time is now spent on data ingestion. That is, gathering it from thousands of sources for transformation and transfer. That’s a big chunk of the week, and if we want to win back some of that time, there’s one big, obvious issue we can address: incoming data quality.
If you can tackle data quality from the moment data enters your system—at data ingestion—you avoid the multiplicity of errors that can follow ingesting data with wrong schemas or null counts. Yet the big limitation we face is technological. Pipeline tools like Airflow and Spark don’t offer a way to check data quality as it enters your system. So what are you supposed to do?
In this guide, we share a data ingestion strategy and framework designed to help you wrestle more of your time back, and keep out bad data for good.
What is data ingestion?
For data engineers, data ingestion is both the act and process of importing data from a source (vendor, product, warehouse, file, etc.) into a staging environment. From there, the data is either transformed or transferred to its destination. If you’re thinking in terms of a data ingestion pipeline, ingestion is the first stage. Or perhaps “stage zero,” because it’s often overlooked and rarely measured with the same rigor as other stages—though it should be.
The more poor-quality data you allow to seep into your warehouse or lake, the faster it pollutes everything else. It’s a case of, “garbage in, garbage suddenly everywhere,” as errors replicate across other systems.
Thus, you can either front-load your data quality work or back-load the data-cleansing misery.
Generally, there are three modes of data ingestion:
- Batch ingestion—you gather data in a staging layer and then transfer it to the destination in batches on a daily, weekly, monthly, etc. basis.
- Streaming ingestion—you pass data along to its destination as it arrives in your system. (Or that’s the theory, at least. With data streaming, “real-time” is relative because the pipeline executor like Spark or Airflow is simply micro-batching the data—preparing and sending it in smaller, more frequent, discretized groups. But like a movie projector that displays at 30 frames per second, it appears real-time to humans.)
- Hybrid streaming—you use a mixture of batch and streaming, where you designate certain data types for streaming and others for batch.
What type of data ingestion should you use? By default, start with batch transfers. They are typically less complex, take less setup planning, and have fewer components to manage.
Sometimes, your SLA or business use case will call for a streaming or hybrid approach. Use a streaming approach when you’re dealing with a large variety of different types of data sources. In most cases though, you’ll never be able to do away with batch ingestion entirely, so a hybrid approach will make the most sense. Your important data gets streamed and your unimportant data gets batched, which relieves pressure on both architectures.
What is data ingestion vs ETL?
This is a common question with a simple answer: Data ingestion and ETL are different parts of the same workflow. You first ingest data from, say, a bunch of data vendors and files, and then when it’s ready, you extract it, transform it, and load it (ETL) using a data pipeline that moves it to another destination.
Data ingestion is a much broader term than ETL. Ingestion refers to the general process of ingesting data from hundreds or thousands of sources and preparing it for transfer. ETL is a very specific action, or job, that you can run.
Though, if you want to split hairs, ingestion today involves a fair amount of extracting, transforming, and loading. It’s very rare to ingest and transfer data without some transformation, unless you’re just replicating a database, saving raw system logs, or for some reason, are indifferent to quality.
Of course, as data infrastructure costs have fallen, ELT (loading it into your warehouse or lake before transforming it) has become more popular than ETL. Data teams have to worry less about blowing out their analytics tool budget, so they can now afford to load everything and sort it out later. But again, even when you’re just moving the data, you should always, always be checking its quality and cleaning it.
Which brings us to our data ingestion framework.
A data ingestion framework
How you ingest your data depends on the type of data and its purpose. For example, is it high complexity, high velocity, or low complexity, low velocity? Are you depositing it into a staging layer to transform it before moving it, or are you ingesting it from a bunch of data vendors, trusting that it’s correct, and streaming it directly into your systems? (In which case, I admire your devil-may-care attitude.)
Here are questions you can ask to inform your data ingestion strategy:
- What quality parameters must we meet?
- What’s the complexity?
- What’s the velocity?
- What are the consumer needs?
- Are there standards regimes to follow? (e.g. SOC2, PCI)
- Are there regulatory or compliance needs? (e.g. HIPAA, GDPR, CCPA)
- Can this be automated?
The higher your need for data quality—here, or at any layer or location through which the data will pass—the greater your need for data ingestion observability. That is, the greater visibility you need into the quality of the data being ingested.
As we covered in the introduction, errors tend to cascade, and “garbage in” can quickly become “garbage everywhere.” Small efforts to clean up quality here will have a cumulative effect and save entire days or weeks of work.
As Fithrah Fauzan, Data Engineering Team Lead at Shipper, puts it, “It only takes one typo to change the schema and break the file. This used to happen for us on a weekly basis. Without Databand [data ingestion observability], we often didn’t know that we had a problem until two or three days later—where after we’d have to backfill the data.”
What’s more, when you can observe your ingested data, you can set rules to automatically clean it, and ensure it stays clean. So, if a data vendor changes their API, you catch the schema change. And if there are human errors in your database that have caused some exceedingly wonky values, you notice—and pause—and address them.
When you can observe your data ingestion process, you can more reliably:
- Aggregate the data—gather it all in one place
- Merge—combine like datasets
- Divide—divide unlike datasets
- Summarize—produce metadata to describe the dataset
- Validate the data—verify that the data is high quality (is as expected)
- (Maybe) Standardize—align schemas
- Cleanse—remove incorrect data
- Deduplicate—eliminate unintentional duplicates
From there, you can start to build the framework that works for you. Highly complex data demands highly complex observability—if your monitoring tool isn’t highly customizable, and can’t automatically detect anomalous values, automation doesn’t do you much good. You’ll always be fixing the system, tweaking rules, and refining things by hand.
Highly complex data also makes you a good candidate for a hybrid ingestion approach, where you can designate ingested data into either a batch or streamed workflow. If you have any amount of high-velocity data—that’s supposed to be very up-to-date—you’re a candidate for a streaming ingestion architecture, or of course, hybrid.
And from there, you’ll want to understand what your specific user needs are.
Here are a few data ingestion tools you may find useful for this:
- Ingestion tool—Apache Storm, Apache Flume, Gobblin, Apache NiFi, Apache Logstash
- Storage—Snowflake, Azure, AWS, Google Cloud Platform
- Streaming platform—Apache Kafka
- Observability platform—Databand
Your data ingestion strategy deserves more attention
Data ingestion now occupies roughly one-quarter of our time, and it gets a lot less “visibility” than we believe it deserves. Data issues compound. The more effort you invest into cleaning up your data as it enters your data ingestion pipelines, the more you save yourself the multiplying cleanup job later.
And the best way to begin? Observability. You can’t manage what you can’t measure, and a good observability platform closes the gap for data ingestion pipeline tools like Spark and Airflow, which can’t assess data quality upon ingestion.
If you observe, you can act. And if you can act, you can keep the bad data out for good.