We’ve seen a lot of great blog posts recently about the “modern data stack”, each bringing an interesting new viewpoint on the tools of choice for growing data teams. We wanted to take our own stab at this, sharing our perspective on the latest and greatest systems being used to run a modern data org.
What makes this difficult? The market is going through rapid transformation. There are more tools than ever for building data pipelines and companies are hiring in every data-related role. Every team operates differently, with unique priorities that lead to a differing choice of tools and design patterns. With so much variety, it’s a challenge to talk about tools before discussing people and roles.
(According to the 2020 U.S. Emerging Jobs Report, data roles, specifically data scientist and data engineer roles, are increasing steadily — reflecting about a 35% average annual growth for both roles.)
In this post, we’ll focus on the tools used by teams with more demanding requirements, including larger, faster-moving data. We’ll also focus mostly on the first mile of the stack – the pipelines that data engineers own, from data ingestion to the warehouse (or lakehouse), as opposed to the downstream analytics or data science tools, which is worthy of its own exploration.
To walk through the tools, we’ll start with an example use case. Say we’re an investment firm that uses market data to predict stock prices. This team works with data feeds from various exchanges, collecting a constant flow of trade information. They transform raw trade data into price predictions. What are they using to get there?
Getting the data in
First, data is streamed in or batched from sources. That’s stock exchanges like NASDAQ, NYSE, and brokerages like Robinhood. Tools like Kafka and Kinesis are used to run the streaming pipes.
With real-time feeds and batch pulls from the sources, the data lands in a raw storage layer and data lake – S3, GCS, Blob Storage, or Delta Lake. Since the data from source providers comes in all different formats, we optimize here for storage space and flexibility of storage structure (JSON, CSV, Parquet). For machine learning teams in particular, flexibility and space is a priority because there will be large amounts of data and varying file types (like images).
Apache Airflow will be used for orchestrating batch processes from here. First to pull any data via batch from APIs, and once data is in to begin moving and transforming the data. Just as S3 provides flexibility to bring any kind of file format, Airflow makes it easy to run any data task, whether distributed Spark jobs, python scripts, SQL queries, or even commands to run processes on other orchestrators or ETL tools.
Airflow comes up as the tool of choice because of its interoperability, product maturity, and the open-source community behind it – though there are a number of exciting new orchestrators in the market picking up steam.
Unify and structure
Next, the pipeline unifies and adds structure to the various data sources. In our example, bringing data from various exchanges into a single table or consistent file format so that all Gamestop trades and related data are readable together.
Airflow will kick off Python and Spark tasks for this processing. Tasks will be containerized and run on cloud tools like EMR, Dataproc, and Databricks, so that changes within each task are isolated and easy to make and new libraries can be added quickly.
After structuring, the data will land in our unified storage layer, where it is actionable by more data consumers. This second storage layer might remain in S3 or be moved to a warehouse like Snowflake. This is also a common data access point for an organization building a data mesh, where the data platform team will deliver data to a shared location for access by multiple data teams across the business.
Next, we have a functional layer for aggregation and analysis. This is where different teams can create steps to calculate KPIs or train models on data specific to their use case.
In our example case, if one of those teams is predicting the movement of Gamestop, they might be analyzing the correlation of stock price increase with Elon Musk tweets (every related positive tweet yields a 20% lift in 12 hours?), or define KPIs related to price movements like moving price average and hourly trading volumes.
The tools used at this layer first depend on whether the bulk of our data product is in machine learning or analytics. If ML intensive, we may be using the same tools as our first functional layer, with Python and Spark running most of the logic, Dockerized or in Jupyter notebooks (shoutout to our favorite new data science notebook here, Deepnote). If the team is analytics-oriented, we’ll move from the lake into the data warehouse; Python towards SQL, and platforms like Snowflake, Redshift, and BigQuery, with DBT models running queries.
Side note: we’re closely watching the new capabilities of Databricks’ Delta Lake, and Snowflake’s Snowpark, which are blurring the lines between the Lake and Warehouse.
Data is ready for querying once it’s in the warehouse. The heft of heavy aggregation is behind us, and the data is ready for use by analysts and scientists. While Looker and Tableau still dominate the modern analytics market, newer open source tools like Preset and Metabase are gaining ground.
Choosing ELT vs ETL and ensuring nothing breaks along the way
In our example, data is being moved, transformed, and aggregated at different stages. Throughout the stages, a data team needs to decide the optimal sequence of data movement and processing. Here are some factors to consider when deciding how to split the load:
- Level of interoperability – how easy is it to run processes across multiple locations? If you are mostly invested in DBT, which focuses aggregations on the data warehouse, your team will move more data there before processing. If using Airflow, which can really run any kind of job on any kind of system, you can spread the transformations and logic across more platforms, using more specialized tools for each step.
- Team composition – if your team is stronger in analytics engineering, it makes sense to focus on the warehouse – ELT. If your team comes from a big data mindset, organizing around Spark, Databricks and S3, might make more sense.
- Team priorities – if you are optimizing for performance, it could work better to run transformations in Spark before they get to the warehouse. If optimizing for ease of use, loading all data in your warehouse, in Snowflake, will make more sense.
All these moving pieces form a ton of blindspots in your architecture.
This makes building observability through the pipeline imperative. Data Engineers need checkpoints across all phases of the pipeline to capture the state of their data as it changes over time; allowing a DE to pinpoint the root cause of an issue by tracing the lineage of their data to a particular point in time of their pipeline.
Let’s use a typical scenario – a failed delivery. Our data consumers, the scientists, never got Gamestop trading info from the day prior and are unable to refresh their models. The result of this typical scenario can lead to meaningful losses.
First – we need a solution to alert us that there is a problem – like a data delay, missing data, or data quality drift. Beyond that, we’ll need to identify the causes – is the pipeline still running? What’s different about today? Is the environment different? Is the code different? Is a new join slowing things down? Is it the original input data?
Your observability system should enable you to quickly move through these questions. Ideally providing leading indicators of when problems will arise so you can avoid the consequences of pipeline issues altogether, not just troubleshooting help when they do.
In our next post, we’ll discuss the remaining blindspots for the modern data stack, and how Databand.ai is helping organizations bring visibility. Stay tuned!
Databand.ai is a unified data observability platform built for data engineers. Databand.ai centralizes your pipeline metadata so you can get end-to-end observability into your data pipelines, identify the root cause of health issues quickly, and fix the problem fast. To learn more about Databand and how our platform helps data engineers with their data pipelines, request a demo!