Press Release - IBM Acquires Databand to Extend Leadership in Observability

Read now

What’s the Difference? Data Engineer vs Data Scientist vs Analytics Engineer?

Databand
2022-05-26 13:25:17

The modern data team is, well, complicated.

Even if you’re on the data team keeping track of all the different roles and their nuances gets confusing – let alone if you’re a non-technical executive who’s supporting or working with the team.

One of the biggest areas of confusion? Understanding the differences between a data engineer vs data scientist vs analytics engineer roles.

The three are closely intertwined. And as Josh Laurito, Director of Data at Squarespace and editor of NYC Data, tells us there really is no single definition for each of these roles and the lines between them. You can listen to our full discussion with Josh Laurito below.

But still, there are some standard differences everywhere you go. And that’s exactly what we’ll look at today.

What is a data engineer?

A data engineer develops and maintains data architecture and pipelines. Essentially, they build the programs that generate data and aim to do so in a way that ensures the output is meaningful for operations and analysis.

Some of their key responsibilities include:

  • Managing pipeline orchestration
  • Building and maintaining a data platform
  • Leading any custom data integration efforts
  • Optimizing data warehouse performance
  • Developing processes for data modeling and data generation
  • Standardizing data management practices

Important skills for data engineers include:

  • Expertise in SQL
  • Ability to work with structured and unstructured data
  • Deep knowledge in programming and algorithms
  • Experience with engineering and testing tools
  • Strong creative thinking and problem-solving abilities

What about an analytics engineer?

An analytics engineer brings together data sources in a way that makes it possible to drive consolidated insights. Importantly, they do the work of building systems that can model data in a clean, clear way repeatedly so that everyone can use those systems to answer questions on an ongoing basis. As one analytics engineer at dbt Labs puts it, a key part of analytics engineering is that “it allows you to solve hard problems once, then gain benefits from that solution infinitely.

Some of their key responsibilities include:

  • Understanding business requirements and defining successful analytics outcomes
  • Cleaning, transforming, testing, and deploying data to be ready for analysis
  • Introducing definitions and documentation for key data and data processes
  • Bringing software engineering techniques like continuous integration to analytics code
  • Training others to use the end data for analysis
  • Consulting with data scientists and analysts on areas to improve scripts and queries

Important skills for analytics engineers include:

  • Expertise in SQL
  • Deep understanding of software engineering best practices
  • Experience with data warehouse and data visualization tools
  • Strong capabilities around maintaining multi-functional relationships
  • Background in data analysis or data engineering

So then what’s a data scientist?

A data scientist studies large data sets using advanced statistical analysis and machine learning algorithms. In doing so, they identify patterns in data to drive critical business insights, and then typically use those patterns to develop machine learning solutions for more efficient and accurate insights at scale. Critically, they combine this statistics experience with software engineering experience.

Some of their key responsibilities include:

  • Transforming and cleaning large data sets into a usable format
  • Applying techniques like clustering, neural networks, and decision trees to gain insights from data 
  • Analyzing data to identify patterns and spot trends that can impact the business
  • Developing machine learning algorithms to evaluate data
  • Creating data models to forecast outcomes

Important skills for a data scientist include:

  • Expertise in SAS, R, and Python
  • Deep expertise in machine learning, data conditioning, and advanced mathematics
  • Experience using big data tools
  • Understanding of API development and operations
  • Background in data optimization and data mining
  • Strong creative thinking and decision-making abilities

How does it all fit together?

Even seeing the descriptions of data engineer vs data scientist vs analytics engineer side-by-side can cause confusion, as there are certainly overlaps in skills and areas of focus across each of these roles. So how does it all fit together?

A data engineer builds programs that generate data, and while they aim for that data to be meaningful, it will still need to be combined with other sources. An analytics engineer brings together those data sources to build systems that allow users to access consolidated insights in an easy-to-access, repeatable way. Finally, a data scientist develops tools to analyze all of that data at scale and identify patterns and trends faster and better than any human could.

Critically, there needs to be a strong relationship between these roles. But too often, it ends up as dysfunctional. Jeff Magnuson, Vice President, Data Platform at Stitch Fix, wrote about this topic several years ago in an article titled Engineers Shouldn’t Write ETL. The crux of his article was that teams shouldn’t have separate “thinkers” and “doers”. Rather, high-functioning data teams need end-to-end ownership of the work they produce, meaning that there shouldn’t be a “throw it over the fence” mentality between these roles. 

The result is a high demand for data scientists who have an engineering background and understand things like how to build repeatable processes and the importance of uptime and SLAs. In turn, this approach has an impact on the role of data engineers, who can then work side-by-side with data scientists in an entirely different way. And of course, that cascades to analytics engineers as well.

Understanding the difference between data engineer vs data scientist vs analytics engineer once and for all – for now

The truth remains that many organizations define each of these roles differently. It’s difficult to draw a firm line between where one ends and where one begins because they all have similar tasks to some extent. As Josh Laurito concludes: “Everyone writes SQL. Everyone cares about the quality. Everyone evaluates different tables and writes data somewhere, and everyone complains about time zones. Everyone does a lot of the same stuff. So really the way we [at Squarespace] divide things is where people are in relation to our primary analytical data stores.”

At Squarespace, this means data engineers are responsible for all the work done to create and maintain those stores, analytics engineers are embedded into the functional teams to support decision making, put together narratives around the data, and use that to drive action and decisions, and finally, data scientists sit in the middle, setting up the incentive structures and the metrics to make decisions and guide people. 

Of course, it will be slightly different for every organization. And as blurry as the lines are now, each of these roles will only continue to evolve and further shift the dynamics across each of them. But hopefully, this overview helps solve the question of what’s the difference between data engineer vs data scientist vs analytics engineer – for now.

The ideal DataOps org structure

Databand
2021-08-27 15:13:37

The ideal data operations (DataOps) org structure

An organization’s external communications tend to reflect its internal ones. That’s what Melvin Conway taught us, and it applies to data engineering. If you don’t have a clearly defined data operations or “DataOps” team, your company’s data outputs will be just as messy as its inputs. 

For this reason, you probably need a data operations team, and you need one organized correctly.

conways law org structure

So first let’s back up—what is data operations?

Data operations is the process of assembling the infrastructure to generate and process data, as well as maintain it. It’s also the name of the team that does (or should do) this work—data operations, or DataOps. What does DataOps do? Well, if your company maintains data pipelines, launching one team under this moniker to manage those pipelines can bring an element of organization and control that’s otherwise lacking. 

DataOps isn’t just for companies that sell their data, either. Recent history has proven you need a data operations team no matter the provenance or use of that data. Internal customer or external customer, it’s all the same. You need one team to build (or let’s be real, inherit and then rebuild) the pipelines. They should be the same people (or, for many organizations, person) who implement observability and tracking tools and monitor the data quality across its four attributes. 

And of course, the people who built the pipeline should be the same people who get the dreaded PagerDuty alert when a dashboard is down—not because it’s punitive, but because it’s educational. When they have skin in the game, people build differently. It’s good incentive and allows for better problem solving and speedier resolution.

Last but not least, that data operations team needs a mission—one that transcends simply “moving the data” from point A to point B. And that is why the “operations” part of their title is so important.

Data operations vs data management—what’s the difference?

Data operations is building resilient processes to move data for its intended purpose. All data should move for a reason. Often, that reason is revenue. If your data operations team can’t trace a clear line from that end objective, like the sales teams having better forecasts and making more money, to their pipeline management activities, you have a problem. 

Without operations, problems will emerge as you scale:

  • Data duplication
  • Troubled collaboration
  • Waiting for data 
  • Band-aids that will scar
  • Discovery issues
  • Disconnected tools
  • Logging inconsistencies
  • Lack of process
  • Lack of ownership & SLAs

If there’s a disconnect, you’re simply practicing plain old data management. Data management is the rote maintenance aspect of data operations. Which, while crucial, is not strategic. When you’re in maintenance mode you’re hunting down the reason for a missing column or pipeline failure and patching it up, but you don’t have time to plan and improve.

Your work becomes true “operations” when you transform trouble tickets into repeatable fixes. Like, for example, you find a transformation error coming from a partner, and you email them to get it fixed before it hits your pipeline. Or you implement an “alerts” banner on your executives’ dashboard that tells them when something is wrong so they know to wait for the refresh. Data operations, just like developer operations, aims to put repeatable, testable, explainable, intuitive systems in place that ultimately reduce effort for all.

That’s data operations vs data management. And so the question then becomes, how should that data operations team be structured?

Organizing principles for a high-performing data operations team structure

So let’s return to where we began—talking about how your system outputs reflect your organizational structure. If your data operations team is an “operations” team in name only, and mostly only maintains, you’ll probably receive a forever ballooning backlog of requests. You’ll rarely have time to come up for air to make long-term maintenance changes, like switching out a system or adjusting a process. You’re stuck in Jira or ServiceNow response hell. 

If, on the other hand, you’ve founded (or relaunched) your data operations team with strong principles and structure, you produce data that reflects your high-quality internal structure. Good data operations team structures produce good data.

Principle 1: Organize in full-stack functional work groups

Gather a data engineer, a data scientist, and an analyst into a group or “pod” and have them address things together they might have addressed separately. Invariably, these three perspectives lead to better decisions, less fence-tossing, and more foresight. For instance, rather than the data scientist writing a notebook that doesn’t make sense and passing it to the engineer only to create a back-and-forth loop, they and the analyst can talk through what they need and the engineer can explain how it should be done.
Lots of data operations teams already work this way. “Teams should aim to be staffed as ‘full-stack,’ so the necessary data engineering talent is available to take a long view of the data’s whole life cycle,” say Krishna Puttaswamy and Suresh Srinivas at Uber. And at the travel site Agoda, the engineering team uses pods for the same reason.

Principle 2: Publish an org chart for your data operations team structure

Do this even if you’re just one person. Each role is a “hat” that somebody must wear. To have a high-functioning data operation team, it helps to know which hat is where, and who’s the data owner for what. You also need to reduce each individual’s span of control to a manageable level. Maybe drawing it out like this helps you make the case for hiring. 

What is data operations team management? A layer of coordination on top of your pod structures who plays the role of servant leader. They project manage, coach, and unblock. Ideally, they are the most knowledgeable people on the team.

We’ve come up with our own ideal structure, pictured, though it’s a work in progress. What’s important to note is there’s one single person leading with a vision for the data (the VP). Below them are multiple leaders guiding various data disciplines towards that vision (the Directors), and below them, interdisciplinary teams who ensure data org and data features work together. (Credit to our Data Solution Architect, Michael Harper, for these ideas.)

data operations org structure chart

Principle 3: Publish a guiding document with a DataOps North Star metric

Picking a North Star metric helps everyone involved understand what they’re supposed to optimize for. Without such an agreement, you get disputes. Maybe your internal data “customers” complain that the data is slow. But the reason it’s slow is because you know their unstated desire is to optimize for quality first.

Common DataOps North Stars: Data quality, automation (repeatable processes), and process decentralization (aka end-user self-sufficiency).

Once you have a North Star, you can also decide on sub-metrics or sub-principles that point to that North Star, which is almost always a lagging indicator. 

Principle 4: Build in some cross-functional toe-stepping

Organize the team so different groups within it must frequently interact and ask other groups for things. These interactions can prove priceless. “Where the data scientists and engineers learn about how each other work, these teams are moving faster and producing more,” says Amir Arad, Senior Engineering Manager at Agoda. 

Amir says he finds one of the hidden values to a little cross-functional redundancy is you get people asking questions nobody on that team had thought to ask. 

“The engineering knowledge gap is actually kinda cool. It can lead to them asking us to simplify,” says Amir. “They might say, ‘But why can’t we do that?’ And sometimes, we go back and realize we don’t need that code or don’t need that server. Sometimes non-experts bring new things to the table.”

Principle 5: Build for self-service

Just as with DevOps, the best data operations teams are invisible, and constantly working to make themselves redundant. Rather than play the hero who likes to swoop in to save everybody, but ultimately makes the system fragile, play the servant leader. Aim to, as Lao Tzu put it, lead people to the solution in a way that gets them thinking, “We did it ourselves.” 

Treat your data operations team like a product team. Study your customer. Keep a backlog of fixes. Aim to make the tool useful enough that the data is actually used. 

Principle 6: Build in full data observability from day one

There is no such thing as “too early” for data monitoring and observability. The analogy that’s often used to excuse putting off monitoring is, “We’re building the plane while in flight.” Think about that visual. Doesn’t that tell you everything you need to know about your long-term survival? A much better analogy is plain old architecture. The longer you wait to assemble a foundation, the more costly it is to put in, and the more problems the lack of one creates.

Read: Data observability: Everything you need to know

Principle 7: Secure executive buy-in for long-term thinking

The decisions you make now with your data infrastructure will, as General Maximus put it, “Echo in eternity.” Today’s growth hack is tomorrow’s gargantuan, data-transforming internal system chaos nightmare. You need to secure executive support to make inconvenient but correct decisions, like telling everyone they need to pause the requests because you need a quarter to fix things.

Principle 8: Use the “CASE” method (with attribution)

CASE stands for “copy and steal everything,” a tongue-in-cheek way of saying, don’t build everything from scratch. There are so many useful microservices and open-source offerings today. Stand on the shoulders of giants and focus on building the 40% of your pipeline that actually needs to be custom, and doing it well.

If you do nothing else today, do this

Go have a look at the tickets in your backlog. How often are you reacting to rather than preempting problems? How many of the problems you’ve addressed had a clearly identifiable root cause? How many were you able to fix permanently? The more you preempt, the more you resemble a true data operations team. And, the more helpful you’ll find a data observability tool. Full visibility can help you make the transition from simply maintaining to actively improving. 

Teams that actively improve their structure actively improve their data. Internal harmony leads to external harmony, in a connection that’d make Melvin Conway proud.

An 11-point checklist for setting and hitting data SLAs (with an SLA template)

Databand
2021-08-11 11:22:51

We’d venture to say that no team is too small to come up with and commit to a data service level agreement, or data SLA. What is a data SLA? It’s a public promise to deliver a quantifiable level of service. Just like your infrastructure as a service (IaaS) providers commit to 99.99% uptime, it’s you committing to provide data of a certain quality, within certain parameters. 

It’s important that the commitment is public. (Within the company, at least.) Publicity creates better accountability, helps you get all teams aligned around what’s most important, and allows you to build a structure that supports the quality. 

In this guide, we explore how to establish your own data SLA.

Data SLAs reduce disagreement and create clarity

Formalized, written data SLAs make your informal commitments concrete and mutually agreeable. Every data relationship involves informal commitments, whether you state them or not, and very often, two parties can agree to something without realizing they’re talking about different things. 

For example, “Within a reasonable time frame” has very different meanings to each department, or even to each individual. For some, it means a week. For others, it’s a quarter. For salespeople, it’s before their next client meeting.

Informal commitments tend to only be as strong as each person’s memory. It’s not uncommon for a data engineering team to informally commit to delivering data within a few weeks, and for the downstream internal “consumers” to simply say, “Thanks.” But then, a week later, those consumers demand to know where the data is, given they’re about to walk into an executive meeting. It’s in those moments you realize they had unvoiced expectations which would have been useful to document.

And if the agreements are merely verbal, they can twist and transform when something goes wrong. If an executive demands something of one of your data consumers, their emergency becomes your emergency. They need it now. Or if a prospect demands to see a sample data set, suddenly salespeople will believe you should be responding to requests same-day.

Formal data SLAs can help with all that. They help you explain to others how you work to achieve your ultimate purpose: data trust. You want everyone in the organization to trust you, and by extension, the data.

You can borrow this data service level agreement template

So what exactly is the data SLA? It’s a simple written document, usually of 250-500 words, posted in a shared space like a company wiki or Google Doc. It should include six elements:

  • Purpose: Why does this data SLA exist? What issues do you expect it to solve, and how do you hope it is used?
  • Promise: What are you promising to other teams? 
  • Measurement: How will you measure the data SLA, who will measure it, and what’s the SLA time frame?
  • Ramifications: What happens when you miss your data SLA? Who is responsible and what sort of remediations are available, if any?
  • Requirements: What do you expect in return? How are your promises conditional? 
  • Signatures: Who is committing to the data SLA?

When writing your data SLA, convey it in as few words as possible without changing the meaning. This requires lots of editing, but we recommend writing it all in one messy pass and returning to edit later. The reason is, if you stare at the page too long, you may develop what writers call “blank page anxiety” and keep putting it off. Punch out a poor-quality draft now—do not wait.

Here is a data service level agreement example:

Company Data Engineering SLA

The purpose of this document is to establish a public promise from our team to others to maintain high data quality within precise parameters. Our hope is it will create understanding, help us all work together, and keep our teams mutually accountable.

Our promise: We’ll deliver sales data with a data quality score of at least 95% by 5:00 am ET every day so the team can answer questions like “What were sales yesterday?” We’ll acknowledge all requests within one business day and sort them by simple and complex tickets. We’ll resolve simple requests within three business days and complex requests within two weeks.

We’ll measure data quality by comparing data delivery KPIs like Run Start Time and Run Complete Time, Record Count and ratio of Null to Record Count, and distribution and drift scores with the predefined standards for data freshness, data completeness, and data fidelity. 

If we miss a data SLA, within three business days, our team will post a public apology taking credit, explaining why it happened and precise measures we’re putting in place to fix it.

In order to fulfill this promise, we need your help. Our team needs timely direction, input, and clear feedback on how the data is being used, as well as at least four weeks’ notice of any complex requested changes.

Please direct all questions, comments, and concerns to [email protected]. (But you can direct all praise and flattery to [email protected] 😉.)

With resolve,

– Your Data Engineering Team

11 strategies for hitting your data SLA

With your SLA in place (or perhaps while you’re editing it), start thinking about all the things you need to put into place before you can hold yourself to it. 

For example:

1. Define what “good data” means

Try to wring as much ambiguity out of this phrase as possible. Define it in concrete and unmistakable terms. As we see it, there are four characteristics you can use to define high-quality data. Once defined, secure other teams’ agreement on that definition. 

Ask yourself:

  • What is the outcome of good data for the business?
  • What unique characteristics define good data? 
  • What characteristics define bad data?

2. Track whether the data is available

For tracking, you’ll need an observability tool so you actually know if parts of your pipeline are down. Without one, it’s pretty tough to measure whether you’re missing an SLA, much less diagnose the root cause. It’ll also help you understand errors so you can fix things far faster. 

You can treat your data SLA like a North Star metric—one focal point to guide everyone. But within it, there’s of course a lot of concealed complexity, and you’ll need to track a basket of KPIs to help you know what’s happening upstream and downstream. 

Here are a few specific recommendations: 

  1. Set automatic tests to monitor data quality on its four dimensions
    • Test data pre-production
    • Test at each stage: completeness, anomalies
  2. Measure how well you discover, respond to, and address issues
    • Time to discovery
    • Time to resolution
    • Incidents per asset
  3. Document the proximate and root causes of every issue
    • Data partner missed a delivery
    • Time out
    • Job stuck in a queue
    • Unexpected transformation
    • Permission issue
    • Runtime error
    • Schedule changes

3. Identify the infrastructure you’ll need to add

Be cautious about what you commit to. You can’t be everywhere and prepare for everything, and an SLA of 99.999% uptime means you can only have five minutes of downtime each year. To deliver on that, you’d probably need more headcount, more visibility, more redundancies, and people working around the clock.

4. Implement issue tracking and reporting

You’ll probably need a ticketing tool like Jira or ServiceNow. This allows data users to create tickets, your team to track them, and you to understand the nature of those tickets so you can come up with long-term fixes and identify trouble areas. 

5. Define data owners

You may not want to specify it in your public data SLA document, but define data source and pipeline owners. They’re the ones ultimately responsible if something goes wrong. Also specify what happens if they go on vacation or leave the company.

6. Set up alerts

Set up alerts to post in your team messaging app such as Slack or an incident management system like PagerDuty. The more incident detail you can pack into that alert, the faster you can diagnose. These alerts will tell you early who else you’ll need to bring in, or where to begin your analysis. (Databand can send these alerts, and appends useful insights and context.)

7. Publish a team incident response plan

Let’s say a data consumer tells you a table is broken on their dashboard. How do you confirm and respond? Write it out so when an incident occurs, you don’t run into the bystander problem, where everyone assumes someone else will handle it, and then nobody acts. 

Depending on the size of your team, and how you’re distributed around the world, you may want to take this very seriously, and appoint what emergency responders call an incident commander. That person becomes the CEO of the incident and directs all others. (This ensures a coordinated response and helps you avoid multiple people tackling the same issue.)

8. Communicate issues with in-app alerts

If you’re able, create alert panels on people’s dashboards so you can communicate the status of the system. If something goes wrong, you can write, “We’re having an outage—here’s our estimated time to resolution.” This will diffuse repeated alerts from all your data consumers, and free you to actually respond.

If you can’t create alert panels, at the very least, designate a key person on each team who you can tell, who’ll then tell all the others. 

9. Monitor and update

Monitor how your data consumers are using the data (and whether they’re using the data.) Conduct occasional surveys, formal or informal, to gauge their trust in that data, and invite suggestions. For consumers who are interested, communicate what’s on your roadmap.

10. Conduct periodic maintenance

Set periodic maintenance periods where your team reviews why things broke and brainstorm fixes. Ask why those issues were possible, conduct a no-fault post-mortem, document your findings, assign those fixes, and monitor how they worked. 

11. Publish your data SLA

With all that figured out, you’re ready to edit and revise your data SLA. Publish it publicly in your company wiki or somewhere shared, secure everybody’s commitment, and hold yourself to it.

Hitting your data SLAs

Data SLAs help you keep yourself and your team honest. While they’re phrased as a public promise to others, they’re really a bilateral agreement—you agree to provide data within specific parameters, but in return, you need people’s participation and their understanding. 

Lots can go wrong in data engineering and lots of it has to do with miscommunication. Documenting your SLA goes a long way toward clearing it all up, so you can achieve your ultimate goal: instilling greater data trust within your organization.

Avoid data SLA misses with Databand Dashboard

Databand
2021-05-11 14:08:16

Business leaders want higher data quality and on-time data delivery. While your organization might not yet have an explicit data SLA, at some level data engineers will be responsible for making sure good data is delivered on time. At Databand, we want to help data teams meet the data SLAs they set for themselves, and create trust in their data products. We consider four main areas as critical to a data SLA:

  1. Uptime — Is expected data being delivered on time?
  2. Completeness — Is all the data expected to arrive in the right form?
  3. Fidelity — Is accurate, “true” data being delivered?
  4. Remediation — How quickly are any of the above data SLA issues detected and resolved?

Databand.ai can help data-driven organizations improve in all of these areas. For the first article in this series, you’ll be exploring how data pipeline failures affect your uptime and data SLAs.

Data pipeline health isn’t a binary question of job success or failure

Organizations can know they have data health problems, without knowing how those problems actually map to events in their pipelines or attributes of the data itself. This puts organizations in a reactive position in relation to their data SLAs.

The problem described is an observability problem, and it stems from the inability to see the context of pipeline performance due to a fractured and incomplete view of their data delivery. If you are only looking at success/failure counts to understand pipeline health, you may miss critical problems that affect your data SLAs (like uptime), for example, a task running late causing a missed data delivery, and how that might cascade to broader issues.

At Databand, we believe data observability goes deeper than monitoring by adding more context to system metrics, providing a deeper view of system operations, and indicating whether engineers need to step in and apply a fix. 

Observability for production data pipelines is hard, and it’s only getting harder. As companies become more data-focused, the data infrastructure they use becomes more sophisticated. This increased complexity has caused pipeline failures to become more common and more expensive.

Data Observability within organizations is fractured for a variety of reasons. Pipelines interact with multiple systems and environments. Each system has its own monitoring in place. On top of that, different data teams in your organization might have ownership over parts of your stack.

Databand Dashboard: a unified solution for guaranteeing data SLAs

We developed Databand Dashboard to help data engineers gain full observability on their data and monitor data quality across its entire journey. It’s easier than ever to find leading indicators and root causes of pipeline failures that can prevent on-time delivery. Whether your data flows are passing through Spark, Snowflake, Airflow, Kubernetes, or other tools, you can do it all in one place. 

  • Alerts on performance and efficiency bottlenecks before they affect data delivery
  • Unified view your pipeline health including logs, errors, and data quality metrics
  • Seamless connection to your data stack
  • Customizable metrics and dashboards
  • Fast root cause analysis to resolve issues when they are found

A single entry point for all pipeline-related issues

A pipeline can fail in multiple ways and the goal of Databand’s dashboard is to help engineers quickly categorize and prioritize issues so that you meet your data delivery SLAs. Here are some examples of some of these issues that might appear:

  1. Bad data causing pipelines to fail
    Common example: wrong schema or wrong value leading to failure to read the data — and, as a result, a total task failure
  2. Failure related to pipeline logic
    Common example: new version of the pipeline has a bug in task source code causing failures in production run
  3. Resource related issues
    Common example: failure to provision an Apache Spark cluster or lack of available memory
  4. Orchestrator System Failure, issues related to cluster health
    Common example: a failure of scheduler to schedule job

Triage failure and highlight what matters most to your data delivery

It’s difficult to prioritize which data pipeline failures to focus on. Especially when there are many things happening across your entire system at once.

The Dashboard can plot all pipelines and runs together over a pre-configured time with statuses and metrics — allowing you to visualize the urgency and dependencies of each issue and tackle them accordingly. After detection, you can dive into specific runs within your pipelines. You can observe statuses, errors, user metrics, and logs. You’ll be able to see exactly what is causing that failure, whether it’s an application code error, data expectation problem, or slow performance. This way, your DataOps team can begin working on remediation as quickly as possible.

Let’s explore an example:

Jumping into our Dashboard, Databand tells us that there’s been a spike of failed jobs that have started around 3:00 am the prior night.

In our aggregate view, we see it’s not one pipeline failing but rather multiple pipelines failing, and a visualization of a spike in failed jobs at specific points in time makes this clear. This is a sign of a system failure, and we need to analyze the errors happening at this point in time to get to a root cause.

This is a big deal because, with all these failures, we know we’ll have critical data delivery misses. Luckily, Databand can show us the impact of these failures on those missed data deliveries (which tables or files will not be created).

Now, you know you have an issue! How can you fix it? Can we quickly remediate?

To get to the root cause of the problem, you filter the dashboard to a relevant time frame and check for the most common errors using an Error widget.

The most common error across your pipelines is a Spark “out of memory” error. 

This tells you that the root cause of system failure is an under provisioning of Spark cluster. 

Rather than spending hours manually reading logs — and possibly days trying to find a root cause — Databand helped you grouped multiple co-occurring errors, diagnosed, identified a root cause.  Most importantly, you had the context for a possible solution in just a few minutes.

Databand saved you precious time so that you could expedite the remediation process without breaching your Data SLA.

Debug problems and get to resolution fast

When a pipeline fails, engineers need to identify resolutions fast if they want to prevent late data delivery and an SLA breach. The Dashboard gives a data engineer the proper context to the problem and focuses their debugging efforts.

As soon as a run failure happens, Databand sends you an alert, bringing you to the proper dashboard, where you can see how the failure will impact deliveries, what errors the failure relates to, and whether the error correlates with issues occurring across pipelines, indicating a contained or system-wide problem.

Using “Runs by Start Time”, our dashboard will enable us to understand if errors are specific to runs, or spread across pipelines (like a system-wide network issue). For errors detected across any runs, we can open the logs to understand the source, whether it’s our application code, underlying execution system (like Apache Spark), or an issue related to the data.

By tracing our failures to the specific cause, we can quickly resolve problems and get the pipeline back on track, so that we can recover quickly from a missed data delivery or avoid it altogether.

Make sure changes fix your problem, without creating new ones

As with any software process, when we make a change to our code (in this case pipeline code) we need to make sure the change is tested before we push it to our production system.

Unlike software processes, it’s a big pain to test data pipeline changes. There are simply too many factors to take into account – the code changes, the different stages of the pipeline, and the data flow to name a few.

When testing changes, one problem teams often face is the difficulty of comparing results between test and production environments.

With a consolidated view on all pipelines across any environment, Databand makes this easy. Offering a better way to perform quality control on pipelines, the moment before the pipeline is pushed to production so you can decrease the risk of yet another failure and a worse SLA miss.

data pipeline observability compare runs

By selecting across multiple source systems, Databand enables you to compare metrics that are critical to your data delivery such as run durations, data quality measures, and possible errors.

Detect failure trends and prevent future ones

Understanding the what and why behind pipeline failure is important. However, our ultimate goal is to catch problems before they happen so that engineering teams can focus on making their infrastructure more efficient — rather than be stuck in that state of costly damage control.

The Databand Dashboard helps you understand what should be considered an anomalous duration. An example is an abnormally long run, or long-running tasks, that keep the rest of the pipeline waiting in a queue for resources. While pipeline stats show the average run time for the last run, Dashboard’s charts can show what is the duration of a currently running job.

Situations like these can normally only be caught with tedious, manual monitoring. Databand automatically tracks these metrics and will send you an alert so you can fix the issue before your delivery runs late. You can set an alert on the duration of a specific task or run. When the duration exceeds an alert threshold, an alert would be sent to slack, email or incident management systems like PagerDuty.

Get a bird’s eye view of your data infrastructure

The Databand Dashboard is a powerful tool that will help DataOps teams guarantee their data SLA. With Databand Dashboard you can:

  • Fix pipeline issues proactively and ensure on-time data delivery
  • Unify the monitoring of all your pipeline across its entire journey from dev to production
  • Determine the root cause of pipeline issues fast
  • Compare runs from staging and production environments with ease
  • Ensure the health of your computation clusters

We’ve just scratched the surface of what Dashboard is capable of.

In a next post, we will talk about how Dashboard is used to do retros and how favorited metrics can be used to track statuses of important data assets.

Databand.ai is a unified data observability platform built for data engineers. Databand.ai centralizes your pipeline metadata so you can get end-to-end observability into your data pipelines, identify the root cause of health issues quickly, and fix the problem fast. To learn more about Databand and how our platform helps data engineers with their data pipelines, request a demo or sign up for a free trial!