Why Is Spark Data Engineering Still So Challenging?

Why Is Spark Data Engineering Still So Challenging?

Data issues cost companies significant time and money. Every year, companies lose $12.9M on average to data and pipeline issues. Indeed, DBT recently called out that “poor data quality is the most prevalent problem today" for analytics engineers.

It's easy to see why. Over the past decade, data pipelines became one of the most mission-critical software in almost every organization's technology stack. But while application engineers have a robust toolset called APM (Application Performance Monitoring) to monitor their applications and infrastructure, pinpoint issues, and quickly resolve downtimes, the observability space for data engineers is still nascent. The lack of adequate tooling, coverage, and adoption leads to significant business impact - data downtime and revenue loss, low velocity of data engineering teams, and high infrastructure costs. 

The pain and consequences are amplified tenfold for those working with Spark at high-scale, and even more so in on-prem environments. These teams typically manage hundreds or thousands of complex production pipelines, with heavy workloads and a big and evolving code base; they involve tens or hundreds of data producers and consumers, with a constantly changing landscape; their infrastructure and query-engine is much harder to manage and optimize; and the use-cases are typically even more time-sensitive, feeding directly into predictive ML models, operational system & products, or AI applications. For these teams, ensuring pipeline reliability is absolutely mission-critical, but is equally increasingly challenging.

The Banes of Advanced Data Engineering

So, why do data engineers still struggle to maintain data and pipeline reliability?

Lacking Visibility into Pipelines and Data Operation

In software engineering, deep visibility into applications and infrastructure via robust APM tools across the production environment is crucial to maintaining application reliability. 

Similarly, to ensure data and pipeline reliability, data engineers must have granular and ubiquitous monitoring of their data and pipelines, specifically:

  • Data quality, including data volumes, freshness, consistency, values validity and distribution, and schema – across input, interim, and output datasets.
  • Pipeline health, from a single query to pipeline level, including runs durations and SLAs; runs statuses, failures, re-runs, hanging / not-started runs; structure changes; and even code and environment changes.
  • Infra performance, including CPU and memory utilization, query performance, orchestration bottlenecks, and cluster issues.

This is especially exacerbated in the Spark environment, where analytical operations tend to be heavier and run longer; pipelines tend to be more complex with multiple sub-tasks; and runs tend to be more error prone and tightly coupled with the underlying infrastructure performance.

But data engineers rarely have it today. Many teams have built sporadic, ad hoc data quality checks on key fact tables. Some teams have built dashboards tracking specific data quality metrics (like freshness) for some tables, or use 3rd-party data quality tools; others may rely solely on the orchestrator’s (e.g., Airflow) to track pipeline runs. And some don’t have any data or pipeline monitoring in place. 

This siloed, ad hoc monitoring (or lack thereof) is partial, limited, spotty, and fragmented, leaving significant blindspots and low coverage areas. Moreover, it leaves data engineers in the dark about how data behaves and flows through the data platform holistically, how pipelines execute and perform, and how pipeline executions and data behavior tie together. 

With the increasing complexity of data platforms and the ever-increasing scale of data, pipelines, and data producers & consumers, continuous and deep visibility becomes increasingly difficult (or almost impossible) to achieve.

It Takes a Huge Manual Effort to Test Data

The first step for data engineering teams is developing data quality checks. While this makes sense at a smaller scale, it is the part of the process that causes significant concern for enterprise teams.

Firstly, manual testing means more work. Data developers don’t want to spend their time writing mundane tests. They want to build: new pipelines, new business logic, new features. They already maintain tens of pipelines and tables in production. The last thing they want is to write and maintain countless tests for various data scenarios and edge cases or update tests according to changing requirements. While data testing tools and open-source frameworks allow data engineers to run tests easily, they don’t truly alleviate the pain and effort of writing the tests.

Secondly, it means partial coverage. With tens and hundreds of data assets (schemas, tables, columns) and multiple data quality dimensions to check, you never write all possible tests. Moreover, you don’t know how data behavior will evolve with constantly-changing inputs and data flows. So, writing tests becomes a reactive exercise, trying to protect from past failures rather than future ones.

Thirdly, manual testing means static rules and thresholds. When data evolves and behavior changes, these rules & thresholds often become obsolete, requiring an update. This requires development and redeployment, involving more manual effort, delay, and risk. 

Lastly, if data engineers need to maintain data reliability and pipeline health and infra performance, it is not enough to only test data quality. But in practice, data engineers today rarely test pipeline health or infra performance.

Altogether, the result is disastrous. You launch a data quality improvement initiative with enthusiasm, and data engineers spend time and effort developing tests, typically for several key tables. Slowly, engineers write fewer and fewer tests as they focus on delivery. After some time, they update old tests and thresholds based on current data behavior or just deprecate the tests altogether. So you invest high effort in writing tests that shortly become irrelevant, and remain with low and spotty coverage.

Manual data testing simply doesn’t scale.

Issues Are Always Caught Too Late

The next logical step is to employ automated data monitoring to detect data quality issues at scale. Most data monitoring tools today connect outside-in to each data warehouse, poll the data output periodically, and check for anomalies.

While an automated solution is more scalable, when the monitoring is not ubiquitous and is done at-rest, issues slip through the cracks, are not caught in real-time, and end up propagating downstream, affecting the business. These issues are typically caught by other data engineers, BI/data analysts, business executives, and, in the worst case, external customers. This can have a significant impact on trust and, likely, revenue.

Then, data engineers have to go into a reactive problem-solving mode. Instead of proactively preventing issues, they find themselves constantly firefighting and dealing with problems as they arise. This approach is akin to fighting fires without proper fire alarms. The delays in detecting and fixing data issues slow down engineering velocity and consume valuable time that could be better spent building business value and improving the data infrastructure.

It Takes Too Long to Root Cause

When data issues are finally detected, data engineers are first tasked with finding the root cause. 

But the focus on monitoring data (output) leaves data engineers blind.

Modern data pipelines are highly complex, with many hops and inter-dependencies. While understanding the output data is crucial, it provides only a partial picture. To truly & fully understand how data is generated and why it is changing (while maintaining pipeline health), data engineers need the full context – a deep understanding of the behavior of, and the relationships between, data, pipelines, and infrastructure.

But with today’s tools and capabilities, data engineers struggle to perform the root cause analysis (RCA) quickly and efficiently because they:

  • Look at symptoms, not underlying issues: they monitor the data output (e.g., at-rest in the warehouse), but not the leading indicators such as input data, pipeline runs SLAs, and underlying infra resources consumption.
  • Lack execution context: they lack the understanding of what specific pipeline and run generated the faulty data or resulted in an issue. This includes the pipeline context (pipeline, code, environment, schema, external & internal lineage) and run context (SLA, duration, errors, input data), how they compare with previous runs, and whether any changes were detected.
  • Are unable to see data+job lineage: they may be able to see how data is connected across datasets, but cannot see how data flows, the dependencies between pipelines, the internals of each pipelines (broken sub-tasks and transformations), and the mapping of each dataset to the specific transformation and run that generated it.
  • Don’t couple infra performance with pipeline runs: they don’t track how the infra is performing, what are the core issues degrading performance (such as skew or shuffle), and how is it all affecting pipeline health and data output quality.
  • Don’t have actionable alerts: they lack run-time alerts linked to the relevant contextual information as described above, to enable easy pinpointing of issues.

Without this integrated approach, data engineers are left guessing in the dark, not knowing what affected the run and why an issue has occurred in the first place. They are left to manually sift through disparate data points in a lengthy back-and-forth.

So what happens? Slack escalation and email wars. Without the ability to see what is happening within the pipeline, it becomes a matter of asking, "who changed something upstream?". Without contextual information and connective alerting mechanisms it’s hard to understand the source of the issue and what is the needed fix, as well as what is the downstream impact and what mitigation action should be taken.

There Are Always Wasted Runs & Resources

Application developers and platform engineers would never deploy without first understanding their deployment's storage and computing resources. However, with data engineering, a lack of visibility leads to suboptimal orchestration and resource utilization. This results in inefficient scheduling, wasted run and re-tries, and over-provisioned or underutilized compute resources.

For instance, consider orchestration inefficiencies. Without proper tooling, managing task dependencies and scheduling becomes a complex and error-prone process. This often leads to high infra congestion, redundant runs, or idle resources waiting for dependent tasks to complete or data to arrive. By monitoring and understanding how pipeline durations and SLAs behave over time and interdependently, you can streamline these dependencies, ensuring tasks run in the most efficient order and reduce unnecessary pipeline re-runs.

Another example is hanging tasks, which continue to run in the background until someone finally notices that they never finished. Without a real-time detection of SLA breaches and the ability to kill the rogue run, significant resources might be wasted. 

Common Spark performance issues, stemming from its distributed compute framework, such as skew, shuffle, and inefficient garbage collection, typically result in memory spills, longer processing durations, underutilized executors, cluster issues, and even pipeline failures altogether. These not only waste precious resources and may breach SLAs, but may also lead to costly re-tries and affect data reliability altogether.

Lastly, even if you have visibility into your total costs from your cloud data platform or into your overall on-prem server utilization, identifying opportunities for cost savings and optimizing performance is impossible without visibility into the logical layer (e.g., pipeline) and tracking it over time. This granular insight allows you to pinpoint and address inefficiencies proactively, leading to more efficient resource utilization and cost savings.

Data Pipeline Observability for Advanced Data Engineering

To protect against the effects of data & pipeline issues, increase data engineering velocity, and reduce infrastructure costs, data engineers need more robust infrastructure to do their job. What does a best-in-class solution for Spark-heavy data engineers look like?

Deep monitoring

A comprehensive monitoring solution should provide out-of-the-box granular metrics monitoring, covering data quality, pipeline runs, and infrastructure performance. By consolidating these metrics into a unified view, data engineers can gain a holistic understanding of their pipelines' health and performance. 

Gaining visibility into the entire data operation enables data engineers to identify potential issues, bottlenecks, and optimization opportunities that are entirely invisible in their current setup.

AI-powered coverage

Automated coverage is crucial for ensuring data quality and reliability. A best-in-class solution should offer AI-generated tests that provide full coverage, reducing the manual effort required to write and maintain countless tests. This can eliminate the need for data engineers to spend countless hours writing tests and still missing essential cases. Instead, they can focus on building or developing deep business logic and domain-specific checks.

Dynamic anomaly detection can then identify and flag unusual patterns or deviations in data, moving away from static rules that don’t evolve with time and require constant maintenance.

In-motion protection

Proactive protection extends automated coverage by providing run-time testing in line with pipeline runs. This means that input data and outputs are validated and checked for anomalies as they flow through the pipeline rather than waiting until the end of the process to identify issues. This is a big difference from checking data at-rest that detects issues too late, allows for issues to propagate downstream, cannot check input data, and requires a separate process to run validations. 

Furthermore, automatic run preemption can stop problematic pipeline runs before they cause downstream damage, saving valuable time and resources.

Contextualized RCA

Contextualized root cause analysis (RCA) is essential for quickly identifying and resolving issues when they occur. A best-in-class solution should provide rich execution context through unified telemetry, including information about pipeline run history, column-level lineage, code changes, environment settings, and schema modifications. 

Deep column-level lineage that is automatically discovered and built (without user configuration or input), along with mapping between datasets and the specific jobs & transformations that generated or consumed them, allows data engineers to trace the data flow through the pipeline and quickly identify the source of issues across the entire data platform.

Pinpointed alerts surface all this information readily for data engineers to act on, significantly reducing the time spent on manual investigation and troubleshooting.

Performance optimization

Data engineers need to be able to optimize pipelines’ performance, increase resource utilization, and reduce infra costs, all while ensuring that pipelines are reliable and that SLAs are met.

 A best-in-class solution can pinpoint concrete opportunities for optimization, such as:

  • Optimizing resources and memory utilization at the pipeline level
  • Optimizing pipeline orchestration bottleneck to meet SLAs
  • Identifying redundant and obsolete pipeline runs
  • And even, optimizing the observability infrastructure itself (e.g., saving on inefficient resources running as a separate process)

Maximizing Value through Pipeline Observability

By implementing a best-in-class solution for data pipeline observability, Spark-heavy data engineering teams can achieve several positive outcomes and strategic benefits:

  1. Protect against impact of data downtime: increase test coverage and minimize time to detect issues, ensuring data reliability and preventing downstream impact.
  2. Increase data developers' velocity: reduce time to resolve issues and eliminate manual test writing efforts, allowing data engineers to focus on delivering business value.
  3. Reduce data infrastructure costs: optimize resource utilization and minimize inefficient runs and re-runs.
  4. Establish engineering standards: ensure consistency and accountability across the organization, promoting a culture of excellence and quality.
  5. Regain trust: rebuild trust with stakeholders and consumers by consistently delivering reliable, high-quality data on time.
  6. Accelerated code deployments: confidently deploy code and perform platform upgrades, ensuring performance parity and data quality.

By shifting observability to post-production and seamlessly empowering data engineers with the right tools and practices, data organizations can unlock the full potential of their data engineering teams, build the strong foundation required for AI & ML innovation, and maintain a competitive edge in the data-driven landscape.

definity is a data pipeline observability solution designed for Spark-heavy data engineering teams. Contact us for a demo of definity to see how we can help make this happen for your organization.