🎯
🎯 Cut Up To 50% Of Spark Costs
Discover How You Can Cut Up To 50% Of Your Spark Costs, In 1 Week
Blog
Agentic Data Engineering Is Coming – And Today’s Stack Isn’t Ready

Agentic Data Engineering Is Coming – And Today’s Stack Isn’t Ready

We're already underwater. What happens tomorrow?

The demands on modern data platforms are growing faster than most teams’ ability to keep them efficient, reliable, and under control. Controlling costs and maintaining SLAs in the face of continuous change – volume bursts, code changes, version upgrades, schema evolutions, infra bottlenecks – is increasingly harder.

Speak with almost any data platform or engineering leader, and you’ll hear a similar story:

“We are constantly firefighting. Yesterday we found a job that accidentally ran all weekend, costing $10K. Last week, a pipeline broke silently and delivered incomplete data to a product team for three days. And we still don’t know why two of our highest-cost jobs got slower this month – nothing changed in the code.”

These aren’t edge cases – they are the norm.

Their teams are underwater, battling inefficient jobs driving infrastructure costs through the roof; firefighting reliability issues from degraded pipelines, failures, and low data quality; and spending endless hours in troubleshooting cycles because the right info is scattered across logs, tools, and people’s memories.

That’s the situation today. Now imagine it with agents generating hundreds of jobs per week.

Agents building pipelines is the future – and it’s coming fast

The future of data engineering is agentic AI.

Pipelines won’t just be written by engineers, they’ll be generated and maintained by a combination of AI agents, automated policies, data engineers, and business users. Teams will be empowered to build faster, at a higher scale, and without requiring deep expertise – eliminating bottlenecks, democratizing data even further, and increasing time-to-market. This will unlock 100x value from new business use-cases, optimal operations and cost savings, and engineering productivity gains.

We continuously see previews of what’s coming – recent Databricks announcements like Agent Bricks, LakehouseIQ, and declarative systems like LakeFlow; Microsoft reporting that 20–30% of its internal code is already AI‑generated; GitHub analysis showing that over 30% of new Python functions from U.S. developers (as of late 2024) were AI‑written; and data engineering teams across the board that are already using AI co-pilots to generate ETL logic, at an accelerating pace.

But what seems exciting to leadership and end-users is a data engineer’s nightmare:

  • Pipeline sprawl – thousands of new jobs created faster than humans can review
  • Loss of domain knowledge – no engineer to explain why a job was built a certain way
  • Higher variability in quality – some pipelines will be fine, others will be inefficient or unreliable from day one, others will degrade over time

So in order for agentic AI to deliver on its promise – automating pipeline creation and maintenance – rather than create a new set of problems, agents will have to be actionable and reliable.

To achieve that, agentic AI relies on context. Without it, agents will:

  • Generate pipelines without understanding platform constraints, runtime performance, underlying queries, or historic data behavior
  • Not be able to reliably validate their own outputs
  • Become “confidently wrong” – producing suboptimal pipelines faster than ever before
  • Not be actionable – unable to fix and optimize pipelines at-scale

Why current data stack tools aren’t built for it

Where can agentic AI get this context from?

The common data stack tools we rely on today are valuable – but they are either static or enable post-facto analysis. None of them connects the dots across pipelines, data, code, infra, and execution in real time, especially under dynamic, high‑velocity workloads.

Catalogs – static understanding

Data catalogs have become the backbone of data governance and discovery in modern data platforms – providing the much needed order in the chaos.

However, they are descriptive and static:

  • They show what exists – not how it’s behaving
  • They operate on the defined system, not the one that’s actually running (and that’s often a reflection of truth and not the source of truth)
  • They cannot predict or catch performance drift, cost spikes, or runtime failures

Observability Tools – post-facto analysis of data quality

Modern data observability is essentially centered around data monitoring and is good at detecting data quality issues or schema changes in an automated way.

But these tools look outside-in at the data outputs (tables and query results) and operate after the fact. This means that they are reactive and limited:

  • They lack visibility to- and runtime understanding of- the system that created the data – jobs, code, infrastructure, and the connections between them
  • They are unable to help with job health, fixes, and optimizations
  • They can flag data quality symptoms, but can’t reliably point to the cause

APMs – blind to pipelines

Application Performance Monitoring tools are invaluable for keeping servers, services, and API calls healthy.

But they don’t understand analytical pipelines. They:

  • Monitor overall CPU, memory, and network performance – but have no idea which jobs are running, or how
  • Don’t correlate infra usage with pipeline logic and cannot detect inefficiencies – so are unable to help optimize jobs
  • Don’t have visibility to pipeline logic, data inputs/outputs, or query patterns – so they cannot help with data or job observability

Platform-native tools – rudimentary capabilities

Native platform observability is still in its infancy. While Databricks leads in this area, most native tools still fall short:

  • Capabilities remain basic, manual, and narrowly scoped
  • They are spread across multiple disparate tools across the platform
  • They rarely address performance or cost optimization in a meaningful way
  • They are platform-specific – and don’t work across platforms or on-prem/hybrid environments

The tooling gaps today are already pushing data engineering teams to their limits – unable to fight runaway costs, unreliable data, and an unmanageable operational load. And when AI agents begin producing hundreds of jobs a week, those gaps will compound many times over.

The missing piece: an intelligence layer for agentic data engineering

The future of data platforms and agentic data engineering requires something fundamentally different: a system‑aware intelligence layer that’s embedded in runtime and enables agents (and engineers) to act.

It must be:

  • Full‑stack – connecting data, pipelines, execution, infrastructure, and cost
  • Contextual – correlating data behavior, job execution, code, and environment conditions
  • Runtime aware – tracking what’s actually happening, not just what’s defined in a spec

With such a layer, agents can catch inefficiencies before they burn through the budget. They can automate runtime remediations, validate reliability at scale, and instantly correlate symptoms to causes – even for agent-generated jobs.

This layer becomes the foundation for both operational excellence and agentic intelligence.

What it fundamentally builds is trust and actionability – knowing what is happening, when it is happening, what is the impact, and how to take the right corrective action.

Final thought

AI will generate your pipelines. The question is whether you’ll have the visibility and context to keep them efficient, reliable, and trustworthy – or whether you’ll just be generating problems faster than your team can sustain.