How a Fortune 500 FinServ Accelerated GCP Dataproc Platform Upgrade by 6 Months
Learn how a Fortune 500 Financial Services leader accelerated data platform upgrade by 6 months with deep data and performance observability and seamless workload validation.
80%
workload validation effort reduction
50%
faster platform upgrade
$3M
resource savings
Industry: FinTech / Digital Payments
Size: 30,000 employees
HQ: California, United States
Use Case: Platform Upgrade, Platform Modernization, Workload Validation
Data Platform: GCP Dataproc
Overview
The company operates one of the largest data platforms in the world, running thousands of Spark jobs across hundreds of teams. Over the years, this environment had grown into a massive footprint on GCP Dataproc, complemented by significant operations in BigQuery.
While data application (data engineering) teams were focused on delivering their product roadmap and support business value, the data platform team faced a harsh reality – a significant portion of the platform was still running on older (legacy) Dataproc and Spark versions (e.g., Dataproc 1.x and Spark 2.x), for which Google announced end of support.
This created a major challenge at the enterprise platform level – resulting in a high risk posture due to potential security vulnerabilities, introducing meaningful platform performance limitations, and increasing operational risk.
The Challenge
While the enterprise platform team needed to upgrade the platform holistically, data application teams were effectively stuck on legacy versions – there was no reliable or scalable way to validate upgrades while ensuring both data output correctness and performance parity.
At a foundational level, the platform lacked the necessary monitoring and observability infrastructure to properly understand workload behavior. Teams had no ability to deeply profile data, pipeline execution, or performance characteristics across jobs, making it difficult to establish a baseline or evaluate the impact of any change.
Secondly, there was no infrastructure to safely test workloads before production. Running the same pipeline pre- and post-upgrade on real production data in a staging environment required manual code changes – copying tables, redirecting outputs, and reconnecting inputs. This process was tedious, error-prone, and not scalable, while introducing risk to production systems.
Thirdly, even when teams managed to run two code versions on the same input data, there was no structured way to analyze execution differences or compare outputs across runs at a granular level. There was no visibility into how data or behavior changed at each step of the pipeline, and no way to trace issues back to a specific transformation or line of code.
So when differences did occur, debugging was slow and highly manual. Teams had no clear way to isolate whether issues were caused by engine-level changes, data discrepancies, or logic differences. Root-cause analysis often required deep investigation across multiple systems, significantly increasing time to resolution.
As a result, every upgrade became a high-risk, high-effort effort:
- Migrations required heavy manual validation and coordination across teams
- Engineering teams were pulled away from roadmap work to support upgrade efforts
- Timelines extended significantly due to lack of confidence and repeatability
Without a scalable validation approach, the platform team could not confidently drive the modernization forward. What began as a necessary upgrade evolved into a strategic, enterprise-wide platform modernization effort, initially projected to take over 12 months.
Why definity
definity was introduced as the observability and validation layer for the company’s Spark and Dataproc modernization initiative, enabling teams to deeply monitor workloads and to test and compare them across platform versions in a structured and repeatable way.
At its core, definity provided a foundation of deep, out-of-the-box observability across the entire data platform. Teams gained the ability to monitor and profile behavior across data, pipelines, execution, lineage, performance, and cost – at every step of the workflow, without requiring code changes or custom instrumentation.
With definity, teams were able to take existing production Spark jobs and automatically replay them using real production inputs, while redirecting outputs to a controlled staging environment. This made it possible to run legacy and upgraded versions side-by-side without impacting production systems.
definity provided deep visibility into both executions, including:
- Granular comparison of data outputs, at every interim step
- Detailed tracking of execution behavior and performance characteristics
- Detection of schema changes, data mismatches, and logical differences
Instead of relying on manual validation, teams received automated comparative analysis between versions, including clear compatibility reporting and precise identification of deltas. When differences were detected, definity provided context to help engineers pinpoint the exact stage – down to the transformation level – and resolve issues quickly.
This transformed what was previously a long, manual, fragmented, and high-risk process into a standardized, scalable, and data-driven workflow for platform upgrades and code changes validation.
“ Previously, we had to manually compare output tables to ensure correctness and then manually test performance to ensure parity. It used to take days & weeks. When you own tens of pipelines, this doesn’t scale. With definity, all we have to do is instrument the pipeline, and that's it. This is a huge step.“
Shay, Data Engineering Tech Lead
The Impact
By standardizing upgrade validation on definity, the company was able to fundamentally change how platform migrations were executed.
Platform teams could enable safe, large-scale side-by-side validation across thousands of workloads, removing the need for ad hoc testing and reducing dependency on individual teams. Data engineering groups gained confidence to upgrade without risking silent data regressions, which had previously been a major blocker.
Debugging and validation cycles were significantly reduced, allowing teams to identify and resolve issues quickly without prolonged investigation. This also simplified cross-team coordination, as validation became a shared, repeatable process rather than a fragmented effort.
Key business results included:
- 80% reduction in required engineering effort for validation and debugging of workloads
- Overall 50% acceleration in Spark and Dataproc modernization program upgrade – delivered 6 months faster
- Estimated $3.1M in infrastructure and engineering resource savings
- Successful de-risking of a critical enterprise modernization initiative
By enabling reliable, repeatable validation at scale, definity allowed the company to migrate off unsupported infrastructure significantly faster – without compromising trust in data correctness or pipeline performance.
“ definity helped us complete our platform upgrade 50% faster. Workload validation could not have been easier ”
Dan, Data Engineering Manager
Looking forward
With a standardized validation framework in place, the company is now evolving its data platform to support agentic upgrades and validation of any pipeline code change – enabling faster and safer data delivery at scale.
Built on definity’s seamless validation foundation, deep runtime context (MCP), and intelligent code recommendations, this new model allows teams to automatically validate changes, compare outcomes, and deploy with confidence.
In parallel, the platform is extending this approach to continuous performance and cost optimization, using the same automated recommendations with built-in validation before rollout.
“ The future of our platform is AI Agents that seamlessly upgrade pipelines, validate code changes, and optimize code. definity sits at its core it with its runtime MCP, pipeline control, and auto-validation“
Prasanna, Sr Manager, Data Platform
Ready to shift to proactive observability?
Easily optimize jobs, prevent incidents in real-time, and troubleshoot issues.
No code changes. Secure in your environment.
