How a Top-3 TV Provider Cut AWS EMR Cost by 58% and Ensured Pipeline SLA Reliability
Explore how a top-3 TV provider reduced workloads run-times to meet external SLAs, significantly optimized cloud cost, and improved platform reliability.
58%
EMR cost reduction
74%
reduction in job run-times overall
12.5h, daily
optimized run-time of heaviest job
Industry: Media and entertainment
Size: 15,000 employees
HQ: California, United States
Use Case: Cost Optimization, Pipeline performance
Data Platform: AWS EMR, Spark Streaming
Overview
At the heart of the company’s data platform are extensive Spark data pipelines running on AWS EMR, processing massive volumes of data daily. The platform is used to generate analytics that drive personalized content recommendations for viewers and real-time pricing and performance analytics for advertisers.
As data analytics demand grows, ensuring pipeline reliability, cost efficiency, and consistent performance becomes essential to maintaining business continuity, meeting internal service-level expectations, and holding external customer business commitments.
To strengthen platform observability and optimization efforts, the company adopted definity, enabling full-platform visibility and rapid identification of performance and resource inefficiencies.
The Challenge
As workloads scaled, the data platform faced increasing operational strain. Because these data pipelines directly power customer-facing experiences and revenue-generating services, meeting strict SLAs was business-critical.
However, SLA misses and pipeline failures were largely handled reactively, with issues often detected late in the process. As a result, engineering teams were required to maintain 24/7 on-call coverage to respond to performance degradations, pipeline delays, and SLA breaches.
- Business-critical, long-running Spark jobs struggled to meet SLAs, delaying customer-facing analytics and time-sensitive business operations
- High resource consumption led to rising cloud infrastructure costs, making it difficult to balance performance demands with budget constraints
Existing APM and cloud monitoring tools were in place, but they were not sufficient to support platform-wide optimization:
- Limited visibility — monitoring was not unified or automated across the platform, making overall workload behavior difficult to understand and leaving teams without a clear picture of end-to-end performance
- Lack of deep waste analysis — existing tools did not provide insight into inefficiencies at task or resource level, preventing teams from identifying high-ROI optimization opportunities and slowing improvement efforts
Without comprehensive, contextual insight into job execution, health, data and resource usage, performance tuning relied heavily on manual investigation and reactive troubleshooting.
“This technology is clearly missing in the AWS stack.”
Senior Director, Data Platform
Why definity
definity was introduced to deliver comprehensive observability and actionable optimization intelligence across the entire data platform. Key capabilities included:
- Zero-effort, full-platform instrumentation enabling rapid adoption without disrupting existing workflows or requiring intrusive changes
- Automatic, contextualized visibility across Spark and EMR workloads, providing unified insight into job analytic behavior, resource utilization, and execution patterns
- Waste heatmaps that highlighted inefficiencies and revealed high-impact optimization opportunities, helping teams quickly focus on areas with the greatest potential return
This combination allowed teams to move from reactive troubleshooting to proactive, data-driven optimization. definity uncovered major inefficiencies, for example, heavy I/O activity on S3 causing long runtimes and low utilization, limiting overall cluster efficiency. definity’s task-level analysis then delivered concrete, actionable recommendations that enabled quick workload tuning and performance improvements.
The Impact
With actionable insights and targeted improvements, the company achieved measurable results:
- Reduced top heavy pipelines runtime by 74%, significantly reducing risk of SLA misses and stabilizing delivery of time-sensitive business data.
- For example, a critical customer-facing pipeline saw its end-to-end processing time and SLA window reduced from 17 hours to 4.5 hours.
- 58% of total platform resource consumption was eliminated through smarter resource allocation and workload optimization
- Deep code optimization guidance enabled full object-store performance improvements, further strengthening long-term platform efficiency
These outcomes significantly improved platform reliability while delivering substantial cost savings and operational efficiency.
“definity helped us achieve significant cost savings right off the bat, without any code changes – saving real dollars for the company. Having a single pane of glass for monitoring and actionable insights for all our production jobs is simply awesome.
Principal, Platform Engineering
Looking forward
By gaining unified, automated visibility and actionable optimization guidance, the company transformed its data platform operations – improving performance, lowering costs, and ensuring the scalability required for future growth.
Ready to shift to proactive observability?
Easily optimize jobs, prevent incidents in real-time, and troubleshoot issues.
No code changes. Secure in your environment.
