How a Top-3 TV Provider Cut AWS EMR Cost by 58% and Ensured Pipeline SLA Reliability

Explore how a top-3 TV provider reduced workloads run-times to meet external SLAs, significantly optimized cloud cost, and improved platform reliability.

58%

EMR cost reduction

74%

reduction in job run-times overall

12.5h, daily

optimized run-time of heaviest job

Highlights

Industry: Media and entertainment

Size: 15,000 employees

HQ: California, United States

Use Case: Cost Optimization, Pipeline performance

Data Platform: AWS EMR, Spark Streaming

Table of contents

about the company

A leading US TV and streaming provider delivering live TV, on-demand content, and digital entertainment to millions of households nationwide. The company operates across satellite and streaming platforms, offering a broad range of premium content, sports, and bundled services. It leverages large-scale data systems to enhance customer experience, optimize content delivery, inform business operations, and provide deep customer analytics for advertisers.

Overview

At the heart of the company’s data platform are extensive Spark data pipelines running on AWS EMR, processing massive volumes of data daily. The platform is used to generate analytics that drive personalized content recommendations for viewers and real-time pricing and performance analytics for advertisers.

As data analytics demand grows, ensuring pipeline reliability, cost efficiency, and consistent performance becomes essential to maintaining business continuity, meeting internal service-level expectations, and holding external customer business commitments.

To strengthen platform observability and optimization efforts, the company adopted definity, enabling full-platform visibility and rapid identification of performance and resource inefficiencies.

The Challenge

As workloads scaled, the data platform faced increasing operational strain. Because these data pipelines directly power customer-facing experiences and revenue-generating services, meeting strict SLAs was business-critical.

However, SLA misses and pipeline failures were largely handled reactively, with issues often detected late in the process. As a result, engineering teams were required to maintain 24/7 on-call coverage to respond to performance degradations, pipeline delays, and SLA breaches.

Business-critical, long-running Spark jobs struggled to meet SLAs, delaying customer-facing analytics and time-sensitive business operations
High resource consumption led to rising cloud infrastructure costs, making it difficult to balance performance demands with budget constraints

Existing APM and cloud monitoring tools were in place, but they were not sufficient to support platform-wide optimization:

Limited visibility — monitoring was not unified or automated across the platform, making overall workload behavior difficult to understand and leaving teams without a clear picture of end-to-end performance
Lack of deep waste analysis — existing tools did not provide insight into inefficiencies at task or resource level, preventing teams from identifying high-ROI optimization opportunities and slowing improvement efforts

Without comprehensive, contextual insight into job execution, health, data and resource usage, performance tuning relied heavily on manual investigation and reactive troubleshooting.

“This technology is clearly missing in the AWS stack.”
Senior Director, Data Platform

Why definity

definity was introduced to deliver comprehensive observability and actionable optimization intelligence across the entire data platform. Key capabilities included:

Zero-effort, full-platform instrumentation enabling rapid adoption without disrupting existing workflows or requiring intrusive changes
Automatic, contextualized visibility across Spark and EMR workloads, providing unified insight into job analytic behavior, resource utilization, and execution patterns
Waste heatmaps that highlighted inefficiencies and revealed high-impact optimization opportunities, helping teams quickly focus on areas with the greatest potential return

This combination allowed teams to move from reactive troubleshooting to proactive, data-driven optimization. definity uncovered major inefficiencies, for example, heavy I/O activity on S3 causing long runtimes and low utilization, limiting overall cluster efficiency. definity’s task-level analysis then delivered concrete, actionable recommendations that enabled quick workload tuning and performance improvements.

The Impact

With actionable insights and targeted improvements, the company achieved measurable results:

Reduced top heavy pipelines runtime by 74%, significantly reducing risk of SLA misses and stabilizing delivery of time-sensitive business data.
For example, a critical customer-facing pipeline saw its end-to-end processing time and SLA window reduced from 17 hours to 4.5 hours.
58% of total platform resource consumption was eliminated through smarter resource allocation and workload optimization
Deep code optimization guidance enabled full object-store performance improvements, further strengthening long-term platform efficiency

These outcomes significantly improved platform reliability while delivering substantial cost savings and operational efficiency.

“definity helped us achieve significant cost savings right off the bat, without any code changes – saving real dollars for the company. Having a single pane of glass for monitoring and actionable insights for all our production jobs is simply awesome.
‍Principal, Platform Engineering

Looking forward

By gaining unified, automated visibility and actionable optimization guidance, the company transformed its data platform operations – improving performance, lowering costs, and ensuring the scalability required for future growth.

Ready to shift to proactive observability?

Easily optimize jobs, prevent incidents in real-time, and troubleshoot issues.
No code changes. Secure in your environment.

Book a Demo

Try definity Now