The Netflix Streaming Disaster

Possible Causes of Streaming Failure of Boxing Match

Dudhraj Sandeep
5 min readDec 7, 2024

On November 15, 2024, people worldwide watched the highly anticipated boxing match between Mike Tyson and Jake Paul, which was streamed on Netflix.

As horrendous as the match was, so was the Netflix’s data streaming. It was so bad that the match was streamed to X (twitter) by an audience.

Although a huge upset, surprisingly, this is not the first time Netflix has failed to deliver a streaming content data.

A year before on 16th April 2023, Netflix failed to stream “Love Is Blind Reunion”, a reality show, despite serving a smaller audience.

So, why Netflix has such bad live data streaming?

In this article, we will try to understand, from a data architecture perspective, the reasons behind Netflix’s mishap.

The Architecture

While the definite cause for the failure would best known to Netflix, we can speculate over various points for the service disruption on 15th November.

The answer to this failure is hidden in the architecture of Netflix and the complexity of real-time data.

High Level Arch Diagram for Netflix Live Stream

Netflix is a data giant, with range of services (internal, and external) built upon microservices architecture.

A microservices architecture involves designing an application as a group of small, independent services that work together.

Netflix has three main microservices partake in the real-time data movement: Cosmos — the creator, PlayAPI — the controller, and Open Connect CDN — the distributor.

Unfortunately, all three parts took the beating on the boxing match.

Netflix Services

The Issues

As 60M+ users decided to stream that day, Netflix services could not scale up. Thus, the catalyst was indeed the huge traffic to Netflix systems.

We’ll dissect each service microservice separately and discuss on the potential issue with them.

The Cosmos

Cosmos microservice architecture

Cosmos serves various purposes within Netflix’s video computing, one of them being video encoding.

Video encoding is the process of converting raw, uncompressed video data into a compressed digital format that can be stored, transmitted, and played back efficiently.

In order to cost optimize, the video encoding is done on a fleet of 1,500+ r3.4xlarges EC2 instances during their idle time, afternoons and evenings, making a good use of internal EC2 spot instance pool.

This schedule is for most of the pre-recorded content like movies, standup comedies, etc.

However, the match was a live data stream, unlike other pre-recorded content. Depending on internal EC2 spot instances might have caused further resource consumption.

The Cosmos imposed additional resource utilization demands on Netflix’s already constrained infrastructure, causing scalability limitations.

Open Connect CDN

As Netflix grew, more & more people started streaming it, and ISP around the world felt the bandwidth load to deliver the content.

To remediate the issue, Netflix created its own content delivery mechanism, Open Connect CDN.

Open Connect acts as typical CDN that resides on edge location and serving cached content to local requests using OCA. OCA or Open Connect appliance are device with 350 TB disk storage.

There are 8000+ of them with different ISPs.

Open Connect appliances sit at either ISP site or Internet site, caching new popular content to itself nightly, and serving them to clients while coordinating with PlayAPI.

Open Connect Appliance (OCA) architecture

The issue with OCA is that it’s optimized to serve pre-recorded cached content. However, live streams need to be delivered in point-in-time with no cache.

This would not had been a issue at smaller scale, however, at large scale traffic the infrastructure was just not sufficient.

As streamers increased OCA started requesting more to Netflix Streaming server to route users to more optimal OCA nearer to them.

The smart routing technique significantly increased the load on the streaming servers, leading to performance degradation as it processed millions of requests, ultimately causing the servers to fail under the high traffic volume.

PLAYAPI — Load shedding Protocol

Netflix implements a prioritized load-shedding mechanism that stops responding to non-critical services to serve critical services.

Netflix functions on a microservices model, for example: streaming video service, adding new user service, storing watch time metrics service, etc.

Apparently, some services like streaming video have higher priority than writing logs services. Thus, during high traffic, lower priority services get shed in order to keep user experience consistent.

Writes to Open Connect CDN are most critical than reads, and writes cannot be shed.

However, this priority does falls on its face for live streaming data as writing to CDN and reading from CDN should have same priority.

As traffic volume on the streaming server increases, the server begins to offload incoming requests in an attempt to prioritize processing those already in the queue.

This resulted in a feedback loop, where a high number of users were unable to access live stream data from Netflix due to request congestion.

Closing notes

In conclusion, the surge in traffic, combined with scaling challenges in key systems, led to widespread issues.

I believe it was still a good learning experience for Netflix as they are moving in live data streaming landscape.

The entire debacle reminds me of a quote.

In every system, there’s always a problem to fix and an opportunity to improve.

Vince Lombardi

Gratitude

If you read this article till end, I wanted to say thank you.

If you found the article well-researched, and informative, I hope you’ll join me in my future blog posts and stick around because I think we have great taste in future of tech.

See you next time. Ciao.

--

--

Dudhraj Sandeep
Dudhraj Sandeep

No responses yet