DevOps as the road to profitability

Netflix released a great earnings report earlier this week. According to the Wall Street Journal’s page one article on October 18th,

“Netflix Inc. blew through its forecast for the subscriber additions in the September quarter…sending its shares soaring 20% in after-hours trading. … The better-than-expected performance came mainly in international markets, where the company has completed a massive, near global expansion this year.”

For anyone who reads the DevOps literature, this success doesn’t come as a surprise. Rapid testing and provisioning is the name of the game at Netflix. Puppet’s 2016 State of DevOps Report, notes that Netflix is among the top DevOps performers:

High [DevOps] performers deploy on demand, with Etsy deploying 80 times per day, and large companies like Amazon or Netflix deploying thousands of times per day.

So how tightly are the components of profitability and DevOps excellence intertwined? Perhaps another way of looking at this is specifically looking at how minimizing downtime is key to the continued profitability of the company.

Wall Street results and DevOps profitability

One cannot deny that Reid Hasting’s leadership and vision for bringing original content like Orange is the New Black and House of Cards has contributed greatly to the company’s profitability. However, you cannot divorce great content from the operations that produces it. That would be like NASA trying to get to the moon without having great engineers behind it. Anyone want to think about how far the rockets would go before they came crashing down to earth?

So, when considering what has made Netflix so successful, you have to think that success is highly correlated with their best-in-class status in the DevOps world. And as the company looks to expand its global footprint to reach more customers in South America and Asia, its ability to scale and seamlessly deploy content will be key to achieving this success.

The secret to Netflix success is a mixture of storing data on AWS servers and the Netflix Simian Army. By using AWS servers, Netflix can “quickly deploy thousands of servers and terabytes of storage within minutes”. By using the Simian Army, a practiced series of outages and failures, Netflix has prepped its developers to know how to handle situations where unexpected failures (or monkeys) creep into code, delivery or deployment. This Army mindset allows Netflix to continue streaming even when Netflix Amazon servers are down.

DevOps at Netflix – automating failure

In a blog from earlier this year, Netflix developers wrote:

“It wasn’t long ago that 16-minutes from commit to deployment was a dream, but as other parts of the system have gotten faster, this now feels like an impediment to rapid innovation.”

At 16 minutes for commit to deployment, Netflix is experiencing 90 deploys per day. By comparison, companies that are medium IT performers, according to Puppet’s report, only deploy between once per week and once per month. Now, Netflix is deploying over 100 times per day and constantly testing their software.

Behind this success is the notion of automating failure. As C. Aaron Cois of Carneige Mellon wrote:

“Since there are so many components that have to work together to provide reliable video streams to customers across a wide range of devices, Netflix engineers concluded that the only way to be comfortable handling failure is to constantly practice failing.”

To practice failure, Netflix engineers set about to automate failure. That means, rather than waiting for failure to happen, Netflix injected failure into the system so that engineers could find solutions to ensure code works in an environment of unreliability and unexpected outages – kind of like the real world.

Alerting is everybody’s business

By constantly injecting failure into the system, Netflix makes sure that its engineers know that they can and will be accountable for any failures that the code experiences. This injection of failure, known as The Netflix Simian Army ranges in levels of chaos from the Chaos Monkey – random disabling of production instances without any customer impact – to the Chaos Gorilla – an outage of an entire Amazon availability zone.

By building high levels of performance into the code, Netflix has incentivized developers to “build fault-tolerant systems to make their day-to-day job as developers less frustrating.“ That is, make sure the code is resilient or you are going to get lots of alerts about your code not working.

Critical alerting best practices

Not every company has hundreds of DevOps engineers and millions of dollars to simulate this level of robustness. However, these types of war games should be implemented. Simulated failures would ensure engineers have the ability to experience failure in a practiced environment when the stakes are not on the line.

Group leaders could have the simulated failures integrated with OnPage messaging so that the engineer gets messaged on their smartphone and learns how to react and proceed in the case of a critical event. Think of it as a fire drill for system failures. Wouldn’t that make the process of reacting to a real failure much less painful?

Shawn Lazarus

Share
Published by
Shawn Lazarus
Tags: netflix

Recent Posts

OnPage’s Strategic Edge Earns Coveted ‘Challenger’ Spot in 2024 Gartner MQ for Clinical Communication & Collaboration

Gartner’s Magic Quadrant for CC&C recognized OnPage for its practical, purpose-built solutions that streamline critical…

4 days ago

Site Reliability Engineer’s Guide to Black Friday

Site Reliability Engineer’s Guide to Black Friday   It’s gotten to the point where Black Friday…

2 weeks ago

Cloud Engineer – Roles and Responsibilities

Cloud engineers have become a vital part of many organizations – orchestrating cloud services to…

1 month ago

The Vitals Signs: Why Managed IT Services for Healthcare?

Organizations across the globe are seeing rapid growth in the technologies they use every day.…

2 months ago

How Effective are Your Alerting Rules?

How Effective Are Your Alerting Rules? Recently, I came across this Reddit post highlighting the…

2 months ago

Using LLMs for Automated IT Incident Management

What Are Large Language Models?  Large language models are algorithms designed to understand, generate, and…

2 months ago