Netflix/Amazon Outage on December 24th

Note: I know I haven’t been consistent with the blogging this last year or so, its been crazy busy for me professionally and personally. Lots of exciting things going on, plus I relocated to the San Francisco Bay area for a fantastic job opportuni…

Note:  I know I haven’t been consistent with the blogging this last year or so, its been crazy busy for me professionally and personally. Lots of exciting things going on, plus I relocated to the San Francisco Bay area for a fantastic job opportunity. Launched a small consulting business last year, to help out a few friends and my former employer after I left.  Overall though, i’m hoping to at least blog weekly on some topic either personal, tech related, or just of overall interest.  

Netflix Outage on Christmas Eve

I’m shocked at how much press attention the Netflix outage on Christmas Eve has gathered. Its not like this is the first time that either Amazon or Netflix has had an outage, nor will it be the last sadly.  I think the large part of the scorn is that it hit Netflix at an unfortunate time when a lot of their users actually wanted to use the service for Christmas specials, holiday traditions, etc. 

Most interesting of this is the blame that Amazon is taking for this outage and while they did cause the initial issue, the issue is Netflix’s to resolve and prevent future outages.  (Amazon Blog Post on outage: http://aws.amazon.com/message/680587/)  The cloud is fantastic, lowered the barrier to entry for a ton of startups, and provided Amazon scale to companies who wanted to focus on the software not the infrastructure. This power though comes with greater complexity and business challenges that must be addressed. (Spidermans: “With great power comes great responsibility” comes to mind).

When you rely on equipment, services, providers that you ultimately have little control over you must plan for failure.  You must assume that any component that you rely on will be gone at any moment, or perform at a suboptimal level and there is nothing that you can do for these scenarios.  At my last employer we hit several of the “amazon oops” moments.  First we had our application deployed in only one region/Availibility zone.  This is the same thing has running your all of your servers and systems in one datacenter, supported by a single provider/telco/utility/etc.  Its a huge Single Point of Failure (SPOF).

Next we moved to single region/ multiple avaialibility zones, while this is a nice improvement it still bit us when an Amazon Technician made a human error that killed EBS and Storage in the East Region.  Suddenly we realized that while Amazon advertises that each Availability Zone is agnostic from each other on a hardware level, the control tier and shared services back end could and does get shared across multiple AZ’s (Actually in reality some of the control tier is global in nature from some of the recent RCA’s).  

It of course took code changes and infrastructure enhancements for us to tackle multiple AZ’s in a single region. Some of the things you may need to do in your applicaiton are:

  1. Read/Write Master/Slave Replication strategy (NoSQL & Relational all have varying ways to accomplish this depending on your software) 
  2. Traffic load balancing between each AZ for incoming user traffic (if active/active)
  3. Application awareness of databases and databases states.  Some of this is handled by the drivers for the database (ie: Mongo), although implementations are a mixed bag and must be heavily tested

The above is an additional cost in complexity, testing, load testing, network design and thought to how the system is developed and deployed.  Once you realize that this still isn’t good enough you start talking about multiple regions and/or cloud resilient (amazon & rackspace, etc).  This adds new complexities that you now must factor in:

  1. Global load balancing now becomes a must, or enhanced round robin DNS services
  2. Increased latency between sites, syncronous commits become damaging to performance rapidly
  3. Application complexity increases 3x with 2 Regions/Clouds, but expect 4-5x increase in complexity if you add more
  4. Active monitoring and diagnoses of issues must be detected by monitoring and nodes/systems isolated as the number of users impacted could be small or large, or worse impossible to detect

I give lots of Kudos to Netflix for the Chaos monkeys, not a lot of people have the stomach to have a “rogue” agent in their systems breaking stuff on perfect and testing their resiliency. But as more and more companies move to the cloud the practice must become more common, at least in the lab environments. (http://techblog.netflix.com/2011/07/netflix-simian-army.html). 

Global scale once the area of tech giants (Yahoo, Google, Microsoft, Amazon) are available to the masses. Of course finding the tech talent who has dealt with scale at this level is difficult and/or their pretty happy at the companies they work for.  The Devops community is a huge help in this area, with folks sharing their infrastructure, war stories, solutions for scale, and of course a relentless pursuit if metrics and automation that allows the complexity of this scale to become manageable. 

Check back for future posts on devops culture, hiring, global scale, etc!!  Plus you guys can keep me honest on posting at least weekly.

Interested in the Netflix/Amazon outages check out these blogs:

Amazon Post Mortem

Adrian Cockcroft’s (Netflix) analysis of issue

 

 

3 thoughts on “Netflix/Amazon Outage on December 24th”

  1. Netflix had another outage on NY eve too (they tried to claim only impacted DVD site but I saw others say otherwise). Eager to find out the cause of that one..I was down in Orange county last week visiting with family, I talked with my sister briefly about Netflix (she is less up to speed on the latest tech). I bought her a Bluray player that among other things has a big fat netflix button on it. She said she hated netflix and liked blockbuster more. She tried the Netflix streaming trial and said she hated the lack of selection (same reason I stopped netflix streaming), and said the option to stream and dvd was not worth it, she’d rather get stuff from local blockbuster still. She also doesn’t like redbox at least the ones in her area are always out of stock on what she wants to see. Blockbuster express is still tiny by comparison closest one is about 20 miles away.for me I usually just buy the stuff I want, I bought a shitton of bluray and dvds from bestbuy.com over the holidays(mainly black friday), they had some amazing deals. Ordered a few more last night a couple of the resident evil movies and first two seasons of breaking bad which a friend suggested I would like (never seen it myself). $14/ea for seasons of BB, and $10/ea for Resident evil – for bluray to me at least those are good prices.Normally I buy most of my media from buy.com – but at least this year and I think last too – they did not compare to bestbuy.

  2. Yeah the selection of Netflix streaming does need some work. I haven’t found anyone who dislikes redbox though, but i can understand if they never have what you want. I believe though you can reserve movies and receive a txt when they are returned so you can pick them up. Plus, the way those redbox machines are popping up all over the place I assume that demand will eventually be met. I remember being just as frustrated at Blockbuster and Hollywood video when they were out of the 200 copies of X popular movie i wanted to watch. I actually like Amazon Streaming as a nice alternative to Netflix + Hulu + isn’t a bad option either if your looking for recent TV.

  3. yeah redbox has been ok for me (same for blockbuster express) though so-far have only used them while travelling. There’s a redbox outside a 7-11 near the hotel I stay at when I’m in Bellevue, so it’s nice. Used blockbuster expess a couple times while in Cannon beach, OR.I haven’t been to a regular blockbuster since probably the early 90s, same goes for other video rental stuff. With so much tivo usage over the years I really lose track of what is coming out and stuff, I see previews now and then that look interesting but by the time they come out I’ve usually forgotten about them! The movie Ted is one example, though I happened to stumble upon the bluray of it a few weeks ago and have it at home now (haven’t watched it yet).You see my email about drinks on SAT ???

Leave a Reply

Your email address will not be published. Required fields are marked *