Book Review: The Phoenix Project

Just finished reading Gene Kim’s, Kevin Behr, and George Spafford’s new book, The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win. Gene Kim is well known for the The Visible Ops Handbook: Implementing ITIL in 4 Practical a…

Just finished reading Gene Kim’s, Kevin Behr, and George Spafford’s new book, The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win.  Gene Kim is well known for the The Visible Ops Handbook: Implementing ITIL in 4 Practical and Auditable Steps, which came out during the early days of the ITIL movement, and I know it was my first real deep dive into change management. The book is written in a similar style to the 5 dysfunctions of a team which is a “leadership fable”.  

The Pheonix Product follows a fictional auto parts suppy store called parts unlimited, it opens with Bill the main character being promoted by their CEO with little guidance to the VP of IT Operations after the CIO and CTO are fired and that there is a critical payroll issue that must be fixed ASAP.  Bill’s prior experience at Parts Unlimited is running IT Operations for the “Midrange Server Group”, but he has no oversite of the distributed or helpdesk operations of the parts unlimited which are now under his perview.

Several IT Charicatures are represented in the book, your typical Process Guru/ITIL person, your overconfident/arrogant IT Manager, Obstructive IT Security team, etc.  While character development isn’t a strong suit in this book, I was easily able to see links to people that i’ve worked with in the past.   

The first half of the book consists of a ton of outages and problems caused by this completely mismanaged IT organization. Familiar statements like “IT is in the way, IT screws up all the time, etc” are all represented.  They finally start getting on the right track when a new board advisor starts coaching bill on identifying the types of work in the IT shop, and relating it to factory floor operations. This drives the team into implementing change management, Kanban methodologies for workflow, and eventually continuous deployment with even a few mentions of being “allspawed” which i didn’t realize had become a verb. But Kudos to John Allspaw over at etsy, you’ve crossed over to the otherside. 

Overall the book is quite good especially for teams that haven’t embraced ITIL, Lean Manufacturing and Devops into their culture and business processes. For me I really liked the setup of the book (the first two chapters I read after Velocity last year when Gene published them as an early preview) are excellent. I felt a lot of unnecessary time was setup in showing how piss poor the operations were and not enough time in the solutions. I would have hoped for more detail on Kanban processes, ITIL and Devops practices, but instead they were regulated to a single chapter and the complexities of setting up this infrastructure was a bit glossed over.  

There is one character in the book called Brent, that I actually found to be quite unannoying, he is the “know it all” IT guy characture, and of course all changes, and major sev 1 issues always require Brent to be involved to get the issue fixed. At numerous points the management team puts in processes to elevate him as their “most critical resource” and to limit the work coming into his area of expertise. While this is good, and you should follow similar processes to make sure the constrained resources in your group have a clear work allignment and goal.  He may not have been malicous in his retention of tribal knowledge that only he knew, but several times I would have leaned towards firing him. Ultimately I never felt that he was a team player, was protecting his base of knowlege as he liked being the Hero.  Maybe i’m alone in this, and with the limited character development i shouldn’t get hung up on it, but its probably the one piece of the book that I felt was counter-culture to the devops movement. 

Overall, I liked the book, i’ll be sending it to some friends as gifts in the future as there is a lot of good stuff in there. But i don’t know if it will become a favorite of mine as the Visible ops handbook is. I’m excited to see the next collaboration from Gene Kim and his team at ITRevolution the devops cookbook. I think what I was hoping to get out of The Pheonix Project will end up in the cookbook, and than this may be the perfect pairing! Rating B+

Gene Kim and his coauthors are all excellent people to be following on twitter and their respective blogs.

IT Revolution Website http://www.itrevolution.com

Gene Kim’s site: http://www.realgenekim.me/  or twitter @realgenekim

Kevin Behr’s site: http://www.kevinbehr.com/

George Spafford twitter: @gspaff

 

 

Netflix/Amazon Outage on December 24th

Note: I know I haven’t been consistent with the blogging this last year or so, its been crazy busy for me professionally and personally. Lots of exciting things going on, plus I relocated to the San Francisco Bay area for a fantastic job opportuni…

Note:  I know I haven’t been consistent with the blogging this last year or so, its been crazy busy for me professionally and personally. Lots of exciting things going on, plus I relocated to the San Francisco Bay area for a fantastic job opportunity. Launched a small consulting business last year, to help out a few friends and my former employer after I left.  Overall though, i’m hoping to at least blog weekly on some topic either personal, tech related, or just of overall interest.  

Netflix Outage on Christmas Eve

I’m shocked at how much press attention the Netflix outage on Christmas Eve has gathered. Its not like this is the first time that either Amazon or Netflix has had an outage, nor will it be the last sadly.  I think the large part of the scorn is that it hit Netflix at an unfortunate time when a lot of their users actually wanted to use the service for Christmas specials, holiday traditions, etc. 

Most interesting of this is the blame that Amazon is taking for this outage and while they did cause the initial issue, the issue is Netflix’s to resolve and prevent future outages.  (Amazon Blog Post on outage: http://aws.amazon.com/message/680587/)  The cloud is fantastic, lowered the barrier to entry for a ton of startups, and provided Amazon scale to companies who wanted to focus on the software not the infrastructure. This power though comes with greater complexity and business challenges that must be addressed. (Spidermans: “With great power comes great responsibility” comes to mind).

When you rely on equipment, services, providers that you ultimately have little control over you must plan for failure.  You must assume that any component that you rely on will be gone at any moment, or perform at a suboptimal level and there is nothing that you can do for these scenarios.  At my last employer we hit several of the “amazon oops” moments.  First we had our application deployed in only one region/Availibility zone.  This is the same thing has running your all of your servers and systems in one datacenter, supported by a single provider/telco/utility/etc.  Its a huge Single Point of Failure (SPOF).

Next we moved to single region/ multiple avaialibility zones, while this is a nice improvement it still bit us when an Amazon Technician made a human error that killed EBS and Storage in the East Region.  Suddenly we realized that while Amazon advertises that each Availability Zone is agnostic from each other on a hardware level, the control tier and shared services back end could and does get shared across multiple AZ’s (Actually in reality some of the control tier is global in nature from some of the recent RCA’s).  

It of course took code changes and infrastructure enhancements for us to tackle multiple AZ’s in a single region. Some of the things you may need to do in your applicaiton are:

  1. Read/Write Master/Slave Replication strategy (NoSQL & Relational all have varying ways to accomplish this depending on your software) 
  2. Traffic load balancing between each AZ for incoming user traffic (if active/active)
  3. Application awareness of databases and databases states.  Some of this is handled by the drivers for the database (ie: Mongo), although implementations are a mixed bag and must be heavily tested

The above is an additional cost in complexity, testing, load testing, network design and thought to how the system is developed and deployed.  Once you realize that this still isn’t good enough you start talking about multiple regions and/or cloud resilient (amazon & rackspace, etc).  This adds new complexities that you now must factor in:

  1. Global load balancing now becomes a must, or enhanced round robin DNS services
  2. Increased latency between sites, syncronous commits become damaging to performance rapidly
  3. Application complexity increases 3x with 2 Regions/Clouds, but expect 4-5x increase in complexity if you add more
  4. Active monitoring and diagnoses of issues must be detected by monitoring and nodes/systems isolated as the number of users impacted could be small or large, or worse impossible to detect

I give lots of Kudos to Netflix for the Chaos monkeys, not a lot of people have the stomach to have a “rogue” agent in their systems breaking stuff on perfect and testing their resiliency. But as more and more companies move to the cloud the practice must become more common, at least in the lab environments. (http://techblog.netflix.com/2011/07/netflix-simian-army.html). 

Global scale once the area of tech giants (Yahoo, Google, Microsoft, Amazon) are available to the masses. Of course finding the tech talent who has dealt with scale at this level is difficult and/or their pretty happy at the companies they work for.  The Devops community is a huge help in this area, with folks sharing their infrastructure, war stories, solutions for scale, and of course a relentless pursuit if metrics and automation that allows the complexity of this scale to become manageable. 

Check back for future posts on devops culture, hiring, global scale, etc!!  Plus you guys can keep me honest on posting at least weekly.

Interested in the Netflix/Amazon outages check out these blogs:

Amazon Post Mortem

Adrian Cockcroft’s (Netflix) analysis of issue