Sadly Posterous is shutting down on April 30th as they were acquihired by Twitter sometime last year.  It is really unfortunate when products you use and love get killed off by this sort of acquisition.

Will work on getting some more content up in the next few weeks as I get reacquainted with wordpress.

Just finished reading Gene Kim’s, Kevin Behr, and George Spafford’s new book, The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win.  Gene Kim is well known for the The Visible Ops Handbook: Implementing ITIL in 4 Practical and Auditable Steps, which came out during the early days of the ITIL movement, and I know it was my first real deep dive into change management. The book is written in a similar style to the 5 dysfunctions of a team which is a “leadership fable”.  

The Pheonix Product follows a fictional auto parts suppy store called parts unlimited, it opens with Bill the main character being promoted by their CEO with little guidance to the VP of IT Operations after the CIO and CTO are fired and that there is a critical payroll issue that must be fixed ASAP.  Bill’s prior experience at Parts Unlimited is running IT Operations for the “Midrange Server Group”, but he has no oversite of the distributed or helpdesk operations of the parts unlimited which are now under his perview.

Several IT Charicatures are represented in the book, your typical Process Guru/ITIL person, your overconfident/arrogant IT Manager, Obstructive IT Security team, etc.  While character development isn’t a strong suit in this book, I was easily able to see links to people that i’ve worked with in the past.   

The first half of the book consists of a ton of outages and problems caused by this completely mismanaged IT organization. Familiar statements like “IT is in the way, IT screws up all the time, etc” are all represented.  They finally start getting on the right track when a new board advisor starts coaching bill on identifying the types of work in the IT shop, and relating it to factory floor operations. This drives the team into implementing change management, Kanban methodologies for workflow, and eventually continuous deployment with even a few mentions of being “allspawed” which i didn’t realize had become a verb. But Kudos to John Allspaw over at etsy, you’ve crossed over to the otherside. 

Overall the book is quite good especially for teams that haven’t embraced ITIL, Lean Manufacturing and Devops into their culture and business processes. For me I really liked the setup of the book (the first two chapters I read after Velocity last year when Gene published them as an early preview) are excellent. I felt a lot of unnecessary time was setup in showing how piss poor the operations were and not enough time in the solutions. I would have hoped for more detail on Kanban processes, ITIL and Devops practices, but instead they were regulated to a single chapter and the complexities of setting up this infrastructure was a bit glossed over.  

There is one character in the book called Brent, that I actually found to be quite unannoying, he is the “know it all” IT guy characture, and of course all changes, and major sev 1 issues always require Brent to be involved to get the issue fixed. At numerous points the management team puts in processes to elevate him as their “most critical resource” and to limit the work coming into his area of expertise. While this is good, and you should follow similar processes to make sure the constrained resources in your group have a clear work allignment and goal.  He may not have been malicous in his retention of tribal knowledge that only he knew, but several times I would have leaned towards firing him. Ultimately I never felt that he was a team player, was protecting his base of knowlege as he liked being the Hero.  Maybe i’m alone in this, and with the limited character development i shouldn’t get hung up on it, but its probably the one piece of the book that I felt was counter-culture to the devops movement. 

Overall, I liked the book, i’ll be sending it to some friends as gifts in the future as there is a lot of good stuff in there. But i don’t know if it will become a favorite of mine as the Visible ops handbook is. I’m excited to see the next collaboration from Gene Kim and his team at ITRevolution the devops cookbook. I think what I was hoping to get out of The Pheonix Project will end up in the cookbook, and than this may be the perfect pairing! Rating B+

Gene Kim and his coauthors are all excellent people to be following on twitter and their respective blogs.

IT Revolution Website http://www.itrevolution.com

Gene Kim’s site: http://www.realgenekim.me/  or twitter @realgenekim

Kevin Behr’s site: http://www.kevinbehr.com/

George Spafford twitter: @gspaff

 

 

Note:  I know I haven’t been consistent with the blogging this last year or so, its been crazy busy for me professionally and personally. Lots of exciting things going on, plus I relocated to the San Francisco Bay area for a fantastic job opportunity. Launched a small consulting business last year, to help out a few friends and my former employer after I left.  Overall though, i’m hoping to at least blog weekly on some topic either personal, tech related, or just of overall interest.  

Netflix Outage on Christmas Eve

I’m shocked at how much press attention the Netflix outage on Christmas Eve has gathered. Its not like this is the first time that either Amazon or Netflix has had an outage, nor will it be the last sadly.  I think the large part of the scorn is that it hit Netflix at an unfortunate time when a lot of their users actually wanted to use the service for Christmas specials, holiday traditions, etc. 

Most interesting of this is the blame that Amazon is taking for this outage and while they did cause the initial issue, the issue is Netflix’s to resolve and prevent future outages.  (Amazon Blog Post on outage: http://aws.amazon.com/message/680587/)  The cloud is fantastic, lowered the barrier to entry for a ton of startups, and provided Amazon scale to companies who wanted to focus on the software not the infrastructure. This power though comes with greater complexity and business challenges that must be addressed. (Spidermans: “With great power comes great responsibility” comes to mind).

When you rely on equipment, services, providers that you ultimately have little control over you must plan for failure.  You must assume that any component that you rely on will be gone at any moment, or perform at a suboptimal level and there is nothing that you can do for these scenarios.  At my last employer we hit several of the “amazon oops” moments.  First we had our application deployed in only one region/Availibility zone.  This is the same thing has running your all of your servers and systems in one datacenter, supported by a single provider/telco/utility/etc.  Its a huge Single Point of Failure (SPOF).

Next we moved to single region/ multiple avaialibility zones, while this is a nice improvement it still bit us when an Amazon Technician made a human error that killed EBS and Storage in the East Region.  Suddenly we realized that while Amazon advertises that each Availability Zone is agnostic from each other on a hardware level, the control tier and shared services back end could and does get shared across multiple AZ’s (Actually in reality some of the control tier is global in nature from some of the recent RCA’s).  

It of course took code changes and infrastructure enhancements for us to tackle multiple AZ’s in a single region. Some of the things you may need to do in your applicaiton are:

  1. Read/Write Master/Slave Replication strategy (NoSQL & Relational all have varying ways to accomplish this depending on your software) 
  2. Traffic load balancing between each AZ for incoming user traffic (if active/active)
  3. Application awareness of databases and databases states.  Some of this is handled by the drivers for the database (ie: Mongo), although implementations are a mixed bag and must be heavily tested

The above is an additional cost in complexity, testing, load testing, network design and thought to how the system is developed and deployed.  Once you realize that this still isn’t good enough you start talking about multiple regions and/or cloud resilient (amazon & rackspace, etc).  This adds new complexities that you now must factor in:

  1. Global load balancing now becomes a must, or enhanced round robin DNS services
  2. Increased latency between sites, syncronous commits become damaging to performance rapidly
  3. Application complexity increases 3x with 2 Regions/Clouds, but expect 4-5x increase in complexity if you add more
  4. Active monitoring and diagnoses of issues must be detected by monitoring and nodes/systems isolated as the number of users impacted could be small or large, or worse impossible to detect

I give lots of Kudos to Netflix for the Chaos monkeys, not a lot of people have the stomach to have a “rogue” agent in their systems breaking stuff on perfect and testing their resiliency. But as more and more companies move to the cloud the practice must become more common, at least in the lab environments. (http://techblog.netflix.com/2011/07/netflix-simian-army.html). 

Global scale once the area of tech giants (Yahoo, Google, Microsoft, Amazon) are available to the masses. Of course finding the tech talent who has dealt with scale at this level is difficult and/or their pretty happy at the companies they work for.  The Devops community is a huge help in this area, with folks sharing their infrastructure, war stories, solutions for scale, and of course a relentless pursuit if metrics and automation that allows the complexity of this scale to become manageable. 

Check back for future posts on devops culture, hiring, global scale, etc!!  Plus you guys can keep me honest on posting at least weekly.

Interested in the Netflix/Amazon outages check out these blogs:

Amazon Post Mortem

Adrian Cockcroft’s (Netflix) analysis of issue

 

 

Check out my latest blog post over at brodleygroup.com.

http://www.brodleygroup.com/1/post/2012/08/amazon-launches-provisioned-iops.html

Great write up on password reset mechanisms. I learned quite a few things… Anyone who is designing these types of systems should take a few minutes to read this.

http://www.troyhunt.com/2012/05/everything-you-ever-wanted-to-know.html?HN2

For a tech gadget junkie like me this is the perfect service. I was always jealous of the Bag, Borrow or steal system for Purses and other high end luxury items for woman.  Glad someone is finally bringing this to tech gadgets. 

If interested follow the link below:

Here's my link: http://www.ybuy.com/?ref=aZp8v

Thanks.

In case you're interested, YBUY is a membership club that lets you try the greatest products in the world, risk-free! Here's how YBUY works:

1. Get products on the 1st of the month. YBUY ships you a product so that it arrives by the first of every month.

2. Try products for 30 days. After the trial period, you can return the product at no cost, or you can buy it – your call.

3. Skip a month at any time. We won't charge you and you'll still be a member of this exclusive club.

If you want to show some love, you can "like" YBUY on Facebook – http://www.facebook.com/pages/ybuy/232115610153403

So glad that someone published a simple table of all Amazon Instance types and configurations.  Makes it much simpler to see all of the data in one place, and i'm surprised Amazon hadn't already done this. 

Check it out, and you can commit changes to Github. I guess i can throw away my google spreadsheet i've been using. 

Day 3 at Daptiv. Things are going well, starting to figure things out.

As most Ops guys know Code deployment can be a challenge, especially when the code relase is dropping 100′s or 1000′s of changes to the product into production in one large code push.  While I’m a huge fan of Devops and continuous deployments, realistically a lot of organizations are still deploying code the old fashioned way.  Nate over at TechOpsGuys recently ranted about organizations that his company relies on pushing production code on friday and breaking his companies website (http://www.techopsguys.com/2012/02/03/dont-push-code-on-a-friday-damnit/).  This rant brings up a few things i’ve been thinking about for a long time and felt needs to be expressed more clearly in a blog post:

1. Friday Code Deployments or Large code pushes are going the way of the dinosaur, more frequent smaller code releases with feature flags, continual integration testing, etc are becoming the norm.

2. If your software product is relying on services via API’s for your product to function or provide availability you need to start thinking differently.

Code Deployment in Devops

With the push for Devop’s in SAAS and B2C Business you are seeing more and more companies going to continous integration and continous deployment. In fact some software companies won’t even show a new developer where the bathroom is until they push code into production. (http://www.scottporad.com/2010/11/01/cheezburger-network-doesnt-show-its-new-employees-the-bathroom-until-theyve-checked-in-code/)

Having been involved in code deployment for over 5 years in a SAAS environment, I much prefer the smaller code deployment without taking major downtime of a site or service.  Rolling in features bit by bit and turning them on when their fully deployed or turning them on to a select set of beta users makes it so much easier to test features, functions, etc with real production load and know how these systems operate in production. Plus if you find an issue once you’ve deployed the code you can roll back to the previous code by redeploying from source or fix it and push the code out.

Relying on Third Party Services

Once you understand that code from a lot of startups is being deployed all the time (Etsy pushed 10,000 code changes in 2011 (http://codeascraft.etsy.com/2010/05/20/quantum-of-deployment/). You need to start thinking about how you integrate with these services differently. If you main home page pulls in a twitter status updates from your CEO you need to make sure the following happens:

1. When the API is functioning, display the data (Yeah i know Duh)

2. When the API is not functioning that the following happens:

A. Your website still loads and doesn’t throw a nasty 500 error

B. Your website doesn’t have an obnoxious hole with a 404 or 500 error in the sidebar or header of the site.

This means your developers must be planning for the inevitable situation that the API is going to break, the way you call the API will change, or plagues of locusts have infested their datacenters.  Your developers need to focus on thinking about all states of the service and the behavior they want their application to exhibit when the failure occurs. 

 

Great article on True SaaS (multitenant, single code line, hosted software) and why enterprises should care about it.  Its a common practice for software vendors with highly successful products to just assume that it can "be SaaS" if we just repackage the solution as hosted.  While this may look "SAAS" and your customers may even like the appearance of a higher level of data security, it has huge trade offs in terms of an enterprises need to manage the solution.  Customers of these type of solutions are tied to higher per user costs, they have to do their own upgrade coordination and scheduling, normally with high up front PS costs, and they suffer less than stellar quality and performance as they scale the solution.  If your in the market for a SaaS solution you should be asking about the delivery model, if your not your doing yourself a disservice.

http://www.enterpriseirregulars.com/44507/what%E2%80%99s-true-saas-and-why-the-hell-should-customers-care/