One of the things that I found really interesting at Velocity 2010 was the prevalence of the use of continuous deployment. I know I’ve mentioned the Facebook operations talk previously, but it’s worth mentioning again as a good example of this. In it, Tom Cook – a Facebook engineer (sorry, couldn’t find a link for Tom) – talks about about deploying code at least daily, with feature releases once a week. This flies in the face of the “deploy every 2-3 months” model that I’m familiar with. It also requires significantly more developer involvement, with the developer doing the actual deploy and sticking around to support it rather than throwing it over the wall to ops to put in place once the QA cycle is complete.
So, how is this accomplished? Well, without getting into the technical details of the tools they use (watch the video! really!), it essentially demonstrates a completely different culture than a “quarterly installs” sort of model. Obviously, this sort of thing can’t work in a “get every level of management to authorize the install in triplicate” shop. It requires a DevOps-y sort of environment where there is a tight integration between the folks who know the code and the folks who understand the systems its running on. It requires what I heard referred to at the conference as a strong “immune system” – basically, a set of tools (change management, anyone?) and a communication structure that affords a high degree of confidence that a particular install is (a) unlikely to break anything, and (b) can be rolled back quickly with minimal impact if it goes haywire.
I was a bit skeptical of this sort of thing at first, but John Allspaw said something in his Ops Meta-Metrics session that really resonated with me. He said (paraphrasing): “As an ‘ops guy’, I prefer smaller changes more often to big changes less often. Taken to it’s extreme, consider this: what if the change is only 5 lines of code? Does that feel safer? …because it should.” A light turned on inside my head when I heard that. It’s not about deploying fast “because we can”; it’s about deploying fast because it’s the safest thing to do.
Another interesting thing about this is the sorts of deployment models that can be used to mitigate impact if a 5-line code change does happen to break something. One of the most prevalent: not deploying code changes to all at once. Why not deploy it on a handful of servers – or on every server, but with the feature/change/bug-fix only “turned on” for a handful of users? In essence, why not use a relatively small portion of your userbase as unwitting beta testers for your change? Paul Hammond gave some interesting examples of how to handle this sort of deployment inside the code itself in his Always Ship Trunk session.
Whether Instrumentation & Metrics was the focus of the talk or just a portion of what was covered by the speaker, two major rules of thumb seemed to present themselves:
“Instrument everything.” Collect as much data about as many things as you can. If it reports, collect it and store it. If it doesn’t report, make it report. And store it. Make sure you keep around as much historical data as you feasibly can. “But Cliff,” I hear you asking, “won’t you just end up with a whole pile of data that doesn’t really mean anything and just takes up disk space.” Well, read on, because that brings me to the second rule of thumb:
“Data ain’t information!” (Direct quote from a talk on modeling and metrics). So…what does it mean? Well, a couple of things. One of the speakers who gave the presentation linked above would have you believe that data + a model is information. Modeling is critical in that it may allow you to extrapolate information from data points that would be otherwise meaningless. Note that I said may; as this presenter noted, “Data is from the devil, models are from God,” in a nod to the fact that real-world data rarely adheres to the nice, uniform curve generated by the model.
The other piece to this is an emphasis on the importance of visualization – i.e., understanding key metrics and how to display them such that interesting/important trends are elucidated. Some examples of this were given in the Ops Meta-Metrics talk, in which John Allspaw demonstrated that code installs and service downtime don’t always have a 1-to-1 correlation…but you will never know that if you don’t track both of those metrics and understand how important it is to compare them over time.
As a side note on metric monitoring, one of the really cool tools people were talking about at the conference was cucumber-nagios, a monitoring tool that allows you to specify configs in natural language. Slick!
Change Management was another huge theme at the conference. I actually heard more than one speaker say something to the effect of “If you don’t have change management in place in some form, you should leave the conference right now, put it in place, and then come back.” Developers (almost) always have some manner of change management in place – source control, peer review, approval processes, release schedules, etc. – in order to…errm…manage changes to the codebase on which they are working. …but what about the systems – the machines in which that code is running? …and why is that important? Why isn’t it okay to have a sysadmin fire up emacs, slam in a config change, and send out an email saying “all’s well”?
Well…let me give a non-tech example here. Let’s say you took your car to the mechanic. He takes a look-see at it, sort of twiddles about a bit, and hits you with a bill for a few hundred bucks. (I realize that this is almost exactly how most folks’ visits to the mechanic go, but bear with me here). Now suppose he can’t tell you exactly what he did or document it in any way, but it’s running so “we’re good, right?” Oh…and suppose he tells you that your car may or may not start the next time you turn the key; “Just bring ‘er on back and we’ll have another look!” How comfortable would you be with the arrangement?
Okay, so the car analogy doesn’t carry over so well (and is rather unfair to sysadmins, I might add)…but I can tell you that I’ve seen this sort of thing happen in the datacenter many a time. As a Linux admin, I was terrified of rebooting machines, largely due to the inverse relationship between the uptime of a system and the likelihood of it actually coming back up correctly after a reboot. Having a change management tool like Chef (very cool; demoed at the conference), cfengine, puppet – pick your poison – backed by version control (of course) is a means of raising the level of confidence about changes being made to systems.
Oh, and how about auditing? Suppose your method for determining what’s going on with your systems is to walk around to all the sysadmins who might have touched Machine 7 of 3,956 and ask politely “Have you changed anything on this sum’bitch in the past decade?” Repeatability? “Can you build Machine 3,957 and make it look just like Machine 7?” CYA during a postmortem witch hunt? “Prove to me that you didn’t have a hand in bringing down our production database this afternoon.” A good change management system goes a step beyond the typical “get all of management to sign off on the change before putting it in” approach. Just a few of the reasons for implementing change management.
DevOps – summarised reasonably concisely here – can be briefly summed up as “tighter integration between devs and ops”. For those of you not in a technical field, pay special attention to the “siloisation” section; I’d wager that anyone who works for a large company in any has seen this sort of “us vs. them” mentality between departments/divisions. The idea behind DevOps is to foster more of a “we’re all on the same team” sort of mindset.
I’d go so far as to say the DevOps was the theme at the conference. The entire three days were essentially a pep rally designed to promote making things “fast by default” not only by using whiz-bang technologies and tweaking your code, but also by culture change within and among IT organizations. (Note: DevOps Day – which I was unable to attend – took place on the Friday after Velocity 2010 in Mountain View and was mentioned several times by the presenters.)
Okay, so I’d meant to do a day-by-day breakdown of Velocity 2010, but [insert lame excuse here], so…I didn’t. However, now that I’ve had a week or so to “let it simmer”, I’d like to sum up a few of the major themes and undercurrents from the conference. Note that the conference was roughly divided into three flavors of discussion – “Ops”, “Web Performance”, and “Culture”. Of course, being an “ops guy” I focused on the ops-related sessions and tried to fit in as much of the “culture” as I could. (A Day in the Life of Facebook Operations was one of the best talks given, imo, and really touches on a lot of the themes I’m about to talk about below. If you watch no other video from Velocity 2010, watch this one.)
I was going to sum all of it up in one post, but I decided to break it out by theme rather than taking the “wall of text” approach. Hopefully, I’ll get all of it posted in the next couple of days here.
- Change Management
- Instrumentation & Metrics
- Continuous Deployment
- K-V Stores, memcached, etc.