Tuesday, August 11, 2009

Shut Down the Datacenter

From July 7, 2008

Or at least power down significant pieces of it during periods of low demand. This message always draws funny looks from IT types when I suggest a seemingly simple answer to the problem of extreme costs for datacenter resources. I push on:

Billy – If utilization is around 20 – 30%, aren't there periods of time when you could just shut down about 50% of the systems? Or at least 25%?

IT – We can't just shut the systems down. . .

Billy – Why not? You aren't using them.

IT – You don't understand.

Billy – What am I missing?

IT – Well, it just doesn't work that way.

Billy – How does it work?

IT – It takes a long time to lay the application down atop a production server.

Billy – Why?

IT – Set up is complicated. Laying down the application and bringing it online can take several days, typically 2 to 4 weeks.

Billy – So part of the application definition is described by the physical system it runs on?

IT – Yes, that's right. If I shut down the physical system, I lose part of the definition and configuration of the application.

And therein lies the culprit. The “last mile” of application release engineering and deployment is a black art. Applications become tightly coupled to the physical hosts upon which they are deployed, and the physical hosts cannot be powered down without losing the definition of a stable application. Bringing the application back up is expensive due to the high costs of expert administration resources, and it is fraught with peril because the process is not repeatable. Enterprises are spending billions of dollars on datacenter operating costs because the risk of bring applications back on-line is not worth the savings of taking them off-line.

Of course I blame most of this mess on the faulty architecture of the One Size Fits All General Purpose Operating System (OSFAGPOS). OSFAGPOS is typically deployed in unison with the physical hosts because OSFAGPOS provides the drivers that enable the applications to access the hardware resources. To get an application to run correctly on OSFAGPOS, the system administrators then need to “fiddle with it” to adjust it to the needs of any given application. This “fiddling” is where things run amok. It's hard to document “fiddling,” and it is therefore difficult to repeat “fiddling.” The “fiddle” period can last for up to 30 days, depending on the complexity of the “fiddling” required.

So how do we get away from all of this “fiddling” around, and deploy an architecture that allows the datacenter to scale up and down based on actual demand? Start with a bare metal hypervisor as the layer that provides access to the hardware. Then extend release engineering discipline to include the OS by releasing applications as virtual machines with Just Enough OS (JeOS or “juice”) in lieu of OSFAGPOS, complete with all of the “metadata” required to access the appropriate resources (memory, CPU, data, network, authentication services, etc.). By decoupling the definition of the application from the physical hosts, a world of flexibility becomes possible for datacenter resources. Starting up applications becomes fast, cheap, and reliable. As an added bonus, embracing cloud capacity such as that provided by Amazon's EC2 becomes a reality. Instead of standing up application capacity in-house, certain peak demand workloads can be deployed “on-demand” with a variable cost model (in the case of Amazon it starts at about $.10/CPU/hr).

With oil trading at around $140 per barrel, the cost of allowing datacenter resources to “idle” during slow demand periods is becoming a real burden. “Fiddling around” with applications to get them deployed on OSFAGPOS is no longer just good clean fun for system administrators. It is serious money.

1 comment: