Brad Feld made a couple of great posts over the last week talking about some of the challenges with cloud computing around the coordination of infrastructure level services in complex cloud environments. Namely, that most cloud computing (hosting) providers have APIs for primitive functions like creating and starting/stopping instances, but lack coordination or error handling functionality at a granular (operating system) level.
These posts struck a chord with me thinking back to my days as a UNIX systems administrator working for a F100 financial services company. We ran a three-tier architecture with a large number of web servers, application servers, and a pair of beefy database boxes at the bottom of the stack. Whenever we needed to reset the environment, we had to go through a very specific process to get things online again.
Here’s what a reset looked like:
- Restart database servers, make sure they were working
- Restart application servers, make sure they are talking to the database
- Restart webservers, make sure plugins are talking to the app servers
- Flush load balancers and firewalls
This whole process would take about 15 minutes, very manual, and thus was quite error prone. For example, if you started the app servers before the database servers – oops, do not pass go, do not collect $200.
Although that example was from a traditional hosting environment in 2000, I imagine there will be similar problems for medium/high complexity cloud hosted applications.
Vendors like VMware have tried to address this kind of issue by providing a workflow engine that can be used to automate sophisticated processes within the infrastructure, and I know RightScale has the RightScripts framework as well, but there is still a major gap.
These systems presuppose that a developer is going to know and care that there are lower level dependencies beyond the APIs with which they interact with the cloud environment.
The beauty of the cloud is that the developers shouldn’t need to know or care about which server starts before which other server, or that sshd needs to start on box A before box B can run a batch job.
But as Brad implies in his post, I think we’ve got a long way to go…