“You really want to instrument everything you have,” Adams told an audience of 700 operations professionals. “The best thing you can do is have more information about your system. We’ve built a process around using these metrics to make decisions. We use science. The way we find the weakest point in our infrastructure is by collecting metrics and making graphs out of them.”
via Using Metrics to Vanquish the Fail Whale « Data Center Knowledge.
This makes high volume datacenter ops sound fairly straightforward. As long as you have volumes of data, and a well-thought out process, you can make informed decisions. I certainly practiced this methodology when I was involved in datacenter operations, but not so sure this is practical as computing environments become harder to debug through multiple layers of abstraction and virtualization.