The Operations Report Card

A. Public Facing Practices

1. Are user requests tracked via a ticket system?
2. Are "the 3 empowering policies" defined and published?
3. Does the team record monthly metrics?

B. Modern Team Practices

4. Do you have a "policy and procedure" wiki?
5. Do you have a password safe?
6. Is your team's code kept in a source code control system?
7. Does your team use a bug-tracking system for their own code?
8. In your bugs/tickets, does stability have a higher priority than new features?
9. Does your team write "design docs?"
10. Do you have a "post-mortem" process?

C. Operational Practices

11. Does each service have an OpsDoc?
12. Does each service have appropriate monitoring?
13. Do you have a pager rotation schedule?
14. Do you have separate development, QA, and production systems?
15. Do roll-outs to many machines have a "canary process?"

D. Automation Practices

16. Do you use configuration management tools like cfengine/puppet/chef?
17. Do automated administration tasks run under role accounts?
18. Do automated processes that generate e-mail only do so when they have something to say?

E. Fleet Management Processes

19. Is there a database of all machines?
20. Is OS installation automated?
21. Can you automatically patch software across your entire fleet?
22. Do you have a PC refresh policy?

F. Disaster Preparation Practices

23. Can your servers keep operating even if 1 disk dies?
24. Is the network core N+1?
25. Are your backups automated?
26. Are your disaster recovery plans tested periodically?
27. Do machines in your data center have remote power / console access?

G. Security Practices

28. Do Desktops, laptops, and servers run self-updating, silent, anti-malware software?
29. Do you have a written security policy?
30. Do you submit to periodic security audits?
31. Can a user's account be disabled on all systems in 1 hour?
32. Can you change all privileged (root) passwords in 1 hour?

15. Do roll-outs to many machines have a "canary process?"

Suppose you have to roll out a change to 500 machines. Maybe it is a new kernel. Maybe it is just a small bug-fix.

Do you roll it out to all 500? No. You roll it out to a small number of machines and test to see if there are problems. No problems? Roll out to more machines. Then more and more until you are done.

These early machines are called "canaries".

The classic example of animals serving as sentinels is the canary in the coal mine. Well into the 20th century, coal miners in the United Kingdom and the United States brought canaries into coal mines as an early-warning signal for toxic gases including methane and carbon monoxide. The birds, being more sensitive, would become sick before the miners, who would then have a chance to escape or put on protective respirators. Source: Wikipedia

Here are some canary techniques:

  • One, Some, Many:
    Do one machine (maybe your own desktop), do some machines (maybe your co-workers), do many machines (larger and larger groups until done.) Any single failure means you stop the upgrade, roll back the change, and don't continue until the problem is fixed.
  • Cluster Canary:
    Upgrade 1 machine, then 1% of all machines, then 1 machine per second until all machines are done. (Typical at Google and sites with large clusters)

This procedure can be done manually but if you use a configuration management system, the ability to do canaries should be "baked in" to the system.

For More Information

See below links for more information on this topic:

Community Spotlight