The Operations Report Card

A. Public Facing Practices

1. Are user requests tracked via a ticket system?
2. Are "the 3 empowering policies" defined and published?
3. Does the team record monthly metrics?

B. Modern Team Practices

4. Do you have a "policy and procedure" wiki?
5. Do you have a password safe?
6. Is your team's code kept in a source code control system?
7. Does your team use a bug-tracking system for their own code?
8. In your bugs/tickets, does stability have a higher priority than new features?
9. Does your team write "design docs?"
10. Do you have a "post-mortem" process?

C. Operational Practices

11. Does each service have an OpsDoc?
12. Does each service have appropriate monitoring?
13. Do you have a pager rotation schedule?
14. Do you have separate development, QA, and production systems?
15. Do roll-outs to many machines have a "canary process?"

D. Automation Practices

16. Do you use configuration management tools like cfengine/puppet/chef?
17. Do automated administration tasks run under role accounts?
18. Do automated processes that generate e-mail only do so when they have something to say?

E. Fleet Management Processes

19. Is there a database of all machines?
20. Is OS installation automated?
21. Can you automatically patch software across your entire fleet?
22. Do you have a PC refresh policy?

F. Disaster Preparation Practices

23. Can your servers keep operating even if 1 disk dies?
24. Is the network core N+1?
25. Are your backups automated?
26. Are your disaster recovery plans tested periodically?
27. Do machines in your data center have remote power / console access?

G. Security Practices

28. Do Desktops, laptops, and servers run self-updating, silent, anti-malware software?
29. Do you have a written security policy?
30. Do you submit to periodic security audits?
31. Can a user's account be disabled on all systems in 1 hour?
32. Can you change all privileged (root) passwords in 1 hour?
  

20. Is OS installation automated?

Automated OS installations are faster, more consistent, and let the users do one more task so you don't have to.

If OS installation is automated then all machines start out the same. Fighting entropy is difficult enough. If each machine is hand-crafted, it is impossible.

If you install the OS manually, you are wasting your time twice: Once when doing the installation and again every time you debug an issue that would have been prevented by having consistently configured machines.

If two people install OSs manually, half are wrong but you don't know which half. Both may claim they use the same procedure but I assure you they are not. Put each in a different room and have them write down their procedure. Now show each sysadmin the other person's list. There will be a fistfight.

Users see inconsistency as incompetence. If new machines always arrive with a setting that isn't to their liking they know how to change that setting and are happy. If half the time that setting is one way and half the time it is another way, they lose confidence in the system administrators. What bozos are installing this stuff?

If you can re-install the OS automatically, so can the users. Now you have one less thing to do. Automation that saves you time is great. Automation that lets other people do a task is even better.

Not being able to easily wipe and reload a machine is a security issue. A machine should be wiped and reloaded when a "hand me down" computer moves from one user to another. If this process isn't "friction free" there is temptation to "save time" by not doing it.

 
Community Spotlight
LISA15