The Operations Report Card

A. Public Facing Practices

1. Are user requests tracked via a ticket system?
2. Are "the 3 empowering policies" defined and published?
3. Does the team record monthly metrics?

B. Modern Team Practices

4. Do you have a "policy and procedure" wiki?
5. Do you have a password safe?
6. Is your team's code kept in a source code control system?
7. Does your team use a bug-tracking system for their own code?
8. In your bugs/tickets, does stability have a higher priority than new features?
9. Does your team write "design docs?"
10. Do you have a "post-mortem" process?

C. Operational Practices

11. Does each service have an OpsDoc?
12. Does each service have appropriate monitoring?
13. Do you have a pager rotation schedule?
14. Do you have separate development, QA, and production systems?
15. Do roll-outs to many machines have a "canary process?"

D. Automation Practices

16. Do you use configuration management tools like cfengine/puppet/chef?
17. Do automated administration tasks run under role accounts?
18. Do automated processes that generate e-mail only do so when they have something to say?

E. Fleet Management Processes

19. Is there a database of all machines?
20. Is OS installation automated?
21. Can you automatically patch software across your entire fleet?
22. Do you have a PC refresh policy?

F. Disaster Preparation Practices

23. Can your servers keep operating even if 1 disk dies?
24. Is the network core N+1?
25. Are your backups automated?
26. Are your disaster recovery plans tested periodically?
27. Do machines in your data center have remote power / console access?

G. Security Practices

28. Do Desktops, laptops, and servers run self-updating, silent, anti-malware software?
29. Do you have a written security policy?
30. Do you submit to periodic security audits?
31. Can a user's account be disabled on all systems in 1 hour?
32. Can you change all privileged (root) passwords in 1 hour?
  

19. Is there a database of all machines?

Every site should know what machines it has. The database should store at least some basic attributes: OS, RAM, disk size, IP address, owner/funder, who to notify about maintenance, and so on.

Having a database of all machines enables automation across all your machines. Being able to run a command on precisely the machines with a certain configuration is key to many common procedures.

This data should be automatically collected, though a very small site can make due with a spreadsheet or wiki page.

Having an inventory like this lets you make decisions based on data and helps you prevent problems.

I know a small university in New Jersey that could have prevented a major failure if they had better inventory: They tried to upgrade all of its PCs to the latest version of Microsoft Office. The executive council was excited that there would finally be a day when version incompatibilities didn't make every interaction an exercise in frustration. Plus look at all these new features! Oh, how enthusiasm turned into resentment as the project collapsed. It turned out that one third of the machines on campus didn't have enough RAM or disk space. Random segments of the university couldn't get any work done due to botched upgrades. The executive council was not only upset but is now very risk adverse. It will be a long time before any new upgrades will happen. All of this would have been prevented if a good asset management system was in place. A simple query would have produced a list of machines that needed upgrades. Budgeting and work estimates could have been provided as part of the upgrade program.

 
Community Spotlight
LISA15