The Operations Report Card

A. Public Facing Practices

1. Are user requests tracked via a ticket system?
2. Are "the 3 empowering policies" defined and published?
3. Does the team record monthly metrics?

B. Modern Team Practices

4. Do you have a "policy and procedure" wiki?
5. Do you have a password safe?
6. Is your team's code kept in a source code control system?
7. Does your team use a bug-tracking system for their own code?
8. In your bugs/tickets, does stability have a higher priority than new features?
9. Does your team write "design docs?"
10. Do you have a "post-mortem" process?

C. Operational Practices

11. Does each service have an OpsDoc?
12. Does each service have appropriate monitoring?
13. Do you have a pager rotation schedule?
14. Do you have separate development, QA, and production systems?
15. Do roll-outs to many machines have a "canary process?"

D. Automation Practices

16. Do you use configuration management tools like cfengine/puppet/chef?
17. Do automated administration tasks run under role accounts?
18. Do automated processes that generate e-mail only do so when they have something to say?

E. Fleet Management Processes

19. Is there a database of all machines?
20. Is OS installation automated?
21. Can you automatically patch software across your entire fleet?
22. Do you have a PC refresh policy?

F. Disaster Preparation Practices

23. Can your servers keep operating even if 1 disk dies?
24. Is the network core N+1?
25. Are your backups automated?
26. Are your disaster recovery plans tested periodically?
27. Do machines in your data center have remote power / console access?

G. Security Practices

28. Do Desktops, laptops, and servers run self-updating, silent, anti-malware software?
29. Do you have a written security policy?
30. Do you submit to periodic security audits?
31. Can a user's account be disabled on all systems in 1 hour?
32. Can you change all privileged (root) passwords in 1 hour?
  

9. Does your team write "design docs?"

Good sysadmin teams "think before they do." On a larger team it is important to communicate what you are about to do, or what you have done.

A design doc is a standardized format for proposing new things or describing current things. It should be short, 1-2 pages, but can be very long when the need arises.

Create a template and use it all over the place. The section headings might include: Overview, Goals, Non-Goals, Background, Proposed Solution, Alternatives Considered, Security, Disaster Recovery, Cost.

This format can be used to write a 20-page plan for how to restructure your network when you want to get buy-in from many people. It can be a 5-page document of how a prototype was built so everyone can see your results and give feedback before you build the real thing. It can be used for a half-page memo that explains the names you plan on using for a new directory tree on the file server (in which case, you probably don't need most of the headings). Use it to document a system after it was built for use as a reference by others on your team. Heck, use it to describe the team cookout you are planning.

The point is that your team has a mechanism for thinking before doing, a way to communicate plans beyond talking in hallways and chat-room, and a system that leaves behind artifacts that others can use to understand how or why something was done.

This format works when seeking feedback whether you want serious critiques or just "warn me if this will conflict with something you are about to do."

Your design doc format might have more headings or fewer, some may be optional, others may be required. Be flexible. A small project should only require a few headings. A huge project should require more of the headings. Be flexible. Have a "short form" and "long form". Having a standard template that people can use as a starting point avoids the "blank page syndrome".

For More Information

See below links for more information on this topic:

 
Community Spotlight
LISA15