The Operations Report Card

A. Public Facing Practices

1. Are user requests tracked via a ticket system?
2. Are "the 3 empowering policies" defined and published?
3. Does the team record monthly metrics?

B. Modern Team Practices

4. Do you have a "policy and procedure" wiki?
5. Do you have a password safe?
6. Is your team's code kept in a source code control system?
7. Does your team use a bug-tracking system for their own code?
8. In your bugs/tickets, does stability have a higher priority than new features?
9. Does your team write "design docs?"
10. Do you have a "post-mortem" process?

C. Operational Practices

11. Does each service have an OpsDoc?
12. Does each service have appropriate monitoring?
13. Do you have a pager rotation schedule?
14. Do you have separate development, QA, and production systems?
15. Do roll-outs to many machines have a "canary process?"

D. Automation Practices

16. Do you use configuration management tools like cfengine/puppet/chef?
17. Do automated administration tasks run under role accounts?
18. Do automated processes that generate e-mail only do so when they have something to say?

E. Fleet Management Processes

19. Is there a database of all machines?
20. Is OS installation automated?
21. Can you automatically patch software across your entire fleet?
22. Do you have a PC refresh policy?

F. Disaster Preparation Practices

23. Can your servers keep operating even if 1 disk dies?
24. Is the network core N+1?
25. Are your backups automated?
26. Are your disaster recovery plans tested periodically?
27. Do machines in your data center have remote power / console access?

G. Security Practices

28. Do Desktops, laptops, and servers run self-updating, silent, anti-malware software?
29. Do you have a written security policy?
30. Do you submit to periodic security audits?
31. Can a user's account be disabled on all systems in 1 hour?
32. Can you change all privileged (root) passwords in 1 hour?
  

13. Do you have a pager rotation schedule?

Do you have a pager rotation schedule or are you a sucker that is simply on-call forever?

An on-call rotation schedule documents who is "carrying the pager" (or responsible for alerts and emergencies) at which times.

You might literally "pass the pager", handing it to the next person periodically, or everyone might have their own pager and your monitoring system consults a schedule to determine who to page. It is best to have a generic email address that goes to the current person so that customers don't need to know the schedule.

A rotation schedule can be simple or complex. 1 week out of n (for a team of n people) makes sense if there are few alerts. For more complex situations splitting the day into three 8-hour shifts makes sense. "Follow the sun" support usually schedules those 8-hour shifts such that a global team always has a shift during daylight hours. You might take a week of 8-hour shifts each n weeks if your team has 3n people. The variations are endless.

This schedule serves many people: You, your customers, management and HR.

It serves you well because it lets you plan for a life outside of work. I put the highest priority on having a good work-life balance. If you don't have a good work-life balance, and you don't have a rotation schedule, physician heal thy self.

The rotation improves service to customers because it takes the "chaotic panic of trying to find a sysadmin" and makes it easy and predictable.

It serves management because it gives them confidence that the next emergency won't happen while everyone is "away".

It serves HR since, of course, your company gives compensation time or pay as required by law. If your schedule is in machine-readable format, a simple script can read it to generate reports for the payroll department.

If you think you don't have a schedule then it is "24x7x365" and you are a sucker. (But that doesn't mean you can answer "yes" for this question.)

 
Community Spotlight
LISA15