The Operations Report Card

A. Public Facing Practices

1. Are user requests tracked via a ticket system?
2. Are "the 3 empowering policies" defined and published?
3. Does the team record monthly metrics?

B. Modern Team Practices

4. Do you have a "policy and procedure" wiki?
5. Do you have a password safe?
6. Is your team's code kept in a source code control system?
7. Does your team use a bug-tracking system for their own code?
8. In your bugs/tickets, does stability have a higher priority than new features?
9. Does your team write "design docs?"
10. Do you have a "post-mortem" process?

C. Operational Practices

11. Does each service have an OpsDoc?
12. Does each service have appropriate monitoring?
13. Do you have a pager rotation schedule?
14. Do you have separate development, QA, and production systems?
15. Do roll-outs to many machines have a "canary process?"

D. Automation Practices

16. Do you use configuration management tools like cfengine/puppet/chef?
17. Do automated administration tasks run under role accounts?
18. Do automated processes that generate e-mail only do so when they have something to say?

E. Fleet Management Processes

19. Is there a database of all machines?
20. Is OS installation automated?
21. Can you automatically patch software across your entire fleet?
22. Do you have a PC refresh policy?

F. Disaster Preparation Practices

23. Can your servers keep operating even if 1 disk dies?
24. Is the network core N+1?
25. Are your backups automated?
26. Are your disaster recovery plans tested periodically?
27. Do machines in your data center have remote power / console access?

G. Security Practices

28. Do Desktops, laptops, and servers run self-updating, silent, anti-malware software?
29. Do you have a written security policy?
30. Do you submit to periodic security audits?
31. Can a user's account be disabled on all systems in 1 hour?
32. Can you change all privileged (root) passwords in 1 hour?
  

16. Do you use configuration management tools like cfengine/puppet/chef?

Config Management (CM) software is a tool that coordinates the configuration of machines. It might control the OS, the software, the service provided, or all of the above.

Before CM sysadmins manually made changes to machines. If you had to change 100 machines, you had 100 manual tasks to do. Smarter sysadmins would automate such a change.

Even smarter sysadmins realized that general tools for such automation would be even better. They invented automation frameworks with names like track, cfengine, bcfg2, Puppet, Chef and others.

The hallmark of CM systems is that you describe what you want and the software figures out how to do it. What you want is specified in declarative statements like "hostA is a web server" and "web servers have the following packages and other attributes". The software turns that into commands that need to be executed. Another important attribute is that the declarations are generic ("install the commands in foo.sh as a cron job") but the CM system does the right thing for that computer's operating system (Selecting "/etc/crontab" vs. "/var/spool/cron").

With CM, instead of manual changes, you change a configuration file and let the CM system do the work.

Local changes on a server are not ok. Any time you create a file like /etc/crontab.bak or /etc/hosts.[today's date] it is a red flag that you are doing it wrong.

Configuration management is the ultimate automation. You go from being the The Sorcerer's Apprentice to the puppet master. That's no Mickey Mouse idea.

 
Community Spotlight
LISA15