Systems Administration Practices

The following are practices for SystemsAdministration. The descriptions on this page are terse summaries -- individual wiki pages can or should be developed to expand on the practices. Meanwhile, a SystemsAdministrationPracticesDiscussion? page may be useful. -- KenBitskoMacLeod


When developing code by or for SysAdmins or production systems, you're developing code. Most coding practices should apply (aka SystemProgramming).

Production systems should have the bare minimum of differences from the vendor's base install. They should be summarizable on one sheet of paper, the manifest. A manifest should list the base install, the add-on packages installed, the files used from source control for configuration, and any remaining hand-made changes. Per-system manual changes should be kept to a strict minimum. Use automated installs whereever possible.

Manual changes are migrated to source control, common configuration and implementation is migrated into releasable packages. Automate manual procedures. Goals: keep the system manifest simple and keep "work effort" off production systems.

Test systems should be rebuilt regularly (every or every few iterations) using production manifests. This "tests" the production manifest and keeps the test systems in-sync with the production systems. Use test systems to aid in refactoring production manifests. Use automated installs to make rebuilding test systems faster and easier.

Apply security and vendor updates regularly. Give updates a chance to "settle" before applying them. Use test systems to "age" updates before applying to production systems. Keep up with major product and OS updates, they're easier when they're gradual. Use test systems to do parallel development, updating a test system with a new OS/product and identifying any issues with existing systems.

Develop and factor common operations and maintenance procedures, as checklists or outlines. For less-often used procedures, note the last time it was used. Double-check unfamiliar or older procedures before they're performed. The default procedure (typically used only in emergencies) is to create a brief plan then to log every step you do.

At any given time, a production system should work indefinitely without intervention or support. Access to production systems should always be for a documented procedure (operations or maintenance). Enable and extend remote status tools to provide production status without requiring access. Use test systems and browse source control for checking how things are installed or working. Avoid using significant privileges (Unix's root) unless absolutely necessary. Factor out and wrap privileged operations whenever possible (Unix's sudo).

Create (or require) operational tests that verify that a system, product, or package is working as expected. Both on-line (non-destructive, accessibility) and off-line (thorough) testing should be used.

Database schemas and updates are delivered with packages. In-place database updates are repeatably testable and verifiable on test systems, prior to release on production systems. Tuning information is maintained across updates (either through feedback to source code or externally merged).

The operations group is a customer of development. Administrators are co-developers for many tasks. Every iteration release should be installed on a test system. Refactor installation processes and keep as simple as possible. Minimal production support is a coding standard.

Require vendors and off-site developers to deliver using your system's packaging tools. Package any products downloaded or developed locally. Merge common, infrequently changed packages into your auto-install or vendor media.

Prioritize operations and maintenance over development, refactoring over new development. Spread non-weekly tasks so that several occur every week. Keep a record of time spent on tasks for future scheduling. Group related operations together, and complete them as a whole (rather than individual ticket-tracking). Rotate "on-call" duty as a daily or weekly task, so that other's tasks can be completed uninterrupted. (In other words, avoid mixing on-call duty with other tasks). "Tasks" are the primary internal unit of staff load, with respect to team progress.

Keep your schedules and policies in an open place (web page or bulletin board). Let customers (internal users, stakeholders, and developers) know where you're spending your time. Let them know which tasks are fast-tracked (common operations) and which require prioritization and queuing (development), let them know what your queue is. Let customers prioritize queued tasks among themselves.

A test system built according to the manifest of a production system needs only the production machine's data to replace that system. Do that at least once a year for every production system.

Develop sub-systems (packages, databases, resources, configuration) so that they can be moved from one system to another easily. Migrating to a new version of a system is more easily done one sub-system at a time.


EditText of this page (last edited May 30, 2005) or FindPage with title or text search