I'm a sysadmin; it's my job to make sure my company's servers are doing what they're supposed to be doing when they are supposed to be doing it, and it's my job to solve any problems that interfere with that.
This job has given me what I call my Big Red Button Dream(tm): I dream of a separate entrance to my own office, with all of the monitors, servers, workstations, and whatever I need to do my job. No one sees me enter, no one sees me leave, no one knows if I'm working or sleeping. But when anything anywhere on the network breaks, a Big Red Button on the wall starts flashing to indicate a problem. In order to solve this problem, whether it's a service outage or a new server to build and deploy, I must reach over and smack that Big Red Button. This solves the problem, and I can go back to doing whatever it is that I am or am not doing. No one knows that all I do is push a button to fix every problem out there; all they know is that with me on the job, the systems never get in the way of the work they are supposed to do. Of course, the next step in the dream is to delegate the actual smacking of the button to someone else, but that requires there be someone else in my sysadmin cell, and it all kind of breaks down then.
This qualifies as a dream and not a goal because it is clearly unattainable, but maintaining this dream as a guiding ideal does a lot to keep me on what I see as the right track in my job as a sysadmin. In this two-part series, I hope to talk about how I use planning and automation in my quest to achieve this ideal, and specifically, why I begin both at the earliest possible point: when a server is built. Because not everyone truly understands the fundamental importance of planning and automation, the first part of this series will go through an explanation of the benefits of dedicating your life as a sysadmin to planning and automating everything you do, and the second part will focus on planning and automating the server build process.
Although I see automation as being contingent on planning, it is only when automation is attempted without planning that it becomes obvious how important the planning is. Therefore, I will discuss automation first, and that discussion will hopefully enlighten us as to the importance of planning.
From my perspective, automation provides five main benefits:
Reducing the amount of time a given task requires.
Automating a task means that less time is required each time that task is performed, which leaves more time to devote to other tasks, such as automation.
Reducing the opportunity for error in a given task.
Most tasks have to be done in certain ways, and leaving it to humans to perform those tasks leaves the chance that those humans will perform the task incompletely or incorrectly, or will break something essentially unrelated. When a task is automated, a preferred way can be found and the task can then be performed that way every time, essentially eliminating the chance for error in that specific task, as long as the automation was thoroughly planned and tested.
Reducing turnaround time for a given task.
While leaving more time for other work, automation also means that most work gets done faster. This is important in many situations, particularly while firefighting (solving service outages), performing work on production systems during short maintenance windows, or satisfying short project timelines. It is often worth spending more total time automating a task before it is needed because of the reduced time it takes to actually perform a task -- if it takes you twenty hours of scripting to successfully automate a four-hour task, but as a result you are able to fit it entirely within your server's two-hour maintenance window, then it was well worth the effort. This is, again, usually not possible without thorough planning and testing, which is often a significant portion of the automation time.
Enhancing and perpetuating configuration consistency across multiple systems.
In addition to humans potentially introducing error when they work, they also introduce something possibly more nefarious: individuality. Because they often cause outages of some kind, errors are usually caught and fixed, but when multiple people perform the same task in different but equally correct ways, there is no outage to catch. The problem with this situation is that once multiple people have started to do the same thing in different ways, system consistency is sacrificed. Once a network lacks overall consistency, it is far more difficult to come behind and automate. This situation also often ends up in a catch-22 of not enough to consistency to allow automation but a lack of automation causing consistency to deteriorate. In addition to making networks harder to automate, a lack of consistency also makes networks significantly more difficult to administer in general, because all the exceptions have to be kept in mind when any work is performed.
Providing a limited kind of process documentation.
Last but not least, an automated task is a documented task. It might not be well-documented (although hopefully the code is well-commented, at the least), but even if the person who did the automation leaves the company, you can still go behind and read the scripts. This is far superior to information leaving with an employee, and also provides a starting point for other employees to begin learning the process involved.
This is quite a lot, so you shouldn't need much more convincing. But in addition to these benefits, which I consider to be fundamental to automation and the main reasons for concentrating on it, automation allows you to package up a complex, senior-level task and delegate it to someone lower on the food chain. This provides the lower-level employee the opportunity to fully understand the task by reading and using the script, and it leaves the senior-level employee time for more important tasks, such as automation. Another great thing about automation is that it builds on itself; the more you automate the small tasks, the more you can build tools which automate the automation. This is obviously how my ideal of the Big Red Button happens: the button is the top of a very large pyramid of automation tools, set up so it can diagnose and solve any problem anywhere on the network.
As with all things, there are some caveats. Automation rewards in proportion to the complexity, repetition, or time consumption of a task, which means that sometimes automation ends up taking more time than it saves. Also, most of us are unfortunately hired into companies which already have computers, which often means that we walk into a situation with little or no consistency to start with; when this happens, we usually have to spend a significant amount of time bringing consistency to the network just to get to the point where we can start automating. This sometimes puts the benefits of automation far enough away so as to seem not worth it. Lastly, all automation requires significant testing, because by the time you notice there's a problem with your automation tool, it's usually too late to cancel it, and that is too large of a risk to take on a production system.
In the end, though, hopefully you'll see that, in the big picture, automation almost always profits you more than it costs. If you start by only automating the tasks that you get the biggest return on and then work your way down, you will soon find that there are only a few mundane tasks left to automate. Automating those last few tasks does take more time than it specifically saves, but now with that final automation, you have basically automated all of the low-level tasks on your network, and suddenly the whole is greater than the sum of its parts: instead of having to think in concrete terms about each task on your network, your tools provide an abstraction layer between yourself and the work you must do. This abstraction layer provides you a means of changing the way you think about your work -- instead of the network defining how you work, your tools do. Hopefully you've developed your tools to work the way you want them to, but if you haven't, you can reorganize how those tools work without actually impacting the underlying work they do -- this is the real benefit of this abstraction layer that the tools provide.
So you're convinced that automation is the way to go, and you're ready to get cracking. You are either starting with a clean slate, all your hardware ready to power on and install, or you have a network and you want to automate all of that time-wasting work that's been keeping you busy. Now I want to convince you not to start just yet.
Automation can and should save you significant amounts of time, but you can end up wasting all that time, and more, if you don't plan your automation. A plan can consist of as little as whom to inform and when to put it into production, but it can also be a project plan spanning months or years and requiring a complete rebuild of your network. Most plans are going to require some modification as you progress, but you will always have a better-designed, more complete picture solution if you make a plan before you start working.
At its most basic, a plan serves as a blueprint for your work, something you can use to remind yourself of what you are doing and in what order. In the same way, it allows other people, including your manager, and sometimes your users, to understand what you are doing and why. Because of this exchange of information, having a plan gives other people more confidence in your ability to do the job well, and you are thus much more likely to be given the resources you need to do it right and a better reward when it is all done.
Like any programming task, it is always best to work through an automation plan as completely as possible before beginning a full-scale implementation; there's no good excuse for doing 90% of the necessary work and finding out that your solution isn't compatible with your network, or that it's just not possible. The process of planning your automation should provide you with an understanding of the complexity (or lack thereof) of the task at hand and the confidence that your methods will be sufficient. This again makes you more capable of convincing others to provide you with the necessary resources to do the job right the first time.
With a plan, you, your users, and your managers will all be surprised far less by the results of your work. Nothing goes perfectly, but least when something does go wrong people will see it as a reasonable deviation from a reasonable plan, rather than the inevitable failure of someone with stars in his/her eyes.
As you get in the habit of planning the work you do, you'll find that your plans provide documentation of the network as you want it to be and your automation tools provide documentation of your network as it is. Between the two of them, it should be relatively easy for anyone to understand both the current state and the future direction of your network. This makes the loss of an employee less damaging, but it also makes everyone involved, from the low-level help-desk employee running shell scripts s/he doesn't understand to the manager who can't spell Unix to the users who have to change the way they work, feel like they are part of the ongoing work, because they can see and understand it as it progresses. It's all about people understanding and agreeing with your work, supporting you in doing it, and appreciating the amount of effort you had to do to get it done; this is what facilitates your work as you do it, enhances your company's understanding of your importance in the network, and just generally gives you the satisfaction of a job well done.
Once you're in the habit of planning your work, and you have progressed in implementing some initial automation plans, the plans become easier, because you have a clearer idea of the network and thus a more realistic idea of the work involved. This, combined with the already-completed automation, in turn makes the continuing automation easier, thus getting the plan accomplished faster. Just like automation, planning seems to feed on itself, but planning also provides your way of communicating to the rest of the world what you are doing, when, and why, which is often important if other people happen to use the servers you maintain.
Now you're ready to go build a plan to redo your entire network so you can automate it down to the last detail, aren't you? Well, once again, I want to convince you not to start yet. It seems that even in system administration, patience is a virtue.
Part one of this series was for those who weren't aware of how much of quality system administration depends on automation and planning; part two will focus on where the network starts: during the server build process. Nowhere is automation and planning more important than in the server build process, because it is the build process that determines how you will maintain your systems afterwards. An hour spent automating an install can often save you that same amount of time every month for the life of the server.
Luke A. Kanies is an independent consultant and researcher specializing in Unix automation and configuration management.
Return to ONLamp.com.
Copyright © 2009 O'Reilly Media, Inc.