The first phase of implementing Visible Ops is Stabilize The Patient and Modify First Response. This first order of business intends to reduce our amount of unplanned work significantly, to 25% or less, to allow our organization to focus on more than just firefighting. When the environment is stabilized, efforts can instead be spent on proactive work that stops fires before they start.
The chapter, like all chapters in Visible Ops, begins with an Issues and Indicators section, describing the issues with both a formal statement and a narrative example. The issues are familiar to anyone in IT, from how self-inflicting problems create unplanned work to the inconvenient, but predictable, timing of system failures during times of urgency. The narrative example provided helps to relate the formal statement to the experiences we all share.
Stabilize the Patient
Visible Ops proceeds to describe the stabilization goal and examine the cause of instability. The basic goal, decrease the amount of unplanned work to free up time for proactive processes, is great but does not instruct us on how to achieve the goal. Understanding the causes of unplanned work gets us closer. Primarily, this is done by admitting that we have a problem: if most of the issues are self-inflicted, then we are causing most of the unplanned work. By self-inflicted, we mean change. Examining change schedules and business processes, we can identify the systems where change typically generates most unplanned work and which systems tend not to introduce unplanned work. We will call the systems generating the most unplanned work our “fragile artifacts”.
Once we have identify some fragile artifacts, we can take some steps to stabilize them. The first is to reduce access to these devices. If only people who are formally cleared may make changes, the number of changes and resulting unplanned work will be reduced. Documenting the new policy, establishing change windows, and notifying the stakeholders ensures that business needs can still be met. Because stakeholders’ previous expectations may have been that changes would be made immediately, or at least that day, by anyone on the team, it is vital to ensure they understand the new expectation of change times. Visible Ops also strongly suggests reinforcing the policy emphatically during this phase. This ensures that everyone understands the policy is in place to help and protect the team, lets the team know that cowboys will have to justify themselves, and reminds everyone that continued policy violations will not be taken lightly. Obviously, make sure we have management backing before stating such a policy!
The next step is to Electrify The Fence. This section includes a cautionary tale that many of us know: Change is out of control, management puts in a change control system; six months later, there are little to no changes in the change system, yet the fragile artifacts continue to have the same rate of outages and unplanned work. Instead of contributing to the change system in any way, all changes are suddenly circumventing the system – tracked changes are not performed before authorization was granted or outside of maintenance windows, they simply never hit the system to begin with. This can have significant side effects that we want to minimize. Those attempting to resolve issues are not able to review recent changes and must guess at what happened, when, and by whom, increasing the Mean Time To Repair. If changes need to be backed out, further archeology is required to ensure all changes are reverted, assuming they can be backed out at all (think of database modifications that could break other changes performed later). Two people or teams could decide to make changes at the same time, unaware of each others plans, and impact both changes.
Implementing detective controls is a way to take the fence around the change process and electrify it by finding out when people and processes violate them. An example is Tripwire, which automatically detects and reports changes on a system. When a change is performed outside of the established process, the detective control notifies the system owners automatically. This can often be tied in to a change system so notifications are only sent for changes not associated with an existing change order or performed outside of maintenance windows. This is also where Configuration Management tools, like Puppet, that force a system to converge on a desired state, come into play. Unauthorized changes can be detected and reverted by the system, sometimes before we even receive the alert of the change. To make a change permanent, the change control policy must be followed so that the change is integrated into the CM system.
This section finished by reminding us that high performing IT organizations accept only one number of unauthorized changes: Zero. Kevin Behr describes how his organization established a tradition of anyone performing a cowboy change having to upload a photo to a specific web page as a form of accountability and contrition. At one of my previous companies, we had a pound puppy that would reside on the most recent cowboy’s monitor (it was a while ago!) until someone else owned it. These kinds of traditions, done right, can energize the team to enjoy the new change control policies.
Modify First Response
The next section encourages us to continue our self-introspection and review our first response processes. Most organizations initally view change management as a bureaucratic nightmare that just consumes everyone’s resources and energy. However, change management can be integrated with our problem response processes to reduce MTTR. Visible Ops suggests MTTR as one of the most vital metrics measuring our organizations performance. In this way, good integration can provide great value to our organization, rather than being red tape that drags everyone down.
The most significant modification to our first response is to ask a simple question first: What approved or detected changes were made on the system in the past X hours? (the book suggests 72 hours; I think this varies based on your industry and change rate) As mentioned earlier in the book, 80% of outages are caused by change and 80% of MTTR is spent trying to find that change. Examine the changes and attempt to eliminate each as a causal factor as soon as possible, and document this effort and the reviewed changes in the ticket. Studies show that by following this process, issues can be successfully diagnosed without logging into devices over 50% of the time. We repeat this process throughout the infrastructure in widening circles of related systems – a single VM, the local LAN, the edge connection, the WAN, etc.
Above all, use analytical and iterative steps to identify the options. Remove the temptation to reboot first. This does not identify or resolve root causes, it only hides them, nor does it further the integration of change management and problem resolution.
Create The Change Team
The rest of the chapter is devoted to the effort of creating a change team – what it is, who is involved, when they meet, and some do’s and don’ts.
The first step is the creation of a Change Advisory Board (CAB). This board (or multiple boards, in larger organizations) is made up of stakeholders of our critical systems. Stakeholders in this sense means those who can best make decisions on changes based on their understanding of the business goals and technical and operational risks – we want people who can make fact-based decisions, not ego- or product-based decisions (avoid the HiPPO method). Once the board is assembled, it is important to remember that urgent changes must also be handled by the CAB, despite the protests of everyone who thinks their urgent changes are simple and safe. These are the changes most likely to require extra consideration and approval. Establish a process to call a CAB emergency committee (CAB/EC) meeting to review such changes promptly when time is a factor.
The next step is to implement a ticketing system (hopefully, your company already has one!). Every change must have an associated ticket, allowing us to track individual changes and also report on metrics in aggregate. Change tickets need to capture many pieces of information in addition to the change itself to help change managers determine whether they should approve a given change. Specifically, the change, the risk and criticality, and the back-out plan are all required. For high-risk changes, test results should also be included.
With a CAB and a ticketing system in place, it’s time to start weekly change management meetings and daily change briefings. The weekly meeting is where changes are proposed and authorized. Daily briefings let the team announce authorized changes to the stakeholders. With practice, 15 minute calls are practical without becoming a rubber stamp factory. Focus on making the process easy so that team members are encouraged to use it, not circumvent it.
The suggested interval for meetings and approval is the first place that Visible Ops shows its age. Many organizations now push changes tens or hundreds of times per day. Waiting for any interval may be impossible as it would cause work to pile up immediately. Visible Ops does suggest tracking the Change Success Rate for all changes and classifying each change based on the type. The change success rate can be calculated for each type and those with extremely high rates – 99% or higher – may be made exempt from the CAB, though still tracked in the system. Code reviews are a decent substitute as an authorization gateway as many version control systems can be configured to prohibit merging of code that is not reviewed by someone other than the submitter or by specific reviewers. Those change types with a lower success rate must still be brought to the CAB. The change success rates should be reviewed regularly to ensure rates do not slip and teams can be adjusted as needed. You can also increase or decrease the frequency of CAB meetings, for those changes that do make it there.
A suggested agenda is provided for the weekly CAB meetings, along with a list of Who/What/When/How/What If questions to help focus the team, such as Who will be affected? What assets are involved? When will the benefits be realized? How is success determined? and What will the worst case service outage be? Next are some Do’s and Don’ts that the ITPI studies collected, including obvious items like, “Do track the outcomes of all changes (change withdrawn, aborted, completed successfully),” and the less obvious, “Don’t sent mixed messages such as ‘just do it’ or ‘we did it that way last time’.”
The changes we see will progress across a spectrum that Visible Ops describes. The spectrum has 7 gradiants, from Oblivious to Change to Managing Change. Visible Ops then describes some top reasons that such initiatives fail, including the common refrains such as, “We won’t be able to do anything anymore!” and helpful responses to the claims.
Summary
Each chapter ends with a section titled What You Have Built And What You Will Likely Hear. We have stopped inflicting wounds on ourselves, significantly lowered the amount of unplanned work, and integrated change processes with troubleshooting. Putting an electric fence around change control ensures that change is monitored and successful, driving down MTTR and increasing the throughput of the organization. What You Will Likely Hear is provided through a series of quotes from managers, consultants, and executives who have participated in Visible Ops implementations. The last page also includes some tips on preparing for audits. As mentioned in the review, Visible Ops prepares us to have a healthy relationship with auditing teams and does so through these prescriptive steps.