The final phase of Visible Ops is Enable Continual Improvement. To really succeed with our efforts, we need to make sure that the resources we have are allocated optimally toward our business goals. With most of our fires put out and significant efforts into avoiding future fires, we have the time available to do this right. To determine where our resources should be allocated, we need to look at metrics.
Metrics And How To Use Them
It is very easy to collect too much information and end up with metrics that mean little to nothing. Consideration is given here to the types of metrics that hold actual value. MTTR and MTBF are vital to almost every IT organization, but are a result of hundreds of other factors. Monitoring MTTR and MTBF is valuable, but we can’t change those values by working on them directly. They are indicators of issues elsewhere, which we need to identify and focus our efforts on. Visible Ops breaks down these metrics into three categories that reflect our ability to:
- Release: generate and provision infrastructure
- Controls: change decisions that ensure infrastructure is available, predictable and secure
- Resolution: to diagnose and resolve issues
The next few pages introduce specific metrics to collect, including Shelf life of builds, Percent of systems that match known good builds, Number of actual changes made per week, Changes submitted vs changes reviewed, and of course MTTR and MTBF. With these metrics, we can close the feedback loop and implement fixes at the beginning of the lifecycle, where we know the defect cost is low. Every issue fixed in pre-production prevents unplanned work and outages in production.
As we work for improvement in our metrics, it is vital to focus on the qualitative nature rather than the raw numbers. The numbers are symptoms, not causes. Focusing on increasing the Change Success Rate without looking at the quality of changes will lead to the team circumventing the change policy, increasing the Change Success Rate but reducing MTTR and MTBF. Instead, analyze the quality of the changes and fix the cause at the root, such as bad change guides, poor code reviews, or undocumented and fragile artifacts. Do not make metrics into a numbers game.
There are other improvement points that aren’t tied to metrics. Examples are provided in the same three areas, Release, Controls, and Resolution. Almost all of the recommendations – e.g. Change management meetings must have a specified agenda – have already been explored. Consider this a checklist of items and processes that should be constantly reviewed for improvement. Agendas tend to grow over time and team members start to get off-topic, which may lead to lower participation or increased meeting times. If you spend 5 minutes checking this quarterly, you can keep this in check before it blows up.
One recommendation stands out that was only touched on earlier: “Track repeat offenders who circumvent change management policies. Determine the best course of correction action[…]” It is definitely vital to track this and ensure that bad behavior is not rewarded. I will add that it is important to respect people. This is a tenet of the Toyota Production System, yet another precursor to Visible Ops. Even if someone is actively offending (maliciously, apathetically, out of carelessness, etc.), be respectful. If the offender actually enjoys what they do, training and mentoring can help get them on track. If the person does not show an interest, help them find something they do interest, preferably at your company, even if it’s in another organization. Even if someone is consciously and purposefully ignoring and violating the policies, do what you can to help that person find a good home. Perhaps they need to go elsewhere but you can help them land on their feet. Letting someone go should be a last resort, or the consequence for unambiguously malicious behavior only, such as installing backdoors or destroying infrastructure!
I think it’s vital to emphasis this. We have, every single one of us, been in a situation where we did not like what we did or we were so far behind our coworkers that our performance suffered. Usually it’s related to some project and we get past it shortly, but sometimes it’s a deeper issue. In such situations, it becomes easy for people to do things wrong, to not be as careful and cause mistakes, or to simply not care anymore. Trying harder at something you emphatically do not like only hides the problem but does not fix it. The solution is often to get these people some training or find another role to fulfill. For instance, I have worked alongside someone in operations who was struggling a bit at the job. Thankfully, a development position opened up and they were able to transfer into it and turned out to be one the best developers I have ever worked with. If someone in your team is struggling, help them out. If you’re not in a position to help directly, advocate for your co-worker.
A Caution About Automation
The chapter finishes up with a cautionary tale of automation gone bad. Attempts to automate processes that aren’t fully understood, whether they are new or more “magical” older processes, can be very dangerous. Consistency must be there before you worry about automation. Or as Alan Renouf says, “you will retrieve mass produced crap!”
With the completion of Phase Four, our feedback loop is in place. We can now iterate through the four phases continually, catching what was missed on the last iteration or improving it. We also have data that we rely on, rather than hunches and feelings, to drive our business decisions and outcomes – management by fact. Visible Ops finishes with a short summary of all four phases, some encouragement from the authors to pursue the Visible Ops methodology, and some Deming quotes.
I hope you have enjoyed both the Visible Ops Handbook itself and this mini-series about it. Please let me know what you think in the comments and on Twitter. Enjoy!