In the second phase of Visible Ops implementation, our goal is to Catch & Release and Find Fragile Artifacts. This phase focuses on creating and maintaining an accurate inventory of assets and highlighting those that generate the most unplanned work. The various configurations in use are also collected in order to start reducing the unique configuration counts, the focus of Phase Three. This is the shortest chapter in the book at 6 pages, though it may take significant time to complete the work efforts.
The Issues and Indicators lays out the issues being tackled, including moving from “individual knowledge” to “tribal knowledge” (a shared knowledgebase the entire organization can access, rather than in people’s heads) and preventing the “special snowflake” syndrome where every node in the network, even those in clusters or farms, are similar but still unique.
Catch & Release
The Catch & Release project is analogous to catching wild animals to tag or otherwise insert a tracking device before releasing them into the wild again. Just as naturalists have attributes and metrics they track, Visible Ops suggests what we should track: what is running on a node, what depends on those services, what services does this node depend on, who is the stakeholder and who is authorized to make changes, how fragile it is (how much unplanned work it generates), etc. A list of over 30 questions is provided.
Gathering answers to these questions is tedious and it is tempting to let juniors and neophytes collect the data. This is a bad idea as much of the data exists as individual knowledge in the head of the seniors and the longest tenured staff. If this documentation existed, we wouldn’t be performing this step! All collected information needs to be placed in a Configuration Management Database (CMDB), our one source of truth for the inventory. Appendix E provides a sample database schema. There are plenty of commercial products that can act as our CMDB.
Many of us will find we end up with multiple CMDBs due to legacy product decisions or permissions around who can access data. Ensure that each piece of data only has one source of truth. For example, we might have one database that the entire company uses to track IP address assignments and a separate database for asset inventory. If possible, have the asset inventory database poll the IP database, or a web frontend that pulls data from both DBs for easy presentation.
Find Fragile Artifacts
We did label a few systems as fragile in Phase One mostly based on gut feelings. Here we take a more in-depth look at the artifacts that have been inventoried. We determine fragile artifacts by their low Change Success Rate and high Mean Time To Repair (MTTR) – or, as the book puts it, we’ll recognize them by “knowing that if someone even looks at one wrong, it will crash and cause a massive upside of unplanned work.” We likely know where many of these fragile artifacts are already, and we can use the collected information to complete the list and sort it by change success rate and MTTR.
With the list in hand, we mark the devices in question as fragile. Visible Ops suggests being bold. Attach a sign that says “Do Not Touch!” to the device physically. Update the login banner to say “Do Not Touch!” Notify the CAB not to permit normal changes and to weigh emergency changes against the cost (below). The only change suggested at this point is to implement our detective controls to ensure the guidelines are followed.
We can also put a price tag on changes, at least in rough terms, by taking the aggregate time of unplanned work generated on a system and dividing it by the number of changes in that time. This value can be provided to the CAB, who can then intelligently determine if a change on that system is worth the potential unplanned work. The book refers us to a story of a $239M weather satellite being tipped over when someone bumped into it because someone had taken the security bolts out of the stand but neglected to put a warning sign on it.
The idea behind labeling fragile artifacts is to positively reinforce such learning. Our team is documenting what does and does not work, replacing gut feelings with informed decision making and avoiding historical decisions that generated unplanned work for the team.
Prevent Further Configuration Mutation
The last section of the chapter deals with mutation of the systems once this phase has begun. It is advisable to have a change freeze while documenting existing builds. This means not just a freeze of systems already in production, but a freeze of introducing new systems, a subtle issue that is easy to miss. We accomplish this by ensuring that the Catch & Release and Find Fragile Artifacts steps, which can take a while, have definitive start and end dates. The CAB/EC meetings can capture any true emergency changes that must occur during the freeze. It is also vital to ensure that detective controls are in place before the freeze ends, as these tools ensure unauthorized changes are not performed and that future authorized change information can be captured as well. Unauthorized changes should, as always, be backed out.
This section also touches on another Configuration Management topic: configuration drift. Even with a firm and observed change policy in place, it is possible for system drift to occur. In addition to installing detective controls, use this opportunity to install and configure any CM tools that help prevent or revert such configuration drift. We’ll have more on using CM during Phase Three.
Summary
With Phase Three completed, we have a good inventory of our systems and have identified the fragile artifacts that cause unplanned work when you make changes. This information forms a service catalog, documenting the services in use and the infrastructure it runs on as a Configuration Item (CI) stored in our CMDB. Every CI is associated with one or more services and with zero or more CIs. Detective controls and configuration management tools ensure this information remains relevant. The prioritized list of fragile artifacts completed will then feed into Phase Three.