Phase three of Visible Ops is Create a Repeatable Build Library. This phase’s focus is to define build mechanisms, create system images, and establish documentation that together describe how to build our desired infrastructure from “bare metal” (I’ll continue to use “bare metal” throughout for consistency, but “bare VM” may be more appropriate in today’s virtualized IT). This allows us to treat our infrastructure like fuses. When a fuse pops, it is discarded instead of repaired and a new fuse is inserted in its place; likewise when a system fails, it is removed from service and a new system is provisioned in it’s place. All high-performing IT organizations, not just the unicorns of IT, use this technique. This chapter focuses on how to achieve that goal.
Issues and Indicators
Coming off of Phase Two, where we merely identified what we have, it is not surprising that many of the issues are similar. Configuration drift and Special Snowflake Syndrome raise their ugly head, as well as development making fixes in production and Ops crashing systems by applying patches. One issue is phrased a bit awkwardly to me, “The production team supports unusual applications and infrastructure.” This is referring to the proliferation of umpteen types of database servers, every web server out there, 20 different versions of Java – situations many had under control until IT was decentralized or strangled by SHadow IT (SHIT). Thanks to Visible Ops, we can get a handle on all our SHIT.
Create A Release Management Team
The first goal is to move staff from a reactive, fire-fighting role to a new, proactive, release management role. This team’s focus is software and integration releases for production. This work is now early in the pipeline (like the integration of change control into troubleshooting) where the cost of defects is low. The team should consist of the most senior IT Ops staff, which should have the benefit that most of these people were part of Phase Two’s Catch and Release.
The Release Management Team (RMT) is ideally only responsible for the mechanisms to deploy into production and engineering the builds, not actually building and deploying to production. This guidance may not hold up for every organization, especially those embracing DevOps methodologies. We will discuss this in more detail later. The team should first focus on decomposing and engineering new builds for the most fragile artifacts. If we can reprovision a system from bare metal rapidly, we have confidence that we can perform changes on the system and can truly treat systems as replaceable fuses.
A number of benefits are listed to this approach, two of which I found more important than the others. When we rebuild infrastructure in situ, the chances of introducing configuration variance as high; provisioning new infrastructure dramatically lowers this chance. The automated process can also be timed, giving us a very reliable number to use as a troubleshooting and change guideline: if it takes 15 minutes to build a new system, any troubleshooting or backout procedure that takes more than 14 minutes is likely wasted (counterpoint: chronic issues should be troubleshot ASAP, as 15 minutes * number of outages can add up quickly and erode user’s and colleague’s confidence in the system).
Create A Repeatable Build Process
The next few pages discuss the repeatable build process and introduce us to some of the more important terminology and processes of Visible Ops. The build process results in Golden Builds, an image, ruleset, set of documents, or some combination of all three that brings bare metal to the desired state that has been tested and approved, prior to being introduced to production. These Golden Builds are kept up to date over time with OS and application patches and as services are added to the systems. Golden Builds are stored in the Definitive Software Library (DSL), a conceptual “vault” where our software assets reside. The DSL is where OS ISOs, Infrastructure-As-Code version control systems, licenses keys, and patches all live.
It’s worth emphasizing that the DSL is internal to our company. By removing the reliance on upstream repos as much as is reasonable (not every company can or wants to host half of github), new systems can be provisioned when the internet connection is down or over-utilized – something that may not happen often, but is vital when we need to rebuild the system that provides the internet connection. We can also be confident that we do not discover the download link for an older OS we support has disappeared in the middle of an outage. By ensuring all content and dependencies are in the DSL, we can avoid those rare catch-22 situations.
The RMT will obviously be focusing on the fragile artifacts, but can further focus and prioritize by reviewing the common components used across multiple systems and searching for the lowest common denominators. This term is often used pejoratively. In our case, it’s a positive, as the goal is to find areas where the team’s work has larger benefits. A good example is web servers. Perhaps we use Apache everywhere, but with different configs. A resulting base config that satisfies all systems but is extensible by each system ensures that the effort pays off handsomely.
For each of these components, the goal is to have a “push button” or equivalent deployment capability for each system. Systems can be deployed from bare metal by layering the components together properly. With critical systems or load-balanced/farm environments where multiple nodes perform the same function, these builds can be deployed automatically be a provisioning system with minimal or no human intervention. An approval process is created to accept items into the DSL that ensures documentation exists and testing was performed accurately, as well as capturing information on the subject matter expert (SME) responsible for maintaining the build.
Finally, the completed builds are stored in the DSL. Visible Ops suggests a clean room for media and a segregated network. There are more modern methodologies since Visible Ops was written that make this recommendation in particular feel outdated. Infrastructure-as-Code, for instance, relies on the “builds”, or at least some portions of them, residing in version control that is accessibly by developers, operations, and the systems themselves. The constant change makes it difficult and almost meaningless to capture a build at a given point-in-time. We can still ensure that our media and version control systems are backed up and meet our business continuity/disaster recovery guidelines, just as any other critical business systems do. Visible Ops does require that all software in the DSL, internal or external, remain under version control, so the DSL concept itself remains vital.
When beginning Phase Three, the DSL is empty and the RMT has not produced a Golden Build and there are still fragile artifacts in the environment. It suggests an “amnesty program” that allows the existing, non-repeatable builds to be placed in the DSL with a stringent expiration date, such as one year. By adding the fragile artifacts to version control, the organization can at least ensure those artifacts have some recovery capabilities if the worst were to happen before the new builds were completed. These amnesty builds can be replaced wholesale or component by component until the expiry or the new build is complete. This could be as simple as replacing multiple versions of Red Hat Enterprise Linux with the latest RHEL, but with the same poorly documented tarball of an app on top, in order to reduce the number of supported operating systems.
The DSL will need reviewed periodically to prune outdated and no longer used builds, as well as ensuring only authorized components are present. This is another challenge that Infrastructure-as-Code presents as the builds may be more “tangled” than in a traditional DSL, but with some effort unused portions of the codebase can be removed or refactored.
Though the role of Configuration Management (CM) is not specifically discussed in Visible Ops (it was written in 2004, before CM was a “thing”, at least as we know it now), this chapter certainly describes a system providing features that we now associate with configuration management. Particularly when defining common components used in various systems, we can see how CM can meet our needs. Rather than focus on building an apache config by hand, we can leverage our CM tool(s) to deploy an apache module/cookbook that does it for us. This is a strong benefit of the Infrastructure-as-Code pattern described above.
In particular, Puppet’s role/profile and hiera pattern allows the RMT to craft a lowest common-denominator apache profile, then apply role, environment, element, or other tiers of hierarchical data to configure apache. Operations can easily maintain this as the options can be tweaked in a human-readable file that Puppet uses to enforce state on the node. Any unauthorized changes are easily detected and the node reconverges on the desired state at the specified frequency. We may still benefit from some detective tools (on the node or that pull change notifications from Puppet’s logs). I am familiar with Puppet but am certain that other CM tools have similar capabilities.
Visible Ops meets DevOps
The next few sub-sections also seem a bit dated, but can still be used as guidelines. First, Create An Acceptance Process Contract discusses the relationship between the Release Management Team and Production (Ops) Team, including giving Ops control of whether or not they accept an RMT build. Moving From Production Acceptance to Deployment focuses on Ops using the build process to provision infrastructure without the assistance of development or RMT. The wording seems more divisive than that of DevOps, which encourages collaboration between Development (and all Pre-Production teams) and Operations. A closer look at these sections shows that this is mostly a matter of phrasing and evolution of methodologies over the past decade.
Visible Ops does encourage RMT and Ops to work together to define these processes and determine where handoffs occur. It also requires that RMT have checked all components into the DSL and have tested them rigorously (with an optional QA team/environment if warranted). Operations does the actual build and releases it into production if everything is successful. This sounds very DevOps-ish to me. Let’s examine the largest apparent difference in these two methodologies.
An RMT is in pre-production and does not participate in firefighting with production assets. This allows the RMT to focus on longer release cycles for the builds and is not possible if they are taken away from their project work for unplanned work. This can be a huge positive. However, we also want senior staff on the RMT, which requires us to have a much larger staff or have an Ops team that consists of only junior staff. If we can get away with a dedicated RMT, we may see some benefits over the muddle-huddle that DevOps encourages. If we introduce a rotation that swaps an RMT and Ops team member every month or so, to give each team the insights of the other team, we can bring back some of the advantages of the muddle-huddle.
Another section further on recommends the separation of duties between developers, release management, and operations, using an example from the airline industry of “those who build the airplane are not allowed to fly it.” This is another conflict between Visible Ops and DevOps that has no correct answer. A common solution is to set up environments and enforce this segregation in different degrees in different environments. In Development, developers may be unrestricted and operations are read-only; in Production, developers are read-only and Operations are unrestricted; with a gradient of rights in the Testing, QA, and Pre-Prod environments. There are advantages and disadvantages to separation that will vary for each organization. I personally believe that separation should be preserved, especially in Production. Effective detective controls will minimize the risk of unauthorized change regardless of the policy chosen. Whatever your decision, ensure that the policy is well-described and enforced consistently.
Which combination of methodologies will best benefit the organization is something you have to experiment with. I encourage everyone to keep an open mind about both Visible Ops and DevOps while experimenting. Visible Ops provides more prescriptive methods and DevOps can be a little more nebulous, but I am confident there are aspects of both can be combined to fit your situation best.
Phase Three also discusses where patches belong in the software lifecycle: in the release management process. The ITPI observes that high performance IT organizations patch less frequently than other IT organizations without jeopardizing their security posture by effectively implementing compensating controls and reducing risk. Therefore, patching in production remains risky business! It’s easy to introduce immediate and latent errors that accumulate over time, introducing errors at a later date and possibly compromising security.
Patches can be tested in pre-production to detect these errors. Golden Builds without the errors can be added to the DSL and existing systems can be replaced like fuses with new builds that are properly patched. Of course, in the past two years we’ve seen some really high profile vulnerabilities, like Heartbleed and Shellshock, that may absolutely need to be applied immediately. Visible Ops provides a series of questions to evaluate this need accurately, such as whether the threat can be mitigated without applying the patch or update, and if it must be done now, can we get the stakeholders and management to sign off on the risk? If the answer to any question is in the negative, it may be judicious to hold off until that answer can be in the affirmative.
When Phase Three is completed, we have some sort of DSL containing Golden Builds from our Release Management Team. This has moved much of our defects earlier in the QA pipeline, where the cost is lower and the fix is easier, and ensured less variance exists in production. Software updates are part of release management and can be easily deployed to production with our new fusebox mentality that treats production systems as replaceable rather than repairable components.