I’ll be presenting at #PuppetConf 2015 in October and I need your help!

In my 2015 Goals, I originally had a goal of submitting a talk to VMworld. As the year went on, it became clear that a Puppet-oriented talk was more fitting and I submitted an abstract to the PuppetConf Call For Papers. I’ve very proud to announce that my abstract was accepted and that I’ll be presenting at PuppetConf 2015 this October in Portland, OR! My talk is tentatively titled Puppetizing your Organization: Taking Puppet from a Proof of Concept to the Configuration Management Tool of Choice (PyO:TPfaPoCttCMToC will probably be shortened!) and aims to help you move from buy-off on your proof of concept toward buy-in from your entire organization.

PuppetConf’s call for papers works is similar to many other conferences: you present an abstract and if accepted, you have a few months to flesh the abstract out into a full presentation. One of the concepts in my abstract is to share with and learn from others, and this talk is no different. I need your help to make sure I include multiple perspectives and lessons, not just mine. Please take a look at the abstract and let me know both what you’d like to see me cover, and any tips you have to share with others. You can leave your comments here or on twitter. I’ll make sure to acknowledge you during my presentation, unless you let me know otherwise.

I’m eagerly looking forward to meeting everyone in the Puppet community this fall! Make your reservation now using this 35% off link and be sure to attend my presentation!

Visible Ops Phase Four: Enable Continual Improvement

The final phase of Visible Ops is Enable Continual Improvement. To really succeed with our efforts, we need to make sure that the resources we have are allocated optimally toward our business goals. With most of our fires put out and significant efforts into avoiding future fires, we have the time available to do this right. To determine where our resources should be allocated, we need to look at metrics.

Continue reading

Visible Ops Phase Three: Create A Repeatable Build Library

Phase three of Visible Ops is Create a Repeatable Build Library. This phase’s focus is to define build mechanisms, create system images, and establish documentation that together describe how to build our desired infrastructure from “bare metal” (I’ll continue to use “bare metal” throughout for consistency, but “bare VM” may be more appropriate in today’s virtualized IT). This allows us to treat our infrastructure like fuses. When a fuse pops, it is discarded instead of repaired and a new fuse is inserted in its place; likewise when a system fails, it is removed from service and a new system is provisioned in it’s place. All high-performing IT organizations, not just the unicorns of IT, use this technique. This chapter focuses on how to achieve that goal.

Continue reading

Visible Ops Phase Two: Catch And Release and Find Fragile Artifacts

In the second phase of Visible Ops implementation, our goal is to Catch & Release and Find Fragile Artifacts. This phase focuses on creating and maintaining an accurate inventory of assets and highlighting those that generate the most unplanned work. The various configurations in use are also collected in order to start reducing the unique configuration counts, the focus of Phase Three. This is the shortest chapter in the book at 6 pages, though it may take significant time to complete the work efforts.

The Issues and Indicators lays out the issues being tackled, including moving from “individual knowledge” to “tribal knowledge” (a shared knowledgebase the entire organization can access, rather than in people’s heads) and preventing the “special snowflake” syndrome where every node in the network, even those in clusters or farms, are similar but still unique.

Continue reading

Visible Ops Phase One: Stabilize The Patient and Modify First Response

The first phase of implementing Visible Ops is Stabilize The Patient and Modify First Response. This first order of business intends to reduce our amount of unplanned work significantly, to 25% or less, to allow our organization to focus on more than just firefighting. When the environment is stabilized, efforts can instead be spent on proactive work that stops fires before they start.

The chapter, like all chapters in Visible Ops, begins with an Issues and Indicators section, describing the issues with both a formal statement and a narrative example. The issues are familiar to anyone in IT, from how self-inflicting problems create unplanned work to the inconvenient, but predictable, timing of system failures during times of urgency. The narrative example provided helps to relate the formal statement to the experiences we all share.

Continue reading

Book Review: The Visible Ops Handbook

A few months ago, I purchased The Visible Ops Handbook: Implementing ITIL in 4 Practical and Auditable Steps, by Kevin Behr, Gene Kim, and George Spafford and published by the IT Process Institute (ITPI). Those names may sound familiar, as these are the same authors of The Phoenix Project. Where The Phoenix Project is a novel that teaches us general concepts through storytelling, The Visible Ops Handbook is a set of instructions to help us implement the changes that are now so vital to the DevOps movement – it’s all about executing! The Visible Ops methodology it teaches is one of many precursors to DevOps, and it’s a wonderful foundation that I feel should be explored by all.

Visible Ops has a nice structure. There are five sections describing the four steps of Visible Ops (70 pages) followed by a few appendices (30 pages). The content is a result of studies done by the ITPI (contemporary equivalents of today’s State of DevOps reports) and the implementation of Visible Ops, before it went to print, by IP Services. In those 100 pages, the lessons learned from the studies is codified using ITIL terminology. The writing is very accessible, but also very dense, and is worth referencing repeatedly as you work through the included steps. The target audience is decidedly technical, but everyone in an IT organization can benefit from the material.

Continue reading

Hiera-fy your Hiera setup

In 2014, we set up our puppet environment and we’ve spent the first half of 2015 improving the configuration. In that time, we installed hiera, were introduced to it through the role/profile pattern, focused on separating the data from the code and moving it into hiera, and most recently on an improved controlrepo that modified the hiera layout. We have been using hiera the whole time, and there’s a lot we can do to improve how we use it still.

Manage Hiera with Puppet

Our initial hiera.yaml was simple and static. With our improved controlrepo layout, the new hiera.yaml file is more dynamic. A problem still remains: we are configuring hiera manually! You may have a hiera.yaml in your controlrepo or even a bootstrap.pp file for your initial puppet master. We have also been managing the hiera package manually in profile::hiera. This addresses the problem in the short term but adds to our administrative overhead – anytime we update the hiera config, we need to do so in these files as well as on the master itself.

Continue reading

Opinionated Test Driven Development

There are many technical reasons to chose (or not chose!) to use Test Driven Development. In my opinion, one of the most important benefits isn’t technical at all – it’s that it forces you to have an opinion! When you decide what you are going to test, you are forming an opinion on what you want your code to do. It might be processing an input, transforming an input, or providing output based on current state. You have an opinion on the end state but leave the implementation to be determined.

I liken this to having an outline for writing. If you start writing without an outline, the end result never seems to match what you thought you’d end up with. If you follow an outline, however sparse, the result matches what you thought you’d up with – if not, you didn’t follow the outline. Sure, it requires some revisions here and there to make it all tidy, but at least your first draft is legible.

When considering whether you should employ Test Driven Development, keep this in mind as a non-technical benefit!

Distributed systems and high availability. Also, 787s.

We often talk about distributed systems at scale. A system with thousands of nodes in dozens of cities on three or more continents is clearly distributed. Sometimes we even discuss some of the difficult problems, like is there such a thing as ‘now’? What often gets lost is how smaller scale systems are distributed as well, especially systems that we describe as Highly Available. Even though these systems are often very close in proximity – sometimes as two line cards in the same chassis – we still must apply distributed systems principles to them.

This week was a good reminder of that. Perhaps you’ve heard of an issue with Boeing’s 787 and all four generator control units (GCUs) encountering an integer overflow after 248 days of continual power. All four. At once. Well, maybe… Will all four GCUs really shut off at the same time? They will, on one condition: all four GCUs were powered on at the same time. It is clearly stated as a condition in the FAA directive:

If the four main GCUs (associated with the engine mounted generators) were powered up at the same time, after 248 days of continuous power, all four GCUs will go into failsafe mode at the same time, resulting in a loss of all AC electrical power regardless of flight phase.

There’s a very simple solution here: don’t power all four GCUs up at the same time. Let’s pretend that GCUs take 10 minutes to go from off to available on average. The GCU power up sequence should introduce a delay of 10 minutes * X, where X is some factor representing the desired safety margin for the longest possible time to availability, say 1.5. We’d have a timeline like this:

  • 0:00 – GCU 1 powered on
  • 0:15 – GCU 2 powered on
  • 0:30 – GCU 3 powered on
  • 0:45 – GCU 4 powered on
  • 0:60 – All GCUs available

By following this schedule, if the 787 is not power cycled before 248d 14h rolls around, the plane won’t drop out of the sky. 3 GCUs will be available at all times.

I’m not going to say that these kinds of bugs are excusable. They simply aren’t – we’ve known 32-bit counters roll over relatively quickly for decades and have had the software and hardware capable of handling 64-bit or larger counters for quite some time. However, we have dealt with distributed systems that run in high availability modes for decades. Whatever your highly available, distributed systems are – servers, firewalls, routers, databases, etc. – you should use delayed startup times to avoid this well known set of problems. This leaves you more time to focus on the lesser known problems.

 

Configuring an R10k webhook on your Puppet Master

Now that we have a unified controlrepo, we need to set up an r10k webhook. I have chosen to implement the webhook from zack/r10k. There are other webhooks out there – I’m a huge fan of Reaktor – but I chose this because I’m already using this module and because it is recommended by Puppet Labs. It’s an approved module, to boot!

Update: The zack/r10k module has migrated to puppet/r10k, which should be used instead. I’ve commented out sections that are incompatible with the most recent versions of the module, but as this article is now 2 years old, there may be other changes in surrounding modules you will become aware of, too.

Module Setup

The first step is to make sure the module is installed along with its dependencies. There are no conditional dependencies in a Puppet module’s metadata.json, so you can skip puppetlabs/pe_gem and gentoo/portage if you’d like. On the other hand, there are no ill side effects from having the modules present unless you were to use them for some reason. This is an opportunity to up the version on some pinned modules as well, such as stdlib, as long as you do not increment the major version. If the major version increases, there’s a significant chance your code will have some breakage, it’s best to do that in a separate branch.

I encountered a bug with zack/r10k v2.7.3 (#162). This bug is fixed in v2.7.4. Be sure to upgrade!

Continue reading