In 2014, we set up our puppet environment and we’ve spent the first half of 2015 improving the configuration. In that time, we installed hiera, were introduced to it through the role/profile pattern, focused on separating the data from the code and moving it into hiera, and most recently on an improved controlrepo that modified the hiera layout. We have been using hiera the whole time, and there’s a lot we can do to improve how we use it still.
Manage Hiera with Puppet
Our initial hiera.yaml was simple and static. With our improved controlrepo layout, the new hiera.yaml file is more dynamic. A problem still remains: we are configuring hiera manually! You may have a hiera.yaml in your controlrepo or even a bootstrap.pp file for your initial puppet master. We have also been managing the hiera package manually in profile::hiera. This addresses the problem in the short term but adds to our administrative overhead – anytime we update the hiera config, we need to do so in these files as well as on the master itself.
There are many technical reasons to chose (or not chose!) to use Test Driven Development. In my opinion, one of the most important benefits isn’t technical at all – it’s that it forces you to have an opinion! When you decide what you are going to test, you are forming an opinion on what you want your code to do. It might be processing an input, transforming an input, or providing output based on current state. You have an opinion on the end state but leave the implementation to be determined.
I liken this to having an outline for writing. If you start writing without an outline, the end result never seems to match what you thought you’d end up with. If you follow an outline, however sparse, the result matches what you thought you’d up with – if not, you didn’t follow the outline. Sure, it requires some revisions here and there to make it all tidy, but at least your first draft is legible.
When considering whether you should employ Test Driven Development, keep this in mind as a non-technical benefit!
We often talk about distributed systems at scale. A system with thousands of nodes in dozens of cities on three or more continents is clearly distributed. Sometimes we even discuss some of the difficult problems, like is there such a thing as ‘now’? What often gets lost is how smaller scale systems are distributed as well, especially systems that we describe as Highly Available. Even though these systems are often very close in proximity – sometimes as two line cards in the same chassis – we still must apply distributed systems principles to them.
This week was a good reminder of that. Perhaps you’ve heard of an issue with Boeing’s 787 and all four generator control units (GCUs) encountering an integer overflow after 248 days of continual power. All four. At once. Well, maybe… Will all four GCUs really shut off at the same time? They will, on one condition: all four GCUs were powered on at the same time. It is clearly stated as a condition in the FAA directive:
If the four main GCUs (associated with the engine mounted generators) were powered up at the same time, after 248 days of continuous power, all four GCUs will go into failsafe mode at the same time, resulting in a loss of all AC electrical power regardless of flight phase.
There’s a very simple solution here: don’t power all four GCUs up at the same time. Let’s pretend that GCUs take 10 minutes to go from off to available on average. The GCU power up sequence should introduce a delay of 10 minutes * X, where X is some factor representing the desired safety margin for the longest possible time to availability, say 1.5. We’d have a timeline like this:
- 0:00 – GCU 1 powered on
- 0:15 – GCU 2 powered on
- 0:30 – GCU 3 powered on
- 0:45 – GCU 4 powered on
- 0:60 – All GCUs available
By following this schedule, if the 787 is not power cycled before 248d 14h rolls around, the plane won’t drop out of the sky. 3 GCUs will be available at all times.
I’m not going to say that these kinds of bugs are excusable. They simply aren’t – we’ve known 32-bit counters roll over relatively quickly for decades and have had the software and hardware capable of handling 64-bit or larger counters for quite some time. However, we have dealt with distributed systems that run in high availability modes for decades. Whatever your highly available, distributed systems are – servers, firewalls, routers, databases, etc. – you should use delayed startup times to avoid this well known set of problems. This leaves you more time to focus on the lesser known problems.
Now that we have a unified controlrepo, we need to set up an r10k webhook. I have chosen to implement the webhook from zack/r10k. There are other webhooks out there – I’m a huge fan of Reaktor – but I chose this because I’m already using this module and because it is recommended by Puppet Labs. It’s an approved module, to boot!
The first step is to make sure the module is installed along with its dependencies. There are no conditional dependencies in a Puppet module’s metadata.json, so you can skip puppetlabs/pe_gem and gentoo/portage if you’d like. On the other hand, there are no ill side effects from having the modules present unless you were to use them for some reason. This is an opportunity to up the version on some pinned modules as well, such as stdlib, as long as you do not increment the major version. If the major version increases, there’s a significant chance your code will have some breakage, it’s best to do that in a separate branch.
I encountered a bug with zack/r10k v2.7.3 (#162). This bug is fixed in v2.7.4. Be sure to upgrade!
Quick note: I am deprecating my individual repos – role, profile, hiera etc – that I have used throughout the Puppet series. I will be doing representative work within the Puppetinabox repositories, mostly the controlrepo. I’m not sure when I’ll shut down the repos entirely, not until after I update old links, of course. Some of the older history will eventually be lost, but it’s mostly primitive versions of the code you shouldn’t want to copy. If you actually want the code, check out the repos now, while you still can:
I’d like to tell a tale of a git-astrophe that I caused in the hope that others can learn from my mistakes. Git is awesome but also very feature-ful, which can lead to learning about some of those features at the worst times. In this episode, I abused my knowledge of git rebase, learned how the -f flag to git push works, and narrowly avoided learning about git reflog/fsck in any great detail.
Often times, you will need to rebase your feature branch against master (or production, in this case, it was a puppet controlrepo) before submitting a pull request for someone else to review. This isn’t just a chance to rewrite your commit history to be tidy, but to re-apply the changes in your branch against an updated main branch.
For instance, you created branch A from production on Monday morning, at the same time as your coworker created a branch B. Your coworker finished up her work on the branch quickly and submitted a PR that was merged on Monday afternoon. It took you until Tuesday morning to have your PR ready. At this time, it is generally adviseable to rebase against the updated production to ensure your branch behaves as desired after applying B‘s changes. Atlassian has a great tutorial on rebasing, if you are not familiar with the concept.
When watching others troubleshoot, I have noticed one very important step that is frequently overlooked: reproduction of the problem and validation of the solution.
Once you believe you have remediated an issue, you should attempt to immediately recreate the problem (use your common sense – if the issue affects online sales on Black Friday, it’s probably best to make a note and schedule the testing for later!). This is often as simple as undoing the fix or re-implementing the broken config. If the problem does not return, you didn’t actually fix the issue! Something else must have happened in the meantime to fix the issue.
You may be asking yourself, “If the problem is fixed, why do I care if it was my efforts that fixed it or not?” There are three main reasons why you should care:
- Ensure the problem does not reoccur without warning. If your fix isn’t a fix and you cannot induce the problem to occur immediately, you can at least document what steps were taken and that they did not resolve the issue. When it does occur again, no one will be surprised.
- Your “fix” may have side effects. Revert the configuration change along with any compensating controls put in place, such as a set of permit rules above a deny rule that didn’t exist in the firewall before.
- You may start a cargo cult! This is very likely if the fix isn’t a setting but an action – clearing cache, restarting a process, or even rebooting. These hoops and the need to jump through them may become part of the diagnosis and remediation process. If the solution was invalidated, everyone would realize that these efforts only waste time and have no benefit.
Customer satisfaction will increase when they see with certainty that a fix works and that it won’t spontaneously reoccur in the future. Explain that you want to take some time now to recreate the issue and validate the solution and almost all customers will be understanding and appreciate the effort.