In 2014, we set up our puppet environment and we’ve spent the first half of 2015 improving the configuration. In that time, we installed hiera, were introduced to it through the role/profile pattern, focused on separating the data from the code and moving it into hiera, and most recently on an improved controlrepo that modified the hiera layout. We have been using hiera the whole time, and there’s a lot we can do to improve how we use it still.
Manage Hiera with Puppet
Our initial hiera.yaml was simple and static. With our improved controlrepo layout, the new hiera.yaml file is more dynamic. A problem still remains: we are configuring hiera manually! You may have a hiera.yaml in your controlrepo or even a bootstrap.pp file for your initial puppet master. We have also been managing the hiera package manually in profile::hiera. This addresses the problem in the short term but adds to our administrative overhead – anytime we update the hiera config, we need to do so in these files as well as on the master itself.
There are many technical reasons to chose (or not chose!) to use Test Driven Development. In my opinion, one of the most important benefits isn’t technical at all – it’s that it forces you to have an opinion! When you decide what you are going to test, you are forming an opinion on what you want your code to do. It might be processing an input, transforming an input, or providing output based on current state. You have an opinion on the end state but leave the implementation to be determined.
I liken this to having an outline for writing. If you start writing without an outline, the end result never seems to match what you thought you’d end up with. If you follow an outline, however sparse, the result matches what you thought you’d up with – if not, you didn’t follow the outline. Sure, it requires some revisions here and there to make it all tidy, but at least your first draft is legible.
When considering whether you should employ Test Driven Development, keep this in mind as a non-technical benefit!
We often talk about distributed systems at scale. A system with thousands of nodes in dozens of cities on three or more continents is clearly distributed. Sometimes we even discuss some of the difficult problems, like is there such a thing as ‘now’? What often gets lost is how smaller scale systems are distributed as well, especially systems that we describe as Highly Available. Even though these systems are often very close in proximity – sometimes as two line cards in the same chassis – we still must apply distributed systems principles to them.
This week was a good reminder of that. Perhaps you’ve heard of an issue with Boeing’s 787 and all four generator control units (GCUs) encountering an integer overflow after 248 days of continual power. All four. At once. Well, maybe… Will all four GCUs really shut off at the same time? They will, on one condition: all four GCUs were powered on at the same time. It is clearly stated as a condition in the FAA directive:
If the four main GCUs (associated with the engine mounted generators) were powered up at the same time, after 248 days of continuous power, all four GCUs will go into failsafe mode at the same time, resulting in a loss of all AC electrical power regardless of flight phase.
There’s a very simple solution here: don’t power all four GCUs up at the same time. Let’s pretend that GCUs take 10 minutes to go from off to available on average. The GCU power up sequence should introduce a delay of 10 minutes * X, where X is some factor representing the desired safety margin for the longest possible time to availability, say 1.5. We’d have a timeline like this:
- 0:00 – GCU 1 powered on
- 0:15 – GCU 2 powered on
- 0:30 – GCU 3 powered on
- 0:45 – GCU 4 powered on
- 0:60 – All GCUs available
By following this schedule, if the 787 is not power cycled before 248d 14h rolls around, the plane won’t drop out of the sky. 3 GCUs will be available at all times.
I’m not going to say that these kinds of bugs are excusable. They simply aren’t – we’ve known 32-bit counters roll over relatively quickly for decades and have had the software and hardware capable of handling 64-bit or larger counters for quite some time. However, we have dealt with distributed systems that run in high availability modes for decades. Whatever your highly available, distributed systems are – servers, firewalls, routers, databases, etc. – you should use delayed startup times to avoid this well known set of problems. This leaves you more time to focus on the lesser known problems.
Now that we have a unified controlrepo, we need to set up an r10k webhook. I have chosen to implement the webhook from zack/r10k. There are other webhooks out there – I’m a huge fan of Reaktor – but I chose this because I’m already using this module and because it is recommended by Puppet Labs. It’s an approved module, to boot!
Update: The zack/r10k module has migrated to puppet/r10k, which should be used instead. I’ve commented out sections that are incompatible with the most recent versions of the module, but as this article is now 2 years old, there may be other changes in surrounding modules you will become aware of, too.
The first step is to make sure the module is installed along with its dependencies. There are no conditional dependencies in a Puppet module’s metadata.json, so you can skip puppetlabs/pe_gem and gentoo/portage if you’d like. On the other hand, there are no ill side effects from having the modules present unless you were to use them for some reason. This is an opportunity to up the version on some pinned modules as well, such as stdlib, as long as you do not increment the major version. If the major version increases, there’s a significant chance your code will have some breakage, it’s best to do that in a separate branch.
I encountered a bug with zack/r10k v2.7.3 (#162). This bug is fixed in v2.7.4. Be sure to upgrade!
Quick note: I am deprecating my individual repos – role, profile, hiera etc – that I have used throughout the Puppet series. I will be doing representative work within the Puppetinabox repositories, mostly the controlrepo. I’m not sure when I’ll shut down the repos entirely, not until after I update old links, of course. Some of the older history will eventually be lost, but it’s mostly primitive versions of the code you shouldn’t want to copy. If you actually want the code, check out the repos now, while you still can:
I’d like to tell a tale of a git-astrophe that I caused in the hope that others can learn from my mistakes. Git is awesome but also very feature-ful, which can lead to learning about some of those features at the worst times. In this episode, I abused my knowledge of git rebase, learned how the -f flag to git push works, and narrowly avoided learning about git reflog/fsck in any great detail.
Often times, you will need to rebase your feature branch against master (or production, in this case, it was a puppet controlrepo) before submitting a pull request for someone else to review. This isn’t just a chance to rewrite your commit history to be tidy, but to re-apply the changes in your branch against an updated main branch.
For instance, you created branch A from production on Monday morning, at the same time as your coworker created a branch B. Your coworker finished up her work on the branch quickly and submitted a PR that was merged on Monday afternoon. It took you until Tuesday morning to have your PR ready. At this time, it is generally adviseable to rebase against the updated production to ensure your branch behaves as desired after applying B‘s changes. Atlassian has a great tutorial on rebasing, if you are not familiar with the concept.