Distributed systems and high availability. Also, 787s.

We often talk about distributed systems at scale. A system with thousands of nodes in dozens of cities on three or more continents is clearly distributed. Sometimes we even discuss some of the difficult problems, like is there such a thing as ‘now’? What often gets lost is how smaller scale systems are distributed as well, especially systems that we describe as Highly Available. Even though these systems are often very close in proximity – sometimes as two line cards in the same chassis – we still must apply distributed systems principles to them.

This week was a good reminder of that. Perhaps you’ve heard of an issue with Boeing’s 787 and all four generator control units (GCUs) encountering an integer overflow after 248 days of continual power. All four. At once. Well, maybe… Will all four GCUs really shut off at the same time? They will, on one condition: all four GCUs were powered on at the same time. It is clearly stated as a condition in the FAA directive:

If the four main GCUs (associated with the engine mounted generators) were powered up at the same time, after 248 days of continuous power, all four GCUs will go into failsafe mode at the same time, resulting in a loss of all AC electrical power regardless of flight phase.

There’s a very simple solution here: don’t power all four GCUs up at the same time. Let’s pretend that GCUs take 10 minutes to go from off to available on average. The GCU power up sequence should introduce a delay of 10 minutes * X, where X is some factor representing the desired safety margin for the longest possible time to availability, say 1.5. We’d have a timeline like this:

  • 0:00 – GCU 1 powered on
  • 0:15 – GCU 2 powered on
  • 0:30 – GCU 3 powered on
  • 0:45 – GCU 4 powered on
  • 0:60 – All GCUs available

By following this schedule, if the 787 is not power cycled before 248d 14h rolls around, the plane won’t drop out of the sky. 3 GCUs will be available at all times.

I’m not going to say that these kinds of bugs are excusable. They simply aren’t – we’ve known 32-bit counters roll over relatively quickly for decades and have had the software and hardware capable of handling 64-bit or larger counters for quite some time. However, we have dealt with distributed systems that run in high availability modes for decades.¬†Whatever your highly available, distributed systems are – servers, firewalls, routers, databases, etc. – you should use delayed startup times to avoid this well known set of problems. This leaves you more time to focus on the lesser known problems.


Configuring an R10k webhook on your Puppet Master

Now that we have a unified controlrepo, we need to set up an r10k webhook. I have chosen to implement the webhook from zack/r10k. There are other webhooks out there – I’m a huge fan of Reaktor – but I chose this because I’m already using this module and because it is recommended by Puppet Labs. It’s an approved module, to boot!

Update: The zack/r10k module has migrated to puppet/r10k, which should be used instead. I’ve commented out sections that are incompatible with the most recent versions of the module, but as this article is now 2 years old, there may be other changes in surrounding modules you will become aware of, too.

Module Setup

The first step is to make sure the module is installed along with its dependencies. There are no conditional dependencies in a Puppet module’s metadata.json, so you can skip puppetlabs/pe_gem and gentoo/portage if you’d like. On the other hand, there are no ill side effects from having the modules present unless you were to use them for some reason. This is an opportunity to up the version on some pinned modules as well, such as stdlib, as long as you do not increment the major version. If the major version increases, there’s a significant chance your code will have some breakage, it’s best to do that in a separate branch.

I encountered a bug with zack/r10k v2.7.3 (#162). This bug is fixed in v2.7.4. Be sure to upgrade!

Continue reading