Using Puppet Enterprise 2018’s new backup/restore features

I was pretty excited when I read the new features in Puppet Enterprise 2018.1. There are a lot of cool new features and fixes, but the backup/restore feature stood out for me. Even with just 5 VMs at home, I don’t want to rock the boat when rebuilding my master by losing my CA or agent certs, much less with a lot more managed nodes at work, and all the little bootstrap requirements have changed since I started using PE in 2014. Figuring out how to get everything running myself would be possible, but it would take a while and be out of date in a few months anyway. Then there is everything in PuppetDB that I do not want to lose, like collected facts/resources and run reports.

Not coincidentally, I still had a single CentOS 6 VM around because it was my all-in-one puppet master, and migrating to CentOS 7 was not something I looked forward to due to the anticipated work it would require. With the release of this feature, I decided to get off my butt and do the migration. It still took over a month to make it happen, between other work, and I want to share my experience in the hope it saves someone else a bit of pain.

Create your upgrade outline

I want to summarize the plan at a really high level, then dive in a bit deeper. Keep in mind that I have a single all-in-one master using r10k and my plan does not address multi-master or split deployments. Both of those deployment models have significantly different upgrade paths, please be careful if you try and map this outline onto those models without adjusting. For the all-in-one master, it’s pretty simple:

  • Backup old master
  • Deploy a new master VM running EL7
  • Complete any bootstrapping that isn’t part of the backup
  • Install the same version of PE
  • Restore the old master’s backup onto the new master
  • Run puppet
  • Point agents at the new master

I will cover the backup/restore steps at the end, so the first step to cover is deploying a new master. This part sounds simple, but if Puppet is currently part of your provisioning process and you only have one master, you’ve got a catch 22 situation – new deployments must talk to puppet to complete without errors, and if you deploy a new puppet master using the same process, it will either fail to communicate with itself since PE is not installed, or it will talk to a PE installation that does not reflect your production environment. We need to make sure that we have the ability to provision without puppet, or be prepared for some manual efforts in the deploy. With a single master, manual efforts aren’t that burdensome, but can still reduce accuracy, which is why I prefer a modified automated provisioning workflow.

A lot of bootstrapping – specifically hiera and r10k/code manager – should be handled by the restore. There were just a few things I needed to do:

  • Run ssh-keygen/install an existing key and attach that key to the git system. You can avoid this by managing the ssh private/public keys via file resources, but you will not be able to pull new code until puppet processes that resource.
  • SSH to your git server and accept the key. You can avoid this with the sshkey resource, with the same restriction.
  • Check your VMs default iptables/selinux posture. I suggest managing security policy via puppet, which should prevent remote agents from connecting before the first puppet run, but it’s also possible to prevent the master from communicating with itself with the wrong default policy.
  • Check the hostname matches your expectations. All of /etc/hosts, /etc/hostname, /etc/sysconfig/network should list the short and FQDN properly, and hostname; hostname -f should return the same values.¬†/etc/resolv.conf may also need the search domain. Fix any issues before installing PE, as certs are generated during install, and having the wrong hostname result can cause cascading faults best addressed by starting over.

The restore should get the rest from the PE side of things. If your provisioning automation performs other work that you had to skip, make sure you address it now, too.

Installing PE is probably the one manual step you cannot avoid. You can go to https://support.puppet.com and find links to current and past PE versions. Make sure you get the EL7 edition and not the EL6 edition. I did not check with Support, but I assume that you must restore on the same version you backed up, I would not risk even a patch release difference.

Skipping the restore brings us to running the agent, a simple puppet agent -t on the master, or waiting 30 minutes for the run to complete on its own.

The final step may not apply to your situation. In addition to refreshing the OS of the master, I switched to a new hostname. If you’re dropping your new master on top of the existing one’s hostname/IP, you can skip this step. I forked a new branch from production called mastermigration. The only change in this branch is to set the server value in /etc/puppetlabs/puppet/puppet.conf. There are a number of ways to do this, I went with a few ini_setting resources and a flag manage_puppet_conf in my profile::base::linux. The value should only be in one of the sections main or agent, so I ensured it is in main and absent elsewhere:

  if $manage_puppet_conf {
    # These settings are very useful during migration but are not needed most of the time
    ini_setting { 'puppet.conf main server':
      ensure => present,
      path => '/etc/puppetlabs/puppet/puppet.conf',
      section => 'main',
      setting => 'server',
      value => 'puppet.example.com',
    }
    ini_setting { 'puppet.conf agent server':
      ensure => absent,
      path => '/etc/puppetlabs/puppet/puppet.conf',
      section => 'agent',
      setting => 'server',
    }
  }

During the migration, I can just set profile::base::linux::manage_puppet_conf: true in hiera for the appropriate hosts, or globally, and they’ll point themselves at the new master. Later, I can set it to false if I don’t want to continue managing it (while there is no reason you cannot leave the flag enabled, by leaving it as false normally you can ensure that changing the server name here does not take effect unless purposefully flip the flag; you could also parameterize the server name).

Now let’s examine the new feature that makes it go.

Backups and Restores

Puppet’s documentation on the backup/restore feature provides lots of detail. It will capture the CA and certs, all your currently deployed code, your PuppetDB contents including facts, and almost all of your PE config. About the only thing missing are some gems, which you should hopefully be managing and installing with puppet anyway.

Using the new feature is pretty simple, puppet-backup createor puppet-backup restore <filename> will suffice for this effort. There are a few options for more fine-grained control, such as backup/restore of individual scopes with --scope=<scopes>[,<additionalscopes>...], e.g. --scope=certs.

 

The backup will only backup the current PE edition’s files, so if you still have /etc/puppet on your old master from PE 3 days, that will not be part of the backup. However, files in directories it does back up, like /etc/puppetlabs/puppet/puppet.conf.rpmsave, will persist. This will help reduce cruft, but not eliminate it. You will still need to police on-disk content. In particular, if you accidentally placed a large file in /etc/puppetlabs, say the PE install tarball, that will end up in your backup and can inflate the size a bit. If you feel the backup is exceptionally large, you may want to search for large files in that path.

The restore docs also specify two commands to run after a restore when Code Manager is used. If you use CM, make sure not to forget this step:

puppet access login
puppet code deploy --all --wait 

The backup and restore process are mostly time-dependent on the size of your puppetdb. With ~120 agents and 14 days of reports, it took less than 10 minutes for either process and generated a ~1G tarball. Larger environments may expect the master to be offline for a bit longer, if they want to retain their full history.

Lab it up

The backup/restore process is great, but it’s new, and some of us have very ancient systems laying around. I highly recommend testing this in the lab. My test looked like this:

  • Clone the production master to a VM on another hostname/IP
  • Run puppet-backup create
  • Fully uninstall PE (sudo /opt/puppetlabs/bin/puppet-enterprise-uninstaller -p -d -y)
  • Remove any remaining directories with puppet in them, excepting the PE 2018 install files, to ensure all cruft is gone
  • Disable and uninstall any r10k webhook or puppet-related services that aren’t provided by PE itself.
  • Reboot
  • Bootstrap (from above)
  • Install PE (sudo /opt/puppetlabs/bin/puppet-enterprise-installer) only providing an admin password for the console
  • Run puppet-backup restore <backup file>
  • Run puppet agent -t
  • Make sure at least one agent can check in with¬†puppet agent -t --server=<lab hostname> (clone an agent too if need be)
  • Reboot
  • Make sure the master and agent can still check in, Console works, etc.
  • If possible, test any systems that use puppet to make sure they work with the new master
  • Identify any missing components/errors and repeat the process until none are observed

I mentioned that I used PE3. My master had been upgraded all the way from version 3.7 to 2018.1.2. I’m glad I tested this, because there were some unexpected database settings that the restore choked on. I had to engage Puppet Support who provided the necessary commands to update the database so I could get a useful backup. This also allowed me to identify all of my bootstrap items and of course, gain familiarity and confidence with the process.

This became really important for me because, during my production migration, I ran into a bug in my provisioning system where the symptom presented itself through Puppet. Because I was very practiced with the backup/restore process, I was able to quickly determine PE was NOT the problem and correctly identify the faulty system. Though it took about 6 hours to do my “very quick” migration, only about an hour of that was actually spent on the Puppet components.

I also found a few pieces of managed files on the master where the code presumed the directory structure would already be there, which it turns out was not the case. I must have manually created some directories 4 years ago. I think the most common issues you would find at this point are dependencies and ordering, but there may be others. Either fix the code now or, if it would negatively affect the production server, prep a branch for merging just prior to the migration, with the plan to revert if you rollback.

I strongly encourage running through the process a few times and build the most complete checklist you can before moving on to production.

Putting it together

With everything I learned in the lab, my final outline looked like this:

  • Backup old master, export to another location
  • Deploy a new master VM running EL7 using an alternative workflow
  • Run ssh-keygen/install an existing key and attach that key to the git system
  • SSH to the git server and accept the key
  • Verify your VMs default iptables/selinux posture; disable during bootstrap if required
  • Validate the hostname is correct
  • Install PE
  • Restore the backup
  • [Optional] Merge any code required for the new server; run r10k/CM to ensure it’s in place on the new master
  • Run puppet
  • Point agents at the new master

Yours may look slightly different. Please, spend the time in the lab to practice and identify any missing steps, it’s well worth it.

Summary

Refreshing any system of significant age is always possible, and often fraught with manual processes that are prone to error. Puppet Enterprise 2018.1 delivered a new backup/restore process that automates much of this process. We have put together a rough outline, refined it in the lab, and then used it to perform the migration in production with high confidence, accounting for any components the backup did not include. I really appreciate this new feature and I look forward to refinements in the future. I hope that soon enough migrations should be as simple an effective as in-place upgrades.