vCenter 7 upgrade fails with “Exception occurred in install precheck phase”

I finally decided to upgrade my home lab to vSphere 7. First thing is always to upgrade vCenter. I had some issues with my 6.7 VCSA, specifically, that I lost the root password and kinda broke it when trying to recover. No big deal, I don’t use dvSwitches at home and I only have 9 VMs, so no great loss to set up a new vCenter.

After powering off the old VCSA, I deployed the lastest VCSA 7.0u1 image. Everything went well. The next day, I found out the 7.0u1a patch had been released. Since I hadn’t done more than add the vSphere hosts to vCenter, I jumped into the VAMI to perform an automatic upgrade. 7.0u1a was right there waiting for me. I selected it and started the upgrade.

About 30 seconds after starting the upgrade, I received this error:

Installation failed: Exception occurred in install precheck phrase

I googled for “Exception occurred in install precheck phrase” and didn’t see any result for this at all. Everything that looked close was way off. Hitting Resume just started the upgrade again, asking me to cancel or proceed. The cancel button did nothing, and the proceed button led right back to this error.

I tried rebooting the VCSA, figuring maybe it just needed a fresh boot. Logging into the VAMI – after the reboot – showed the same error. In addition, it grays out the left-hand menu. If you change the URL to remove the navigation (https://vcsa/ui) you do get back to the menu, but any attempt to perform another upgrade leads to the same cycle.

After deleting this VCSA and deploying another 7.0u1 instance from scratch, I encountered the same error! Again, google failed me, even a few days later. Thankfully, someone on the VMware{code} slack pointed me to an article on Paul Braren’s blog.

Though the error was somewhat different, the resolution worked for me! Paul’s post goes through the detail of how he ran into the problem and leads to an ultimate resolution of removing the file /etc/applmgmt/appliance/software_update_state.conf. Once I removed this file, I was able to navigate to and perform and upgrade in the VAMI.

I could, of course, have simply downloaded a new image of the VCSA and deploy it, but in production we don’t often have this luxury. We have to fix what’s out there instead of starting from scratch. I hope that by writing this post, which builds on Paul’s excellent work, this error message pulls up at least one result for you. Take that, denvercoder9!

Using PowerShell 7 in VS Code

If you haven’t heard, PowerShell 7 has been released! Even if you haven’t gotten emails or RSS alerts, it’s hard to miss if you use VS Code as the PowerShell plugin will remind you on startup:

Installation on Windows is as simple as selecting Yes and following the prompts. You’ll have to close VS Code – and for some reason, Slack, at least on Windows – and shortly you’ll see PowerShell 7 installed. I assume it’s similar on other OSes, but the specifics may differ.

When you relaunch VS Code, however, you’ll still be using whatever PowerShell you had prior. For me, it was 5.1. This is because the default integrated shell on Windows is the base PowerShell from your OS, and PowerShell 7 is a separate install.

Select Edit in settings.json and add this text (using the correct path if you installed to a non-default location) and save the file:

 "terminal.integrated.shell.windows": "C:\\Program Files\\PowerShell\\7\\pwsh.exe",

Because this is a completely different installation, it doesn’t inherit our existing profile, so before restarting, run notepad $profile and copy the contents. Now restart VS Code and you should have a PowerShell 7 prompt, if uncustomized.

Type notepad $profile, paste your old profile, save and exit, and restart VS Code. Now you have PowerShell 7 as your integrated console with the same customizations as before. Of course, with a new version there might be more you want to customize – and now you’re ready to do so!

If you run VS Code on macOS or Linux, the process above should be a good guideline, but will likely need some tweaks. If you’ve done this, please leave a comment with any specifics you can share, thanks!

Making VS Code’s Powershell Integrated Console useful

I recently started using VS Code pretty heavily and I’ve had a fun time configuring it – especially getting synth wave glow working! One thing that continued to bother me was the Powershell Integrated Console (PIC) – which is different than a normal terminal running Powershell. Not only was it a different powershell session, but it behaved slightly differently. Let’s take a look at the differences between the two, then look at how to improve the PIC.

Types of Terminals

You can run a number of different terminals inside VS Code. If you don’t see your terminal, you can hit Ctrl-` to bring it up. What kind of terminal comes up is up to you. Pull up settings and search for terminal.integrated.shell.windows. You can check the official docs for more information, including common shells. You’ll note that on that page, it’s called the Integrated Terminal, but don’t confuse that for the Powershell Integrated Console. I’ve chosen Powershell with "C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\powershell.exe", but whatever you chose, you’ll see it come up as the 1st console:

This is the same terminal no matter what kind of file you are editing. Many of us are editing Powershell, so we also have the Powershell extension installed. This is what gives us intellisense, code snippets, PS Script Analyzer output inline, etc. It also lets us run and debug code, and here’s where the Integrated Terminal and Integrated Console differences start to come into play.

When you highlight a line of code in a Powershell file and hit F8, it runs in your current terminal. In my case, it’s the powershell terminal (#1), as we see here.

When you hit F5, the Powershell extension runs the entire file. In addition to force-saving the file, you’ll notice one big difference – it runs in a different terminal, the Powershell Integrated Console (#2):

That looks a bit different! It’s got a different prompt, it’s got no colors, it’s just not the same. So why does it exist?

When you enable the Powershell extension, it has to run a powershell instance in order to provide the intellisense and syntax highlighting and such. You can hide that terminal on start (in settings.json, add "powershell.integratedConsole.showOnStartup": false), but it HAS to run in order to get those benefits. If you close it (Trashcan icon), you’ll get a notice from the Powershell extension that it crashed and you should restart it, which you need to if you want to leverage the extension (click on Session Exited if you accidentally dismissed the error). So, no, you cannot get rid of the PIC terminal.

It’s still running Powershell, though, right? Except for being a different session, it shouldn’t be a big deal, right? Well, not if you want to do fancy stuff like colors or Ctrl-r to search history or anything involving PSReadline. For reasons that are well beyond the scope of this article, PSReadline support is not available in the Integrated Console from the Powershell extension (issue 535).

Improving the PIC Terminal

Thankfully, there IS a solution! While the PSReadline support is not yet available in the main extension, it IS available in the Powershell Preview extension! This is a preview and you may encounter bugs, but it’s pretty stable for me so far. Just install it through the Extension Manager (Ctrl-Shift-X), disable the Powershell extension, and reload your session. If you modified your theme, you’ll be asked if you want to keep it or revert to the ISE-like theme. You’ll now have access to PSReadline and other preview features. Ctrl-r brings up search history and colors are available, hooray!

If at any time you find the preview extension causes you problems, just go back to the Extension Manager, disable Powershell Preview, Enable Powershell, and reload. Voila!

One other modification I made is to enable emacs mode with PSReadline. This gives you key bindings that mirror emacs, like Ctrl-w to remove a single word, Ctrl-u to delete from the cursor to the beginning of the line, etc. – all things I’m very comfortable with after decades of working in *nix. You can add Set-PSReadLineOption -EditMode Emacs to your profile to enable this – and remember that the Integrated Terminal and the PIC each have their own profile.

You still have to make sure you’re in the right terminal for the right session (variables set in #1 aren’t set in #2, etc.), but hopefully with the Preview extension, you find the Powershell Integrated Console more useful and your overall VS Code experience is improved. Enjoy!

Updating Puppet classification with hiera to use the modern lookup command

One of the most important parts of using any configuration management software, including Puppet, is making sure that your nodes receive the correct classification. You can write all the code you want to describe the desired system state, but if you don’t attach it to a node, it doesn’t provide any value. In previous articles, I have described using hiera_include() for that classification. However, hiera functions have been deprecated since at least version 4.10.

The replacement for hiera functions is called Lookup. Automatic Parameter Lookup now directly uses lookup instead of the older hiera functions. There is an actual lookup() function. There is also a plugin for the puppet command line utility, puppet lookup. Here’s what we have now:

node default {
  hiera_include('classes')
}

What can we replace this with? If we simply swap out hiera_include with lookup, we don’t actually include the results. We can add .include at the end, which is an object-oriented function that can be called on any string or array of strings:

node default {
  lookup('classes').include
}

This works, but leaves some ugly edges. First, if there are no results for classes, it gives a static error, rather than one you can control:

C:\Windows\system32>puppet apply --noop -e 'lookup("classes").include'
Error: Function lookup() did not find a value for the name 'classes'

Second, it could return a null string, which also gives an error:

C:\Windows\system32>puppet apply --noop -e '"".include'
Error: Evaluation Error: Error while evaluating a Method call, Cannot use empty string as a class name (line: 1, column: 3) on node agent.example.com

A third issue is one of cleanliness, not an actual error: the result could have multiple instances of the same class. Inserting .unique between the lookup and the include would address that.

Fourth, we could have almost any type of data in the classes key, not just strings. If it returned a hash, some other type related error would be seen.

Finally, while rare, we could potentially want to iterate on the result, maybe for logging purposes, and as it stands now, the lookup would have to be performed again for that. We can store the result and operate on it.

With these concerns in mind, a more comprehensive result was obtained with the help of Nate McCurdy:

node default {
  # Find the first instance of `classes` in hiera data and includes unique values. Does not merge results.
  $classes = lookup('classes', Variant[String,Array[String]])
  case $classes {
    String[1]: {
      include $classes
    }
    Array[String[1],1]: {
      $classes.unique.include
    }
    default: {
      fail('This node did not receive any classification')
    }
  }
}

The lookup() call does two things different now. First, we specify a type definition. classes MUST be either a String or an Array of Strings. Any other result will trigger a type mismatch error (while I definitely encourage the “one node, one role” promise of the role/profile pattern, returning multiple classes can be useful during development – just be sure not to allow such tests to propagate to production). Second, the result is stored in $classes.

Next, we have a case statement for the result. The first case is a String with at least one character, to protect against a null string. In that case, the single class is included. The second case matches an Array of Strings with at least 1 element, and each element has at least one character, to protect against a null array or an array with null strings.

The final case matches any empty strings or arrays and throws a customized error about the lack of classification. Because this fails, instead of silently completing without applying anything, no catalog is compiled and the agent’s local state cache is not updated, both events your monitoring system can report on.

As is, this provides a flexible classification system using modern puppet language constructs. It can also be further customized for each case, if desired. I would recommend that anyone still using hiera_include() review this and implement something similar to protect against the eventual removal of the deprecated functions.

Thanks to Nate McCurdy for his assistance with the language constructs!

Planning Your Distributed Log Insight Deployments

As I mentioned recently, I’ve changed jobs and it’s giving me more time for my blog. One of my first challenges at the new job is to look at how to deploy a Log Insight cluster, with the wrinkle that there are multiple datacenters and availability zones that need to leverage Log Insight. Previously, I have only worked with single-node instances of Log Insight, so I had a lot to learn.

The design includes three datacenters, A, B, and C, and the last has two availability zones, giving us A, B, C-1 and C-2. Each site has between 10 and 50 ESXi hosts, so not small but not gigantic, either. Every center should also forward data to a per-datacenter instance of a separate system for compliance.

Log Insight Product Docs

I found a few articles to help me out with the design aspect. The first are the official VMware docs vRealize Log Insight Configuration Limits, Sizing the vRealize Log Insight Virtual Appliance and Planning Your vRealize Log Insight Deployment. Each cluster consists of 3-12 members – one master and 2-11 workers – and each node can have up to 4TB of storage. They can talk to 1 each of a vROps Manager and Active Directory domain, 15 vCenter servers, and 10 forwarders. When using a cluster, nodes must be Medium or Large size and all nodes must be the same size. We should always use the Integrated Load Balancer (ILB) and direct targets to its virtual IP (VIP), even in non-cluster mode. This allows addition of nodes in the future without having to adjust the destination address of other devices.

There are a few caveats, as well. LI’s ILB does not support geoclusters yet, so all cluster members must be on the same Layer 2 network; devices in different L2 networks must be in separate clusters. If you’re using NSX, exclude the LI nodes from Distributed Firewall Protection, otherwise the ILB traffic may be blocked by spoofed traffic rules.

Initial Design

This gives us a lot of information to start designing. We need at least 3 medium or large Log Insight nodes in each datacenter, plus a VIP address assigned to the ILB. Since we have thin provisioned storage, we assign an additional 4TB disk to each node – it’s slightly above the space that LI will use, but hey, thin provisioned and no math required! If you don’t have thin provisioned storage, you can check the usable storage in the UI after deployment (IIRC it’s .6G for LI 8, but might be different if you’re upgrading a node from a previous version) and add whatever the delta is.

There is also one paragraph under the Planning doc’s Clusters with Forwarders section that I almost missed on the first readthrough:

The design is extended through the addition of multiple forwarder clusters at remote sites or clusters. Each forwarder cluster is configured to forward all its log messages to the main cluster and users connect to the main cluster, taking advantage of CFAPI for compression and resilience on the forwarding path. Forwarder clusters configured as top-of-rack can be configured with a larger local retention.

What this means is that in addition to clusters for A, B, C-1 and C-2, which will receive logs from the hosts in their respective datacenters, we also need a main cluster for our “Single Pain of Glass” view. This means 5 clusters of at least 3 nodes each, with clusters A, B, C-1, and C-2 forwarding to both the compliance system and the Main cluster.

Less obvious from the reading, but implied, is that the retention times of the SPOG cluster will likely be shorter than those of the datacenter clusters. Log Insight simply rotates out the oldest logs (by default to the bitbucket, but you can set up a long term archive location) when the disks get full, so your log ingestion rate determines your retention timeline in each cluster. However, you can’t combine 4 sets of logs into 1 without taking up more space, which means the retention time with the same disk space will be lower. We could increase the size of the SPOG cluster to 12 node, but it’s still possible that one noisy datacenter drowns out logs from the other 3 datacenters. Regardless of whether you increase the SPOG cluster size or not, it’s best to assume that your longest retention timeframes will be on the local LI cluster. As a result, when you log into the main cluster and you need to go back just a tiny bit further in time than is retained, you may have to log in to the respective datacenter LI instance to view the older records.

Additional Guides

Before finalizing the design, I looked beyond the product docs and found two more great references. VMware has an Architecting a VMware vRealize Log Insight Solution for VMware Cloud Providers whitepaper from January 2018, which makes it a bit older than Log Insight 8.0 but is still very relevant (reminder: LI jumped from 4.8 to 8.0 to match product numbers with other vRealize Suite products, not because of any major changes to LI). There’s a ton of valuable information in this paper, including some tuning advice for ESXi that I’ll inevitably come back to later, but right now we’re focused on the cluster design aspects.

While it’s not a factor for me, section 3.4.2 (pages 25-27) covers how to use a non-LI system as an intermediate syslog forwarder. Section 4.5 (page 33) has a table showing estimated log retention sizes for a single ESXi node or an 8 node ESXi cluster, which may be helpful in understanding the retention pattern of 3x4TBs in a cluster and whether additional nodes are required just for storage. Section 6.3 (pages 36-38) includes a number of tables of the required firewall ports and directions to build a proper firewall policy. Page 39 reminds us that the Log Insight nodes should probably run on a management cluster that has more nodes than the LI cluster (at least N+1), has HA enabled, and is using local (non-auto deploy) boot. Section 7.3 (page 42) notes that while an LI cluster can only communicate with one vROps Manager, one vROps Manager can communicate with multiple LI clusters – though the Launch in Context only works with a 1:1 mapping. Lots of good but random stuff there to inform the overall plan.

Section 10, starting on page 52, examines three scenarios and includes details on the resulting design. Design scenario C (page 54) most closely approximates my scenario. The London, Paris, and Frankfurt datacenters approximate my A, B, C-1, and C-2 datacenters, and the GNOC cluster approximates my Main cluster. The sizing is a bit more conservative (1 node at datacenters, 3 in the GNOC) but otherwise pretty close to what I came up with. Score 1 for me!

There’s one more document I reviewed, Log Insight Best Practices: Server by Steve Flanders. I worked my way backwards to this link, which is pretty much a checklist of everything I already identified, and a lot more, but all in one document! Really wish I had found this one first. Some of the things I didn’t catch already include:

  • Only list the local Active Directory domain servers; listing distant servers could result in up to 20 minute wait times for logins (or IME, just broken logins).
  • Use a service account for AD binding, decreases the chances of an expired password impairing your ability to log in.
  • If you use data archiving, LI does NOT clean up the archive location. It’s your job to make sure it continues to have free space and deletes data older than your long term retention requirements.
  • The 2TB limit mentioned is from 2015 and is currently 4TB.
  • Steve recommends using the DNS name instead of the IP as the log destination. There are pros and cons to each, I recommend investigating this and making an informed decision.
  • Replace the self-signed SSL certificate with a custom SSL certificate. Don’t forget to import the updated cert into other systems that need it, esp vROps.
  • We hopefully all know the importance of NTP, and Steve has written an article on proper NTP configuration. There’s nothing specific to LI here, but it’s especially vital that our log analysis systems are all properly synchronized. There’s nothing like events that actually happen hours apart appearing correlated because NTP isn’t working!

Full Design

With the addition of the whitepaper and Steve’s checklist, we have a better design. Let’s lay it out in two pieces: the installed components and the checklist of changes.

Components

  • Datacenter A
    • 3 node LI management cluster with a VIP, each node is a Medium install  with an extra 4TB disk
    • 3 node LI forwarding cluster with a VIP, each node is a Medium install with an extra 4TB disk
  • Datacenter B:
    • 3 node LI forwarding cluster with a VIP, each node is a Medium install with an extra 4TB disk
  • Datacenter C-1
    • 3 node LI forwarding cluster with a VIP, each node is a Medium install with an extra 4TB disk
  • Datacenter C-2
    • 3 node LI forwarding cluster with a VIP, each node is a Medium install with an extra 4TB disk

Checklist

Follow this for every cluster.

  • Deploy a single node as a new deployment to create the cluster
  • Configure NTP
  • Configure Active Directory authentication
    • Use the in-datacenter DCs only, accept certificates when using SSL
    • Use a service account for the binding
  • Replace the self-signed certificate with a custom certificate
  • Deploy 2 additional nodes as members of an existing cluster
  • (Forwarding clusters) Configure forwarding with a target of the management cluster’s VIP and the compliance system, by FQDN
  • (Forwarding clusters) Configure vCenter/vROps integration with the local datacenter’s vCenter(s) and single vROps instance.

Once the clusters are stood up and configured, they are ready to ingest data. You may choose an IP or FQDN. Though DNS is not hosted on the ESXi hosts, I still prefer IP-based destination as it is more likely to work during a datacenter issue; your preference may be different. How you apply it is also up to you. For ESXi hosts, you can use the vCenter integration in each forwarding cluster, Host Profiles, PowerCLI, or other vSphere APIs. LI can also ingest data from non-ESXi hosts, if you want to point them at LI as well.

Summary

We’ve reviewed a number of documents, official and unofficial, on Log Insight designs. We’ve taken the general instructions and guidelines and built our own design and implementation checklists. The design complies with best practices and can easily be expanded – just add additional nodes to each cluster as the need arises.

If I’ve missed anything, or you have questions about the design, drop a note in the comments or ping me on Twitter. Thanks!

Change is in the air

As the leaves start to change color and fall, you can’t help but notice the changes in the air. Our family is no exception.

First, I am so excited and happy for my wife, Michelle, who accepted the Stark Professor Endowed Chair here at IUSM. If you’re not familiar with academia positions, this is a really big deal! In addition, it’s at the same institute she’s already at, so no moving this time. Super proud of you, Michelle! She’ll be starting her new position on November 1st.

Second, I am changing jobs. I just celebrated my last day at AT&T, where I spent almost 15 years. It was a great ride, but sometimes you need a change. I’m gonna miss y’all! Next week, I start my new job at athenahealth. Like my previous one, this is full time remote and I will go to the Boston office a few times a year. I’m super excited about this change, it’s my first completely new job since 2004 and it’s in an new industry. This change should also give me more time to work on Puppet and the blog, which is great since I have some really great ideas following Puppetize PDX. However, I’m taking the week off between jobs to relax and avoid doing work-like things – and we’re redoing the floor in the guest bathroom to help me avoid temptation.

There’s also some other stuff in the works that we don’t want to jinx, but you’ll hear more soon. Tune in on Twitter if you want to be the first to get the deets.

Fall is always an exciting time, and I hope yours is as exciting as ours!

Adding, extending, and removing Linux disks and partitions in 2019

Managing disks and partitions in Linux has changed quite a bit over time. Unfortunately, as Jonathan Frappier points out, a lot of advice is either wrong, dated, or makes some poor assumptions along the way:

I know I almost always go to the search results drawing board when I have to manage a disk, so it’s pretty infrequent, and have the same problems of sifting through documentation that will work and that which will. I hope this article can be used as a source of modern, tested documentation for some common use cases.

Tools and File systems

First, a quick mention of the tools and file systems available.

fdisk

A venerable tool that continues to work fully as long as your disks are under 2 T in size. Once you’re over 2 T, I’d treat it like a RO only and not use it to make edits. You can use fdisk -l to display information on all disks it sees or fdisk <devicepath> to interact with a specific disk:

[rnelson0@build03 ~]$ sudo fdisk -l

Disk /dev/sda: 107.4 GB, 107374182400 bytes, 209715200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x000b06ec

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048     1026047      512000   83  Linux
/dev/sda2         1026048   209715199   104344576   8e  Linux LVM

Disk /dev/mapper/VolGroup00-lv_root: 12.6 GB, 12582912000 bytes, 24576000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/mapper/VolGroup00-lv_swap: 4294 MB, 4294967296 bytes, 8388608 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/mapper/VolGroup00-lv_home: 83.9 GB, 83886080000 bytes, 163840000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/loop0: 107.4 GB, 107374182400 bytes, 209715200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/loop1: 2147 MB, 2147483648 bytes, 4194304 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/mapper/docker-253:0-263061-pool: 107.4 GB, 107374182400 bytes, 209715200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 65536 bytes / 65536 bytes

[rnelson0@build03 ~]$ sudo fdisk /dev/sda
Welcome to fdisk (util-linux 2.23.2).

Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.


Command (m for help): m
Command action
   a   toggle a bootable flag
   b   edit bsd disklabel
   c   toggle the dos compatibility flag
   d   delete a partition
   g   create a new empty GPT partition table
   G   create an IRIX (SGI) partition table
   l   list known partition types
   m   print this menu
   n   add a new partition
   o   create a new empty DOS partition table
   p   print the partition table
   q   quit without saving changes
   s   create a new empty Sun disklabel
   t   change a partition's system id
   u   change display/entry units
   v   verify the partition table
   w   write table to disk and exit
   x   extra functionality (experts only)

Command (m for help): p

Disk /dev/sda: 107.4 GB, 107374182400 bytes, 209715200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x000b06ec

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048     1026047      512000   83  Linux
/dev/sda2         1026048   209715199   104344576   8e  Linux LVM

Command (m for help): q

parted

A more modern tool that supports over 2 T disks. I would prefer this, though there’s nothing wrong with fdisk other than the size limit. Similarly, you can use -l to show all data or a device name to interact. Once nice thing is it defaults to human-readable sizes instead of a jumble of long numbers without commas.

[rnelson0@build03 ~]$ sudo parted -l
Model: VMware Virtual disk (scsi)
Disk /dev/sda: 107GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:

Number  Start   End    Size   Type     File system  Flags
 1      1049kB  525MB  524MB  primary  ext4         boot
 2      525MB   107GB  107GB  primary               lvm


Model: Linux device-mapper (thin-pool) (dm)
Disk /dev/mapper/docker-253:0-263061-pool: 107GB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:

Number  Start  End    Size   File system  Flags
 1      0.00B  107GB  107GB  xfs


Model: Linux device-mapper (linear) (dm)
Disk /dev/mapper/VolGroup00-lv_home: 83.9GB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:

Number  Start  End     Size    File system  Flags
 1      0.00B  83.9GB  83.9GB  ext4


Model: Linux device-mapper (linear) (dm)
Disk /dev/mapper/VolGroup00-lv_swap: 4295MB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:

Number  Start  End     Size    File system     Flags
 1      0.00B  4295MB  4295MB  linux-swap(v1)


Model: Linux device-mapper (linear) (dm)
Disk /dev/mapper/VolGroup00-lv_root: 12.6GB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:

Number  Start  End     Size    File system  Flags
 1      0.00B  12.6GB  12.6GB  ext4


[rnelson0@build03 ~]$ sudo parted /dev/sda
GNU Parted 3.1
Using /dev/sda
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) help
  align-check TYPE N                        check partition N for TYPE(min|opt) alignment
  help [COMMAND]                           print general help, or help on COMMAND
  mklabel,mktable LABEL-TYPE               create a new disklabel (partition table)
  mkpart PART-TYPE [FS-TYPE] START END     make a partition
  name NUMBER NAME                         name partition NUMBER as NAME
  print [devices|free|list,all|NUMBER]     display the partition table, available devices, free space, all found partitions, or a particular partition
  quit                                     exit program
  rescue START END                         rescue a lost partition near START and END

  resizepart NUMBER END                    resize partition NUMBER
  rm NUMBER                                delete partition NUMBER
  select DEVICE                            choose the device to edit
  disk_set FLAG STATE                      change the FLAG on selected device
  disk_toggle [FLAG]                       toggle the state of FLAG on selected device
  set NUMBER FLAG STATE                    change the FLAG on partition NUMBER
  toggle [NUMBER [FLAG]]                   toggle the state of FLAG on partition NUMBER
  unit UNIT                                set the default unit to UNIT
  version                                  display the version number and copyright information of GNU Parted
(parted) print
Model: VMware Virtual disk (scsi)
Disk /dev/sda: 107GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:

Number  Start   End    Size   Type     File system  Flags
 1      1049kB  525MB  524MB  primary  ext4         boot
 2      525MB   107GB  107GB  primary               lvm

(parted) quit

ext4

The “extended” family of filesystems (currently ext4, but possibly ext3 or ext2 if you work on some really old systems) have been used by Linux for a long time. It’s still a default in a number of distros, especially for smaller partitions. It’s a very competent journaled filesystem and the maximums have been massively extended over earlier versions (for example, 1 EiB max volume and 16 TiB max file size), but it’s still considered yesterday’s technology. The biggest fault I find with ext4 is that inodes – what tracks the metadata for files – are set at filesystem creation time and cannot be changed. This means that when extended volumes, the inode count is not increased. Thus, after an expansion, you could wind up horribly undersized in inodes while still having free space available. New files cannot be created when the inodes are exhausted. Most are familiar with measuring disk usage by space capacity, but you can also view your inode capacity:

[rnelson0@build03 ~]$ df -h
Filesystem                      Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-lv_root   12G  6.7G  4.3G  62% /
devtmpfs                        908M     0  908M   0% /dev
tmpfs                           920M     0  920M   0% /dev/shm
tmpfs                           920M   98M  822M  11% /run
tmpfs                           920M     0  920M   0% /sys/fs/cgroup
/dev/sda1                       477M  185M  263M  42% /boot
/dev/mapper/VolGroup00-lv_home   77G  5.8G   68G   8% /home
tmpfs                           184M     0  184M   0% /run/user/1000
[rnelson0@build03 ~]$ df -i
Filesystem                      Inodes  IUsed   IFree IUse% Mounted on
/dev/mapper/VolGroup00-lv_root  768544 202858  565686   27% /
devtmpfs                        232260    360  231900    1% /dev
tmpfs                           235373      1  235372    1% /dev/shm
tmpfs                           235373    850  234523    1% /run
tmpfs                           235373     16  235357    1% /sys/fs/cgroup
/dev/sda1                       128016    354  127662    1% /boot
/dev/mapper/VolGroup00-lv_home 5120000 634404 4485596   13% /home
tmpfs                           235373      1  235372    1% /run/user/1000

This limitation leads me to recommend other filesystems, especially when you expect it will grow in the future.

btrfs

Designed at Oracle for Linux, btrfs uses a copy-on-write B-Tree data structure and provides advanced pooling, snapshots, and checksum features. Initially developed in 2008, it took until 2013 to be marked as stable and even longer for Linux distros to mark it as supported – even now, only Oracle Linux 7, SuSE 15, and Synology DSM v6 do so. Plenty of people do love in spite of its support status and it is a viable solution, I’m just not as familiar with it personally.

xfs

Originally designed by SGI, xfs was ported to Linux in 2001. This, too, took a long while to become supported by distros, but now is supported by almost every distro, many of which use it as the default file system. It supports dynamic inode allocation, up to 8 EiB – 1 byte volumes AND files, online resizing, and many other features. This is the default in Red Hat Enterprise Linux and my preferred FS, which I’ll be using later – but is roughly equivalent to btrfs in general, if you prefer that.

LVM

Not quite a file system or a tool, the Logical Volume Manager is a device mapper that provides an extra layer of an abstraction between raw file systems and the OS. All of ext4, btrfs, and xfs can be used separately or with the LVM. Among other things, it lets you easily create device maps that span partitions and disks, resize them, and export/import them (for instance, to migrate disks to another system and retain the FS). I prefer using the LVM for everything; if you never modify the partitions it causes no harm, but if you ever need to and have not used LVM, it’s far more painful.

Most LVM commands begin with pv, vg, or lv and you use can use tab-completion to discover the utilities we do not cover today. I regularly use pvs/vgs/lvs to get short summaries and pvdisplay/vgdisplay/lvdisplay for detailed status. All others are very situational.

Adding a new volume

With some understanding of the tools and filesystems at our fingertips, let’s take a look at a common scenario: adding a new volume. I’m only focused on the OS steps here, so we will assume that you have added a new physical disk, attached a VMDK or EC2 volume, etc. as your implementation requires. We will also assume that you want an xfs filesystem at /mnt/newdisk, the system had a single disk /dev/sda that was automatically partitioned by the installer, and the new disk receives the moniker /dev/sdb. Not all OSes and systems will provide the same name; for instance, Jonathan’s AWS Linux system’s second disk was called /dev/nvme1n1. In such a case, just swap in the correct partition name.

Most modern distros will automatically detect new disks while running. If yours does not, or fails to detect the disk, you can force a rescan with the following command (change the host number as appropriate):

echo "- - -" > /sys/class/scsi_host/host0/scan

Make sure the new device shows up before continuing:

[rnelson0@build03 ~]$ ls /dev/sd?
/dev/sda
[rnelson0@build03 ~]$ ls /dev/sd?
/dev/sda /dev/sdb

You can use fdisk or parted to get information on the device. There will be no label, since we haven’t touched it, but the reported size should be accurate (be sure to account for the difference between, say, gigabytes and gibibytes. I was surprised to learn that when I specify 16 GB in my vSphere system, it apparently means gibibytes, where the proper label is actually GiB. By comparison, parted’s GB does mean gigabytes.

[rnelson0@build03 ~]$ sudo parted /dev/sdb
GNU Parted 3.1
Using /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print
Error: /dev/sdb: unrecognised disk label
Model: VMware Virtual disk (scsi)
Disk /dev/sdb: 17.2GB
Sector size (logical/physical): 512B/512B
Partition Table: unknown
Disk Flags:

If we plan to have multiple partitions on a single disk, this is where we would create them using mkpart. However, you do not need to create a partition and can use the whole disk. I generally prefer using the whole disk; it’s fairly cheap to add a new disk and I believe is cognitively easier to manage than extending disks and partitions. For instance, if you add a 100 GB disk and make a single partition and later extend it to 200 GB, you can just extend it easily. If you create two 50 GB partitions and extend it to 100 GB later, you can only extend the 2nd partition and would have to create a new partition and join it to an existing LVM mapper to (effectively) increase the space of the partition. Alternatively, you can just always add entire disks to LVM groups, assuming that any extension is via an additional disk, which is what we will do here.

Though we are not creating partitions on the disk, we do want to leverage the LVM to create a mapping that we can manipulate semi-independently of the disks. Each LVM mapping has 3 layers – the physical, virtual, and logical layers. The physical layer refers to the disk itself (even if it’s a virtualized system, we pretend it’s a physical disk). The logical layer is where we create the end mapping that receives the file system. The virtual layers sits in the middle, sort of akin to the partitions we skipped, and presents physical layer information to the virtual layer. In our case, we simply map the entire disk to the physical and virtual layers, then assign 100% of the free space of the virtual device to the new logical device.

Once that is done, we format the result with our filesystem of choice, in this case xfs by using mkfs.xfs, and if it doesn’t exist, create the mount point. I normally put some of the information in variables, since the LVM commands are almost always the same. You’ll also notice I switched to root to save a few keystrokes over using sudo constantly, as all the LVM commands require elevated permissions.

#COMMANDS
DEVICE=/dev/sdb
NAME=newdisk
MOUNT=/mnt/newdisk

pvcreate ${DEVICE}
vgcreate vg_xfs_${NAME} ${DEVICE}
lvcreate -n lv_${NAME} -l 100%FREE vg_xfs_${NAME}
mkfs.xfs /dev/vg_xfs_${NAME}/lv_${NAME}
if [[ ! -d "${MOUNT}" ]]; then mkdir ${MOUNT}; fi

#OUTPUT
[root@build03 ~]# DEVICE=/dev/sdb
[root@build03 ~]# NAME=newdisk
[root@build03 ~]# MOUNT=/mnt/newdisk
[root@build03 ~]#
[root@build03 ~]# pvcreate ${DEVICE}
  Physical volume "/dev/sdb" successfully created.
[root@build03 ~]# vgcreate vg_xfs_${NAME} ${DEVICE}
  Volume group "vg_xfs_newdisk" successfully created
[root@build03 ~]# lvcreate -n lv_${NAME} -l 100%FREE vg_xfs_${NAME}
  Logical volume "lv_newdisk" created.
[root@build03 ~]# mkfs.xfs /dev/vg_xfs_${NAME}/lv_${NAME}
meta-data=/dev/vg_xfs_newdisk/lv_newdisk isize=512    agcount=4, agsize=1048320 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=4193280, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

[root@build03 ~]# parted -l
...
Model: Linux device-mapper (linear) (dm)
Disk /dev/mapper/vg_xfs_newdisk-lv_newdisk: 17.2GB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:

Number  Start  End     Size    File system  Flags
 1      0.00B  17.2GB  17.2GB  xfs

Finally, we have to mount this new partition. How your system manages the fstab can vary wildly, but generally just appending a line to /etc/fstab works (some distros automagically manage it). Mount the partition and now we can start using it!

[root@build03 ~]# echo "/dev/vg_xfs_${NAME}/lv_${NAME} ${MOUNT}          xfs    defaults        1 2" >> /etc/fstab
[root@build03 ~]# mount /mnt/newdisk
[root@build03 ~]# df -h /mnt/newdisk
Filesystem                             Size  Used Avail Use% Mounted on
/dev/mapper/vg_xfs_newdisk-lv_newdisk   16G   33M   16G   1% /mnt/newdisk
[root@build03 ~]# df -i /mnt/newdisk
Filesystem                             Inodes IUsed   IFree IUse% Mounted on
/dev/mapper/vg_xfs_newdisk-lv_newdisk 8386560     3 8386557    1% /mnt/newdisk

Extending an LVM device with an existing disk

The next most common case is to extend the disk a device uses. As I mentioned before, I prefer to just use a new disk as they’re frequently cheap, but some systems do make them more expensive. In my lab, I increased the size of my /dev/sdb to 32 GiB. Because the disk has already been detected, I have to force the OS to rescan and see the new space using echo 1>/sys/class/block/<DEVICE>/device/rescan:

[root@build03 ~]# parted -l
...
Error: /dev/sdb: unrecognised disk label
Model: VMware Virtual disk (scsi)
Disk /dev/sdb: 17.2GB
Sector size (logical/physical): 512B/512B
Partition Table: unknown
Disk Flags:

Model: Linux device-mapper (linear) (dm)
Disk /dev/mapper/vg_xfs_newdisk-lv_newdisk: 17.2GB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:

Number  Start  End     Size    File system  Flags
 1      0.00B  17.2GB  17.2GB  xfs
...
[root@build03 ~]# echo 1>/sys/class/block/sdb/device/rescan
[root@build03 ~]# parted -l
...
Error: /dev/sdb: unrecognised disk label
Model: VMware Virtual disk (scsi)
Disk /dev/sdb: 34.4GB
Sector size (logical/physical): 512B/512B
Partition Table: unknown
Disk Flags:

Model: Linux device-mapper (linear) (dm)
Disk /dev/mapper/vg_xfs_newdisk-lv_newdisk: 17.2GB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:

Number  Start  End     Size    File system  Flags
 1      0.00B  17.2GB  17.2GB  xfs

With the physical size increase visible, we need to expand the LVM device. First, pvresize so the physical layer sees the change, then vgscan to see the changes, and finally lvextend to have the logical volume allocate the virtual group’s free (unassigned) space (the + in +100%FREE means to ADD free space; 100%FREE without the plus means it allocates as much space as is free, but starting from the beginning – I don’t make the rules, I just get run over by them like everyone else!). Extend the FS with xfs_growfs and then look at the space and inodes. We can use the same variables as before to make it a little easier:

#COMMANDS
DEVICE=/dev/sdb
NAME=newdisk
MOUNT=/mnt/newdisk

pvresize ${DEVICE}
vgscan
lvextend -l +100%FREE /dev/vg_xfs_${NAME}/lv_${NAME}
xfs_growfs ${MOUNT}

#OUTPUT
[root@build03 ~]# DEVICE=/dev/sdb
[root@build03 ~]# NAME=newdisk
[root@build03 ~]# MOUNT=/mnt/newdisk
[root@build03 ~]#
[root@build03 ~]# pvresize ${DEVICE}
  Physical volume "/dev/sdb" changed
  1 physical volume(s) resized / 0 physical volume(s) not resized
[root@build03 ~]# vgscan
  Reading volume groups from cache.
  Found volume group "vg_xfs_newdisk" using metadata type lvm2
  Found volume group "VolGroup00" using metadata type lvm2
[root@build03 ~]# lvextend -l +100%FREE /dev/vg_xfs_${NAME}/lv_${NAME}
  Size of logical volume vg_xfs_newdisk/lv_newdisk changed from 16.00 GiB (4096 extents) to <32.00 GiB (8191 extents).
  Logical volume vg_xfs_newdisk/lv_newdisk successfully resized.
[root@build03 ~]# xfs_growfs ${MOUNT}
meta-data=/dev/mapper/vg_xfs_newdisk-lv_newdisk isize=512    agcount=4, agsize=1048320 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=4193280, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 4193280 to 8387584

[root@build03 ~]# df -h /mnt/newdisk
Filesystem                             Size  Used Avail Use% Mounted on
/dev/mapper/vg_xfs_newdisk-lv_newdisk   32G   33M   32G   1% /mnt/newdisk
[root@build03 ~]# df -i /mnt/newdisk
Filesystem                              Inodes IUsed    IFree IUse% Mounted on
/dev/mapper/vg_xfs_newdisk-lv_newdisk 16775168     3 16775165    1% /mnt/newdisk

You can see that the inode count is roughly twice the size as before, so you are in no danger of running out any time soon, even with a ton of tiny files!

Extending an LVM device with a new disk

We can also extend an LVM device using a new disk. After attaching the new disk, you need to pvcreate the new physical device, vgextend the virtual group into the new disk, and then lvextend the volume and xfs_growfs filesystem. In this example, the new disk is /dev/sdc and is 16 GiB:

#COMMANDS
DEVICE=/dev/sdc
NAME=newdisk
MOUNT=/mnt/newdisk

pvcreate ${DEVICE}
vgextend vg_xfs_${NAME} ${DEVICE}
lvextend /dev/vg_xfs_${NAME}/lv_${NAME}
xfs_growfs ${MOUNT}

#OUTPUT
[root@build03 ~]# DEVICE=/dev/sdc
[root@build03 ~]# NAME=newdisk
[root@build03 ~]# MOUNT=/mnt/newdisk
[root@build03 ~]#
[root@build03 ~]# pvcreate ${DEVICE}
  Physical volume "/dev/sdc" successfully created.
[root@build03 ~]# vgextend vg_xfs_${NAME} ${DEVICE}
  Volume group "vg_xfs_newdisk" successfully extended
[root@build03 ~]# vgs
  VG             #PV #LV #SN Attr   VSize   VFree
  VolGroup00       1   3   0 wz--n- <99.51g   5.66g
  vg_xfs_newdisk   2   1   0 wz--n-  47.99g <16.00g
[root@build03 ~]# lvextend -n lv_${NAME} -l +100%FREE /dev/vg_xfs_${NAME}/lv_${NAME}
  Please specify a logical volume path.
  Run `lvextend --help' for more information.
[root@build03 ~]# lvextend -l +100%FREE /dev/vg_xfs_${NAME}/lv_${NAME}
  Size of logical volume vg_xfs_newdisk/lv_newdisk changed from <32.00 GiB (8191 extents) to 47.99 GiB (12286 extents).
  Logical volume vg_xfs_newdisk/lv_newdisk successfully resized.
[root@build03 ~]# xfs_growfs ${MOUNT}
meta-data=/dev/mapper/vg_xfs_newdisk-lv_newdisk isize=512    agcount=9, agsize=1048320 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=8387584, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 8387584 to 12580864
[root@build03 ~]# df -h /mnt/newdisk
Filesystem                             Size  Used Avail Use% Mounted on
/dev/mapper/vg_xfs_newdisk-lv_newdisk   48G   33M   48G   1% /mnt/newdisk
[root@build03 ~]# df -i /mnt/newdisk
Filesystem                              Inodes IUsed    IFree IUse% Mounted on
/dev/mapper/vg_xfs_newdisk-lv_newdisk 25161728     3 25161725    1% /mnt/newdisk

Once again, the space and the inodes have increased!

Removing an LVM device

Finally, there are times when you will remove a mount point entirely. You can just “yank” the disks and you’ll probably be okay, but rather than cross our fingers and hope, we can manually remove the configuration to ensure no auto-detection goes wrong. The process includes unmounting the partition, removing the logical/virtual/physical layer mappings, then yanking the disks. We will undo our previous examples, where /dev/sdb and /dev/sdc provide vg_xfs_newdisk and lv_newdisk. Just start at the end and work our way back:

[root@build03 ~]# umount /mnt/newdisk
[root@build03 ~]# lvremove lv_${NAME}
  Volume group "lv_newdisk" not found
  Cannot process volume group lv_newdisk
[root@build03 ~]# lvremove /dev/vg_xfs_newdisk/lv_newdisk
Do you really want to remove active logical volume vg_xfs_newdisk/lv_newdisk? [y/n]: y
  Logical volume "lv_newdisk" successfully removed
[root@build03 ~]# vgremove vg_xfs_newdisk
  Volume group "vg_xfs_newdisk" successfully removed
[root@build03 ~]# pvremove /dev/sdc
  Labels on physical volume "/dev/sdc" successfully wiped.
[root@build03 ~]# pvremove /dev/sdb
  Labels on physical volume "/dev/sdb" successfully wiped.
[root@build03 ~]#
[root@build03 ~]# lvs
  LV      VG         Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv_home VolGroup00 -wi-ao----  78.12g
  lv_root VolGroup00 -wi-ao---- <11.72g
  lv_swap VolGroup00 -wi-ao----   4.00g
[root@build03 ~]# vgs
  VG         #PV #LV #SN Attr   VSize   VFree
  VolGroup00   1   3   0 wz--n- <99.51g 5.66g
[root@build03 ~]# pvs
  PV         VG         Fmt  Attr PSize   PFree
  /dev/sda2  VolGroup00 lvm2 a--  <99.51g 5.66g

Do not forget to modify /etc/fstab to remove the mapping, or the next boot may complain about the missing disk or even fail outright. Make sure to follow your distro guide for this, as sometimes systems like anaconda can undo your edits silently. While this can all be done live, I do recommend a reboot at the end, so that if something went wrong that affects the boot cycle, it is found immediately.

Summary

I covered common tasks – adding, extending, and removing partitions on the same system – but there’s a lot more you can do. As mentioned, parted replaces fdisk, but we glossed over that in favor of using entire disks. But, at least you know that any documentation using fdisk is pretty aged. I also briefly mentioned vgexport/vgimport. If you plan to detach disks in LVM – even a single disk – and re-attach them to another system, you want to export the virtual group on the old system and import it on the new system. This will help ensure that any mismatch in device naming by the OS – say sdc and sde on the old system and sdc and sdd on the new system – do not result in any data loss.

I hope this article is a good reference for fundamental filesystem tasks for novice and expert linux administrators alike. Please let me know on twitter or in the comments of anything else you would like to see and I’ll keep this updated. Thanks!

Convert a controlrepo to using the Puppet Development Kit (PDK)

I previously wrote about converting an individual puppet module’s repo to use the Puppet Development Kit. We can also convert controlrepos to use the PDK. I am starting with a “traditional” controlrepo, described here, as well as centralized tests, described here. To follow this article directly, you need to:

  • Have all hiera data and role/profile/custom modules in the /dist directory
  • Have all tests, for all modules, in the /spec directory

If your controlrepo looks different, this article can be used to migrate to the PDK, but you will have to modify some of the sections a bit.

This will be a very lengthy blog post (over 4,000 words!) and covers a very time-consuming process. It took me about 2 full days to work through this on my own controlrepo. Hopefully, this article helps you shave significant time off the effort, but don’t expect a very quick change.

Managing the PDK itself

First, let’s make sure we have a profile to install and manage the PDK. As we use the role/profile pattern, we create a class profile::pdk with the parameter version, which we can specify in hiera as profile::pdk::version: ‘1.7.1’ (current version as of this writing). This profile can then be added to an appropriate role class, like role::build for a build server, or applied directly to your laptop. I use only Enterprise Linux 7, but we could certainly flush this out to support multiple OSes:

# dist/profile/manifests/pdk.pp
class profile::pdk (
  String $version = 'present',
) {
  package {'puppet6-release':
    ensure => present,
    source => "https://yum.puppet.com/puppet6/puppet6-release-el-7.noarch.rpm",
  }
  package {'pdk':
    ensure => $version,
    require => Package['puppet6-release'],
  }
}

# spec/classes/profile/pdk_spec.rb
require 'spec_helper'
describe 'profile::pdk' do
  on_supported_os.each do |os, facts|
    next unless facts[:kernel] == 'Linux'
    context "on #{os}" do
      let (:facts) {
        facts.merge({
          :clientcert => 'build',
        })
      }

      it { is_expected.to compile.with_all_deps }

      it { is_expected.to contain_package('puppet6-release') }
      it { is_expected.to contain_package('pdk') }
    end
  end
end

Once this change is pushed upstream and the build server (or other target node) checks in, the PDK is available:

$ pdk --version
1.7.1

Now we are almost ready to go. Of course, we need to start with good, working tests! If any tests are currently failing, we need to get them to a passing state before continuing, like this:

Finished in 3 minutes 12.2 seconds (files took 1 minute 17.89 seconds to load)
782 examples, 0 failures

With everything in a known good state, we can then be sure that any failures are related to the PDK changes, and only the PDK changes.

Setting up the PDK Template

The PDK comes with a set of pre-packaged templates. It is recommended to stick with a set of templates designed for the current PDK version for stability. However, the templates are online and may updated without an accompanying PDK release. We may choose to stick with the on-disk templates, we may point to the online templates from Puppet, or we may create our own! For those working with the the on-disk templates, you can skip down to working with .sync.yml

To another template, we use the pdk convert --template-url. If this is our own template, we should make sure the latest commit is compliant with the PDK version we are using. If we point to Puppet’s templates, we essentially shift to the development track. Make sure you understand this before changing the templates. We can get back to using the on-disk template with the url file:///opt/puppetlabs/pdk/share/cache/pdk-templates.git, though, so this isn’t a decision we have to live with forever. Here’s the command to switch to the official Puppet templates:

$ pdk convert --template-url=https://github.com/puppetlabs/pdk-templates

------------Files to be added-----------
.travis.yml
.gitlab-ci.yml
.yardopts
appveyor.yml
.rubocop.yml
.pdkignore

----------Files to be modified----------
metadata.json
Gemfile
Rakefile
spec/spec_helper.rb
spec/default_facts.yml
.gitignore
.rspec

----------------------------------------

You can find a report of differences in convert_report.txt.

pdk (INFO): Module conversion is a potentially destructive action. Ensure that you have committed your module to a version control system or have a backup, and review the changes above before continuing.
Do you want to continue and make these changes to your module? Yes

------------Convert completed-----------

6 files added, 7 files modified.

Now, everyone’s setup is probably a little different and thus we cannot predict the entirety of the changes each of us must make, but there are some minimal changes everyone must make. The file .sync.yml can be created to allow each of us to override the template defaults without having to write our own templates. The layout of the YAML starts with the filename the changes will modify, followed by the appropriate config section and then the value(s) for that section. We can find the appropriate config section by looking at the template repo’s erb templates. For instance, I do not use AppVeyor, GitLab, or Travis with this controlrepo, so to have git ignore them, I made the following changes to the .gitignore‘s required hash:

$ cat .sync.yml
---
.gitignore:
  required:
    - 'appveyor.yml'
    - '.gitlab-ci.yml'
    - '.travis.yml'

When changes are made to the sync file, they must be applied with the pdk update command. We can see that originally, these unused files were to be committed, but now they are properly ignored:

$ git status
# On branch pdk
# Changes not staged for commit:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#       modified:   .gitignore
#       modified:   .rspec
#       modified:   Gemfile
#       modified:   Rakefile
#       modified:   metadata.json
#       modified:   spec/default_facts.yml
#       modified:   spec/spec_helper.rb
#
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#       .gitlab-ci.yml
#       .pdkignore
#       .rubocop.yml
#       .sync.yml
#       .travis.yml
#       .yardopts
#       appveyor.yml
no changes added to commit (use "git add" and/or "git commit -a")

$ cat .sync.yml
---
.gitignore:
  required:
    - 'appveyor.yml'
    - '.gitlab-ci.yml'
    - '.travis.yml'

$ pdk update
pdk (INFO): Updating mss-controlrepo using the template at https://github.com/puppetlabs/pdk-templates, from 1.7.1 to 1.7.1

----------Files to be modified----------
.gitignore

----------------------------------------

You can find a report of differences in update_report.txt.

Do you want to continue and make these changes to your module? Yes

------------Update completed------------

1 files modified.

$ git status
# On branch pdk
# Changes not staged for commit:
# (use "git add <file>..." to update what will be committed)
# (use "git checkout -- <file>..." to discard changes in working directory)
#
# modified: .gitignore
# modified: .rspec
# modified: Gemfile
# modified: Rakefile
# modified: metadata.json
# modified: spec/default_facts.yml
# modified: spec/spec_helper.rb
#
# Untracked files:
# (use "git add <file>..." to include in what will be committed)
#
# .pdkignore
# .rubocop.yml
# .sync.yml
# .yardopts
no changes added to commit (use "git add" and/or "git commit -a")

Anytime we pdk update, we will still receive new versions of the ignored files, but they won’t be committed to the repo and a git clean or a clean checkout will remove them.

After initial publication, I was made aware that you can completely delete or unmanage a file using delete:true or unmanage:true, as described here, rather than using .gitignore.

We may need to implement other overrides, except that we do not know what they would be yet, so let’s commit our changes so far. Then we can start working on validation or unit tests. It doesn’t really matter which we choose to work on first, though my preference is validation first as it does not depend on the version of Puppet we are testing.

PDK Validate

The PDK validation check, pdk validate, will check the syntax and style of metadata.json and any task json files, syntax and style of all puppet files, and ruby code style. This is roughly equivalent to our old bundle exec rake syntax task. Since the bundle setup is a wee bit old and the PDK is kept up to date, we shouldn’t be surprised if what was passing before now has failures. Here’s a sample of the errors I encountered on my first run – there were hundreds of them:

$ pdk validate
pdk (INFO): Running all available validators...
pdk (INFO): Using Ruby 2.5.1
pdk (INFO): Using Puppet 6.0.2
[✔] Checking metadata syntax (metadata.json tasks/*.json).
[✔] Checking module metadata style (metadata.json).
[✔] Checking Puppet manifest syntax (**/**.pp).
[✔] Checking Puppet manifest style (**/*.pp).
[✖] Checking Ruby code style (**/**.rb).
info: task-metadata-lint: ./: Target does not contain any files to validate (tasks/*.json).
warning: puppet-lint: dist/eyaml/manifests/init.pp:43:12: indentation of => is not properly aligned (expected in column 14, but found it in column 12)
warning: puppet-lint: dist/eyaml/manifests/init.pp:51:11: indentation of => is not properly aligned (expected in column 12, but found it in column 11)
warning: puppet-lint: dist/msswiki/manifests/init.pp:56:12: indentation of => is not properly aligned (expected in column 13, but found it in column 12)
warning: puppet-lint: dist/msswiki/manifests/init.pp:57:10: indentation of => is not properly aligned (expected in column 13, but found it in column 10)
warning: puppet-lint: dist/msswiki/manifests/init.pp:58:11: indentation of => is not properly aligned (expected in column 13, but found it in column 11)
warning: puppet-lint: dist/msswiki/manifests/init.pp:59:10: indentation of => is not properly aligned (expected in column 13, but found it in column 10)
warning: puppet-lint: dist/msswiki/manifests/init.pp:60:12: indentation of => is not properly aligned (expected in column 13, but found it in column 12)
warning: puppet-lint: dist/msswiki/manifests/rsync.pp:37:140: line has more than 140 characters
warning: puppet-lint: dist/msswiki/manifests/rsync.pp:43:140: line has more than 140 characters
warning: puppet-lint: dist/profile/manifests/access_request.pp:21:3: optional parameter listed before required parameter
warning: puppet-lint: dist/profile/manifests/access_request.pp:22:3: optional parameter listed before required parameter

We can control puppet-lint Rake settings in .sync.yml – but it only works for rake tasks. pdk validate will ignore it because puppet-lint isn’t invoked via rake. The same settings need to be put in .puppet-lint.rc in the proper format. That file is not populated via pdk, so just create it by hand. I don’t care about the arrow alignment or 140 characters checks, so I’ve added the appropriate lines to both files and re-run pdk update. We all have difference preferences, just make sure they are reflected in both locations:

$ cat .sync.yml
---
.gitignore:
  required:
    - 'appveyor.yml'
    - '.gitlab-ci.yml'
    - '.travis.yml'
Rakefile:
  default_disabled_lint_checks:
    - '140chars'
    - 'arrow_alignment'
$ cat .puppet-lint.rc
--no-arrow_alignment-check
--no-140chars-check
$ grep disable Rakefile
PuppetLint.configuration.send('disable_relative')
PuppetLint.configuration.send('disable_140chars')
PuppetLint.configuration.send('disable_arrow_alignment')

Now we can use pdk validate and see a lot fewer violations. We can try to automatically correct the remaining violations with pdk validate -a, which will also try to auto-fix other syntax violations, or pdk bundle exec rake lint_fix, which restricts fixes to just puppet-lint. Not all violations can be auto-corrected, so some may still need fixed manually. I also found I had a .rubocop.yml in a custom module’s directory causing rubocop failures, because apparently rubocop parses EVERY config file it finds no matter where it’s located, and had to remove it to prevent errors. It may take you numerous tries to get through this. I recommend fixing a few things and committing before moving on to the next set of violations, so that you can find your way back if you make mistakes. Here’s a command that can help you edit all the files that can’t be autofixed by puppet-lint or rubocop (assuming you’ve already completed an autofix attempt):

vi $(pdk validate | egrep "(puppet-lint|rubocop)" | awk '{print $3}' | awk -F: '{print $1}' | sort | uniq | xargs)

Alternatively, you can disable rubocop entirely if you want by adding the following to your .sync.yml. If you are only writing spec tests, this is probably fine, but if you are writing facts, types, and providers, I do not suggest it.

.rubocop.yml:
  selected_profile: off

We have quite a few methods to fix all the possible errors that come our way. Once we have fixed everything, we can move on to the Unit Tests. We will re-run validation again after the unit tests, to ensure any changes we make for unit tests do not introduce new violations.

Unit Tests

Previously, we used bundle exec rake spec to run unit tests. The PDK way is pdk test unit. It performs pretty much the same, but it does collect all the output before displaying it, so if you have lots of fixtures and tests, you won’t see any output for a good long while and then bam, you get it all at once. The results will probably be just a tad overwhelming at first:

$ pdk test unit
pdk (INFO): Using Ruby 2.5.1
pdk (INFO): Using Puppet 6.0.2
[✔] Preparing to run the unit tests.
[✖] Running unit tests.
  Evaluated 782 tests in 110.76110479 seconds: 700 failures, 0 pending.
failed: rspec: ./spec/classes/profile/base__junos_spec.rb:11: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'cron' (file: /home/rnelson0/puppet/controlrepo/spec/fixtures/modules/profile/manifests/base/junos.pp, line: 15, column: 3) on node build
  profile::base::junos with defaults for all parameters should contain Cron[puppetrun]
  Failure/Error:
    context 'with defaults for all parameters' do
      it { is_expected.to create_class('profile::base::junos') }
      it { is_expected.to create_cron('puppetrun') }
    end
  end

failed: rspec: ./spec/classes/profile/base__linux_spec.rb:12: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'sshkey' (file: /home/rnelson0/puppet/controlrepo/spec/fixtures/modules/ssh/manifests/hostkeys.pp, line: 13, column: 5) on node build
  profile::base::linux on redhat-6-x86_64 disable openshift selinux policy should contain Selmodule[openshift-origin] with ensure => "absent"
  Failure/Error:
        if (facts[:os]['family'] == 'RedHat') && (facts[:os]['release']['major'] == '6')
          context 'disable openshift selinux policy' do
            it { is_expected.to contain_selmodule('openshift-origin').with_ensure('absent') }
            it { is_expected.to contain_selmodule('openshift').with_ensure('absent') }
          end
failed: rspec: ./spec/classes/profile/base__linux_spec.rb:162: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'cron' (file: /home/rnelson0/puppet/controlrepo/spec/fixtures/modules/os_patching/manifests/init.pp, line: 113, column: 3) on node build
  profile::base::linux on redhat-6-x86_64 when managing OS patching should contain Class[os_patching]
  Failure/Error:
          end

          it { is_expected.to contain_class('os_patching') }
          if (facts[:os]['family'] == 'RedHat') && (facts[:os]['release']['major'] == '7')
            it { is_expected.to contain_package('yum-utils') }

failed: rspec: ./spec/classes/profile/base__linux_spec.rb:18: error during compilation: Evaluation Error: Unknown variable: '::sshdsakey'. (file: /home/rnelson0/puppet/controlrepo/spec/fixtures/modules/ssh/manifests/hostkeys.pp, line: 12, column: 6) on node build
  profile::base::linux on redhat-7-x86_64 with defaults for all parameters should compile into a catalogue without dependency cycles
  Failure/Error:

        context 'with defaults for all parameters' do
          it { is_expected.to compile.with_all_deps }

          it { is_expected.to create_class('profile::base::linux') }

Whoa. Not cool. From 728 working tests to 700 failures is quite the explosion! And they blew up on missing resource types that are built-in to Puppet. What happened? Puppet 6, that’s what! However…

Fix Puppet 5 Tests First

When I ran pdk convert, it updated my metadata.json to specify it supported Puppet versions 4.7.0 through 6.x because I was missing any existing requirements section. The PDK defaults to using the latest Puppet version your metadata supports. Whoops! It’s okay, we can test against Puppet 5, too. I recommend that we get our existing tests working with the version of Puppet we wrote them for, just to get back to a known good state. We don’t want to be troubleshooting too many changes at once.

There are two ways to specify the version to use. There’s the CLI envvar PDK_PUPPET_VERSION that accepts a simple number like 5 or 6, which is preferred for automated systems like CI/CD, rather than humans. You can also use --puppet-version or --pe-version to set an exact version. I’m an old curmudgeon, so I’m using the non-preferred envvar setting today, but Puppet recommends using the actual program arguments! Regardless of how you specify the version, the PDK changes not just the Puppet version, but which version of Ruby it uses:

$ PDK_PUPPET_VERSION='5' pdk test unit
pdk (INFO): Using Ruby 2.4.4
pdk (INFO): Using Puppet 5.5.6
[✔] Preparing to run the unit tests.
[✖] Running unit tests.
  Evaluated 782 tests in 171.447295254 seconds: 519 failures, 0 pending.
failed: rspec: ./spec/classes/msswiki/init_spec.rb:24: error during compilation: Evaluation Error: Error while evaluating a Function Call, You must provide a hash of packages for wiki implementation. (file: /home/rnelson0/puppet/controlrepo/spec/fixtures/modules/msswiki/manifests/init.pp, line: 109, column: 5) on node build
  msswiki on redhat-6-x86_64  when using default params should compile into a catalogue without dependency cycles
  Failure/Error:
          end

          it { is_expected.to compile.with_all_deps }

          it { is_expected.to create_class('msswiki') }

failed: rspec: ./spec/classes/profile/apache_spec.rb:31: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Empty string title at 0. Title strings must have a length greater than zero. (file: /home/rnelson0/puppet/controlrepo/spec/fixtures/modules/concat/manifests/setup.pp, line: 59, column: 10) (file: /home/rnelson0/puppet/controlrepo/spec/fixtures/modules/apache/manifests/init.pp, line: 244) on node build
  profile::apache on redhat-7-x86_64 with additional listening ports should contain Firewall[100 Inbound apache listening ports] with dport => [80, 443, 8088]
  Failure/Error:
          end

          it {
            is_expected.to contain_firewall('100 Inbound apache listening ports').with(dport: [80, 443, 8088])
          }

failed: rspec: ./spec/classes/profile/base__linux_spec.rb:12: Evaluation Error: Unknown variable: '::sshecdsakey'. (file: /home/rnelson0/puppet/controlrepo/spec/fixtures/modules/ssh/manifests/hostkeys.pp, line: 36, column: 6) on node build
  profile::base::linux on redhat-6-x86_64 disable openshift selinux policy should contain Selmodule[openshift-origin] with ensure => "absent"
  Failure/Error:
        if (facts[:os]['family'] == 'RedHat') && (facts[:os]['release']['major'] == '6')
          context 'disable openshift selinux policy' do
            it { is_expected.to contain_selmodule('openshift-origin').with_ensure('absent') }
            it { is_expected.to contain_selmodule('openshift').with_ensure('absent') }
          end

Some of us may be lucky to make it through without errors here, but I assume most of us encounter at least a few failures, like I did – “only” 519 compared to 700 before. Don’t worry, we can fix this! To help us focus a bit, we can run tests on individual spec files using pdk bundle exec rspec <filename> (remembering to specify PDK_PUPPET_VERSION or to export the variable). Everyone has different problems here, but there are some common failures, such as missing custom facts:

  69) profile::base::linux on redhat-7-x86_64 when managing OS patching should contain Package[yum-utils]
      Failure/Error: include ::ssh::hostkeys

      Puppet::PreformattedError:
        Evaluation Error: Unknown variable: '::sshdsakey'. (file: /home/rnelson0/puppet/controlrepo/spec/fixtures/modules/ssh/manifests/hostkeys.pp, line: 12, column: 6) on node build

Nate McCurdy commented that instead of calling rspec directly, you can pass a comma-separated list of files with --tests​, e.g. pdk test unit --tests path/to/spec/file.rb,path/to/spec/file2.rb

I defined my custom facts in spec/spec_helper.rb. That has definitely changed. Here’s part of the diff from running pdk convert:

$ git diff origin/production spec/spec_helper.rb
diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb
index de3e7e6..5e721b7 100644
--- a/spec/spec_helper.rb
+++ b/spec/spec_helper.rb
@@ -1,45 +1,44 @@
 require 'puppetlabs_spec_helper/module_spec_helper'
 require 'rspec-puppet-facts'

-add_custom_fact :concat_basedir, '/dne'
-add_custom_fact :is_pe, true
-add_custom_fact :root_home, '/root'
-add_custom_fact :pe_server_version, '2016.4.0'
-add_custom_fact :selinux, true
-add_custom_fact :selinux_config_mode, 'enforcing'
-add_custom_fact :sshdsakey, ''
-add_custom_fact :sshecdsakey, ''
-add_custom_fact :sshed25519key, ''
-add_custom_fact :pe_version, ''
-add_custom_fact :sudoversion, '1.8.6p3'
-add_custom_fact :selinux_agent_vardir, '/var/lib/puppet'
-
+include RspecPuppetFacts
+
+default_facts = {
+  puppetversion: Puppet.version,
+  facterversion: Facter.version,
+}

+default_facts_path = File.expand_path(File.join(File.dirname(__FILE__), 'default_facts.yml'))
+default_module_facts_path = File.expand_path(File.join(File.dirname(__FILE__), 'default_module_facts.yml'))

Instead of modifying spec/spec_helper.rb, facts should go in spec/default_facts.yml and spec/default_module_facts.yml. As the former is modified by pdk update, it is easier to maintain the later. Review the diff of spec/spec_helper.rb and spec/default_facts.yml (if we have the latter) for our previous custom facts and their values. When a test is failing for a missing fact, we can add it to spec/default_module_facts.yml in the format factname: “factvalue”.

  1) profile::base::linux on redhat-6-x86_64 disable openshift selinux policy should contain Selmodule[openshift-origin] with ensure => "absent"
     Failure/Error: include concat::setup

     Puppet::PreformattedError:
       Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Empty string title at 0. Title strings must have a length greater than zero. (file: /home/rnelson0/puppet/controlrepo/spec/fixtures/modules/concat/manifests/setup.pp, line: 59, column: 10) (file: /home/rnelson0/puppet/controlrepo/spec/fixtures/modules/ssh/manifests/server/config.pp, line: 12) on node  build

This is related to an older version of puppetlabs/concat (v1.2.5). The latest is v5.1.0. After updating my Puppetfile and .fixtures.yml with the new version, I ran pdk bundle exec rake spec_prep to update the test fixtures, and this is resolved.

  2) profile::base::linux on redhat-6-x86_64 domain_join is true should contain Class[profile::domain_join]
     Failure/Error: include domain_join

     Puppet::PreformattedError:
       Evaluation Error: Error while evaluating a Function Call, Class[Domain_join]:
         expects a value for parameter 'domain_fqdn'
         expects a value for parameter 'domain_shortname'
         expects a value for parameter 'ad_dns'
         expects a value for parameter 'register_account'
         expects a value for parameter 'register_password' (file: /home/rnelson0/puppet/controlrepo/spec/fixtures/modules/profile/manifests/domain_join.pp, line: 8, column: 3) on node build

In this case, the mandatory parameters for an included class were not provided. The let (:params) block of our rspec contexts only allows us to set parameters for the current class. We have been pulling these parameters from hiera instead. This data should be coming from spec/fixtures/hieradata/default.yml, however, hiera lookup settings were also removed in spec/spec_helper.rb, breaking the existing hiera configuratioN:

 RSpec.configure do |c|
-  c.hiera_config = File.expand_path(File.join(__FILE__, '../fixtures/hiera.yaml'))
-  default_facts = {
-    puppetversion: Puppet.version,
-    facterversion: Facter.version
-  }

There is no replacement setting provided. The PDK was designed with hiera use in mind, so add this line (replacing the bold filename if yours is stored elsewhere) to .sync.yml and run pdk update. Your hiera config should start working again:

spec/spec_helper.rb:
  hiera_config: 'spec/fixtures/hiera.yaml'

These are just some common issues with a PDK conversion, but we may have others to resolve. We just need to keep iterating until we get through everything.

Things are looking good! But we are just done with the Puppet 5 rodeo. Before we move on to Puppet 6, now is a good time to make sure syntax validation works, and if you need to make changes to syntax, you then run the Puppet 5 tests that you know should work. Get that all settled before moving on.

Puppet 6 Unit Tests

There’s one big change to be aware of before we even run unit tests against Puppet 6. To make updating core types easier, without requiring a brand new release of Puppet, a number of types were moved into dedicated modules. This means that for Puppet 6 testing, we need to update our Puppetfile and .fixtures.yml (though the puppet agent all-in-one package packages these modules, we do not went our tests relying on an installed puppet agent). When we update these files, we need to make sure we ONLY deploy these core modules on Puppet 6, not Puppet 5- both for the master and testing – or we will encounter issues with Puppet 5. The Puppetfile is actually ruby code, so we can check the version before loading the modules (see note below), and .fixtures.yml accepts a puppet_version parameter to modules. We can click on each module name here to get the link for the replacement module. We do not have to add all of the modules, just the ones we use, but including the ones we are likely to use or have other modules depend on can reduce friction. The changes will look like this:

# Puppetfile
require 'puppet'
# as of puppet 6 they have removed several core modules into seperate modules
if Puppet.version =~ /^6\.\d+\.\d+/
  mod 'puppetlabs-augeas_core', '1.0.3'
  mod 'puppetlabs-cron_core', '1.0.0'
  mod 'puppetlabs-host_core', '1.0.1'
  mod 'puppetlabs-mount_core', '1.0.2'
  mod 'puppetlabs-sshkeys_core', '1.0.1'
  mod 'puppetlabs-yumrepo_core', '1.0.1'
end

# .fixtures.yml
fixtures:
  forge_modules:
    augeas_core:
      repo: "puppetlabs/augeas_core"
      ref: "1.0.3"
      puppet_version: ">= 6.0.0"
    cron_core:
      repo: "puppetlabs/cron_core"
      ref: "1.0.0"
      puppet_version: ">= 6.0.0"
    host_core:
      repo: "puppetlabs/host_core"
      ref: "1.0.1"
      puppet_version: ">= 6.0.0"
    mount_core:
      repo: "puppetlabs/mount_core"
      ref: "1.0.2"
      puppet_version: ">= 6.0.0"
    scheduled_task:
      repo: "puppetlabs/scheduled_task"
      ref: "1.0.0"
      puppet_version: ">= 6.0.0"
    selinux_core:
      repo: "puppetlabs/selinux_core"
      ref: "1.0.1"
      puppet_version: ">= 6.0.0"
    sshkeys_core:
      repo: "puppetlabs/sshkeys_core"
      ref: "1.0.1"
      puppet_version: ">= 6.0.0"
    yumrepo_core:
      repo: "puppetlabs/yumrepo_core"
      ref: "1.0.1"
      puppet_version: ">= 6.0.0"

Note: Having performed an actual upgrade to Puppet 6 now, I do NOT recommend adding the modules to the Puppetfile after all, unless you are specifying a newer version of the modules than is provided with the version of puppet-agent you are using, or you are not using the AIO versions, and ONLY if you have no Puppet 5 agents. Puppet 5 agents connecting to a Puppet 6 master will pluginsync these modules and throw errors instead of applying a catalog. If you do have multiple compile masters, you could conceivably keep a few running Puppet 5 and only have Puppet 5 agents connect to it, but that seems like a really specific and potentially problematic scenario, so in general, I repeat, I do NOT recommend adding the core modules to the Puppetfile. They must be placed in the .fixtures.yml file for testing, though.

Now give pdk test unit a try and see how it behaves. All the missing types will be back, so any errors we see now should be related to actual failures, or some other edge case I did not experience.

Note: I experienced the error Error: Evaluation Error: Error while evaluating a Function Call, undefined local variable or method `created' for Puppet::Pops::Loader::RubyLegacyFunctionInstantiator:Class when running my Puppet 6 tests immediately after working on the Puppet 5 tests. Nothing I found online could resolve this and when I returned to it later, it worked fine. I could not replicate the error, so I am unsure of what caused it. If you run into that error, I suggest starting a new session and running git clean -ffdx to remove unmanaged files, so that you start with a clean environment.

Updating CI/CD Integrations

Once both pdk validate and pdk test unit complete without error, we need to update the automated checks our CI/CD system uses. We all use different systems, but thankfully the PDK has many of us covered. For those who use Travis CI, Appveyor, or Gitlab, there are pre-populated .travis.yml.appveyor.yml, and .gitlab-ci.yml files, respectively. For those of us who use Jenkins, we have two options: 1) Copy one of these CI settings and integrate them into our build process (traditional or pipeline) or 2) apply profile::pdk to the Jenkins node and use the PDK for tests. Let’s look at the first option, basing it off the Travis CI config:

before_install:
  - bundle -v
  - rm -f Gemfile.lock
  - gem update --system
  - gem --version
  - bundle -v
script:
  - 'bundle exec rake $CHECK'
bundler_args: --without system_tests
rvm:
  - 2.5.0
env:
  global:
    - BEAKER_PUPPET_COLLECTION=puppet6 PUPPET_GEM_VERSION="~> 6.0"
matrix:
  fast_finish: true
  include:
    -
      env: CHECK="syntax lint metadata_lint check:symlinks check:git_ignore check:dot_underscore check:test_file rubocop"
    -
      env: CHECK=parallel_spec
    -
      env: PUPPET_GEM_VERSION="~> 5.0" CHECK=parallel_spec
      rvm: 2.4.4
    -
      env: PUPPET_GEM_VERSION="~> 4.0" CHECK=parallel_spec
      rvm: 2.1.9

We have to merge this into something useful for Jenkins. I am unfamiliar with Pipelines myself (I know, it’s the future, but I have $reasons!), but I have built a Jenkins server with RVM installed and configured it with a freestyle job. Here’s the current job:

#!/bin/bash
[[ -s /usr/local/rvm/scripts/rvm ]] && source /usr/local/rvm/scripts/rvm
# Use the correct ruby
rvm use 2.1.9

git clean -ffdx
bundle install --path vendor --without system_tests
bundle exec rake test

The new before_install section cleans up the local directory, equivalent to git clean -ffdx, but it also spits out some version information and runs gem update. These are optional, and the latter is only helpful if you have your gems cached elsewhere (the git clean will wipe the updated gems otherwise, wasting time). The bundler_args are already part of the bundle install command. The rvm version varies by puppet version, that will need tweaked. The test command is now bundle exec rake $CHECK, with a variety of checks added in the matrix section. test used to do everything in the first 2 matrix sections; parallel_spec just runs multiple tests at once instead of in serial which can be faster. The 3rd and 4th matrix sections are for older puppet versions. We can put this together into multiple jobs, into a single job that tests multiple versions of Puppet, or into a single job testing just one Puppet version. Here’s what a Jenkins job would look like that tests Puppet 6 and 5:

#!/bin/bash
[[ -s /usr/local/rvm/scripts/rvm ]] && source /usr/local/rvm/scripts/rvm

# Puppet 6
export BEAKER_PUPPET_COLLECTION=puppet6
export PUPPET_GEM_VERSION="~> 6.0"
rvm use 2.5.0
bundle -v
git clean -ffdx
# Comment the next line if you do not have gems cached outside the job workspace
gem update --system
gem --version
bundle -v

bundle install --path vendor --without system_tests
bundle exec rake syntax lint metadata_lint check:symlinks check:git_ignore check:dot_underscore check:test_file rubocop
bundle exec rake parallel_spec

# Puppet 5
rvm use 2.4.4
export PUPPET_GEM_VERSION="~> 5.0"
bundle exec rake parallel_spec

This creates parity with the pre-defined CI tests for other services.

The other option is adding profile::pdk to the Jenkins server, probably through the role, and use the PDK to run tests. That Jenkins freestyle job looks a lot simpler:

#!/bin/bash
PDK=/usr/local/bin/pdk
echo -n "PDK Version: "
$PDK --version

# Puppet 6
git clean -ffdx
$PDK validate
$PDK test unit

# Puppet 5
git clean -ffdx
PDK_PUPPET_VERSION=5 $PDK test unit

This is much simpler, and it should not need updated until removing Puppet 5 or adding Puppet 7 when it is released, whereas the RVM version in the bundle-version may need tweaked throughout the Puppet 6 lifecycle as Ruby versions change. However, the tests aren’t exactly the same. Currently, pdk validate does not run the rake target check:git_ignore, and possibly other check: tasks. In my opinion, as the pdk improves, the benefit of only having to update the PDK package version and not the git-based automation outweighs the single missing check and the maintenance of RVM on a Jenkins server. And for those of us using Travis CI/Appveyor/Gitlab-CI, it definitely makes sense to stick with the provided test setup as it requires almost no maintenance.

I used this earlier without explaining it, but the PDK also provides the ability to run pdk bundle, similar to the native bundle but using the vendored ruby and gems provided by the PDK. We can run individual tests like pdk bundle exec rake check:git_ignore, or install the PDK and modify the “bundle” Jenkins job to recreate the bundler setup using the PDK and not have to worry about RVM at all. I’ll leave that last as an exercise for the user, though.

We must be sure to review our entire Puppet pull request process and see what other integrations need updated, and of course we must update documentation for our colleagues. Remember, documentation is part of “done”, so we cannot say we are done until we update it.

Finally, with all updates in place, submit a Pull Request for your controlrepo changes. This Pull Request must go through the new process, not just to verify that it passes, but to identify any configuration steps you missed or did not document properly.

Summary

Today, we looked at converting our controlrepo to use the Puppet Development Kit for testing instead of bundler-based testing. It required lots of changes to our controlrepos, many of which the PDK handled for us via autocorrect; others involved manual updates. We reviewed a variety of changes required for CI/CD integrations such as Travis CI or Jenkins. We reviewed the specifics of our setup that others don’t share so we had a working setup top to bottom, and we updated our documentation so all of our colleagues can make use of the new setup. Finally, we opened a PR with our changes to validate the new configuration.

By using the PDK, we have leveraged the hard work of many Puppet employees and Puppet users alike who provide a combination of rigorously vetted sets of working dependencies and beneficial practices, and we can continue to improve those benefits by simply updating our version of the PDK and templates in the future. This is a drastic reduction in the mental load we all have to carry to keep up with Puppet best practices, an especially significant burden on those responsible for Puppet in their CI/CD sysytems. I would like to thank everyone involved with the PDK, including David Schmitt, Lindsey Smith, Tim Sharpe, Bryan Jen, Jean Bond, and the many other contributors inside and outside Puppet who made the PDK possible.

Linux OS Patching with Puppet Tasks

One of the biggest gaps in most IT security policies is a very basic feature, patching. Specific numbers vary, but most surveys show a majority of hacks are due to unpatched vulnerabilities. Sadly, in 2018, automatic patching on servers is still out of the grasp of many, especially those running older OSes.

While there are a number of solutions out there from OS vendors (WSUS for Microsoft, Satellite for RHEL, etc.), I manage a number of OSes and the one commonality is that they are all managed by Puppet. A single solution with central reporting of success and failure sounds like a plan. I took a look at Puppet solutions and found a module called os_patching by Tony Green. I really like this module and what it has to offer, even though it doesn’t address all my concerns at this time. It shows a lot of promise and I suspect I will be working with Tony on some features I’d like to see in the future.

Currently, os_patching only supports Red Hat/Debian-based Linux distributions. Support is planned for Windows, and I know someone is looking at contributing to provide SuSE support. The module will collect information on patching that can be used for reporting, and patching is performed through a Task, either at the CLI or using the PE console’s Task pane.

Setup

Configuring your system to use the module is pretty easy. Add the module to your Puppetfile / .fixtures.yml, add a feature flag to your profile, and include os_patching behind the feature flag. Implement your tests and you’re good to go. Your only real decision is whether you default the feature flag to enabled or disabled. In my home network, I will enable it, but a production environment may want to disable it by default and enable it as an override through hiera. Because the fact collects data from the node, it will add a few seconds to each agent’s runtime, so be sure to include that in your calculation.

Adding the module is pretty simple, Here are the Puppetfile / .fixtures.yml diffs:

# Puppetfile
mod 'albatrossflavour/os_patching', '0.3.5'

# .fixtures.yml
fixtures:
  forge_modules:
    os_patching:
      repo: "albatrossflavour/os_patching"
      ref: "0.3.5"

Next, we need an update to our tests. I will be adding this to my profile::base, so I modify that spec file. Add a test for the default feature flag setting, and one for the non-default setting. Flip the to and not_to if you default the feature flag to disabled. If you run the tests now, you’ll get a failure, which is expected since there is no supporting code in the class yet.(there is more to the test, I have only included the framework plus the next tests):

require 'spec_helper'
describe 'profile::base', :type => :class do
  on_supported_os.each do |os, facts|
    let (:facts) {
      facts
    }

    context 'with defaults for all parameters' do
      it { is_expected.to contain_class('os_patching') }
    end

    context 'with manage_os_patching enabled' do
      let (:params) do {
        manage_os_patching: false,
      }
      end

      # Disabled feature flags
      it { is_expected.not_to contain_class('os_patching') }
    end
  end
end

Finally, add the feature flag and feature to profile::base (the additions are in italics):

class profile::base (
  Hash    $sudo_confs = {},
  Boolean $manage_puppet_agent = true,
  Boolean $manage_firewall = true,
  Boolean $manage_syslog = true,
  Boolean $manage_os_patching = true,
) {
  if $manage_firewall {
    include profile::linuxfw
  }

  if $manage_puppet_agent {
    include puppet_agent
  }
  if $manage_syslog {
    include rsyslog::client
  }
  if $manage_os_patching {
    include os_patching
  }
  ...
}

Your tests will pass now. That’s all it takes! For any nodes where it is enabled, you will see a new fact and some scripts pushed down on the next run:

[rnelson0@build03 controlrepo:production]$ sudo puppet agent -t
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Notice: /File[/opt/puppetlabs/puppet/cache/lib/facter/os_patching.rb]/ensure: defined content as '{md5}af52580c4d1fb188061e0c51593cf80f'
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for build03.nelson.va
Info: Applying configuration version '1535052836'
Notice: /Stage[main]/Os_patching/File[/etc/os_patching]/ensure: created
Info: /Stage[main]/Os_patching/File[/etc/os_patching]: Scheduling refresh of Exec[/usr/local/bin/os_patching_fact_generation.sh]
Notice: /Stage[main]/Os_patching/File[/usr/local/bin/os_patching_fact_generation.sh]/ensure: defined content as '{md5}af4ff2dd24111a4ff532504c806c0dde'
Info: /Stage[main]/Os_patching/File[/usr/local/bin/os_patching_fact_generation.sh]: Scheduling refresh of Exec[/usr/local/bin/os_patching_fact_generation.sh]
Notice: /Stage[main]/Os_patching/Exec[/usr/local/bin/os_patching_fact_generation.sh]: Triggered 'refresh' from 2 events
Notice: /Stage[main]/Os_patching/Cron[Cache patching data]/ensure: created
Notice: /Stage[main]/Os_patching/Cron[Cache patching data at reboot]/ensure: created
Notice: Applied catalog in 54.18 seconds

You can now examine a new fact, os_patching, which will shows tons of information including the pending package updates, the number of packages, which ones are security patches, whether the node is blocked (explained in a bit), and whether a reboot is required:

[rnelson0@build03 controlrepo:production]$ sudo facter -p os_patching
{
  package_updates => [
    "acl.x86_64",
    "audit.x86_64",
    "audit-libs.x86_64",
    "audit-libs-python.x86_64",
    "augeas-devel.x86_64",
    "augeas-libs.x86_64",
    ...
  ],
  package_update_count => 300,
  security_package_updates => [
    "epel-release.noarch",
    "kexec-tools.x86_64",
    "libmspack.x86_64"
  ],
  security_package_update_count => 3,
  blocked => false,
  blocked_reasons => [],
  blackouts => {},
  pinned_packages => [],
  last_run => {},
  patch_window => "",
  reboots => {
    reboot_required => "unknown"
  }
}

Additional Configuration

There are a number of other settings you can configure if you’d like.

  • patch_window: a string descriptor used to “tag” a group of machines, i.e. Week3 or Group2
  • blackout_windows: a hash of datetime start/end dates during which updates are blocked
  • security_only: boolean, when enabled only the security_package_updates packages and dependencies are updated
  • reboot_override: boolean, overrides the task’s reboot flag (default: false)
  • dpkg_options/yum_options: a string of additional flags/options to dpkg or yum, respectively

You can set these in hiera. For instance, my global config has some blackout windows for the next few years:

os_patching::blackout_windows:
  'End of year 2018 change freeze':
    'start': '2018-12-15T00:00:00+1000'
    'end':   '2019-01-05T23:59:59+1000'
  'End of year 2019 change freeze':
    'start': '2019-12-15T00:00:00+1000'
    'end':   '2020-01-05T23:59:59+1000'
  'End of year 2020 change freeze':
    'start': '2020-12-15T00:00:00+1000'
    'end':   '2021-01-05T23:59:59+1000'
  'End of year 2021 change freeze':
    'start': '2021-12-15T00:00:00+1000'
    'end':   '2022-01-05T23:59:59+1000'

Patching Tasks

Once the module is installed and all of your agents have picked up the new config, they will start reporting their patch status. You can query nodes with outstanding patches using PQL. A search like inventory[certname] {facts.os_patching.package_update_count > 0 and facts.clientcert !~ 'puppet'} can find all your agents that have outstanding patches (except puppet – kernel patches require reboots and puppet will have a hard time talking to itself across a reboot). You can also select against a patch_window selection with and facts.os_patching.patch_window = "Week3" or similar. You can then provide that query to the command line task:

puppet task run os_patching::patch_server --query="inventory[certname] {facts.os_patching.package_update_count > 0 and facts.clientcert !~ 'puppet'}"

Or use the Console’s Task view to run the task against the PQL selection:

Add any other parameters you want in the dialog/CLI args, like setting rebootto true, then run the task. An individual job will be created for each node, all run in parallel. If you are selecting too many nodes for simultaneous runs, use additional filters, like the aforementioned patch_window or other facts (EL6 vs EL7, Debian vs Red Hat), etc. to narrow the node selection [I blew up my home lab, which couldn’t handle the CPU/IO lab, when I ran it against all systems the first time, whooops!]. When the job is complete, you will get your status back for each node as a hash of status elements and the corresponding values, including return (success or failure), reboot, packages_updated, etc. You can extract the logs from the Console or pipe CLI logs directly to jq (json query) to analyze as necessary.

Summary

Patching for many of us requires additional automation and reporting. The relatively new puppet module os_patching provides helpful auditing and compliance information alongside orchestration tasks for patching. Applying a little Puppet Query Language allows you to update the appropriate agents on your schedule, or to pull the compliance information for any reporting needs, always in the same format regardless of the (supported) OS. Currently, this is restricted to Red Hat/Debian-based Linux distributions, but there are plans to expand support to other OSes soon. Many thanks to Tony Green for his efforts in creating this module!

Using Puppet Enterprise 2018’s new backup/restore features

I was pretty excited when I read the new features in Puppet Enterprise 2018.1. There are a lot of cool new features and fixes, but the backup/restore feature stood out for me. Even with just 5 VMs at home, I don’t want to rock the boat when rebuilding my master by losing my CA or agent certs, much less with a lot more managed nodes at work, and all the little bootstrap requirements have changed since I started using PE in 2014. Figuring out how to get everything running myself would be possible, but it would take a while and be out of date in a few months anyway. Then there is everything in PuppetDB that I do not want to lose, like collected facts/resources and run reports.

Not coincidentally, I still had a single CentOS 6 VM around because it was my all-in-one puppet master, and migrating to CentOS 7 was not something I looked forward to due to the anticipated work it would require. With the release of this feature, I decided to get off my butt and do the migration. It still took over a month to make it happen, between other work, and I want to share my experience in the hope it saves someone else a bit of pain.

Create your upgrade outline

I want to summarize the plan at a really high level, then dive in a bit deeper. Keep in mind that I have a single all-in-one master using r10k and my plan does not address multi-master or split deployments. Both of those deployment models have significantly different upgrade paths, please be careful if you try and map this outline onto those models without adjusting. For the all-in-one master, it’s pretty simple:

  • Backup old master
  • Deploy a new master VM running EL7
  • Complete any bootstrapping that isn’t part of the backup
  • Install the same version of PE
  • Restore the old master’s backup onto the new master
  • Run puppet
  • Point agents at the new master

I will cover the backup/restore steps at the end, so the first step to cover is deploying a new master. This part sounds simple, but if Puppet is currently part of your provisioning process and you only have one master, you’ve got a catch 22 situation – new deployments must talk to puppet to complete without errors, and if you deploy a new puppet master using the same process, it will either fail to communicate with itself since PE is not installed, or it will talk to a PE installation that does not reflect your production environment. We need to make sure that we have the ability to provision without puppet, or be prepared for some manual efforts in the deploy. With a single master, manual efforts aren’t that burdensome, but can still reduce accuracy, which is why I prefer a modified automated provisioning workflow.

A lot of bootstrapping – specifically hiera and r10k/code manager – should be handled by the restore. There were just a few things I needed to do:

  • Run ssh-keygen/install an existing key and attach that key to the git system. You can avoid this by managing the ssh private/public keys via file resources, but you will not be able to pull new code until puppet processes that resource.
  • SSH to your git server and accept the key. You can avoid this with the sshkey resource, with the same restriction.
  • Check your VMs default iptables/selinux posture. I suggest managing security policy via puppet, which should prevent remote agents from connecting before the first puppet run, but it’s also possible to prevent the master from communicating with itself with the wrong default policy.
  • Check the hostname matches your expectations. All of /etc/hosts, /etc/hostname, /etc/sysconfig/network should list the short and FQDN properly, and hostname; hostname -f should return the same values. /etc/resolv.conf may also need the search domain. Fix any issues before installing PE, as certs are generated during install, and having the wrong hostname result can cause cascading faults best addressed by starting over.

The restore should get the rest from the PE side of things. If your provisioning automation performs other work that you had to skip, make sure you address it now, too.

Installing PE is probably the one manual step you cannot avoid. You can go to https://support.puppet.com and find links to current and past PE versions. Make sure you get the EL7 edition and not the EL6 edition. I did not check with Support, but I assume that you must restore on the same version you backed up, I would not risk even a patch release difference.

Skipping the restore brings us to running the agent, a simple puppet agent -t on the master, or waiting 30 minutes for the run to complete on its own.

The final step may not apply to your situation. In addition to refreshing the OS of the master, I switched to a new hostname. If you’re dropping your new master on top of the existing one’s hostname/IP, you can skip this step. I forked a new branch from production called mastermigration. The only change in this branch is to set the server value in /etc/puppetlabs/puppet/puppet.conf. There are a number of ways to do this, I went with a few ini_setting resources and a flag manage_puppet_conf in my profile::base::linux. The value should only be in one of the sections main or agent, so I ensured it is in main and absent elsewhere:

  if $manage_puppet_conf {
    # These settings are very useful during migration but are not needed most of the time
    ini_setting { 'puppet.conf main server':
      ensure => present,
      path => '/etc/puppetlabs/puppet/puppet.conf',
      section => 'main',
      setting => 'server',
      value => 'puppet.example.com',
    }
    ini_setting { 'puppet.conf agent server':
      ensure => absent,
      path => '/etc/puppetlabs/puppet/puppet.conf',
      section => 'agent',
      setting => 'server',
    }
  }

During the migration, I can just set profile::base::linux::manage_puppet_conf: true in hiera for the appropriate hosts, or globally, and they’ll point themselves at the new master. Later, I can set it to false if I don’t want to continue managing it (while there is no reason you cannot leave the flag enabled, by leaving it as false normally you can ensure that changing the server name here does not take effect unless purposefully flip the flag; you could also parameterize the server name).

Now let’s examine the new feature that makes it go.

Backups and Restores

Puppet’s documentation on the backup/restore feature provides lots of detail. It will capture the CA and certs, all your currently deployed code, your PuppetDB contents including facts, and almost all of your PE config. About the only thing missing are some gems, which you should hopefully be managing and installing with puppet anyway.

Using the new feature is pretty simple, puppet-backup createor puppet-backup restore <filename> will suffice for this effort. There are a few options for more fine-grained control, such as backup/restore of individual scopes with --scope=<scopes>[,<additionalscopes>...], e.g. --scope=certs.

 

The backup will only backup the current PE edition’s files, so if you still have /etc/puppet on your old master from PE 3 days, that will not be part of the backup. However, files in directories it does back up, like /etc/puppetlabs/puppet/puppet.conf.rpmsave, will persist. This will help reduce cruft, but not eliminate it. You will still need to police on-disk content. In particular, if you accidentally placed a large file in /etc/puppetlabs, say the PE install tarball, that will end up in your backup and can inflate the size a bit. If you feel the backup is exceptionally large, you may want to search for large files in that path.

The restore docs also specify two commands to run after a restore when Code Manager is used. If you use CM, make sure not to forget this step:

puppet access login
puppet code deploy --all --wait 

The backup and restore process are mostly time-dependent on the size of your puppetdb. With ~120 agents and 14 days of reports, it took less than 10 minutes for either process and generated a ~1G tarball. Larger environments may expect the master to be offline for a bit longer, if they want to retain their full history.

Lab it up

The backup/restore process is great, but it’s new, and some of us have very ancient systems laying around. I highly recommend testing this in the lab. My test looked like this:

  • Clone the production master to a VM on another hostname/IP
  • Run puppet-backup create
  • Fully uninstall PE (sudo /opt/puppetlabs/bin/puppet-enterprise-uninstaller -p -d -y)
  • Remove any remaining directories with puppet in them, excepting the PE 2018 install files, to ensure all cruft is gone
  • Disable and uninstall any r10k webhook or puppet-related services that aren’t provided by PE itself.
  • Reboot
  • Bootstrap (from above)
  • Install PE (sudo /opt/puppetlabs/bin/puppet-enterprise-installer) only providing an admin password for the console
  • Run puppet-backup restore <backup file>
  • Run puppet agent -t
  • Make sure at least one agent can check in with puppet agent -t --server=<lab hostname> (clone an agent too if need be)
  • Reboot
  • Make sure the master and agent can still check in, Console works, etc.
  • If possible, test any systems that use puppet to make sure they work with the new master
  • Identify any missing components/errors and repeat the process until none are observed

I mentioned that I used PE3. My master had been upgraded all the way from version 3.7 to 2018.1.2. I’m glad I tested this, because there were some unexpected database settings that the restore choked on. I had to engage Puppet Support who provided the necessary commands to update the database so I could get a useful backup. This also allowed me to identify all of my bootstrap items and of course, gain familiarity and confidence with the process.

This became really important for me because, during my production migration, I ran into a bug in my provisioning system where the symptom presented itself through Puppet. Because I was very practiced with the backup/restore process, I was able to quickly determine PE was NOT the problem and correctly identify the faulty system. Though it took about 6 hours to do my “very quick” migration, only about an hour of that was actually spent on the Puppet components.

I also found a few pieces of managed files on the master where the code presumed the directory structure would already be there, which it turns out was not the case. I must have manually created some directories 4 years ago. I think the most common issues you would find at this point are dependencies and ordering, but there may be others. Either fix the code now or, if it would negatively affect the production server, prep a branch for merging just prior to the migration, with the plan to revert if you rollback.

I strongly encourage running through the process a few times and build the most complete checklist you can before moving on to production.

Putting it together

With everything I learned in the lab, my final outline looked like this:

  • Backup old master, export to another location
  • Deploy a new master VM running EL7 using an alternative workflow
  • Run ssh-keygen/install an existing key and attach that key to the git system
  • SSH to the git server and accept the key
  • Verify your VMs default iptables/selinux posture; disable during bootstrap if required
  • Validate the hostname is correct
  • Install PE
  • Restore the backup
  • [Optional] Merge any code required for the new server; run r10k/CM to ensure it’s in place on the new master
  • Run puppet
  • Point agents at the new master

Yours may look slightly different. Please, spend the time in the lab to practice and identify any missing steps, it’s well worth it.

Summary

Refreshing any system of significant age is always possible, and often fraught with manual processes that are prone to error. Puppet Enterprise 2018.1 delivered a new backup/restore process that automates much of this process. We have put together a rough outline, refined it in the lab, and then used it to perform the migration in production with high confidence, accounting for any components the backup did not include. I really appreciate this new feature and I look forward to refinements in the future. I hope that soon enough migrations should be as simple an effective as in-place upgrades.