vCenter 7 upgrade fails with “Exception occurred in install precheck phase”

NOTE: Since this article was written, KB83145 has been created which helps resolve the issue caused by this file and a few possible others.

I finally decided to upgrade my home lab to vSphere 7. First thing is always to upgrade vCenter. I had some issues with my 6.7 VCSA, specifically, that I lost the root password and kinda broke it when trying to recover. No big deal, I don’t use dvSwitches at home and I only have 9 VMs, so no great loss to set up a new vCenter.

After powering off the old VCSA, I deployed the lastest VCSA 7.0u1 image. Everything went well. The next day, I found out the 7.0u1a patch had been released. Since I hadn’t done more than add the vSphere hosts to vCenter, I jumped into the VAMI to perform an automatic upgrade. 7.0u1a was right there waiting for me. I selected it and started the upgrade.

About 30 seconds after starting the upgrade, I received this error:

Installation failed: Exception occurred in install precheck phrase

I googled for “Exception occurred in install precheck phrase” and didn’t see any result for this at all. Everything that looked close was way off. Hitting Resume just started the upgrade again, asking me to cancel or proceed. The cancel button did nothing, and the proceed button led right back to this error.

I tried rebooting the VCSA, figuring maybe it just needed a fresh boot. Logging into the VAMI – after the reboot – showed the same error. In addition, it grays out the left-hand menu. If you change the URL to remove the navigation (https://vcsa/ui) you do get back to the menu, but any attempt to perform another upgrade leads to the same cycle.

After deleting this VCSA and deploying another 7.0u1 instance from scratch, I encountered the same error! Again, google failed me, even a few days later. Thankfully, someone on the VMware{code} slack pointed me to an article on Paul Braren’s blog.

Though the error was somewhat different, the resolution worked for me! Paul’s post goes through the detail of how he ran into the problem and leads to an ultimate resolution of removing the file /etc/applmgmt/appliance/software_update_state.conf. Once I removed this file, I was able to navigate to and perform and upgrade in the VAMI.

I could, of course, have simply downloaded a new image of the VCSA and deploy it, but in production we don’t often have this luxury. We have to fix what’s out there instead of starting from scratch. I hope that by writing this post, which builds on Paul’s excellent work, this error message pulls up at least one result for you. Take that, denvercoder9!

Planning Your Distributed Log Insight Deployments

As I mentioned recently, I’ve changed jobs and it’s giving me more time for my blog. One of my first challenges at the new job is to look at how to deploy a Log Insight cluster, with the wrinkle that there are multiple datacenters and availability zones that need to leverage Log Insight. Previously, I have only worked with single-node instances of Log Insight, so I had a lot to learn.

The design includes three datacenters, A, B, and C, and the last has two availability zones, giving us A, B, C-1 and C-2. Each site has between 10 and 50 ESXi hosts, so not small but not gigantic, either. Every center should also forward data to a per-datacenter instance of a separate system for compliance.

Log Insight Product Docs

I found a few articles to help me out with the design aspect. The first are the official VMware docs vRealize Log Insight Configuration Limits, Sizing the vRealize Log Insight Virtual Appliance and Planning Your vRealize Log Insight Deployment. Each cluster consists of 3-12 members – one master and 2-11 workers – and each node can have up to 4TB of storage. They can talk to 1 each of a vROps Manager and Active Directory domain, 15 vCenter servers, and 10 forwarders. When using a cluster, nodes must be Medium or Large size and all nodes must be the same size. We should always use the Integrated Load Balancer (ILB) and direct targets to its virtual IP (VIP), even in non-cluster mode. This allows addition of nodes in the future without having to adjust the destination address of other devices.

There are a few caveats, as well. LI’s ILB does not support geoclusters yet, so all cluster members must be on the same Layer 2 network; devices in different L2 networks must be in separate clusters. If you’re using NSX, exclude the LI nodes from Distributed Firewall Protection, otherwise the ILB traffic may be blocked by spoofed traffic rules.

Initial Design

This gives us a lot of information to start designing. We need at least 3 medium or large Log Insight nodes in each datacenter, plus a VIP address assigned to the ILB. Since we have thin provisioned storage, we assign an additional 4TB disk to each node – it’s slightly above the space that LI will use, but hey, thin provisioned and no math required! If you don’t have thin provisioned storage, you can check the usable storage in the UI after deployment (IIRC it’s .6G for LI 8, but might be different if you’re upgrading a node from a previous version) and add whatever the delta is.

There is also one paragraph under the Planning doc’s Clusters with Forwarders section that I almost missed on the first readthrough:

The design is extended through the addition of multiple forwarder clusters at remote sites or clusters. Each forwarder cluster is configured to forward all its log messages to the main cluster and users connect to the main cluster, taking advantage of CFAPI for compression and resilience on the forwarding path. Forwarder clusters configured as top-of-rack can be configured with a larger local retention.

What this means is that in addition to clusters for A, B, C-1 and C-2, which will receive logs from the hosts in their respective datacenters, we also need a main cluster for our “Single Pain of Glass” view. This means 5 clusters of at least 3 nodes each, with clusters A, B, C-1, and C-2 forwarding to both the compliance system and the Main cluster.

Less obvious from the reading, but implied, is that the retention times of the SPOG cluster will likely be shorter than those of the datacenter clusters. Log Insight simply rotates out the oldest logs (by default to the bitbucket, but you can set up a long term archive location) when the disks get full, so your log ingestion rate determines your retention timeline in each cluster. However, you can’t combine 4 sets of logs into 1 without taking up more space, which means the retention time with the same disk space will be lower. We could increase the size of the SPOG cluster to 12 node, but it’s still possible that one noisy datacenter drowns out logs from the other 3 datacenters. Regardless of whether you increase the SPOG cluster size or not, it’s best to assume that your longest retention timeframes will be on the local LI cluster. As a result, when you log into the main cluster and you need to go back just a tiny bit further in time than is retained, you may have to log in to the respective datacenter LI instance to view the older records.

Additional Guides

Before finalizing the design, I looked beyond the product docs and found two more great references. VMware has an Architecting a VMware vRealize Log Insight Solution for VMware Cloud Providers whitepaper from January 2018, which makes it a bit older than Log Insight 8.0 but is still very relevant (reminder: LI jumped from 4.8 to 8.0 to match product numbers with other vRealize Suite products, not because of any major changes to LI). There’s a ton of valuable information in this paper, including some tuning advice for ESXi that I’ll inevitably come back to later, but right now we’re focused on the cluster design aspects.

While it’s not a factor for me, section 3.4.2 (pages 25-27) covers how to use a non-LI system as an intermediate syslog forwarder. Section 4.5 (page 33) has a table showing estimated log retention sizes for a single ESXi node or an 8 node ESXi cluster, which may be helpful in understanding the retention pattern of 3x4TBs in a cluster and whether additional nodes are required just for storage. Section 6.3 (pages 36-38) includes a number of tables of the required firewall ports and directions to build a proper firewall policy. Page 39 reminds us that the Log Insight nodes should probably run on a management cluster that has more nodes than the LI cluster (at least N+1), has HA enabled, and is using local (non-auto deploy) boot. Section 7.3 (page 42) notes that while an LI cluster can only communicate with one vROps Manager, one vROps Manager can communicate with multiple LI clusters – though the Launch in Context only works with a 1:1 mapping. Lots of good but random stuff there to inform the overall plan.

Section 10, starting on page 52, examines three scenarios and includes details on the resulting design. Design scenario C (page 54) most closely approximates my scenario. The London, Paris, and Frankfurt datacenters approximate my A, B, C-1, and C-2 datacenters, and the GNOC cluster approximates my Main cluster. The sizing is a bit more conservative (1 node at datacenters, 3 in the GNOC) but otherwise pretty close to what I came up with. Score 1 for me!

There’s one more document I reviewed, Log Insight Best Practices: Server by Steve Flanders. I worked my way backwards to this link, which is pretty much a checklist of everything I already identified, and a lot more, but all in one document! Really wish I had found this one first. Some of the things I didn’t catch already include:

  • Only list the local Active Directory domain servers; listing distant servers could result in up to 20 minute wait times for logins (or IME, just broken logins).
  • Use a service account for AD binding, decreases the chances of an expired password impairing your ability to log in.
  • If you use data archiving, LI does NOT clean up the archive location. It’s your job to make sure it continues to have free space and deletes data older than your long term retention requirements.
  • The 2TB limit mentioned is from 2015 and is currently 4TB.
  • Steve recommends using the DNS name instead of the IP as the log destination. There are pros and cons to each, I recommend investigating this and making an informed decision.
  • Replace the self-signed SSL certificate with a custom SSL certificate. Don’t forget to import the updated cert into other systems that need it, esp vROps.
  • We hopefully all know the importance of NTP, and Steve has written an article on proper NTP configuration. There’s nothing specific to LI here, but it’s especially vital that our log analysis systems are all properly synchronized. There’s nothing like events that actually happen hours apart appearing correlated because NTP isn’t working!

Full Design

With the addition of the whitepaper and Steve’s checklist, we have a better design. Let’s lay it out in two pieces: the installed components and the checklist of changes.

Components

  • Datacenter A
    • 3 node LI management cluster with a VIP, each node is a Medium install  with an extra 4TB disk
    • 3 node LI forwarding cluster with a VIP, each node is a Medium install with an extra 4TB disk
  • Datacenter B:
    • 3 node LI forwarding cluster with a VIP, each node is a Medium install with an extra 4TB disk
  • Datacenter C-1
    • 3 node LI forwarding cluster with a VIP, each node is a Medium install with an extra 4TB disk
  • Datacenter C-2
    • 3 node LI forwarding cluster with a VIP, each node is a Medium install with an extra 4TB disk

Checklist

Follow this for every cluster.

  • Deploy a single node as a new deployment to create the cluster
  • Configure NTP
  • Configure Active Directory authentication
    • Use the in-datacenter DCs only, accept certificates when using SSL
    • Use a service account for the binding
  • Replace the self-signed certificate with a custom certificate
  • Deploy 2 additional nodes as members of an existing cluster
  • (Forwarding clusters) Configure forwarding with a target of the management cluster’s VIP and the compliance system, by FQDN
  • (Forwarding clusters) Configure vCenter/vROps integration with the local datacenter’s vCenter(s) and single vROps instance.

Once the clusters are stood up and configured, they are ready to ingest data. You may choose an IP or FQDN. Though DNS is not hosted on the ESXi hosts, I still prefer IP-based destination as it is more likely to work during a datacenter issue; your preference may be different. How you apply it is also up to you. For ESXi hosts, you can use the vCenter integration in each forwarding cluster, Host Profiles, PowerCLI, or other vSphere APIs. LI can also ingest data from non-ESXi hosts, if you want to point them at LI as well.

Summary

We’ve reviewed a number of documents, official and unofficial, on Log Insight designs. We’ve taken the general instructions and guidelines and built our own design and implementation checklists. The design complies with best practices and can easily be expanded – just add additional nodes to each cluster as the need arises.

If I’ve missed anything, or you have questions about the design, drop a note in the comments or ping me on Twitter. Thanks!

Disabling account lockout on your VCSA 6.5

I recently locked myself out of my vCenter Server Appliance when I was attempting to perform an upgrade through VAMI. The VAMI just says “invalid password”, but logging in on the console displayed a message indicating I had failed authentication 12 times. I had only tried four times! Regardless of whether it was me or someone else, now that I knew I had the right password, I was locked out. I waited 5 minutes but still couldn’t get in, so it looked like it was time to do a password reset. However, I wanted to explore something I had done with vRealize Orchestrator recently: disable the account lockout.

KB2147144 documents the process for booting into a privileged shell without a password. Unlike in 6.0, you hit ‘e’ instead of ‘space’ at the GRUB prompt, but otherwise it’s the same. You do have about half a second to hit ‘e’, so pay attention or you’ll find yourself rebooting a few times! For those who are not locked out already, you can just ssh into the VCSA and make this change without a reboot

Once you’re in, search for the word tally in the pam setup with grep tally /etc/pam.d/*. You will find these two lines in /etc/pam.d/system-auth.

auth require pam_tally2.so file=/var/log/tallylog deny=3 onerr=fail even_deny_root unlock_time=86400 root_unlock_time=300
auth require pam_tally1.so file=/var/log/tallylog deny=3 onerr=fail even_deny_root unlock_time=86400 root_unlock_time=300

Comment those two lines out (prepend with a #) and save the file:

# cat /etc/pam.d/system-auth
# Begin /etc/pam.d/system-auth

auth required pam_unix.so

# End /etc/pam.d/system-auth
#auth required pam_tally2.so file=/var/log/tallylog deny=3 onerr=fail even_deny_root unlock_time=86400 root_unlock_time=300
#auth required pam_tally1.so file=/var/log/tallylog deny=3 onerr=fail even_deny_root unlock_time=86400 root_unlock_time=300

If you know your password and are just dealing with lockouts, you can type reboot -f now. Otherwise, type passwd and enter the new password twice and then reboot. You can now enter your password wrong a million times – or someone else can – and you will not lose the ability to log in without waiting an extraordinary amount of time or requiring a reboot.

I upgraded from VCSA 6.5U1b to 6.5U1c and this persisted. I assume that when going to vNext (6.6 or 7.0) this change will be reverted, but I am not sure how it will behave when VCSA 6.5U2 is released, this may need to be re-done, so add disabling the lockout to your upgrade checklists alongside disabling the root account expiration.

Getting Started with Veeam SureBackup Jobs

Many wise people have pointed out that a backup doesn’t count until you can restore it. It’s vitally important that we test our backups by restoring them, and doing so manually is often problematic when the original system is still online. If you use Veeam Backup & Replication, it includes functionality called SureBackup to automatically test restores in a private, isolated network so that there’s no conflict with the production systems. You can read more about the functionality in the B&R Manual, starting with this section. I will be providing high level descriptions here as the manual already provides great detail, please take the time to read that as my article isn’t a substitute for the official docs!

The manual is pretty good, but I ran into a few things that were either confusing or missing, things I had to scramble to figure out on my own. That’s not fun and I don’t think others want to waste their time on it. I hope this article helps illuminate some of the gaps for others who wish to explore SureBackup. Let’s start by taking a look at how SureBackup works and the components it uses.

SureBackup, Application Groups, Virtual Labs, and other terminology

The basic process of SureBackup is as one might expect:

  1. Register and power on a VM based off the backup files
  2. Run tests against the VM
  3. Optionally perform a CRC check on the files
  4. Add the status of the VM to the report
  5. Power off and unregister the VPN
  6. Repeat 1-5 for the remaining VMs

Under the hood, of course it’s a little more complicated and introduces some new terminology:

  • Application Groups: A collection of related VMs. For example, an Active Directory Domain Controller, a domain-joined DNS server, and a domain-joined webserver. Or the trio of VMs a 3-tiered application. Only create an application group for VMs that need tested in a particular order or need extra tests. Each VM can have a defined role to run application-level tests and are powered on one at a time in the order specified.
  • Linked Job: A restore test can, after any Application Group VMs pass, run against all the VMs in a Backup or Replication job. These tests are basic power on and heartbeat tests, no application-level tests. These VMs are powered on in groups, by default 3 at a time.
  • Virtual Lab: Each job is run against or inside of a virtual lab. This is where the network isolation occurs. The Lab is attached to a single VMHost, not a cluster, and a standalone vSwitch with no uplinks is created on that VMHost to provide the isolation. A datastore is chosen for the temporary files used during the test. The production and isolated networks are bridged by at least one VM called a…
  • Proxy ApplianceNot to be confused with a Backup Proxy! This linux-based VM bridges the production and isolated networks using iptables and NAT masquerading to allow access to the restored VMs. It is managed entirely by Veeam, including creation, settings, powering on and off, etc.
  • SureBackup Job: A new job type in addition to Backup and Replication jobs. This option is not visible until a Virtual Lab exists.

Now that we know the various components, let’s expand the high level steps from before:

  1. A SureBackup Job starts and brings up the Virtual Lab and its Proxy Appliance[s].
  2. Pick the first VM from an Application Group or the first 3 VMs from a Linked Job. Register and power on a VM and run heartbeat and/or application tests against it. Tests are initiated from the Backup Server through the Proxy Appliance’s NAT and to the test VM.
  3. Optionally perform a CRC check on the files.
  4. Add the status of the VM to the report
  5. Power off and unregister the VPN
  6. [New] If the VM is a member of an application group and has failed, abort the run
  7. Repeat steps 1-6 for the remaining VMs, moving from Application Group VMs to Linked Job VMs.
  8. [New] Clean up all the temporary restore VMs and power off the Virtual Lab

We can optionally allow the VMs to persist after the SureBackup Job completes. In that case, the job actually remains running until we select it in the Console and choose to Stop Session, at which time step 7 completes. If we turn off the VMs manually, it doesn’t hurt anything, though; Veeam still handles the cleanup

Create an Application Group

An application group is defined when we want to test a number of related VMs, such as a 3-tier app or an Active Directory/Exchange setup. We do not create application groups for unrelated VMs, like 5 web servers from 5 different customers. The reason is that each VM is powered on (and left on!) in sequence, and if one fails the whole group fails. Make sure there’s a strong relationship between the VMs in an application group.

Creating an Application Group is a pretty simple process with the wizard. In the Console, go to Backup Infrastructure > SureBackup > Application Groups, right click and choose Add App Group…

Give it a name and description and click Next. On the Virtual Machines page, click Add VM and select one or more related VMs. I’ve chosen an instance of vRealize Operations Manager (vROps). Notice that the Role is not set. Select it and click Edit… Adding a role will enable an application-level check. Select the Web Server option. In the Startup Options tab, we need to make a change. vROps takes a long time to start, more than most web servers. I suggest increasing the Application initialization timeout to 300-400 seconds (5:00-6:40) so it has enough time to complete loading. Switch over to the Test Scripts tab and there is a small problem – the Web Server role uses port 80! If we highlight it and edit it, we cannot change the argument, we can only choose a different application or provide our own test script.

There are two ways to fix this. First, we can create a new role, which means we only have to describe the tests once and can re-use it across anything that fits the role. On the Backup server, browse to %ProgramFiles%\Veeam\Backup and Replication\Backup\SbRoles and we find one XML file for each role. Copy WebServer.xml to HTTPSServer.xml or similar and edit that file. There are three things to change: the Id and Name at the top and the Arguments about 2/3rds of the way down. I’m not aware of any specific rules for the Id generation, just that it should be unique. I changed the last F to an E, that’s all. The Name is what shows up in the Veeam dialog boxes. Here’s what mine looks like with the edits in bold:

<?xml version="1.0" encoding="utf-8" ?>
<SbRoleOptions>
  <Role>
    <SbRole>
      <Id>4CDC7CC4-A906-4de2-979B-E5F74C44832E</Id>
      <Name>HTTPS Web Server</Name>
    </SbRole>
  </Role>
  <Options>
    <SbVerificationOptions>
      <ActualMemoryPercent>100</ActualMemoryPercent>
      <MaxBootTimeoutSec>300</MaxBootTimeoutSec>
      <AppInitDelaySec>120</AppInitDelaySec>
      <TestScripts>
        <TestScripts>
          <TestScript>
            <Name>Web Server</Name>
            <Type>Predefined</Type>
            <TestScriptFilePath>Veeam.Backup.ConnectionTester.exe</TestScriptFilePath>
            <Arguments>%vm_ip% 443</Arguments>
          </TestScript>
        </TestScripts>
      </TestScripts>
      <HeartbeatEnabled>True</HeartbeatEnabled>
      <PingEnabled>True</PingEnabled>
    </SbVerificationOptions>
  </Options>
</SbRoleOptions>  

If we OK the Verification Options window and click Edit… again, we will see the new role HTTPS Web Server is available and the Test Scripts tab shows the port 443 in the arguments. More information on role definitions can be found in the manual.

The second way to configure the test scripts is on the Edit page by selection Use the following test script. Put something in the Name field. The Path is the TestScriptFilePath observed in the XML files plus the full path, giving us C:\Program Files\Veeam\Backup and Replication\Backup\Veeam.Backup.ConnectionTester.exe (assuming %ProgramFiles% is C:\Program Files). The arguments match the same field in the XML file, %vm_ip% 443 – or whatever port the one-off requires. We can also add our own binaries for testing, just make sure they’re documented as part of the Veeam B&R Backup server build.

Our single-VM example application group looks like this now:

There are tons of other things to customize in the application group – such as only allocating 50% of the compute and memory the VM is assigned to preserve resources during the test – but this is sufficient for our tests. Create however many application groups you want to now, you can always come back later and edit them or create more.

Create a Virtual Lab

The application group was the easy part. The Virtual Lab is next, and will create a vSwitch and Proxy Appliance VM on the host/datastore of our choosing. Before beginning, we need to decide which host and datastore to use, and grab an IP for the Proxy Appliance on the same network as the Backup server (it can be set up in a different network, but that’s a more complex setup I won’t be visiting in this article). Once we have that, we go to Backup Infrastructure > SureBackup > Virtual Labs, right click, and choose Add Virtual Lab… If a Virtual Lab has been created previously and disconnected somehow, we can also choose Connect Virtual Lab… to reconnect it. Let’s review creation of a new lab.

Give the lab a name and description. On the next page, we are asked to select a host. Again, we can NOT choose a cluster, we must choose a single host. Once we choose the host it will suggest a Folder and Resource Pool that the restore VMs will be placed in. We can edit with the Configure…button or just click Next. The next page in the wizard asks for a Datastore that the host can see, click Choose… and select one. I believe I saw a suggestion that the free space should be about 10% of the size of the VMs being restored, but I am not sure where I saw that and cannot find a more solid recommendation now.

The next page is where the Proxy Appliance is created. Set the name with the first Configure… and the network settings with the second Configure… In network settings, wemust choose the same production network as the B&R Backup server for our simple setup (more advanced options are discussed in the manual as Advanced multi-host (manual configuration) but there are no guides for it, sorry). If that network supports DHCP, just click OK, otherwise we will need to provide our IPv4 (no IPv6 availability) address settings and DNS servers. We can also optionally allow the proxy appliance to be the VMs internet proxy, but we will skip that for now (directions in steps 4 and 5 here).

Note: The proxy appliance by default receives the same name as the lab. If you use vCheck, there is a plugin that alerts on VMs whose file location on the datastore doesn’t match the VM name, and spaces in VM names are changed to underscores on the filesystem. If you use this plugin, I suggest avoiding spaces in the VM’s name or adjusting your plugin settings to skip the virtual lab VMs to prevent false positives.

On the Networking tab, choose Advanced single-host (manual configuration). You can read up on the networking modes. Our use case calls for tests of VMs in multiple networks, so we must choose the Advanced/manual option. If the restored VMs are all in a single network, then the Basic/automatic mode will work. Click Next to start setting up the Isolated Networks.

The next tab is where we will add the various networks that restored VMs will exist in. We will add some now and we will need to return here in the future when more networks are added. There are unfortunately no cmdlets or functions in the Veeam PowerShell kit to do this… yet. There will be one Isolated Network already.

Update: When I read the documentation, I assumed you needed an isolated network for every production network that a VM in the job uses (i.e. if your VMs were on VLANs 100-110, you needed 11 isolated networks and vNICs), which is not quite true. If no isolated network/vNIC exists that matches the production network for a VM, only Heartbeat and Verification tests are attempted. If an isolated network/vNIC does exist, then Ping and Script tests are attempted as well.

If we select that network and click Edit, we can see how it is associated with a Production network, an Isolated network, and a VLAN ID. This first Isolated Network defaults to the same network of the Proxy Appliance itself. It might be difficult to read through the scrubbed image, but the Isolated Network name is the Production Network name prepended with the lab name.

This network will be added to a private vSwitch on the selected host, which will have no uplinks. We should be perfectly fine leaving the VLAN ID alone, but if you are worried, just assign it a unique number not used elsewhere, maybe add 500 or 1000 to it. Click OK or Cancel and back on the Isolated Networks page of the wizard, click Add… We will need to Choose… a production network. In the dialog box, be sure to expand the host our appliance is in – if it’s a dvSwitch it SHOULD be the same everywhere, but there’s no point in chancing some identifier of a different host conflicting. In figure 7, I’ve chosen the vSphere Management network as that’s where vROPs resides.

Do not just change the VLAN ID and click OK! Take a look at the Isolated Network. I know it’s difficult to see with my scrubbed image, but it’s the same name as the previous isolated network. Click OK and the VLAN ID of both isolated networks are the same. An edit to either will update the ID for both. This isn’t what we want. The isolated network name needs changed. We can make it match the default format of <virtual lab name> <Production network name> or just <Production network name> or just enter free text like Bob. It doesn’t matter what it is called as long as it’s unique. Now, I cannot explain why the wizard doesn’t automatically change the name of the isolated network, but it doesn’t, so we have to do that ourselves. Big tip of the hat to Jason Ross who described the issue and fix in the Veeam forums. Once the Isolated Network is renamed, and click OK and the mapping will look something like this:

Next up is the Network Settings page. Here we want to create a vNIC for each Isolated Network we’ve created. Veeam uses NAT masquerading to let the Backup server communicate with the VMs on that segment, which requires selecting some address ranges that aren’t used elsewhere in the network, or at least that the Backup server doesn’t need to communicate with. Though we chose a manual network mode, a route to the masquerade IPs will automatically be created on the Backup server during restore jobs, so we do not have to manage that (this is why we did not put the proxy appliance in a different network than the Backup server). Edit the existing vNIC and assign it the IP/mask that the default gateway (router) in that network would have. We can also change the masquerade network and disable DHCP if we don’t want to use it on that interface. I would leave it enabled unless one of the VMs being restored is a DHCP server, otherwise it makes it real easy to ensure VMs get IP addresses. Here’s what that would look like for a network X.Y.10.64/27:

Repeat this for every Isolated Network you need, using the Choose isolated network to connect this vNIC to pulldown to select the correct isolated network. If we need VMs to talk to each other, check the Route network traffic between vNICs. If we don’t need it, it probably won’t hurt, though. Here’s what this might look like when complete.

We are going to skip Static Mapping, as the general NAT Masquerade works for this use case. Review the configuration on the Ready to Apply portion of the wizard and hit Apply. When we do, the resource pool, folder, vSwitch, port groups and virtual machine will be created and configured on the specified host. We can now find the proxy appliance VM (or the other resources managed) and add any notes, tags, etc that we would normally apply to those resources (I use tag-based backups so would want to put a NoBackup tag on the proxy appliance).

If you need more assistance on creating a virtual lab, I recommend this Whiteboard Fridays video.

Create a SureBackup Job and test

Finally, we need to create a backup job. We are almost there, I promise! Go into Backup & Replication > Jobs > SureBackup (this is only available if one or more virtual labs exists) and right click to create a new job with Surebackup…

Give the job a name and description and click Next. On the next page we must select a virtual lab. In this case, there is only one. Click Next. On the next page we may optionally select an application group. The next page in the Wizard is for Linked Jobs. Let’s take a moment to explore the three combinations available here:

  • Application Group only: The VMs inside the application group are powered on, one at a time and in serial order, then tested. Any VM test failure aborts the run immediately
  • Linked Jobs only: The VMs in the linked jobs are started up in batches (default: 3 at a time) until all VMs are tested. Any VM test failure does not abort the run.
  • Application Group and Linked Jobs: This a combination of the two above. The Application Group is processed as a unit and then, if it completes successfully, the Linked Job VMs are tested.

Since we created an application group, we will select it. We cannot edit the application group settings from here, only view them to ensure we select the correct group. We may choose to check the Keep the application group running after the job completes box. If so, the job will remain at 99% with all application group VMs and the Proxy Appliance VM powered on until someone right clicks on the job and chooses Stop Session. As described below, this is good for checking out any of the VMs in greater depth after the job completes. It would obviously not be something to leave enabled on a scheduled job. It is important to note that the VMs will only be kept running if the job completes successfully; if it fails, I observed the VMs being shut down immediately. So, it’s not great for troubleshooting. Click Next to proceed.

We can now link one or more Backup jobs to the SureBackup job by clicking the Add… button and selecting a job. We can only specify ONE role for all VMs in the linked job. If left blank, only a ping and heartbeat test will be used. At the bottom, we can specify how many VMs are processed at once. I did not play with the Advanced button but I believe we can use it to set roles by individual VM name, tags, folders, etc. Be aware that each VM will attempt to connect to an isolated network on the virtual lab’s vSwitch. If the backup jobs are by network, the lab can get by with a single isolated network, but if the job contains VMs from multiple networks, each one needs to exist beforehand or the job will fail. Click Next when ready to proceed.

The Settings page is where you specify to send SNMP or email notifications and determine if CRC checks are performed on the backup files. I only received emails in my testing for failed jobs; there appears to be no exposed setting for whether or not to send emails on successful job runs. CRC checks do take a while but I would recommend to avoid bit rot unless there is some sort of detection in the storage array or you’re a gambler.

Clicking Next takes us to the Schedule tab. If we check Run the job automatically we can have it run on a daily or monthly schedule, or have it run after a job – perhaps the Linked Job or a job that the Application Group VMs are backed up in. If some VMs come from a different job, leave If some linked backup jobs… checked and adjust the timer as needed.

Here’s what a successful job run looks like, with a little scrubbing, anyway:

Highlights and Observations

OK, that was a LOT we went through, very chewy. I have tried to highlight the most important items that I did not find in the B&R manual, including some I already covered above. I am also new to SureBackup myself and hope that if you see any incorrect information or workarounds, you will let me know in the comments or on twitter, specifically the affinity issue with the Proxy Appliance.

  • You need at least a Virtual Lab to create a SureBackup Job. Application Groups are optional, but are a quick way to get started.
  • Application Group VMs are processed in serial in the order specified. A single failure aborts the entire group.
  • If there is no existing role for a VM, you can create your own with an XML file. Existing roles are at %ProgramFiles%\Veeam\Backup and Replication\Backup\SbRoles.
  • Virtual Labs are tied to a single VMHost/Datastore and cannot be attached to clusters.
    • The Proxy Appliance VM is normally powered off so is mostly exempt from DRS. However, it can be moved during an HA event. Veeam does not appear to create an affinity rule to keep it in place. It also doesn’t quite notice when starting up the Virtual Lab that the VM isn’t on the same host as the vSwitch and jobs will continue to fail until you vMotion it back. Hopefully this is something Veeam is addressing; in the meantime I created a DRS rule on my own.
  • Spaces in the proxy appliance name are converted to spaces in the folder name on the datastore; at least one vCheck plugin will alert on this discrepancy between name and folder.
  • Place the Virtual Lab’s Proxy Appliance in the same network as the B&R Backup Server (not the Proxies or the Console, if the Console is separate from the Backup) and masquerade routes are added automatically; if you place it elsewhere, you must manage the routing from the Backup to the Proxy Appliance yourself.
  • Isolated networks are attached to a vSwitch with no uplinks. You should be able to use the same VLANs as you use in production, although someone could add an uplink to it. Adding 500 or 1000 to the VLAN number to put it in a range you don’t use may help prevent accidents.
  • The New Virtual Lab wizard’s Isolated Networks Add dialog does not automatically change the Isolated Network name; you must change it manually.
  • Tests vary depending on the network alignment:
    • If there is an isolated network/vNIC that matches the VMs production network, all tests (Heartbeat, Ping, Script, Verification) are attempted
    • If there is NO isolated network/vNIC matching the VM’s production network, only Heartbeat and Verification tests are attempted.
  • Windows Firewall policies default to block ICMP on “Private” networks, which is how the new Isolated network will be identified. Adjust your policy or Ping tests will fail. The policy is File and Printer Sharing (Echo Request – ICMPv4-In) for the Private profile, double click on it and enable it, or use PowerShell:

Enable-NetFirewallRule -DisplayName "File and Printer Sharing (Echo Request - ICMPv4-In)"
  • After you create the virtual lab, don’t forget to update the lab resources created to add Notes, Tags, and other standard meta-data you use internally.
  • A SureBackup Job can use an Application Group, one or more Linked Jobs, or an Application group AND one or more Linked Jobs.
    • When both are used, Linked Jobs are not processed until the Application Group tests are successful.
  • Keep the application group running after the job completes is missing the word successfully. If the application group tests fail, I observed the group shutting down immediately.
    • You will need to right click on the job and choose Stop Session when you are ready to shut down and delete the VMs.
  • Email notifications only happen on failures; I see no exposed setting to send notifications on success.
  • You cannot delete a lab or application group if a SureBackup job references it. Delete or edit the SureBackup job to remove the reference and try again.
  • You can power on the proxy appliance outside of SureBackup and deploy your own VMs attached to the vSwitch and make sure they get DHCP and are reachable with masquerading.
  • The default user/password for the proxy appliance is root/<proxyname>_r. Any spaces or underscores in the name are converted to hyphens. The default proxy name of Virtual Lab 1 results in the combination root / Virtual-Lab-1_r
  • You can examine the NAT masquerade or static NAT rules on the appliance with the commands iptables -L -n -v && iptables -t nat -L -n -v

Summary

With a lot of reading and a little bit of work, we have created an Application Group, a Virtual Lab with a few networks, and a SureBackup job that can test restores in a private environment. Most of us will have bit more work to do to create additional networks and maybe additional labs, but you should be able to start testing at least a few backups immediately. We can go to sleep a little better tonight knowing that our backups AND restores work! Even if they don’t work for some reason, at least we will find out now, not when we need them most!

I would love to hear any other tips and tricks for using SureBackup. It appears very powerful, but requires a good bit of manual effort. Has anyone made strides in automating it, officially or unofficial? Let me know in the comments or on twitter. Thanks!

Prevent vRealize Orchestrator lockouts

If you have played around with vRealize Orchestrator (and vCenter Orchestrator before it) for long enough, you have undoubtedly locked yourself out at least once, either at the console or in VAMI or both. KB 2069041 details the process to reset the password and it’s simple enough, for the most part. You still have to deal with a lockout period in both the console and VAMI, and since the only user that likely exists is root, it appears to me to be just a way to DoS yourself when you most desperately need access to your vRO. The lockout can be disabled, though.

While looking for the KB to reset the password, I found this article (if anyone knows who fdo is, please let me know, their profile page is blank) which describes how to disable the lockout at the console/ssh. Just edit /etc/pam.d/common-auth and comment out the line containing pam_tally2.so and you can get back in, whether you have changed root’s password or not. However, you cannot get into the VAMI still. Let’s see what else uses pam_tally2.so in the PAM configuration directory:

vro01:/var/log # grep tally /etc/pam.d/*
/etc/pam.d/common-account:account required pam_tally2.so
/etc/pam.d/common-account-vmware.local:account required pam_tally2.so
/etc/pam.d/common-auth:#auth required pam_tally2.so deny=3 onerr=fail even_deny_root unlock_time=86400 root_unlock_time=300
/etc/pam.d/common-auth-vmware.local:#auth required pam_tally2.so deny=3 onerr=fail even_deny_root unlock_time=86400 root_unlock_time=300
/etc/pam.d/vami-sfcb:auth required /lib64/security/pam_tally2.so deny=4 even_deny_root unlock_time=1200 root_unlock_time=1200
/etc/pam.d/vami-sfcb:account required /lib64/security/pam_tally2.so

Winner! There’s 3 different files (two are symlinks) containing that pattern and one has the word vami in it, bingo! Just get in and put a # in front of the auth line (the bolded one) to comment it out and suddenly you’ll be able to log in to the VAMI again. I do not know if this persists across updates, so you may want to revisit this after your next upgrade to be sure – I’ll come back and add a note whenever I do my next update.

You can now no longer DoS yourself, or be DoSed by accidental or malicious coworkers. However, keep in mind that this may violate your corporate standards for security, and that’s a political problem, not a technical one – perhaps in that situation, you can adjust the timers instead of disabling it entirely. I think it’s safe to say that this is perfect for everyone’s lab, though!

PowerCLI, vCheck, and vCenter SSL/TLS secure channel errors

I have been struggling with a number of errors and warnings between PowerCLI and my vCenter servers. The warnings about my self-signed certificates are no big deal, but the errors of course are. The biggest error I have is a well-known issue documented in this vCheck issue on GitHub:

The underlying connection was closed: Could not establish trust relationship for the SSL/TLS secure channel.

This happens intermittently, but frequently with the Get-HardDisk cmdlet which is used in most of the Snapshot related plugins. When it does happen, the vCheck plugin fails to return any meaningful data and normally errors pretty fast – run times for the full set of checks in my environment drops from ~120 minutes to ~8 minutes.

The issue goes back over 3 years and while there were a number of attempts to fix the issue, there was no single fix that worked for everyone, every time. Some would hide the issue till you hit a certain threshold and others would just make it far less likely to occur, but not eliminate it. I eventually opened an issue with VMware support and we found what I think is the solution.

Untrusted Certificates and CAs

I am using the provided certificates for my vCenters. These certificates have an expiry term of 10 years and are signed by a CA also provided by vCenter during the initial install. This is typically known as a self-signed certificate, but more specifically means the cert is not signed by a CA trusted by the client (if it was signed by Verisign but you removed the Verisign CAs from your Trusted CA store, it would be reported as a cert signed by an Untrusted CA and/or a self-signed certificate, depending on the application interfacing with it. I have decided to continue to use these certs as the process for attaching new certificates to a vCenter installation is hairy, to say the least.

This means that when I run Connect-VIServer against my vCenter, I receive the following note about the untrusted CA:

Be sure to use the FQDN to access your vCenter server, or this warning will be swallowed in favor of a “name mismatch” warning.

Generally speaking, most of us don’t care about this error because we are confident that we are connecting to our vCenter server and we tend to ignore this as a cause of problems. I certainly did. I don’t know the specifics surrounding it, but PowerCLI sometimes decides it doesn’t like the Untrusted CA and thus generates the error about Could not establish trust relationship. Sometimes, it’s cool and establishes it just fine. I believe it has something to do with resource exhaustion in tracking the connection, as one of the workarounds suggested on GitHub appeared to work for some by increasing the resources available to a PowerShell session. Regardless of the specifics, connecting to a Trusted CA does not have this issue. So our resolution is to use certificates signed by a Trusted CA!

As suggested above, you can attach new certificates directly from a Trusted CA to your vCenter, but it’s a tricky process. The other alternative is to trust the CA from your vCenter, which we’ll do here. Alternatively, if you want to attach new certs from an already-Trusted CA, check out KB2111219 and any number of blog posts that address this process and skip ahead to the Summary section.

Download and Install the Certificate Bundle

The first step to trusting the vCenter’s included CA is to download the certificate bundle. You can do this by visiting your vCenter on port 443, e.g. https://vcenter.example.com, and clicking on Download trusted root CA certificates:

You will receive a zip file that contains the certs in various formats. Since I’m on Windows, I burrow down to the certs\win directory where there are two CRT files and one CRL. Extract this in a folder somewhere. You only need the CRT that is paired with a CRL; the other CRT is for the ssoserver and that is not something PowerCLI cares about.

In vCenter 6.0, the cert bundle had no directories and just two files ending in .0 and .r0 (now found in the lin and mac directories) that correspond to .crt and .crl respectively, so just extract and rename the files if you that’s the case.

Now, we need to access the certificate store. This varies per OS and version. In Windows 7, you can find the store inside the Internet Options control panel on the Content tab by clicking the Certificates button. Click over to the Trusted Root Certification Authorities tab.

Click the Import button and browse to the CRT you stored earlier. When you import it, you’ll see the name CA – if you see ssoserver, you chose the wrong CRT file, try again with the other. You can now click on the imported CA called CA and click View to see the name. This is important when you have more than one vCenter, as they all import with the name CA, because that’s not confusing! You can see here this is the CA from my vCenter server called vcsa.nelson.va:

You want to repeat this process on any and all nodes that will use PowerCLI to connect to the vCenter in question, not just the server you run vCheck from.

Summary

With either your new certs or the new trust with the existing CA, you shouldn’t see the warning upon accessing your vCenter with Connection-VIServer. Close your PowerShell/PowerCLI sessions run that inside a brand new session and if you did things correctly, you will not see any yellow warning text:

When you run vCheck now, you should no longer see those random SSL/TLS errors! If you disabled some checks, like Phantom Snapshots, because they failed more often than they ran, this is a good time to review if you want to re-enable them. I hope this helps.

I will warn that this solution has only been tested for about a month, but I saw error rates drop from 70% to 0%. I could NOT get the errors to occur with the CA in place, but they would come back the moment I removed the CA. If you see the error return, please let me know in the comments or on twitter and I’ll be glad to share the ticket number reference for engaging support!

Many thanks to Isaac at VMware for this solution, and especially his insistence that I should import the CA even though I swore that couldn’t be the problem 🙂

Upgrade VCSA 6.0u3 to VCSA 6.5u1

Today, I upgraded a vCenter appliance on 6.0u3 to 6.5u1. I had been waiting for this forever as we wanted to get to 6.5, but had erroneously missed a line in the 6.0u3 release notes that said it could not be upgraded to 6.5! Happily, 6.5 Update 1 remedied that, so away we go!

You cannot use VAMI to do major/minor upgrades, only point releases (Update X) and patches, so you must download the new ISO and use the installer. You can find the ISO here and some great instructions on the installer in Mike Tabor’s Upgrade vCenter Server Appliance 6.0 to 6.5 article. The installer itself is pretty foolproof and Mike’s article addresses most ambiguities, so I just want to detail a few things I ran into that may help others.

  • Download the ISO before the change window begins, not after. That can be a problem, or so I’ve heard 😀
  • Turn off DRS during the upgrade. It’s mentioned in step 15 and in a warning in the installer itself, but I think it’s better to disable it before you get to that step, just in case DRS kicks in between when you start and that step.
  • The process involves a temporary IP for the new VCSA VM, so the old and new can be online simultaneously to transfer data. Add the temporary IP to any firewall rules involving the existing VCSA! If you do not do this, you may run into an error when stage 1 ends and the installer cannot reach the VAMI interface on the temporary VCSA. If you forget, you can proceed with Stage 2 at the URL specified, though you do have to enter a lot of auth information again:

  • If you have an external VUM, you need to either start the Migration Assistant on it or disable the extension com.vmware.vcIntegrity or the installer will not get started. I chose to disable the extension as the end goal was to use the new internal VUM service.
  • The password policy has changed, so you may not be able to keep the same root password for the new appliance.
  • For Stage 2, Mike very optimistically says “after a few minutes the vCenter Server Appliance upgrade should complete.” With just 2GB of data to migrate, it still took close to 45 minutes, and some individual steps seemed hung for close to 10 minutes at a time. Don’t worry if it takes a while, as long as you’re seeing progress overall.

After performing the upgrade, you’ll surely have other tasks, such as updating extensions like vRO and vROps, so don’t delete any snapshots right away in case something goes awry.

Using PowerCLI from the PowerShell Gallery

As you’ve surely seen, I love me some PowerCLI. So I was really happy when I saw that PowerCLI is now available on the PowerShell Gallery! What this means is that it is no longer a package you install on a server, it’s a set of modules you load from the gallery. When there’s a new version available, you just go get it. Because it’s now a bunch of files, not only do you not need to go to vmware.com to find the download link, you can also install it without requiring administrative access! That’s pretty awesome when you’re a tenant on a system, and it’s pretty awesome for the owners of the system, too (no needing to punt all your PowerCLI users so the files aren’t locked during an upgrade). I fill both roles from time to time, so I’m really happy about this improvement! Read more about the change in this VMware PowerCLI Blog article by Kyle Ruddy.

The article will guide you through the setup just fine, so I won’t dwell on that part very much, but if you’ve followed my PowerShell Profile article, there’s one small change to make: uninstall the old version of PowerCLI, then edit your posh profiles with notepad $profile and remove whatever version of the profile you used. Leave anything else you have added and close it out. Remember to do this once in PowerShell and once in PowerShell ISE if you use both.

Before proceeding, make sure TLS 1.2 is enabled. Even through Powershell 5.x, TLS 1.2 is not enabled by default. You may solve this 3 ways, depending on your access rights and need to preserve < TLS 1.2:

  • Run the following commands in an administrator powershell prompt, this adds TLS 1.2:
Set-ItemProperty -Path 'HKLM:\SOFTWARE\Wow6432Node\Microsoft\.NetFramework\v4.0.30319' -Name 'SchUseStrongCrypto' -Value '1' -Type DWord
Set-ItemProperty -Path 'HKLM:\SOFTWARE\Microsoft\.NetFramework\v4.0.30319' -Name 'SchUseStrongCrypto' -Value '1' -Type DWord
  • Run the following command at the beginning of your session to ONLY allow TLS 1.2:
[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12
  • Add the above command to your profile(s) to have it automatically run per session:
'[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12' | Out-File -FilePath $profile -Append

Now, install the modules as the blog article recommends.

Find-Module -Name VMware.PowerCLI -Repository PSGallery
Install-Module -Name VMware.PowerCLI -Scope CurrentUser -AllowClobber -Repository PSGallery

That’s it, you’re done! The modules will automatically be loaded as needed. You should be able to start typing Connect-VIServer and see autocomplete working by tabbing it out in regular PowerShell, in the typeahead dialog in ISE, or however your PowerShell UI displays it. If you hit enter, the containing sub-modules are loaded immediately on-demand. You can import the entire suite of modules with Import-Module VMware.PowerCLI,in your profile if you’d like, but it adds about 10 seconds to PowerShell startup on my laptop for minimal gain vs on-demand loading. However, it does give you the look of the old PowerCLI desktop shortcut, if you so desire it.

When upgrading to PowerCLI v10+, you may need to add -SkipPublisherCheck due to differences in the issuer information that PowerShell caches.

If, for some reason, the module is not found by PowerShell after installation, check out the value of $env:PSModulePath. It should include %USERPROFILE%\Documents\WindowsPowerShell\Modules, e.g. C:\Users\rnelson0\Documents\WindowsPowerShell\Modules, which is where Install-Module installs the files to. If it does not, you’ll need to modify it. Mine was funky because I apparently edited the environment variable portion of my windows install, even though I don’t remember it.

To keep up with PowerCLI from the Gallery, just run Update-Module -Name VMware.PowerCLI once in a while. Easy peasy. Enjoy!

Making the Puppet vRealize Automation plugin work with vRealize Orchestrator

I’m pretty excited about this post! I’ve been building up Puppet for vSphere Admins for a few years now but the final integration aspects between Puppet and vSphere/vCenter were always a little clunky and difficult to maintain without specific dedication to those integration components. Thanks to Puppet and VMware, that’s changed now.

Puppet announced version 2.0 of their Puppet Plugin for vRealize Automation this week. There’s a nice video attached, but there’s one problem – it’s centered on vRealize Automation (vRA) and I am working with vRealize Orchestrator (vRO)! vRO is included with all licenses of vCenter, whereas vRA is a separate product that costs extra, and even though vRA requires a vRO engine to perform a lot of its work, it abstracts a lot of the configuration and implementation details away that vRO-only users need to care about. This means that much of the vRA documentation and guides you find, for the Puppet plugin or otherwise, are always missing some of the important details needed to implement the same functionality – and sometimes won’t work at all if it relies on vRA functionality not available to us.

Don’t worry, though, the Puppet plugin DOES work with vRO! We’ll look at a few workflows to install, run, and remove puppet from nodes and then discuss how we can use them within larger customized workflows. You must already have an installed vRealize Orchestrator 7.x instance configured to talk to your vCenter instance. I’m using vRO 7.0.0 with vCenter 6.0. If you’re using a newer version, some of the dialogs I show may look a little different. If you’re still on vRO 6.x, the configuration will look a LOT different (you may have to do some research to find the equivalent functionality) but the workflow examples should be accurate.

Puppet provides a User Guide for use with a reference implementation. I’ll be mostly repeating Part 2 when installing and configuring, but reality tends to diverge from reference so we’ll explore some non-reference details as well.

Continue reading

Getting started with vCheck

If you use vSphere and particularly vCenter, you’re probably at least familiar in passing with PowerCLI, a package of snap-ins and modules for PowerShell. This is my preferred language for interacting with the vSphere/vCenter APIs, since it has (IMO) the best documentation of the available languages and API SDKs. If not, I recommend downloading it and playing with it, it can really help you automate many of your repetitive tasks with less Flash and less right clicking.

One of the most popular tools built with PowerCLI is vCheck. It’s a framework for running a number of checks against your vSphere infrastructure and determining what operational issues are present – something every Ops team needs. It won’t replace a monitoring system such as vROps or even Nagios, but it augments such systems very well. For example, it can report on VMs that have ISOs attached, or where snapshots have been present for more than 7 days – issues that probably aren’t worth paging anyone out for, but need to be dealt with eventually. Many of us have built some homegrown solutions for this, maybe even with PowerCLI, but it is difficult to beat a tool designed to meet the needs of a large percentage of vSphere users, is actively developed by VMware employees, and is a framework that you can extend with instance-specific needs. You can always run your tools and vCheck together, too.

Let’s take a look at vCheck and how to get started with it today. We’ll download it, configure it, schedule it as a daily task, review how to enable and disable checks, and store your configuration in version control. This provides a solid base that you can tweak until it fits your specific needs just right.

Continue reading