Auto Deploy Deep Dive, Part 4: Troubleshooting

Part 4 of the Auto Deploy Deep Dive Series details some of the issues I encountered along the way and how to troubleshoot them.

Troubleshooting

PXE

In a VLAN’ed environment (most production, and some home labs), the native VLAN and the destination VLAN for your host may be separate. You can of course boot the host on a port with no VLANs, the move/reconfigure the port to have the correct VLANs, but I would suggest entering the PXE manager at boot and setting the proper VLAN to eliminate possible failure points. If not, you may run into this situation…

Switch Configuration

Ensure that your host will be able to gain a valid IP address when the network management moves from the PXE kernel to the ESXi networking. For example, I have a TP-Link switch with the following two port configs:

interface gigabitEthernet 1/0/1
  switchport mode trunk
  switchport trunk allowed vlan 2,5,10
  description "ESXI02 eth0"
interface gigabitEthernet 1/0/4
  switchport access vlan 2

The host was plugged into Gi 1/0/1 to start. The PXE kernel can grab DHCP on either port, as it will fall to the native VLAN id 1 which the Trend-Net magically carries out the uplink port (the Trend-Net is new to my lab as well and it appears to treat VLANs slightly differently than I expected based on Cisco and Juniper experience). However, once ESXi was installed, vmnic0 would see three VLANs and was not choosing the right one (VLAN2) to request DHCP. The host failed to gain a network address on the trunk int, halting rule processing after the first Auto Deploy rule. I had to move the cable to Gi1/0/4. Be aware of differences between the PXE and ESXi networking stack and configuration that could cause issues if you’re connected to a port in trunk mode.

I did not have the switch configured so that I could sniff the ESXi traffic to see what was happening, and now that it’s in production I’m unlikely to tear it down to see what caused this. I suspect it has to do with my unfamiliarity with Trend-Net (if anyone knows the answer, please let me know). However, if you experience any issues with DHCP or networking, perhaps this will help point you in the right direction.

 TFTP

If the PXE loader isn’t finding a TFTP image, your first step is to use a TFTP client and download a file manually. I have a TFTP client installed on my Linux host, so I used that to download the file ‘tramp’ from the auto deploy package we deployed in the first article. Here was the output:

[user@server ~]$ tftp
(to) 10.0.0.242
tftp> get tramp
tftp> quit
[user@server ~]$ ls -l tramp
-rw-rw-r--. 1 user user 102 Dec 31 18:35 tramp

If it still isn’t working, check to make sure the TFTP service is running and that a firewall rule exists. If that’s not it, time to review logs on the TFTP server and/or get out wireshark and tcpdump!

DHCP Options

There are two common issues with DHCP: you’re not getting an address, or you verified that TFTP is working but the PXE loader isn’t getting an image.

For the first, make sure the server can get a DHCP address. Check that the DHCP server is started and that some other client can get an address – your laptop or tablet, for example – when not using PXE. On the new host, check the PXE settings and switch port config, as described above under Switch Configuration, if it has a problem getting an address.

If the host can get an address but not the image, the next step is to ensure the DHCP options are working properly. Since my DHCP server is linux, I’ll use tcpdump to see the output on the command line. If you’re using Windows, Wireshark will show the options in the packet contents. Here’s what a working PXE boot looks like:

[root@server ~]# tcpdump -nni eth0 -vvv -s 0 port bootps or port bootpc
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
16:04:02.132204 IP (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 576)
    0.0.0.0.68 > 255.255.255.255.67: [udp sum ok] BOOTP/DHCP, Request from 74:86:7a:e5:27:ce, length 548, xid 0x7be527ce, secs 4, Flags [Broadcast] (0x8000)
          Client-Ethernet-Address 74:86:7a:e5:27:ce
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message Option 53, length 1: Discover
            Parameter-Request Option 55, length 24:
              Subnet-Mask, Time-Zone, Default-Gateway, IEN-Name-Server
              Domain-Name-Server, RL, Hostname, BS
              Domain-Name, SS, RP, EP
              Vendor-Option, Server-ID, Vendor-Class, BF
              Option 128, Option 129, Option 130, Option 131
              Option 132, Option 133, Option 134, Option 135
            MSZ Option 57, length 2: 1260
            GUID Option 97, length 17: 0.68.69.76.76.49.0.16.50.128.67.199.192.79.66.90.49
            ARCH Option 93, length 2: 0
            NDI Option 94, length 3: 1.2.1
            Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001"
            END Option 255, length 0
            PAD Option 0, length 0, occurs 212
16:04:02.132666 IP (tos 0x10, ttl 128, id 0, offset 0, flags [none], proto UDP (17), length 348)
    10.0.0.251.67 > 255.255.255.255.68: [udp sum ok] BOOTP/DHCP, Reply, length 320, xid 0x7be527ce, secs 4, Flags [Broadcast] (0x8000)
          Your-IP 10.0.0.245
          Server-IP 10.0.0.242
          Client-Ethernet-Address 74:86:7a:e5:27:ce
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message Option 53, length 1: Offer
            Server-ID Option 54, length 4: 10.0.0.251
            Lease-Time Option 51, length 4: 21600
            Subnet-Mask Option 1, length 4: 255.255.255.0
            Time-Zone Option 2, length 4: -18000
            Default-Gateway Option 3, length 4: 10.0.0.1
            Domain-Name-Server Option 6, length 4: 10.0.0.251
            Domain-Name Option 15, length 9: "nelson.va"
            BF Option 67, length 27: "undionly.kpxe.vmw-hardwired"
            END Option 255, length 0
16:04:06.196541 IP (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 576)
    0.0.0.0.68 > 255.255.255.255.67: [udp sum ok] BOOTP/DHCP, Request from 74:86:7a:e5:27:ce, length 548, xid 0x7be527ce, secs 4, Flags [Broadcast] (0x8000)
          Client-Ethernet-Address 74:86:7a:e5:27:ce
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message Option 53, length 1: Request
            Requested-IP Option 50, length 4: 10.0.0.245
            Parameter-Request Option 55, length 24:
              Subnet-Mask, Time-Zone, Default-Gateway, IEN-Name-Server
              Domain-Name-Server, RL, Hostname, BS
              Domain-Name, SS, RP, EP
              Vendor-Option, Server-ID, Vendor-Class, BF
              Option 128, Option 129, Option 130, Option 131
              Option 132, Option 133, Option 134, Option 135
            MSZ Option 57, length 2: 1260
            Server-ID Option 54, length 4: 10.0.0.251
            GUID Option 97, length 17: 0.68.69.76.76.49.0.16.50.128.67.199.192.79.66.90.49
            ARCH Option 93, length 2: 0
            NDI Option 94, length 3: 1.2.1
            Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001"
            END Option 255, length 0
            PAD Option 0, length 0, occurs 200
16:04:06.196774 IP (tos 0x10, ttl 128, id 0, offset 0, flags [none], proto UDP (17), length 348)
    10.0.0.251.67 > 255.255.255.255.68: [udp sum ok] BOOTP/DHCP, Reply, length 320, xid 0x7be527ce, secs 4, Flags [Broadcast] (0x8000)
          Your-IP 10.0.0.245
          Server-IP 10.0.0.242
          Client-Ethernet-Address 74:86:7a:e5:27:ce
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message Option 53, length 1: ACK
            Server-ID Option 54, length 4: 10.0.0.251
            Lease-Time Option 51, length 4: 21600
            Subnet-Mask Option 1, length 4: 255.255.255.0
            Time-Zone Option 2, length 4: -18000
            Default-Gateway Option 3, length 4: 10.0.0.1
            Domain-Name-Server Option 6, length 4: 10.0.0.251
            Domain-Name Option 15, length 9: "nelson.va"
            BF Option 67, length 27: "undionly.kpxe.vmw-hardwired"
            END Option 255, length 0
16:04:13.710800 IP (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 418)
    0.0.0.0.68 > 255.255.255.255.67: [udp sum ok] BOOTP/DHCP, Request from 74:86:7a:e5:27:ce, length 390, xid 0x7ae527ce, secs 4, Flags [none] (0x0000)
          Client-Ethernet-Address 74:86:7a:e5:27:ce
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message Option 53, length 1: Discover
            MSZ Option 57, length 2: 1472
            ARCH Option 93, length 2: 0
            NDI Option 94, length 3: 1.2.1
            Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001"
            CLASS Option 77, length 4: "iPXE"
            Parameter-Request Option 55, length 13:
              Subnet-Mask, Default-Gateway, Domain-Name-Server, LOG
              Hostname, Domain-Name, RP, Vendor-Option
              Vendor-Class, TFTP, BF, Option 175
              Option 203
            T175 Option 175, length 48: 2969895188,3826671384,16851713,19005697,419496225,16846849,34799873,335610129,16902915,16777239,16848129,17957121
            Client-ID Option 61, length 7: ether 74:86:7a:e5:27:ce
            GUID Option 97, length 17: 0.68.69.76.76.49.0.16.50.128.67.199.192.79.66.90.49
            END Option 255, length 0
16:04:13.710946 IP (tos 0x10, ttl 128, id 0, offset 0, flags [none], proto UDP (17), length 342)
    10.0.0.251.67 > 10.0.0.245.68: [udp sum ok] BOOTP/DHCP, Reply, length 314, xid 0x7ae527ce, secs 4, Flags [none] (0x0000)
          Your-IP 10.0.0.245
          Server-IP 10.0.0.242
          Client-Ethernet-Address 74:86:7a:e5:27:ce
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message Option 53, length 1: Offer
            Server-ID Option 54, length 4: 10.0.0.251
            Lease-Time Option 51, length 4: 21600
            Subnet-Mask Option 1, length 4: 255.255.255.0
            Default-Gateway Option 3, length 4: 10.0.0.1
            Domain-Name-Server Option 6, length 4: 10.0.0.251
            Domain-Name Option 15, length 9: "nelson.va"
            BF Option 67, length 27: "undionly.kpxe.vmw-hardwired"
            END Option 255, length 0
16:04:14.654766 IP (tos 0x0, ttl 64, id 2, offset 0, flags [none], proto UDP (17), length 418)
    0.0.0.0.68 > 255.255.255.255.67: [udp sum ok] BOOTP/DHCP, Request from 74:86:7a:e5:27:ce, length 390, xid 0x7ae527ce, secs 10, Flags [none] (0x0000)
          Client-Ethernet-Address 74:86:7a:e5:27:ce
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message Option 53, length 1: Discover
            MSZ Option 57, length 2: 1472
            ARCH Option 93, length 2: 0
            NDI Option 94, length 3: 1.2.1
            Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001"
            CLASS Option 77, length 4: "iPXE"
            Parameter-Request Option 55, length 13:
              Subnet-Mask, Default-Gateway, Domain-Name-Server, LOG
              Hostname, Domain-Name, RP, Vendor-Option
              Vendor-Class, TFTP, BF, Option 175
              Option 203
            T175 Option 175, length 48: 2969895188,3826671384,16851713,19005697,419496225,16846849,34799873,335610129,16902915,16777239,16848129,17957121
            Client-ID Option 61, length 7: ether 74:86:7a:e5:27:ce
            GUID Option 97, length 17: 0.68.69.76.76.49.0.16.50.128.67.199.192.79.66.90.49
            END Option 255, length 0
16:04:14.655044 IP (tos 0x10, ttl 128, id 0, offset 0, flags [none], proto UDP (17), length 342)
    10.0.0.251.67 > 10.0.0.245.68: [udp sum ok] BOOTP/DHCP, Reply, length 314, xid 0x7ae527ce, secs 10, Flags [none] (0x0000)
          Your-IP 10.0.0.245
          Server-IP 10.0.0.242
          Client-Ethernet-Address 74:86:7a:e5:27:ce
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message Option 53, length 1: Offer
            Server-ID Option 54, length 4: 10.0.0.251
            Lease-Time Option 51, length 4: 21600
            Subnet-Mask Option 1, length 4: 255.255.255.0
            Default-Gateway Option 3, length 4: 10.0.0.1
            Domain-Name-Server Option 6, length 4: 10.0.0.251
            Domain-Name Option 15, length 9: "nelson.va"
            BF Option 67, length 27: "undionly.kpxe.vmw-hardwired"
            END Option 255, length 0
16:04:16.631904 IP (tos 0x0, ttl 64, id 3, offset 0, flags [none], proto UDP (17), length 430)
    0.0.0.0.68 > 255.255.255.255.67: [udp sum ok] BOOTP/DHCP, Request from 74:86:7a:e5:27:ce, length 402, xid 0x7ae527ce, secs 14, Flags [none] (0x0000)
          Client-Ethernet-Address 74:86:7a:e5:27:ce
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message Option 53, length 1: Request
            MSZ Option 57, length 2: 1472
            ARCH Option 93, length 2: 0
            NDI Option 94, length 3: 1.2.1
            Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001"
            CLASS Option 77, length 4: "iPXE"
            Parameter-Request Option 55, length 13:
              Subnet-Mask, Default-Gateway, Domain-Name-Server, LOG
              Hostname, Domain-Name, RP, Vendor-Option
              Vendor-Class, TFTP, BF, Option 175
              Option 203
            T175 Option 175, length 48: 2969895188,3826671384,16851713,19005697,419496225,16846849,34799873,335610129,16902915,16777239,16848129,17957121
            Client-ID Option 61, length 7: ether 74:86:7a:e5:27:ce
            GUID Option 97, length 17: 0.68.69.76.76.49.0.16.50.128.67.199.192.79.66.90.49
            Server-ID Option 54, length 4: 10.0.0.251
            Requested-IP Option 50, length 4: 10.0.0.245
            END Option 255, length 0
16:04:16.632153 IP (tos 0x10, ttl 128, id 0, offset 0, flags [none], proto UDP (17), length 342)
    10.0.0.251.67 > 10.0.0.245.68: [udp sum ok] BOOTP/DHCP, Reply, length 314, xid 0x7ae527ce, secs 14, Flags [none] (0x0000)
          Your-IP 10.0.0.245
          Server-IP 10.0.0.242
          Client-Ethernet-Address 74:86:7a:e5:27:ce
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message Option 53, length 1: ACK
            Server-ID Option 54, length 4: 10.0.0.251
            Lease-Time Option 51, length 4: 21600
            Subnet-Mask Option 1, length 4: 255.255.255.0
            Default-Gateway Option 3, length 4: 10.0.0.1
            Domain-Name-Server Option 6, length 4: 10.0.0.251
            Domain-Name Option 15, length 9: "nelson.va"
            BF Option 67, length 27: "undionly.kpxe.vmw-hardwired"
            END Option 255, length 0

In the DHCP responses, you’ll see the ‘Server-IP’ key’s value is that of ‘next-server’ and ‘BF Option 67’ is ‘bootfile-name’. When I was troubleshooting the dhcp directives ‘next-server’ vs. ‘tftp-server-name’ issue, this tcpdump command came in very handy.

Host Profiles

In this tutorial, a host profile was applied to the new host. It was also attached to a cluster. You may find that the incorrect host profile is applied upon completion. Check to see where your existing host profiles are applied. If a profile is applied at a cluster level, applying a profile to a host will fail gracefully, with no error. In most cases, you will want all hosts in a cluster to have the same profile applied for compliance. However, in some cases – like a home lab – you may have a cluster with great variance in the members. Detach the profile from the cluster and attach the correct profiles to the correct members individually. Reboot your new host, PXE boot again, and it should now receive the correct host profile.

You may run into other issues where the host profile does not do what you want it to do. Host Profiles are extremely complex and finicky. The sample profile we used has everything unchecked except for stateful caching. This is going to work everywhere because there is nothing specific in the profile. In a production environment, you will want to use a profile with many more settings enabled, especially to apply proper networking and storage settings. You may encounter issues with the profiles if device-specific entries from the reference host are unavailable or have different values on the target host (Ex: PCI slot numbers between my T110 ii and T320 are vastly different, which generated plenty of errors). I even got stuck in a nasty PXE boot/Load ESXi/reboot/PXE boot/Load ESXi loop at one point, for no discernible reason. Start disabling other profile items, starting with the hardware- or physical network-dependent settings, and try again. Keep narrowing it down, it may take a large number of iterations to get things right.

Don’t forget to then check the host for compliance. Even if it autodeploys properly, it may still be out of compliance! Repeat until perfect, or at least close enough – I’ve extracted the profile from a host and applied it to itself, after which it was determined to be non-compliant to itself! It may make more sense to use vCO or PowerCLI to automate some of the final details in 30 seconds rather than spend another day of long PXE-boots getting rid of that last compliance error.

There are also a few bugs in Host Profiles and Auto Deploy to be aware of. Recently, two issues floated across Planet v12n, vMotion not enabled on vmkernel ports and the default gateway being lost after a reboot (KB 2032817), but they’re not the only ones. Always check the KB before you drive yourself insane.

Just keep in mind that Host Profiles and Auto Deploy are imperfect, like all other software. Make it work for you and don’t be afraid to enhance it with other automation tools. Remember xkcd’s Automation and Is it worth the time? strips and allocate your time properly. The only right answer is the one that’s actually saving you time.

Bonus Points

For bonus points, I added a tftp server to the Linux VM for testing. I decided not to stick with it because in the long run, I’d move Auto-Deploy and TFTP to the vCenter VM once the kinks were worked out. In production, I’d probably also use a Window DHCP service, which entirely obviates the need for the Linux VM. However, if you are using the VCSA and want to stay as Windows-free as possible (Auto Deploy IS available with VCSA!), perhaps this will help.

Using CentOS 6.4, providing the TFTP service is fairly simple. Install the service itself with ‘yum install tftp-server’, which has a pre-requisite of xinetd as well. Once installed, you’ll need to enable the tftp service (or more properly, un-disable it – the syntax is really horrible if you ask me) in the file /etc/xinetd.d/tftp, and then start the xinetd service. To un-disable it, change the ‘disable’ line below from a yes to a no:

[root@server ~]# cat /etc/xinetd.d/tftp
# default: off
# description: The tftp server serves files using the trivial file transfer 
#       protocol.  The tftp protocol is often used to boot diskless 
#       workstations, download configuration files to network-aware printers, 
#       and to start the installation process for some operating systems.
service tftp
{
        socket_type             = dgram
        protocol                = udp
        wait                    = yes
        user                    = root
        server                  = /usr/sbin/in.tftpd
        server_args             = -s /var/lib/tftpboot
        disable                 = yes
        per_source              = 11
        cps                     = 100 2
        flags                   = IPv4
}

Start the service with ‘service xinetd start’. You can check to make sure tftp is running with ‘chkconfig’ – for some reason, chkconfig expands the status of xinetd services but ‘service xinetd status’ does not. Regardless, make sure it is enabled and you’re set:

[root@server ~]# chkconfig
<snip>
xinetd          0:off   1:off   2:off   3:on    4:on    5:on    6:off

xinetd based services:
        chargen-dgram:  off
        chargen-stream: off
        daytime-dgram:  off
        daytime-stream: off
        discard-dgram:  off
        discard-stream: off
        echo-dgram:     off
        echo-stream:    off
        tcpmux-server:  off
        tftp:           on
        time-dgram:     off
        time-stream:    off

Expand the deploy-tftp.zip file contents into /var/lib/tftproot (url: https://servername:6502/vmw/rbd/deploy-tftp.zip). You can change that location if you prefer by editing the ‘server_args’ line of /etc/xinetd.d/tftp with ‘ -s /some/other/path’.

Your last step is to change the ‘next-server’ option in dhcpd.conf to that of your Linux VM’s IP address. If you only have one DHCP server and the DHCP and TFTP server are the same, you can remove the statement entirely as it is only needed when they are not the same. If you have multiple DHCP servers or the services are split, then you have to be explicit. I prefer to be explicit whenever possible, self-documentation is usually helpful.

Wrap Up

By following along this far, not only are you able to create an Auto Deploy infrastructure and deploy new VMhosts in a matter of minutes, but you have some good troubleshooting fundamentals to deal with any hiccups, which is bound to happen if you scale this example up in a production environment. If you have any other tips and tricks for troubleshooting, or you think I missed something important, please use the comments for feedback. Check here if you’ve missed any other parts of this series and tune in to  my live vBrownBag session on the US broadcoast on 2014-04-16!. Thanks!

One thought on “Auto Deploy Deep Dive, Part 4: Troubleshooting

  1. Pingback: Newsletter: April 12, 2014 | Notes from MWhite

Leave a comment