I’ve written before about the importance of hypothesis driven troubleshooting. The hypothesis is, of course, very important. So is the testing methodology. Let’s talk about positive controls. A positive control is where you test something that you expect to behave a certain way, and it does so. Positive controls help prove that your assumptions about the system (your world-view) is correct. When a positive control fails, it’s either because of user error or that we have a poor understanding of a system and we need to re-define our positive controls. Validation of the positive controls ensures that we spend our time testing valid assumptions.
In the context of IT, positive controls are often equivalent to our “baseline” measurements – but we also test our positive controls “at runtime” to ensure the historical measures are still accurate in the present. Today, we’ll use ping tests (ICMP) for positive controls, because it’s a simple model that everyone understands.
A user contacts you and says they cannot access a site you support. They can’t ping it and they can’t traceroute it. Your intuition leds you to form a hypothesis that their remote office’s router is experiencing an issue. Sounds probable, anyway. You need to test the hypothesis and the first inclination is to tell the customer to ping their router. You might get something like this:
C:\>ping 10.0.0.1 Pinging 10.0.0.1 with 32 bytes of data: Reply from 10.0.0.201: Destination host unreachable. Reply from 10.0.0.201: Destination host unreachable. Reply from 10.0.0.201: Destination host unreachable. Reply from 10.0.0.201: Destination host unreachable. Ping statistics for 10.0.0.1: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Ah ha, problem solved! You have the user reboot the router… and their problem persists. Crap.
Establishing Positive Controls
There could be any number of reasons for the router not to respond to ICMP. Let’s establish some positive controls to test our understanding of the present situation. First, let’s have the user ping themselves. This should absolutely work if they are on the network:
C:\>ping 10.0.0.201 Pinging 10.0.0.201 with 32 bytes of data: Reply from 10.0.0.201: bytes=32 time<1ms TTL=128 Reply from 10.0.0.201: bytes=32 time<1ms TTL=128 Reply from 10.0.0.201: bytes=32 time<1ms TTL=128 Reply from 10.0.0.201: bytes=32 time<1ms TTL=128
Well, that looks much better, at least we’re getting responses from ourself! If there was no response or an error, we’d know that the assumption of “the user is on the network” is invalid. Let’s assume there’s another node on the network that we know is responsive, say a printer, and use that as a positive control to ensure that the network itself is working.
C:\>ping 10.0.0.3 Pinging 10.0.0.3 with 32 bytes of data: Reply from 10.0.0.3: bytes=32 time<1ms TTL=64 Reply from 10.0.0.3: bytes=32 time<1ms TTL=64 Reply from 10.0.0.3: bytes=32 time<1ms TTL=64 Reply from 10.0.0.3: bytes=32 time<1ms TTL=64
Again, we’re looking good. A failure here might indicate a local network issue – user is in the wrong VLAN or the switches are messed up.
Real World Application
If we had established these positive controls before we started, we would have known for sure that the LAN was not the issue. However, if the positive controls failed, we would have rebooted the router for no good reason and could have affected other users unnecessarily. If that router had some important changes that weren’t saved, the reboot may have made things worse!
Hopefully you can see how to create positive controls for more complex systems. If an SQL query is failing, try a simple “select * from <table>;”. If an application’s errors are not making it to the central syslog server, test if a syslog message from the application server makes it to the central syslog server. Write down your assumptions and you can almost always make them into positive controls (or negative controls, but that’s another article).
Define your positive controls alongside the hypothesis. Test them first so that you can verify the world-view is solid before testing the hypothesis itself. If your positive controls fail, your world-view is incorrect and it’s back to the drawing board!