Hypothesis Driven Troubleshooting

John Price wrote a wonderful article about troubleshooting the other day that got me thinking about this skill. Troubleshooting is an incredibly vital skill in IT and one that many people view as an innate skill, to the point that a common adage is, “You can’t teach troubleshooting, you have it or you don’t.” I believe that, like nearly every other skill, it is a learned skill, and those without the skills should not be treated as hopeless. It may come easier to some people, but anyone can be taught the fundamentals of troubleshooting if they care to learn.

Troubleshooting is, at a bare minimum, the search for the source of a problem. Good, effective troubleshooting is a logical and systematic search for that source. That difference is driven by a scientific hypothesis, or a proposed explanation for a problem that can be tested. The hypothesis might be, “The reason the internet is unavailable for users is that their internet connection is down.” This can be tested and determined to be the cause, or discarded as a failed hypothesis. The troubleshooter can determine another scientific hypothesis, “The reason the internet is unavailable for users is because the firewall is not passing traffic,” which can then be tested. By creating and following a series of hypothesis until a valid hypothesis is found, the troubleshooter can identify a problem that can be fixed. This is the essence of the scientific method, which isn’t just for scientists anymore. Troubleshooting without a hypothesis may lead to the source of a problem, but only through random luck.

The scientific method, how to craft a hypothesis, and how to test a hypothesis are all methods and skills that must be learned. We are not born with this knowledge, it must be taught. Some of us learn this in school as part of our formal education. Some of us learn in less formal methods. In John’s article, his father taught him how to define and test a hypothesis via the Socratic method, asking John to ask and answer questions and teaching him how to narrow the possible sources down to a single source. While most of us learn these skills at a relatively young age, usually before age 20, the skills and knowledge are teachable to anyone of any age. All it requires is a good teacher and a student willing to listen.

If someone you know does not have good troubleshooting skills and their job – or a job they want to obtain – requires it, they can be taught. If this person is your colleague or friend, do not give up on them! Become a teacher to them or find them a mentor. Perhaps they’ll teach you something along the way, and you’ll have the satisfaction of knowing that you’ve contributed to the next generation of IT leaders.

Questioning Assumptions with Intelligence

“Question everything!” You’ve heard this a million times. You probably try to do it, sometimes, too. The underlying tenants of The Goal, the Theory of Constraints, Lean, and other methodologies relies on questioning assumptions. It’s important, but what exactly does it mean, and what do you do afterward?

First off, it’s not a license to literally ask questions about every business decision at every opportunity. Many questions can be answered in your own head before you open your mouth, so there’s no need to bother others with those questions. For the rest, go back to the theory of constraints and ask yourself if it’s a bottleneck first. If not, the answer might not matter. Above all, always be courteous and understanding of the situation before speaking. If you do literally question everything, you will be treated like an a-hole of the first degree and your message will be lost. There’s a time and place for everything. Continuing on…

In the right context, “Hey, wait a minute, why exactly are we doing that?” is a good question. Sometimes there is a good answer,  but other times the answer is simply, “because.” That’s not a good answer. For example, someone who lives in SoCal suggested I salt my car’s tires in the winter. Though I have lived in the north, I had never heard of doing that. I asked where they learned to do that. Many years ago, the person went to college in Pittsburgh and saw buckets of salt near parking areas. They saw someone else pour salt around their car’s tires, so they assumed that is what it was intended for. Turns out it was for the sidewalks.

You might find what that person did humorous, but before you snicker, look around your business – are you sure you’re not doing something simply because your colleague or predecessor did it? A long time ago, I found out I had been swapping backup tapes every morning on a system that had been decommissioned but not powered off. Whoops! This is cargo cult behavior, and we all participate in it at some point in our lives. Businesses do it A LOT. The important thing is that we come to understand what we are doing and correct the behavior.

When you do find some broken assumption, you must be smart in how you address it. Again, make sure it’s a constraint. A little salt around the tires won’t really hurt anything, but putting salt in the gas tank certainly would. Focus the efforts on the constraints. Figure out what is wrong with the assumption and how to make it right. When you find these broken assumptions, there’s no need to blame or ridicule someone. You fixed a problem, everyone should be happy! Once you make some correction, take a look at the other assumptions in your system and see if they were affected. Decisions in the fundamental parts of the system tend to have cascading effects further down the line.

This is an iterative process. If you question an assumption this year and there’s a good reason for it, you will eventually want to revisit it, maybe next year or in 5 years. Change is perpetual and you should embrace it, not flee from it.

Fortigate user permissions peculiarities

While working with a customer on their Fortigate firewalls, I was introduced to a peculiarity of how FortiOS interprets user’s diag commands. I suspect this affects multiple versions, but I don’t have the ability to test this.

  • FortiOS: 4.2.x
  • User: wild-card (TACACS)
  • Profile: super_admin_readonly

TACACS users whose permissions elevate them to the super_admin profile are unaffected. They can run diag commands unrestricted as they have full access.

TACACS users whose permissions remain at super_admin_readonly were finding that they could not run diag commands that accessed an interface, such as diag sniff packet any “icmp”. Upon further investigation, the issue was related to the IP the user connected to and the interface (“any” in the example) used in the command. As a readonly user, the any interface is off-limits. The interfaces configured for the VDOM that the user connected to are available to the readonly users.

In other words, if a firewall had two VDOMs, Common and DMZ, and the user connected to any interface connected to the Common interface, only those interfaces would be useable. For instance, diag sniff packet common-outside “icmp” would work, as well as common-inside. Interfaces connected to other VDOMs are off-limits, so diag sniff packet dmz-outside “icmp would fail. By providing the end user a list of the IP addresses and interface names, and the VDOM they belonged to, the user was able to perform all required diagnostic commands.

I hope this is fixed in more recent versions, but at least there’s a workaround that makes some logical sense.

Thinking TechX

A word we hear too much of these days is ‘disrupt’. When it’s not overused, it means that you’re trying to change the way you do things in some dramatic fashion. Instead of doing things by hand, you use some tool to automate some or all of it. Or you switch from Linux everywhere to Window everywhere, or vice versa. Whatever the change is, the point is that you’re changing how you do things.

Something that frequently appears to be forgotten during disruption is to change how you think about doing things. When you were doing things on Windows, you probably did a lot of mouse clicking and typing. Now you’ve moved to Linux. Was the change really about the OS? Probably not. The change was about not having to click the mouse and type. So stop it! Start “thinking Linux”, or whatever technology you’re using.

This has two advantages. First, it becomes really disruptive, because it was the thought process holding you back the whole time. If you only change the technology, you’ve just hidden the problem for a while. That buys you a bit of runway but no real solution. Applying an entirely new thought process will help you get out of the rut of “the way we’ve always done it.”

Second, if you are using idiomatic patterns of the chosen technology – such as using camelCase in Powershell but snake_case in Ruby – you’re going to find it much easier to attract and retain coworkers who already think that way. If your Ruby code looks like PowerShell, most Ruby devs will just run away. Even if your team has low turnover, it will make everyone on the team better able to receive new team members and allow the team to better contribute back to the community, especially via open source projects.

Take the time to approach your problems in a new manner from top to bottom and you’ll reap the benefits.

The Goal: Throughput and Efficiency

One of the most important concepts of The Goal is to increase throughput. Throughput is the rate at which the system generates money through sales. That is, when your company takes raw materials, processes them into a finished good, and sells it, the measured rate of that activity is your throughput. Severe emphasis on sales. Throughput is not the same as efficiency. Today, we will look at throughput vs. efficiency and how these concepts apply to IT.

Though we are focusing on throughput, we must state the descriptions of the two other measurements. Inventory is all the money that the system has invested in purchasing things which it intends to sell. Operational expense is all the money the system spends in order to turn inventory into throughput. I list the three definitions together because the definitions are precise and interconnected. Changing even a single word in one requires the other two be adjusted as well.

Another important concept in throughput is that it measures the entire system, not a locality. Whether you work in your garage or in a giant auto plant, you can not measure throughput locally, it must be measured over the entire system. This conflicts with most companies’ measurements of local efficiency. Employers naturally want to keep all their employees busy and employees like to see their coworkers pull their own weight. Why should Jane get to twiddle her thumbs at the Fob machine when Jill is busy pushing pallets of Fob parts around the floor? Is it fair to George to watch Jeff read the newspaper while he has to investigate hundreds of parts for quality control? And shouldn’t Jane and Jeff be worried that they might be reprimanded or fired for not being efficient, or draw the ire of their coworkers?

Continue reading

Introduction to rspec-puppet

Editor’s note: Please check out the much newer article Configuring Travis CI on a Puppet Module Repo for the new “best practices” around setting up rspec-puppet. You are encouraged to use the newer setup, though everything on this page will still work!

Over the course of the Puppet series, one thing I’ve ignored is testing. As vSphere admins, many of us are comfortable with programming but probably not as well versed in some practices as full-time developers. Today we’ll look at an introduction to some test-driven-development with puppet.

Test Driven Development

What is this Test Driven Development, or TDD, that everyone speaks so highly of? In essence, you write tests that fail before you write any code, then you write code to satisfy the tests. Each test typically looks at a specific unit of functionality of a program, such as whether a file is created or has contents, and are called “unit tests.” By testing a specific function, when you have a failure, you can typically narrow down the problem domain to a few lines of code. When all unit tests generate successes, your code works (in theory!). In addition, when you modify the code in the future, these unit tests help ensure that you haven’t broken something that was previously working, also known as a “regression.”

TDD depends, of course, on writing tests that both provide coverage of all your code and that map to the requirements of the program. If you forget to provide a test that covers a vital portion of your code, all your tests can be successful but leave you with a broken program. If you have not been practicing TDD on an existing program, you can still add tests. However, you will not have 100% test coverage (the percent of code that is covered by unit tests) initially, or possibly ever, as all of the existing code was written prior to the unit tests. To keep things simple today, we’ll start writing some new code.

Continue reading

Sometimes We Break Things

Today’s a no-deploy Friday for me, like it is for many. However, also like many others, here I am deploying things. Small, minor things, but it would ruin my weekend if they broke anyway. Sometimes the worst does happen and we break things. Don’t worry, we’re professionals!

So, what happens if you do break something? First, don’t panic. Everyone’s broken something before, and that includes everyone above you in the food chain. The second step is to notify those above you according to your internal processes. In most cases, that means stopping what you are doing and giving your boss a paragraph summary of the issue, what it affects, and what you’re doing about it, then getting back to work. Third, don’t panic! I know I already said that, but since you’ve now gone and told your boss, they may have induced some panic – let it pass. The only way you’ll recover is if you don’t panic. Breath.

Fourth, fix it! Use your mind to decide what was supposed to happen, what you did, and where things went wrong. Identify the steps required to either back things out or repair the situation so you can proceed. Document the steps and follow them. If you have a maintenance window you are operating under, put some time estimates down and set an alarm for when you need to make the go/no-go call. Though the situation is urgent, taking a few moments now to prepare will make you more efficient as you proceed. Give your management chain short updates throughout the event until it is cleared, and don’t let rising panic get to you.

Continue reading