When watching others troubleshoot, I have noticed one very important step that is frequently overlooked: reproduction of the problem and validation of the solution.
Once you believe you have remediated an issue, you should attempt to immediately recreate the problem (use your common sense – if the issue affects online sales on Black Friday, it’s probably best to make a note and schedule the testing for later!). This is often as simple as undoing the fix or re-implementing the broken config. If the problem does not return, you didn’t actually fix the issue! Something else must have happened in the meantime to fix the issue.
You may be asking yourself, “If the problem is fixed, why do I care if it was my efforts that fixed it or not?” There are three main reasons why you should care:
- Ensure the problem does not reoccur without warning. If your fix isn’t a fix and you cannot induce the problem to occur immediately, you can at least document what steps were taken and that they did not resolve the issue. When it does occur again, no one will be surprised.
- Your “fix” may have side effects. Revert the configuration change along with any compensating controls put in place, such as a set of permit rules above a deny rule that didn’t exist in the firewall before.
- You may start a cargo cult! This is very likely if the fix isn’t a setting but an action – clearing cache, restarting a process, or even rebooting. These hoops and the need to jump through them may become part of the diagnosis and remediation process. If the solution was invalidated, everyone would realize that these efforts only waste time and have no benefit.
Customer satisfaction will increase when they see with certainty that a fix works and that it won’t spontaneously reoccur in the future. Explain that you want to take some time now to recreate the issue and validate the solution and almost all customers will be understanding and appreciate the effort.
One thought on “Troubleshooting: Recreation and Validation”