Troubleshooting: Recreation and Validation

When watching others troubleshoot, I have noticed one very important step that is frequently overlooked: reproduction of the problem and validation of the solution.

Once you believe you have remediated an issue, you should attempt to immediately recreate the problem (use your common sense – if the issue affects online sales on Black Friday, it’s probably best to make a note and schedule the testing for later!). This is often as simple as undoing the fix or re-implementing the broken config. If the problem does not return, you didn’t actually fix the issue! Something else must have happened in the meantime to fix the issue.

You may be asking yourself, “If the problem is fixed, why do I care if it was my efforts that fixed it or not?” There are three main reasons why you should care:

  • Ensure the problem does not reoccur without warning. If your fix isn’t a fix and you cannot induce the problem to occur immediately, you can at least document what steps were taken and that they did not resolve the issue. When it does occur again, no one will be surprised.
  • Your “fix” may have side effects. Revert the configuration change along with any compensating controls put in place, such as a set of permit rules above a deny rule that didn’t exist in the firewall before.
  • You may start a cargo cult! This is very likely if the fix isn’t a setting but an action – clearing cache, restarting a process, or even rebooting. These hoops and the need to jump through them may become part of the diagnosis and remediation process. If the solution was invalidated, everyone would realize that these efforts only waste time and have no benefit.

Customer satisfaction will increase when they see with certainty that a fix works and that it won’t spontaneously reoccur in the future. Explain that you want to take some time now to recreate the issue and validate the solution and almost all customers will be understanding and appreciate the effort.

Home Lab 2015 Project

I strongly believe that everyone needs a home lab in order to practice Continual Improvement of the self. I recently completed an upgrade of my own home lab, for those interested. This year’s upgrade was inspired partly by need after moving to a new house that lacked ethernet wiring and partly by Chris Wahl’s colorful network.

The Existing Lab

For the past years, my focus has truly been on virtualizing everything. The core of my lab are two Dell hosts running vSphere. The smaller is a 2012 PowerEdge T110 ii with a 4 core processor, 32 GB RAM (32 GB max), a single onboard NIC, and some local storage. The larger is a 2013 PowerEdge T320 with a 6 core processor, 32 GB RAM (96 GB max), dual onboard NICs, and some local storage. They are both single socket, but could take extra NICs or storage. The T320 could also have an iDRAC if I didn’t mind running downstairs once in a blue moon. They are currently running vSphere 5.5 and I will upgrade them in the next month or so.

Continue reading

On Karōjisatsu And Avoiding Burnout

Recently, John Willis (@botchagalupe) wrote an excellent article about Karōjisatsu, one who commits suicide due to mental stress, often work-related. It’s a very sad, emotional tale that is relevant in many industries, but one that speaks particularly to high-pressure, high-stress STEM jobs, including IT. If you have not read this article, please take a few moments to go read it now.

The core idea of nearly overwhelming burnout is probably one that you recognize. John’s article spoke very eloquently on the need to reach out if you feel overwhelmed, that you’re not alone, that there are many people who are willing to help you, and that suicide is not an option. I would like to add that if I can ever be of any assistance to anyone reading this, don’t hesitate to reach out. If you ever feel truly overwhelmed, reach out to the National Suicide Hotline at (800) 273-8255 as well. You do matter!

John describes some causes of Karōshi, including, “Stress accumulated due to frustration at not being able to achieve the goals set by the company.” There is always pressure to do more with less and in IT, we tend to feel this pressure very heavily. Systems and their associated problems always seem to come and rarely to go, giving even stable, growth-restricted companies an increasing IT burden. Every day, there is an increasing amount of systems knowledge – often of the tribal and oral history varieties – for each of us to remember and maintain. When things go wrong – and they always do – we have to drop what we are doing to put out the fires, delaying our schedule and often without the ability to adjust the delivery dates on the schedule. We often feel that we must work harder and longer to make up for these delays and maintain the schedule in order to hit the company’s goals. The mental and physical stress of something going wrong combined with the mental and physical stress of working harder and longer accumulates in a vicious cycle that must be broken before it leads to karōshi.

I know this feeling. I have found myself looking at the clock near midnight, telling myself that I’ll put the computer down in 10 minutes and go to bed, only to blink and the clock reads 3AM. I have gotten up early on a Sunday to fix something broken that I could not get to on Friday. I’ve even found myself getting up early to “fix” something that’s not broken! The pressure of needing to resolve an issue, ship a product, or address a customer’s question keeps my brain running at night when it should be resting and recuperating so that I can do good work the next day. Sometimes it’s not even a company goal that keeps me working on an issue, just my stubborn pride. Whatever the cause, I know the feeling of overwhelming pressure that affects all of us from time to time.

Burnout of any sort, whether it puts you on the edge of suicide or the edge of your career, is dangerous. We must all develop coping strategies to deal with these feelings. I have been fortunate to have some wonderful mentors in my career. I credit my first two bosses for giving me two great coping strategies to deal with this pressure, and I would like to share those strategies with you.

The first coping strategy is courtesy of Bob at Centerline, my first “real world” job. We were the IT Operations staff at an engineering firm. His advice was simple: “Sometimes, you let it burn.” It’s very easy to hear users scream and think that world really is ending. What the users are saying is important, but we must evaluate what we hear carefully and prioritize accordingly. Are we reacting because a single person is struggling with an issue or because the company is negatively affected by a problem more than they are positively affected by whatever you are currently doing? If you’re off shift when the issue occurs, must it really be taken care of immediately by you, or can it wait or be handled by someone else? Most of us have been taught repeatedly that the answer is always, “Fix it now!” but is that truly the case?

When issues have a low severity or affect a low number of users, particularly if you’re treating symptoms and not causes, let them “burn”. While things are burning, put your effort toward fixing the underlying causes in order to prevent future fires. You will often find that your environment is not as flammable as everyone thought and that a little fire and smoke won’t destroy the company. It’s still hot, and it still hurts, but it is a different kind of hurt. This is an especially great way to deal with chronic issues. Rather than dropping everything for, say, a single user who complains about a broken report that they need RIGHT NOW, fix the underlying bug in the reporting system. If you can pick just one “burn day” a month and spend that time on underlying causes, you will find yourself in a much better position in a few months. If you can do it more frequently, or cherry-pick some chronic issues to let burn, you may see results in just a few weeks.

Regardless of the frequency with which you have burn days, you’ll notice one thing very quickly: your stress levels will go down. When you do encounter a chronic issue that you cannot let burn, you know that someday soon you will be able to make that issue go away forever. Your time will be freed up to work on improvements and innovation rather than just outages, lowering the pressure put upon you and enabling you to meet the company’s goals.

The second coping strategy was taught to me by Scott from RBA Systems. This is a consulting firm where we provided both development and operations to our customers. I was a 21 year old kid who just dropped out of college and was out to prove myself in IT. In my first few weeks, Scott often had to tell me, “pace yourself.” I wish I could say I thought nothing of it, but as the young smartass I was, I thought it was something a jaded old guy would say. I’m tough and there’s no way I’ll let him slow me down! Instead, in just a few months, the blistering pace I had coming out the gate had to falter and Scott, who also had to manage a few other people at the same time, started lapping me.

There’s simply no way you can keep up a lightning pace forever. Going 110% seems great until your body and mind start to fall apart due to the constant pressure they are under. Even going 100% cannot be maintained. You might find yourself flagging at the end of the day or your typing rate going to shit or constantly typing the wrong commands in the wrong windows. This is especially dangerous with ‘reboot’, ‘write erase’, or ‘rm’ style commands! None of these actions help you, your company, or your customers. Find out what your 100% looks like, pull back a bit from it until you find your pace you can maintain that balances speed, efficiency, and accuracy. Keep adjusting that pace over time as your skills improve and your work/life demands shift to maintain the balance. You may be making adjustments every day, and that’s okay – no-one’s perfect.

I credit my ability to successfully maintain a high level of performance and avoid burnout in IT over the past fifteen years to the valuable lessons from my early mentors, burn days and pacing myself. I hope these tools can help others with this ongoing struggle.

Do these things because you have pride in your work, because you want to be able to continue contributing to IT for decades, because they’re the right things to do. Do it because you matter. Do it because you love life.

Rob Nelson’s 2015 Goals

On January 19th, I graded my efforts in 2014 and promised to document my technical goals for this year.

Learn Ruby

I spend a lot of time with Puppet and R10k and while I’m a decent enough programmer to pick up on what’s going on in general, I really need to grok Ruby at an advanced level. I bought a copy of Eloquent Ruby for the theoretical and I’ve eyeballed a few projects to work on for the practical.

Blog more about Security

One of my primary functions at work is related to security, but I rarely blog about it. That needs to change.

Home Network

1) After moving, I have my servers in a temporary space. Complete the finished space and migrate all the hardware there. Parts are on order and I’m going to add to the breaker box soon!

2) Stand up a complete environment. I’m missing vCO, an IPAM solution, and monitoring. Yes, it’s just my home network, but if I can’t monitor a handful of devices, I can’t monitor hundreds or thousands.

Expand Puppetinabox

Two weeks ago I released my first solo software project, puppetinabox. I’ve laid out some enhancements for it and I’m sure there are deficiencies I’m not aware of. I need to practice some good software development patterns both for myself, and if I want it to be used by others.

Propose a VMworld Talk

Last year, I proposed an Auto Deploy talk. It was not accepted. I won’t make it a goal to have an accepted talk – I have no control over it, after all – but I’d like to put a proposal in again. Topic undecided.

Propose a PuppetConf Talk

As the year progressed, I became more interested in submitting a talk to PuppetConf than VMworld. This appears to have been the right choice, as my talk proposal was accepted. Mission accomplished!

VCAP-DCA

Last year, I obtained by VCP5-DCV. This year, I need to at least get my VCAP-DCA. That means a new study guide, some time in the lab, and a plan to prep, study, take, and pass the exam.

These are goals that are important to me. I’m documenting the goals so I can hold myself accountable come next January and see how I progressed.

2014 Recap: How did I do?

Inspired by Scott Lowe’s goals for 2015, I’ve decided to be more rigorous about my own work-related goals. In this post, I’ll give a recap of my unofficial plans for 2014. I never wrote these down or otherwise formalized them, so they are a bit rough.

  • Start a blog

This one I achieved pretty early in the year. I started collecting material during the holiday and started publishing content on February. Grade: Pass

  • Post a Monday morning blog article every week

I had mixed success here. My initial aim really was to have one article a week and I made it for about 10 months, with the help of Jason (@hawkbox), who contributed some Hyper-V content during August. Things started to fall apart in November, though. I participated in the 30in30 challenge for November and didn’t quite make the mark, and it got me away from the weekly post. In December, I didn’t post weekly. So I failed, only hitting this about 70% of the time. Grade: C

However, I consider the results a success. I learned a few things. One, how difficult it is to get content ready every week without fail. I would have done much poorer if Jason had not helped me in August, as I was reaching burnout. Learning new things can be fun, but the pressure to do it continually does suck some of the enjoyment out. It also has a high cost on your free time, you need to spend a lot of it on the blog and it can intrude upon time with family and friends (you’ll note that blog content has been almost nil since the holiday season began, for this reason). Second, a weekly article isn’t needed. I managed to have 100 blog articles in 11 months, which is more than one per week. Some were really short, some very lengthy technical articles, and some were opinion pieces. People found the articles and I get a note from readers once in a while. Thank you all for reading, it really makes it worthwhile.

Even though I missed my original mark, I’ll regrade it as a B as I met the real, unwritten goal – create helpful content and post it on my blog frequently.

  • Learn Puppet

On this goal, I did very well. I wrote some 35+ articles, created three forge modules, and it spawned a project ‘Puppetinabox’. It also had some unrecognized synergy with using and learning Git, which I now feel adept with. I still have a lot of things to learn about Puppet, but my efforts clearly paid off. Grade: A

  • Present publicly

A goal I added midway through the year was to present something publicly. It didn’t matter what, but it had to be in front of real live human beings. I had done an Auto Deploy Deep Dive on the vBrownBag podcast earlier in the year and loved it. A podcast can make you nervous, but no-one can see you and you can’t see the attendees. I wanted to really push myself as I know I’m not good at public speaking. I ended up giving two public presentations. The first was a Virtual Design Master followup at the annual Indianapolis VMUG conference in July and the second was about DevOps for SysAdmins in November at a regular VMUG meeting. I had a lot of fun with both and I think I got past much of my stage fright. I also presented with Byron Schaller both times, who really helped with getting the presentations together. Of course, both presentations were to small groups. I had applied to give my Auto Deploy presentation at VMworld and presenting to (hopefully!) a few hundred people would have been a much different experience. There’s always next year! Grade: B+

For someone without a solid set of goals for the year, I think I did fairly well! I’m going to try and improve on this for 2015 by documenting the goals publicly and posting them here. Stay tuned.

Hypothesis Driven Writing

I just tackled hypothesis-driven troubleshooting, which brings me to an important subject for blog writers and #vDM30in30 in particular: hypothesis-driven writing. As writers, we constantly seek to improve our abilities. One of the most important skills, in my opinion, is to use a hypothesis as the foundation of your writing. Writing around a solid hypothesis results in an interesting, focused result that engages readers and leaves them with a clear impression of what the writer wanted to say. A lack of hypothesis results in an aimless article that leaves the reader confused and wondering what the writer was trying to convey.

As a reader, most of us find this hypothesis to be true without requiring great analysis. If an article starts out talking about the importance of OpenStack and devolves into comparing Disney films, we all feel the lack of a solid hypothesis. On the other hand, if Disney films are involved in the hypothesis, perhaps as analogies to the components of or community around OpenStack, the reader may feel rewarded and be very receptive to the writer’s goals (I challenge someone to write such an article, it would be quite the feat!). When the writer follows the hypothesis, everyone enjoys the benefits.

If we agree that good writing relies on a solid hypothesis and the writer’s adherence to the hypothesis, how do we, as writers, craft an effective hypothesis? Look at the definition of hypothesis. There are many types and the type chosen will be based on the writing goal. A research paper would require a working hypothesis, a hypothesis that is provisionally accepted to further research. It is constructed as a statement of expectations, such as, “We expect X to increase proportionally to the decrease of Y,” which would then be tested to determine it’s validity. A formal logic statement, of the form, “If X, then Y,” is based on hypothesis X, and can be the foundation of a logical proof or experiment. In an opinion piece, like a blog, a hypothesis may be crafted as a general plot, such as, “Creating and adhering to a hypothesis is the key to good writing,” which is then examined in detail.

Now that you have a hypothesis, you need to state it. The first paragraph if your writing is where you state the hypothesis. There are many ways to state your hypothesis. I follow a few guidelines.

  • Describe the general hypothesis.
  • State your specific hypothesis. Avoid terms like, “I think,” when possible.
  • Repeat your specific hypothesis.

Throughout the rest of your writing, every paragraph needs to relate to the hypothesis, through direct support of the statement or through indirect support, such as data or analysis that relates to the hypothesis. Your readers will be able to follow the thread of your writing and, hopefully, see exactly what you were trying to present them.

In your final summary (typically the final paragraph, except in larger articles), restate the hypothesis and the supporting evidence. If you did a good job explaining yourself, this will reinforce the ideas in your reader’s minds.

As a writer, your goal is to create a valuable article. The foundation of that article is a hypothesis. It’s important to adhere to this hypothesis in order to reward your readers with a solid article. Whether you’re participating in #vDM30in30 or writing on a less frequent basis, by practicing hypothesis-drive writing, take the time to focus on improving this skill and both you and your readers will appreciate the results.

Hypothesis Driven Troubleshooting

John Price wrote a wonderful article about troubleshooting the other day that got me thinking about this skill. Troubleshooting is an incredibly vital skill in IT and one that many people view as an innate skill, to the point that a common adage is, “You can’t teach troubleshooting, you have it or you don’t.” I believe that, like nearly every other skill, it is a learned skill, and those without the skills should not be treated as hopeless. It may come easier to some people, but anyone can be taught the fundamentals of troubleshooting if they care to learn.

Troubleshooting is, at a bare minimum, the search for the source of a problem. Good, effective troubleshooting is a logical and systematic search for that source. That difference is driven by a scientific hypothesis, or a proposed explanation for a problem that can be tested. The hypothesis might be, “The reason the internet is unavailable for users is that their internet connection is down.” This can be tested and determined to be the cause, or discarded as a failed hypothesis. The troubleshooter can determine another scientific hypothesis, “The reason the internet is unavailable for users is because the firewall is not passing traffic,” which can then be tested. By creating and following a series of hypothesis until a valid hypothesis is found, the troubleshooter can identify a problem that can be fixed. This is the essence of the scientific method, which isn’t just for scientists anymore. Troubleshooting without a hypothesis may lead to the source of a problem, but only through random luck.

The scientific method, how to craft a hypothesis, and how to test a hypothesis are all methods and skills that must be learned. We are not born with this knowledge, it must be taught. Some of us learn this in school as part of our formal education. Some of us learn in less formal methods. In John’s article, his father taught him how to define and test a hypothesis via the Socratic method, asking John to ask and answer questions and teaching him how to narrow the possible sources down to a single source. While most of us learn these skills at a relatively young age, usually before age 20, the skills and knowledge are teachable to anyone of any age. All it requires is a good teacher and a student willing to listen.

If someone you know does not have good troubleshooting skills and their job – or a job they want to obtain – requires it, they can be taught. If this person is your colleague or friend, do not give up on them! Become a teacher to them or find them a mentor. Perhaps they’ll teach you something along the way, and you’ll have the satisfaction of knowing that you’ve contributed to the next generation of IT leaders.

Thinking TechX

A word we hear too much of these days is ‘disrupt’. When it’s not overused, it means that you’re trying to change the way you do things in some dramatic fashion. Instead of doing things by hand, you use some tool to automate some or all of it. Or you switch from Linux everywhere to Window everywhere, or vice versa. Whatever the change is, the point is that you’re changing how you do things.

Something that frequently appears to be forgotten during disruption is to change how you think about doing things. When you were doing things on Windows, you probably did a lot of mouse clicking and typing. Now you’ve moved to Linux. Was the change really about the OS? Probably not. The change was about not having to click the mouse and type. So stop it! Start “thinking Linux”, or whatever technology you’re using.

This has two advantages. First, it becomes really disruptive, because it was the thought process holding you back the whole time. If you only change the technology, you’ve just hidden the problem for a while. That buys you a bit of runway but no real solution. Applying an entirely new thought process will help you get out of the rut of “the way we’ve always done it.”

Second, if you are using idiomatic patterns of the chosen technology – such as using camelCase in Powershell but snake_case in Ruby – you’re going to find it much easier to attract and retain coworkers who already think that way. If your Ruby code looks like PowerShell, most Ruby devs will just run away. Even if your team has low turnover, it will make everyone on the team better able to receive new team members and allow the team to better contribute back to the community, especially via open source projects.

Take the time to approach your problems in a new manner from top to bottom and you’ll reap the benefits.

The Goal

If you’ve been paying attention in the IT world at all in the last few years, you’ve heard of this thing called DevOps. You’ve probably also heard of The Phoenix Project, an excellent DevOps novel by Gene Kim and others. Phoenix builds upon the foundation of an earlier novel, The Goal: A Process of Ongoing Improvement, by Eli Goldratt in 1984. The Goal is a revolutionary novel that changed the manufacturing world but hasn’t quite had the same effect on the rest of the world. It’s important to understand history and those that came before us. I decided to really dive in and explore our history.

If you’re curious about what I thought of this book, I’ll save you some time – buy a copy right now and start reading. By teaching in the Socratic Method, it makes high-level concepts easily relate-able by giving us real life examples of how those concepts work. Specifically, I read the 30th anniversary edition that included Standing on the Shoulders of Giants, some extra material that I think really matters. If you already have an older edition, it’s worth the $16 for this extra piece.

So what does a book about manufacturing have to do with DevOps? Nothing – and everything. Phoenix continually shows how lessons learned from the manufacturing world can help us in IT. On the other hand, IT is very different and blindly applying these lessons could actually be harmful. Thankfully, The Goal focuses on two primary components that guide us in applying our new knowledge. The novel is also the foundation of the Theory of Constraints. Let’s take a look at the two components first.

The Goal

The first component of The Goal is… The Goal. Yep, it is that simple. So, what is the goal? It’s universal – the goal of any company is to make money. It doesn’t matter what industry you’re in, that’s just common sense, right? Take a look at your current job and see if you agree. The Goal attacks this assumption and challenges us to view things differently. Eli, through the character of Jonah, defines the goal in the context of a manufacturing plant.

  • Increase throughput, defined as turning raw materials into cash
  • Decrease inventory, defined as all raw and processed materials that are not sold
  • Decrease operational expenses, defined as all costs of running the plant that aren’t inventory costs

These three fundamentals describe the goal in simple to understand terms that can be easily measured. Throughput is taking what you consume and selling it – whether it’s metal into faucets and fixtures that are sold or words and ideas into a blog post that is published. If the consumables lie around far too long, like faucets in a warehouse or a blog post that’s perpetually in draft status, your inventory costs go up instead of down. Operational expenses vary, but your personnel and other operational costs need to trend downward. These concepts are investigated in far more detail. These three concepts turn all of our assumptions on our head.

A Process of Ongoing Improvements

The second part of the story is about the process. Once you’ve come around to a new way of thinking, you don’t just suddenly fix everything. You have to implement change in how you’re doing things to meet the goal. Once you do, you get closer to the goal. You start to decrease the inventory that’s waiting around and throughput goes up. Those initial changes have visible immediate effects, but they may also have hidden long-term effects. Inventory may be decreased enough to deal with the current backlog, but once the backlog is out of the way, do you maintain your throughput? If the inventory is too high or low, new issues may arise. Hence, ongoing improvements.

This is addressed by measuring your throughput, inventory, and operational expenses. However, you need to measure with the goal in mind. Adhering to the previous metrics won’t suffice, as they aren’t aligned with the goal. By getting a better approximation of what is happening in a manufacturing plant, more information is available to drive the ongoing improvements.

Theory of Constraints

Together, these two sections combine to give us the Theory of Constraints. The theory stipulates that there are constraints in your plant and that your effort is best focused on the constraints. Finding these bottlenecks and addressing their limitations (exploiting the constraint) will go the furthest toward increasing your throughput and decreasing inventory and operational expenses. Everything else is subordinated to elevating these constraints. Then, you repeat the process – find a constraint, exploit it, subordinate everything else, elevate it. One additional key is to prevent inertia from becoming a constraint. Don’t do things because that’s how you do them – continue to challenge your assumptions and make whatever changes are required to increase throughput, decrease inventory, and decrease operational expenses.

Standing on the Shoulders of Giants

This bonus story in the 30th anniversary edition describes Toyota’s Lean production system (you may know it as Just-In-Time Production) and its strengths and weaknesses. It’s very short, but mostly importantly it documents that three stability requirements to implement Lean fully, and how unstable businesses can leverage certain parts of Lean to still benefit. A great example is Hitachi Tool Engineering. HTE attempted for years to implement Lean without success, because they did not enjoy the stability required to implement it. Finally, after only attempting to leverage the appropriate processes of Lean, HTE grew their profit ratio before taxes from 7.2% in 2002 to 21.9% in 2007. What company wouldn’t want to do that? Knowing when to not do something is just as important as knowing when to do something.

The Future

Obviously, The Goal struck a chord with me. I think it presents a fabulous theory on how to treat business, whether you’re in manufacturing or not. You’ll see some more posts from me in the future related to The Goal and how we can use the Theory of Constraints and the throughput/inventory/operational expenses in many ways, not just directly in our IT industry. I’ll present some of my own theories and attempt to prove them out by implementing them myself. I hope you take the time to read this book and that you will join me on this journey.

On Mentoring: “Perfection is an illusion, it’s pursuit is a pathology”

A while back, my wife, Michelle Block, and I were talking about getting stuff done – actually done, not just part of the way done – when she said something that I think is really profound:

“Perfection is an illusion, it’s pursuit a pathology.”

I really love this statement. It’s very simple, yet full of meaning. I asked Michelle where this statement came from and she gave me a very good story to tell.

Dr. Michelle Block is an Associate Professor of Anatomy & Cell Biology at Indiana University and an expert in her own field. Michelle takes very seriously the need to foster future generations of scientists and is very proud to be able to mentor some of these future scientists. One of the most inspiring experiences in her own development was reading Rosalyn Yalow’s Nobel Prize Speech, and she hopes to be able to provide similar encouragement to her successors. With that in mind, Michelle had been speaking with a colleague about the best way to explain to the upcoming generation of scientists what is expected of them, what it takes to be a good scientist. Her colleague asked her, “What’s the difference between excellence and perfection?”

Continue reading