Nerd Guru: Troubleshooting Techniques: Step 4 - What isn’t broken versus what is?

At this point, you have eliminated the easy solutions. Set up instructions have been verified and the item in question is in a known state. Because the problem appears to occur in multiple, similar instances, environmental issues have been eliminated as a root cause. Nothing has changed recently that is affecting the results either. These first three steps will catch most problems, but there will be times where you need to dig deeper and your knowledge of the product in question, access to information, and your creativity all become bigger factors in finding a solution.

Before you take that step though, it is a good idea to pause and report status to those who might be interested. This may seem a bit bizarre since you do not have a solution yet and may not appear to even be close to one. But, realize that you have demonstrated what is not the problem by performing the first three steps. That in itself is progress and can signal to people in management or other positions of power that this is likely not a simple problem that will easily be solved. Communicate to them the steps that you have completed thus far, what preliminary conclusions you can draw from those steps, and describe what you are about to try next. For a lot of people, just knowing that work is being done on the problem is considered progress. Explaining to them that you have narrowed the scope of the problem by eliminating the simple solutions helps them understand the potential complexity of the situation. Figure 1 shows a good example of such a communication.

Figure 1: A sample troubleshooting status email message (click to enlarge)

Tracing the logical flow of the device is the next step. Whatever kind of processing is involved, your item likely has one to many forms of input and a similar set of outputs. When stimulated a certain way, the item is expected to react accordingly but is not. With your knowledge of the system, give it a known input and trace its logical flow, measuring intermediate results along the way. The idea, at first, is a process of elimination. There are more things that aren’t related the problem than there are things that are related to it. Isolate those things that are working from those things that are broken.

For example, suppose your device is a toaster. The first step in its operation is to plug it in. When you do that, can you verify that the unit is receiving electricity, perhaps through some indicator light? If so, then see if a piece of bread will fit correctly in the slots. Assuming that works, will the lever depress and drop the bread in the slots properly? Does that cause the coils to heat up? Does adjusting the darkness knob alter the amount of time that the bread stays in the slots and the coils stay hot as you might expect? When the timer is up, does the bread pop up out of the slots and do the coils cool down? These are the basic steps of toaster operation and walking through them, verifying the expected results along the way, gives a more granular look at the problem. You can discover what is working and what is not, allowing you to focus your efforts on a lower level of investigation.

Depending upon your specific situation, a similar approach is to start by taking your device apart completely. Slowly put it back together in functional layers, adding pieces and verifying desired results as you go. As you approach the complexity of the entire finished product, you will gain confidence in the working order of the underlying functionality and reduce the set of potential problems to the more elaborate use cases.

Returning to the toaster example, imagine the components detached from one another and strewn on a workbench. First, you connect the coils and the power supply together and plug in the cord. This set up lets you test the basic functionality of a toaster: getting the coils hot enough to partially char bread products. Now add the adjustment knob that dictates how hot the coils get and for how long. Having already established a functional baseline of the coils heating at all, this adds another level of complexity to the system. Then add the lever mechanism that lowers and pops up the bread, which is essentially the final piece that forms the finished product. Similar to the step by step approach, this strategy attempts to isolate pieces of functionality but does so in a slightly different way. The goal is the same, though: find what is working and what is not.

With the scope now narrowed as much as possible, this is the point at which each troubleshooting situation is unique and any formulaic approaches no longer help much. Again, this is where your knowledge of the system and your creativity become assets in trying to figure out what is going wrong and why. Ask yourself some of the following questions:

Is there a pattern to the incorrect results given different inputs?
Is there any relationship between the sub-steps involved in the flows that are working correctly and the ones that are not working correctly? If so, how is the processing between the two situations different that causes one to succeed and the other to fail?
In the search for a potential workaround, can one of the successful flows be altered to approximate the desired output of the broken flow?

Once you reach this stage of troubleshooting an incident, it is difficult to say where the investigation will take you or how long it will take. You may reach a point of diminishing return, though, where the time you have invested in finding a solution may be great enough that it makes more sense to simply use a different device or start from scratch another way. Only you can determine this for certain and determining when to do so comes with experience.