This year, at several conferences, i did a talk on troubleshooting tips and techniques. One slide in particular seemed to be very much of interest for the attendees, so i decided to elaborate on it with a blog post. In the talk, i was describing what to try to look for when an undesired behavior appears in the system under test, whose root cause is not very obvious. I also encouraged testers to participate in the process of finding the root cause of this bug, by providing a few tips and techniques that can be used. The slide of interest presented a matrix with 5 things to look for, when the bug you are dealing with reproduces on one environment but does not reproduce on another one:
the build number of the software you are working on. This is the software that exhibits the bug you are trying to figure out the root cause for. I will refer to it as the 'software' during this article
the configuration of the software you are working on. This can be any flag or property that needs to be set that is not part of the code. It is probably stored in a different file. It can easily be changed, to apply the desired properties on the software, without needing to change the code. For example, such configuration can include: having the software run in debug or normal mode; settings a timeout value up to which the software waits for a response from another service; or whether a feature in the software is enabled or not
the build number of the software's dependency/dependencies. This can be an internal dependency, like maybe a service developed by another team in the company, or an external service or library that your software uses
the configuration of the software's dependency. Similarly to your own software, dependencies can have their own flags or properties that can be set to several values
These are 5 items you can check rather easily. These are the first 5 items you can check for, in case the undesired behavior occurs only on one machine.
How would you go about it? You should check whether these 5 items are identical on the environment where the bug is present, versus on an environment where it is not. What are the possible scenarios involving these 5 items? 1. Everything is the same on both environments, except for the software's build number. If the bug appears in the newer build, then you should easily find the root cause. It is in the list of commits made from the last commit included in the good build up to the latest commit included in the current version of the software. 2. Everything is the same in both environments, except for the software's configuration. In this case, changing the configuration to the one on the environment where the bug does not reproduce should solve the problem. 3. Everything is the same in both environments, except for the version of one of the software's dependencies. In case the version of the dependency causing the mayhem is newer than the other one, you could revert to the older version. This is temporary of course. In case you really need the new version of your dependency, this won't be satisfactory to you.
If the dependency is built by a different team in your company, you should reach out to them and let them know of the problem. They might genuinely not be aware of it, or even of the way your software consumes their code. Discussing with the owning team will lead to the problem being fixed (as opposed to them not fixing a problem they are not aware they have). Sometimes this fix might not be a priority to them, so solving the problem might take longer. However, if it is a priority to you, point this out to them. This will surely lead to their team also making it a priority.
If the dependency is an external one (like an external service or library), it will probably take more time to have the problem fixed. In this case, also, contact whoever owns this dependency. This is needed to make them aware of the problem and, in case it is important to you, to have them assign higher priority to fixing it.
4. Everything is the same in both environments, except for the configuration of a dependency. In this case, again, you will need to contact the dependency owners, to ask them to change the configuration. This should happen way faster then having a new version of this dependency released (since it does not involve a full release, with heavy regression testing and all that). 5. Everything is the same in both environments except for some hardware or hardware settings. In this case it is pretty clear: change that hardware to the one where the bug does not reproduce and you're all set.
Of course, in each of the above scenarios, only one of the 5 items is different from the others. In those cases where 2 or 3 items might differ from an environment to the other, try to 'normalize' the environment where the bug occurs. Do this by setting item by item to the values it has on the properly working environment. This can help to narrow things down. And of course, these 5 items are not the entire story. But these are the things you can start with. Or the things that are easiest to investigate. It might be that all these 5 items are identical on both environments. In this case you need to dig deeper and look at other aspects like, maybe: the network, DB, firewall, and their configurations.