We are quite familiar with the concept of randomly failing automated tests. Those are the tests that even though there is no change in the feature they are testing, they either fail randomly at the same step, or they fail at random steps. Handling the results of such tests can be tricky, and some teams choose to simply retry a test if it failed. But is that the best option? Here are my thoughts.
First of all, we need to ask ourselves why these tests are failing randomly. Here are a few possible reasons:
the test environment is unreliable. Too often a test environment does not have enough hardware resources to work properly under the load our automation generates. Or, it could be configured incorrectly.
In order to have a green test results report after the tests ran, a retry mechanism is often put in place. It can re-run the failing tests either only once, or a chosen number of times. However this can hide the fact that the tests did really fail for a reason, and the reason was that there is a bug in the system. Because the test failed at the first run, but could pass at a next run, the bugs could occur:
in some (rare) cases, after the servers were started, the first test that runs can uncover a bug that only happens for the first request/requests to that server
in other cases, the test failed due to a real bug, that only occurs when certain environment conditions are met. And in this case, when this test was run, those conditions were met, so the bug did manifest itself. But because the re-run of the test passed, this situation is never investigated and considered simply a test glitch. This bug will not be fixed in quite a while, due to the retry mechanism being in place. People will be happy with the fact that the second run of the test was successful.
Apart from the fact that bugs are not investigated, there is another annoyance the retry mechanism brings: the same test will now take twice (or more) longer to run. That is because we first run it, it fails, then we rerun it, at least once.
So, what would be a better option than using retries?
if the environment is slow or unreliable, fixing the environment is the best solution. This helps with not having all kinds of workaround in our tests. It can actually help with having cleaner code, without all the try, catches, retries, and what not.
if the tests are to blame, the tests should be adjusted (and fixed). We should create the best, most reliable version of a test, not just write some code to check the test off the ‘to do’ list. If there is no change in the environment and the software we are testing, every run of the test should give the same result. And it should run in a decent amount of time, not twice that time, or even more, depending on how many times we would do the retry. We want fast feedback, hence faster test results.
If we don’t address these random failures, at one point these tests can be considered irrelevant, and not run anymore. Or, the failures could be ignored completely, because if we know that a test has a tendency to randomly fail, when it does fail due to good reason, we just consider it to be a random fail. We won’t even look at the failures reason. And that is how bugs are overlooked.