Monday, June 18, 2012

Can your tests stand the ‘test of time’?

Today I want to digress a bit from Model-Based Testing and talk about a general issue in software testing, applicable to all types of testing. My focus will be on the use of time-dependent methods in test code. These include library functions like ‘DateTime.Now’, reading the BIOS clock, retrieving the current time from a time server, etc. The use of time dependent methods can lead to unpredictable behavior and is a common source for fragility in tests.
But first, let’s look at a definition of time. From Wikipedia:
Time is the indefinite continued progress of existence and events that occur in apparently irreversible succession from the past through the present to the future.
There are two vital pieces of information in this definition. They are continued progress and irreversible. There is nothing about size and quantity of time in the definition. In fact our daily representation of time in hours, minutes, seconds, milliseconds, etc. is completely arbitrary. Instead I like to think of time in a mathematical sense, as a strictly increasing series t : ti < ti+1. The absolute value of ti is irrelevant.
This is how I believe time should be view in test code, in fact I will postulate a “test of time” at the end of this article.
I will assume that we all agree fragile tests are costly to maintain, and have no place in a regressions suite.
Common misuses of time #1: Validation of time-dependent data
Common misuses of time is validation of messages, dialogs, errors and other strings containing the current time generated by the system under test. I recently saw a test case validating a text containing the latest updated time of a graph. The test would look something like:
            String expectedMessage = String.Format("Last update: {0}", DateTime.Now.TimeOfDay.ToString(@"hh\:mm"));
            Assert.AreEqual(graph.StatusMessage, expectedMessage, "Incorrect status message displayed");

The graph would show a status like ’Last updated: 13:02’. The test would then validate the correctness of this message using the current time. Of course, the problem is if this test is run exactly around 13:02:59 then there is a chance that the time changes to 13:03 between the Update call and the computation of the expected message, causing this test to fail. This is a fragile test. Computing the expected message before updating will not fix this problem. Actually for this test to be ‘correct’ it should validate that the time returned is within the window of possibilities, like:
            DateTime windowStart = DateTime.Now;
            DateTime windowEnd = DateTime.Now;
            Assert.IsTrue(windowStart <= ExtractDateTime(graph.StatusMessage) <= windowEnd, "Incorrect status message displayed");

Although this test is stable with respect to the natural progression of time, it is susceptible to outside factors affecting the system time. It may happen that the system clock is adjusted through synchronizing the system clock with an online server, or due to daylight savings, just before setting windowEnd variable. Of course these examples are much less likely to happen, but if you run 100,000’s of tests every day chances are that you will end up running into strange scenarios with time behaving unpredictable.
The correct way is of course to mock the timer in the system under test such that you control it from your test. Then it’s a matter of just freezing time:
    public class MockTimer : ITimer
        private DateTime currentTime = DateTime.Now;

        public int GetTimeOfDay()
            return currentTime;

With a mock approach the test is directly in control of the time and you rid yourself of external dependencies. The down-side is that the system under test must be architected in a fashion where you can supply your mock timer to it.
Perception of time
Before I go on to the next example I’d like to define perception of time from a computer program.
By perception of time, I mean the perceived amount of time that one unit of code takes to execute (a unit could be a single instruction, a function call or a class/component performing a task). Perception of time for a computer program is far from constant. It is affected by a plethora of external factors. The most common is CPU usage (by other processes), a highly stressed CPU will execute your program much slower, effectively causing time dilation (actually the metaphor works well here, the faster the CPU is running the more time slows down J).
On the microscopic time-scale perception of time is impacted by OS time slots, an instruction could finish in 1 millisecond or a program may have to wait 100 milliseconds for its OS time slot to begin.
On the macroscopic level the program could be using a database which happens to be in a locked state, causing it to run for 10 minutes before the lock is released and the program is able to finish.
So in fact, the perception of time of a computer program changes quite a lot during execution. It is jittery from time slotting, and in periods it slows or speeds up drastically due to other programs consuming resources.
Common misuses of time #2: ‘Performance’ regression test
A colleague of mine recently wrote a regression test to prevent a performance issue being reintroduced. The problem simplified was that a function working on a set should run in constant time, because the correct implementation needed only to look at the last element in the set to compute the result. The premise of the test was that running the function on a single element set and measuring the execution time should roughly match a repeat execution on a five element set. If the execution times differed by more than 50% the test would fail:
            DateTime beginSingleElement = DateTime.Now;
            TimeSpan durationSingleElement = DateTime.Now - beginSingleElement;
            DateTime beginMultipleElements = DateTime.Now;
            TimeSpan durationMultipleElements = DateTime.Now - beginMultipleElements;
            TimeSpan delta = durationSingleElement - durationMultipleElements;
            Assert.IsTrue(Math.Abs(delta.Ticks) < 0.5 * durationMultipleElements.Ticks, "Performance problem detected.");

Okay, fair enough, the test is at least making a comparison between two runs and not measuring absolute time. Also, it is forgiving – it allows for 50% difference in measurements.
However, with our understanding of perception of time, it is clear that this test is susceptible to changes in perception of time. If the computer happens to start installing Windows Updates during the ExecuteTestOnMultipleElements method call, it could significantly impact the test – even outside the 50% safety-margin!
Again, the test is fragile leading to superfluous analysis work upon failure. In our context we were able to refactor this test to be based on code coverage measurements, which is deterministic and within our control. But the general advice is that performance tests should not be included in regression suites where the purpose is to determine the quality of a particular build. Instead these tests should be executed in a performance suite, which measures the performance of the product running in a highly controllable environment, and where the result is to gauge the performance, not to provide a yes/no on quality.
The ‘test of time’
Mock the underlying timer functions in your test framework the following way:
1.       Initialize a counter to 0
2.       Every call to your timer function increases this counter by a random positive integer amount and returns the updated value of the counter.
Repeatedly run your tests (as well as the system under test) using this mock-implementation of a timer. If any of the tests fail this is an indication that they are not agnostic to the perception of time, and in turn they are susceptible to external factors that can influence the system clock.
This test will surface any undesired test dependencies on the system time, so you are able to fix them up-front. The random pattern makes it much more likely that you will detect problem like #1 and #2.
If your tests can’t stand this ‘test of time’ then my claim is that given enough time they will cause you problems. Depending on your setup, it may be more or less critical that you have stable regression tests, but for the most part I find that any fragile test will either be ignored during analysis because it is known to be fragile (then it is worthless, because if it happened to find a bug that would also be ignored), or it will keep appearing on your radar over and over again disrupting your daily rhythm, sucking out effort that could otherwise be spent more productively. At first this might not seem such a big problem, but as your scope starts scaling to maybe 10,000 or 100,000 tests, what seemed a small problem suddenly magnifies to a gigantic time hole which impedes you from moving forward in your daily work.

No comments:

Post a Comment