Wednesday, November 21, 2012

Simulating Stochastic Models

Recently I’ve been tasked with working on an infrastructure system which comprises several computationally intensive tasks that can be scheduled on a series of host machines. Some tasks have dependencies on other tasks, and cannot be scheduled before its dependencies have finished. In a simplified setup we assume that any task can run on any machine, provided that machine is not busy with running another task.

The following picture illustrates the system


Host A runs task 1 and 3, while host B runs task 2 and 4. Task 4 has a dependency on task 1 and task 3 has one on task 2. When all tasks are completed the job is considered completed. It is evident from the picture that a task (task 3) may be delayed in its scheduling even though its dependencies are satisfied due to resource constraints.

Runtime and randomness
Each task takes a finite amount of time to complete, illustrated by the size of the task in the picture. Furthermore, the runtime of the tasks are highly stochastic as they are influenced by CPU load, network latency, IO latency, even build-in randomness in the tasks, etc. This causes the overall system to behave stochastically. At first hand one would think that we could just look at average runtimes of individual tasks to get an average job runtime, however this is not going to give an accurate response of the system.

Now, traditionally I would model this problem in Spec Explorer, but in this case the system is already represented by a model, namely the dependency tree of the job


 
This allows me to write a simulation tool instead.

Simulation
The idea about the tool is to simulation how the system would stochastically schedule tasks in a single job. This is called a sample. Once this is working, we can set the tool to repeatedly simulate the same system, producing thousands of samples. We can then measure the average job runtime of all the produced samples. This is called Monte Carlo Simulation and is an exceptionally powerful technique. It turns out that the measurement error on the sample mean, is reduced by the inverse square root of the number of samples – regardless of what metric is measured, and regardless of what system is measured! Let’s take a step back and think about the ramifications of that. That means, no matter how complicated we make the stochastic system, the measurements are going to become more and more accurate, the more samples we take from the system. So, say we want to model re-running failing tasks for instance. No problems, if a task fails (this is also stochastically determined) we let the system reschedule it. We sample the system a couple of thousand times and our metrics are updated with the new mean job execution time!

Furthermore, the model was able to predict and generate a heat-map of the bottlenecks of the system, which I’d like to share with you for inspiration:



The redder the task the more often it appeared on the critical-path of the system (the path determines the overall execution time). By reducing the time of the critical path, you reduce the overall execution time. The problem is that in a stochastic system the critical path changes between executions. Therefore, this heat-map is extremely valuable information because it allows us to direct our effort to the tasks that most often appear on the critical path.

What does this have to do with model based testing?
Well, think about it.

Is there an underlying model? – Yes! However, it’s not a finite state model supposed to simulate the system under test, but a stochastic model that is the system under test.

But will it find any bugs? – Arguably the model is not designed to find bugs in the system (because the model and system is the same, which makes it impossible to do so…). Yet, the simulations will help detect bottlenecks, resource starvation, mean critical path, optimum number of resources, optimum re-run strategy, etc. These are not bugs as we see them traditionally. However, they are all system characteristics which we are trying to improve. So they could just as well be interpreted as bugs!

In some extreme cases you could also prevent bugs by specifying acceptable intervals for your measurements – for example, we could state that no job may take more than four hours to complete, then if a simulated sample happens to exceed this limit, we may conclude that the system is defective.

Conclusion
What I wanted to demonstrate with this post is that Model-Based Testing is not just about Finite-State models and Spec Explorer. The fundamentals of modeling applies to many problems. In my example I was lucky enough to be working with a system that could be described through a model. When you are faced with a new problem, try thinking out-side of the box, in many cases modeling turns out to be a valuable tool in your toolbox.

No comments:

Post a Comment