This is a simple event-based simulator for stateful micro-service based applications developed according to the RTilience model (see H. Gustafsson, F. A. Svensson, R. Mini, L. Abeni, R. Andreoli, T. Cucinotta ``RTilience: Fault-Tolerant Time-Critical Kubernetes'' in IEEE Transactions on Services Computing, 2025). Each application is modelled as a Directed Acyclic Graph (DAG) of micro-services, and each node of the DAG is composed of multiple load-balanced containers, to improve fault-tolerance and scalability.
From the mathematical point of view, an RTilience application is a network of nodes composed of multiple FIFO queues; hence, this can be seen as a simulator for a multi-queue systems, where every queue can be subject to failures or can require some state to process packets. See the source file for a description of the simulator's code organization.
The main programs are in the src directory (and the most commonly used one
is dag_queues). To compile them, do
cd src
make
then, the simulator can be started with (for example)
./dag_queues test.dot
where test.dot is a DOT file describing the DAG, for example
digraph test {
deadline="200";
"T0" [c="0.0,0.0,0.5,0.5",M="4"];
"T1" [c="0.0,0.8,0.2",M="2",t="timeout",tout="5",
pfail="0.01",precover="0.2",state_get="1",state_set="2"];
"T0" -> "T1";
}
which describes a simple "Source -> Queue" graph.
The source (T0) has the request inter-arrival probability distribution
described in the file (probability 0.5 of inter-arrival 2 time units and
probability 0.5 of inter-arrival 3 time units). There are 4 such sources
specified.
The queue has service time 1 with probability 0.8 and service time 2 with
probability 0.2 and there are 2 such queues. The queue (T1) has a timeout
of 5 time units, a failure probability of 0.01 and a recovery probability of
0.2. These probabilities will override the ones specified as command line
arguments.
The queue (T1) will have a get/pull state latency of 1 when it is missing
latest state and will have a set/push state latency of 2 when finished
execution.
These latencies will override the ones specified as command line arguments.
The t, i.e. which type of reliability is used, support these values:
-
criticalsend to two queues simultaneously, no timeout needed -
timeoutsends a second replica packet aftertouttime units -
pdeadlinesends a second replica packet meeting apdlpartial deadline, requires alsodeadlinefor full DAG -
c_percentUse a timeout based ontout(0-100) percent of the CDF of the service time distribution -
dynamicautonomously set the timeout based on the full DAGdeadlineand resource usage
This is a simplified version of the dag_queues help text (see the man page
in the docs directory for more information):
Usage: ./dag_queues[-l <max_t>] [-q <qlen>] [-s <rnd_seed>] [-n <notify>] [-f <pfail>] [-h <phealthy>] [-m] [-x] [-S <set-delay>] [-G <get-delay>] [-B {rr | sticky}] [-v {all,state,dag,flow}] <dag.dot-file>
-l <max_t> simulation time
-q <qlen> queue length
-n <notify> nbr of progress reports
-f <pfail> fail probability
-h <phealthy> healing probability
-m enable multiserver
-x enable slotted
-S <set-delay> enable stateful, set delay
-G <get-delay> enable stateful, get delay
-B {rr | sticky} select load balancing, RR is default
-v {all,state verbose comma seperated string of modules
dag,balance}
...
Notice that by default dag_queues uses the load balanced queues,
while -m (multiserver) is to have multiple servers for 1 queue
(something like a G/G/m queue).
The -x option enables slotted used in combination with percentage
timeouts to add 1 extra time unit wait time.
A simple make invocation will compile a non-optimized version of the
simulator, which can be quite slow but is easy to debug.
To compile the optimized (difficult to debug, but fast) version of the
simulator, use
CPPFLAGS="-DNDEBUG -DUNSAFE_CAST" CXXFLAGS="-O3" make
To compile with ASAN, instead, use
CFLAGS=-fsanitize=address CXXFLAGS=-fsanitize=address LDFLAGS=-fsanitize=address make
Notice that ASAN can detect some harmless memory leaks at the end of the execution. To suppress the corresponding message, use
export ASAN_OPTIONS=detect_leaks=0
Finally, it is possible to compile with clang++ by using
CC=clang CXX=clang++ make
If you use this simulator in your research and/or you write a paper based on it, you can cite the following paper:
L. Abeni, H. Gustafsson, F. A. Svensson, T. Cucinotta. ``State Management in Real-Time Cloud and NFV Services'' in Proceedings of the 11th IEEE Conference on Network Functions Virtualization and Software-Defined Networking (IEEE NFV-SDN 2025), November 10-12, 2025, Athens, Greece.
This simulator is released under the MIT license (see LICENSE.txt) and everyone is welcome to contribute fixes, improvements, and/or new features under such a license. Notice that this public git repository is used for releases and the development is done in a private interal repository, which is periodically synchronized with this one. If you plan to contribute patches, contact the maintainers to base your patches on the most updated code.