Skip to content

EricssonResearch/RTilienceSim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Description of the simple RTilience C++ simulator

This is a simple event-based simulator for stateful micro-service based applications developed according to the RTilience model (see H. Gustafsson, F. A. Svensson, R. Mini, L. Abeni, R. Andreoli, T. Cucinotta ``RTilience: Fault-Tolerant Time-Critical Kubernetes'' in IEEE Transactions on Services Computing, 2025). Each application is modelled as a Directed Acyclic Graph (DAG) of micro-services, and each node of the DAG is composed of multiple load-balanced containers, to improve fault-tolerance and scalability.

From the mathematical point of view, an RTilience application is a network of nodes composed of multiple FIFO queues; hence, this can be seen as a simulator for a multi-queue systems, where every queue can be subject to failures or can require some state to process packets. See the source file for a description of the simulator's code organization.

Compilation and usage

The main programs are in the src directory (and the most commonly used one is dag_queues). To compile them, do

	cd src 
	make

then, the simulator can be started with (for example)

	./dag_queues test.dot

where test.dot is a DOT file describing the DAG, for example

digraph test {
    deadline="200";
    "T0" [c="0.0,0.0,0.5,0.5",M="4"];
    "T1" [c="0.0,0.8,0.2",M="2",t="timeout",tout="5",
          pfail="0.01",precover="0.2",state_get="1",state_set="2"];
    "T0" -> "T1";
}

which describes a simple "Source -> Queue" graph. The source (T0) has the request inter-arrival probability distribution described in the file (probability 0.5 of inter-arrival 2 time units and probability 0.5 of inter-arrival 3 time units). There are 4 such sources specified. The queue has service time 1 with probability 0.8 and service time 2 with probability 0.2 and there are 2 such queues. The queue (T1) has a timeout of 5 time units, a failure probability of 0.01 and a recovery probability of 0.2. These probabilities will override the ones specified as command line arguments. The queue (T1) will have a get/pull state latency of 1 when it is missing latest state and will have a set/push state latency of 2 when finished execution. These latencies will override the ones specified as command line arguments. The t, i.e. which type of reliability is used, support these values:

  • critical send to two queues simultaneously, no timeout needed

  • timeout sends a second replica packet after tout time units

  • pdeadline sends a second replica packet meeting a pdl partial deadline, requires also deadline for full DAG

  • c_percent Use a timeout based on tout (0-100) percent of the CDF of the service time distribution

  • dynamic autonomously set the timeout based on the full DAG deadline and resource usage

This is a simplified version of the dag_queues help text (see the man page in the docs directory for more information):

Usage: ./dag_queues[-l <max_t>] [-q <qlen>] [-s <rnd_seed>] [-n <notify>] [-f <pfail>] [-h <phealthy>] [-m] [-x] [-S <set-delay>] [-G <get-delay>] [-B {rr | sticky}] [-v {all,state,dag,flow}] <dag.dot-file>
-l <max_t>         simulation time
-q <qlen>          queue length
-n <notify>        nbr of progress reports
-f <pfail>         fail probability
-h <phealthy>      healing probability
-m                 enable multiserver
-x                 enable slotted
-S <set-delay>     enable stateful, set delay
-G <get-delay>     enable stateful, get delay
-B {rr | sticky}   select load balancing, RR is default
-v {all,state      verbose comma seperated string of modules
    dag,balance}       
...

Notice that by default dag_queues uses the load balanced queues, while -m (multiserver) is to have multiple servers for 1 queue (something like a G/G/m queue). The -x option enables slotted used in combination with percentage timeouts to add 1 extra time unit wait time.

Advanced build options

A simple make invocation will compile a non-optimized version of the simulator, which can be quite slow but is easy to debug. To compile the optimized (difficult to debug, but fast) version of the simulator, use

	CPPFLAGS="-DNDEBUG -DUNSAFE_CAST" CXXFLAGS="-O3" make

To compile with ASAN, instead, use

	CFLAGS=-fsanitize=address CXXFLAGS=-fsanitize=address LDFLAGS=-fsanitize=address make

Notice that ASAN can detect some harmless memory leaks at the end of the execution. To suppress the corresponding message, use

	export ASAN_OPTIONS=detect_leaks=0

Finally, it is possible to compile with clang++ by using

	CC=clang CXX=clang++ make

Citing the simulator

If you use this simulator in your research and/or you write a paper based on it, you can cite the following paper:

L. Abeni, H. Gustafsson, F. A. Svensson, T. Cucinotta. ``State Management in Real-Time Cloud and NFV Services'' in Proceedings of the 11th IEEE Conference on Network Functions Virtualization and Software-Defined Networking (IEEE NFV-SDN 2025), November 10-12, 2025, Athens, Greece.

Contributing

This simulator is released under the MIT license (see LICENSE.txt) and everyone is welcome to contribute fixes, improvements, and/or new features under such a license. Notice that this public git repository is used for releases and the development is done in a private interal repository, which is periodically synchronized with this one. If you plan to contribute patches, contact the maintainers to base your patches on the most updated code.

About

Fault-tolerant real-time distributed systems simulator using services directed acyclic graphs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published