Skip to content

Implementation of "ProfRCA: LLM-Enabled Fine-grained Root Cause Analysis with Continuous Profiling Data" (SANER 2026)

Notifications You must be signed in to change notification settings

IntelligentDDS/ProfRCA

Repository files navigation

ProRCA System

This repository contains the implementation of ProfRCA. It consists of three main components:

  1. Graph Semi-supervised Training (run_graphcl_semi.py) - Trains graph neural networks using contrastive learning
  2. Fault Description Generation (run_generate_description.py) - Generates semantic descriptions for fault patterns using LLM
  3. Root Cause Analysis (run_rca.py) - Performs automated root cause analysis using trained models

Table of Contents

Prerequisites

  • Python 3.11+
  • CUDA-capable GPU (recommended)
  • Ollama with qwen3:30b-a3b model (for RCA and description generation)

Install Dependencies

pip install -r requirements.txt

Download Required Models

  1. Download CodeBERT Model to ./models: https://huggingface.co/microsoft/codebert-base

  2. Install and Setup Ollama (Required for description generation and RCA):

# Install Ollama (follow instructions at https://ollama.ai)
ollama pull qwen3:30b-a3b

Project Structure

├── run_graphcl_semi.py      # Main training script
├── run_generate_description.py # Fault description generation script
├── run_rca.py               # Root cause analysis script
├── graphcl_model.py         # Graph contrastive learning model
├── graphbuilder.py          # Graph construction utilities
├── augmentor.py             # Data augmentation functions
├── profile_dataset.py       # Dataset handling
├── faiss_retriever.py       # Vector similarity search
├── faults.py                # Fault type definitions
├── function_manager.py      # Function name management
├── utils.py                 # Utility functions
├── evaluate_embedding.py    # Embedding evaluation
├── pprof.py                 # Profile data processing
├── profile_agent/           # LLM prompts
├── google/                  # Google protobuf files
├── resources/               # Resource files for function management
├── models/                  # Pre-trained models (CodeBERT)
│   └── codebert-base/       # CodeBERT model files
├── data_normal/             # Normal profiling data 
└── data_fault/          # Fault profiling data 

Data Requirements

The system requires the following data structure to be provided:

1. Normal Profiling Data

data_normal/5m/{service_name}/
├── *.pb           # Profile data files in protobuf format
└── *.gpickle      # Pre-processed graph files (can be generated when generate=True)

2. Fault Profiling Data

data_fault/5m/
├── strace_epoll_wait_delay/{service_name}/
├── strace_futex_delay/{service_name}/
├── strace_read_delay/{service_name}/
└── strace_write_delay/{service_name}/

Services

  • adservice
  • checkoutservice
  • emailservice
  • frontend
  • recommendationservice

Usage

Step 1: Training Graph Model

The training script performs two-stage learning: unsupervised pretraining followed by semi-supervised training.

Prerequisites:

  • CodeBERT model downloaded in models/codebert-base/
  • Normal and fault profiling data in appropriate directories
python run_graphcl_semi.py

Output:

  • results/{num_faults}faults_{timestamp}/
    • {service}/model_semi_{service}.pt - Trained models
    • {service}/training_log_{service}.txt - Training logs
    • {service}/evaluation_results_{service}.txt - Evaluation results
    • all_services_summary.json - Complete results summary
    • model_parameters.json - Model configuration
    • all_services_tsne.png - t-SNE visualization

Step 2: Generate Fault Descriptions

After training, generate descriptions for fault graphs using LLM to improve RCA quality.

Prerequisites:

  • Ollama running with qwen3:30b-a3b model
python run_generate_description.py

What it does:

  1. Loads fault graph data from data_fault/
  2. Uses LLM to generate semantic descriptions for each fault pattern
  3. Saves enhanced data to data_fault_description/

Note: This step is crucial for high-quality root cause analysis as it provides semantic context for fault patterns.

Step 3: Running Root Cause Analysis

The RCA script uses trained models to analyze fault patterns and generate root cause explanations.

Prerequisites:

  1. Trained models from Step 1
  2. Generated fault descriptions from Step 2
  3. Ollama running with qwen3:30b-a3b model
python run_rca.py

Configuration (edit the script):

  • result_dir: Directory containing trained GNN models

Output:

  • results_rca_with_function/
    • 5m/{service}/{filename}.json - Individual RCA results

Data Requirements Summary

To run the complete system, you need to provide:

Required Data Files

  1. Profile Data:

    • Normal operation profiles in data_normal/5m/{service}/ (*.pb files)
    • Fault injection profiles in data_fault/5m/{fault_type}/{service}/ (*.pb files)
  2. Common Graphs:

    • Baseline graphs in data_common/{service}_common.gpickle
  3. Pre-trained Models:

    • CodeBERT model in models/codebert-base/ (download from Hugging Face)

Workflow Summary

  1. Setup: Install dependencies and download CodeBERT model
  2. Train: Run run_graphcl_semi.py to train GraphCL models
  3. Describe: Run run_generate_description.py to generate fault descriptions
  4. Analyze: Run run_rca.py to perform root cause analysis

About

Implementation of "ProfRCA: LLM-Enabled Fine-grained Root Cause Analysis with Continuous Profiling Data" (SANER 2026)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published