This repository contains the implementation of ProfRCA. It consists of three main components:
- Graph Semi-supervised Training (
run_graphcl_semi.py) - Trains graph neural networks using contrastive learning - Fault Description Generation (
run_generate_description.py) - Generates semantic descriptions for fault patterns using LLM - Root Cause Analysis (
run_rca.py) - Performs automated root cause analysis using trained models
- Python 3.11+
- CUDA-capable GPU (recommended)
- Ollama with qwen3:30b-a3b model (for RCA and description generation)
pip install -r requirements.txt-
Download CodeBERT Model to
./models: https://huggingface.co/microsoft/codebert-base -
Install and Setup Ollama (Required for description generation and RCA):
# Install Ollama (follow instructions at https://ollama.ai)
ollama pull qwen3:30b-a3b├── run_graphcl_semi.py # Main training script
├── run_generate_description.py # Fault description generation script
├── run_rca.py # Root cause analysis script
├── graphcl_model.py # Graph contrastive learning model
├── graphbuilder.py # Graph construction utilities
├── augmentor.py # Data augmentation functions
├── profile_dataset.py # Dataset handling
├── faiss_retriever.py # Vector similarity search
├── faults.py # Fault type definitions
├── function_manager.py # Function name management
├── utils.py # Utility functions
├── evaluate_embedding.py # Embedding evaluation
├── pprof.py # Profile data processing
├── profile_agent/ # LLM prompts
├── google/ # Google protobuf files
├── resources/ # Resource files for function management
├── models/ # Pre-trained models (CodeBERT)
│ └── codebert-base/ # CodeBERT model files
├── data_normal/ # Normal profiling data
└── data_fault/ # Fault profiling data
The system requires the following data structure to be provided:
data_normal/5m/{service_name}/
├── *.pb # Profile data files in protobuf format
└── *.gpickle # Pre-processed graph files (can be generated when generate=True)
data_fault/5m/
├── strace_epoll_wait_delay/{service_name}/
├── strace_futex_delay/{service_name}/
├── strace_read_delay/{service_name}/
└── strace_write_delay/{service_name}/
- adservice
- checkoutservice
- emailservice
- frontend
- recommendationservice
The training script performs two-stage learning: unsupervised pretraining followed by semi-supervised training.
Prerequisites:
- CodeBERT model downloaded in
models/codebert-base/ - Normal and fault profiling data in appropriate directories
python run_graphcl_semi.pyOutput:
results/{num_faults}faults_{timestamp}/{service}/model_semi_{service}.pt- Trained models{service}/training_log_{service}.txt- Training logs{service}/evaluation_results_{service}.txt- Evaluation resultsall_services_summary.json- Complete results summarymodel_parameters.json- Model configurationall_services_tsne.png- t-SNE visualization
After training, generate descriptions for fault graphs using LLM to improve RCA quality.
Prerequisites:
- Ollama running with qwen3:30b-a3b model
python run_generate_description.pyWhat it does:
- Loads fault graph data from
data_fault/ - Uses LLM to generate semantic descriptions for each fault pattern
- Saves enhanced data to
data_fault_description/
Note: This step is crucial for high-quality root cause analysis as it provides semantic context for fault patterns.
The RCA script uses trained models to analyze fault patterns and generate root cause explanations.
Prerequisites:
- Trained models from Step 1
- Generated fault descriptions from Step 2
- Ollama running with qwen3:30b-a3b model
python run_rca.pyConfiguration (edit the script):
result_dir: Directory containing trained GNN models
Output:
results_rca_with_function/5m/{service}/{filename}.json- Individual RCA results
To run the complete system, you need to provide:
-
Profile Data:
- Normal operation profiles in
data_normal/5m/{service}/(*.pb files) - Fault injection profiles in
data_fault/5m/{fault_type}/{service}/(*.pb files)
- Normal operation profiles in
-
Common Graphs:
- Baseline graphs in
data_common/{service}_common.gpickle
- Baseline graphs in
-
Pre-trained Models:
- CodeBERT model in
models/codebert-base/(download from Hugging Face)
- CodeBERT model in
- Setup: Install dependencies and download CodeBERT model
- Train: Run
run_graphcl_semi.pyto train GraphCL models - Describe: Run
run_generate_description.pyto generate fault descriptions - Analyze: Run
run_rca.pyto perform root cause analysis