A Python-based proxy feature extraction and classification system for analyzing network traffic patterns. The project extracts features from proxy traffic datasets and classifies different types of network attacks using various machine learning models.
- Feature Extraction: Processes network traffic data and applies various attacks/defenses to extract features
- Classification: Trains and evaluates ML models (XGBoost, CNN, Transformer) on extracted features
ProxyFeatureExtraction/
├── configs/ # Configuration files
├── scripts/ # Main execution scripts
├── src/
│ ├── classification/ # Classification pipeline
│ │ └── models/ # ML model implementations
│ └── feature_extraction/ # Feature extraction pipeline
│ └── extractors/ # Feature extractor modules
├── data/ # Extracted feature datasets
├── final_data/ # Final processed datasets
├── results/ # Experiment results
│ ├── classification/ # Binary classification results
│ └── gateway_classification/ # Gateway experiment results
├── tests/ # Unit tests
│ ├── test_feature_extraction/ # Feature extraction tests
│ └── test_classification/ # Classification tests
└── notebooks/ # Jupyter notebooks for analysis
- Python 3.7+
- Required Python packages (see below)
-
Clone the repository:
git clone <repo-url> cd ProxyFeatureExtraction
-
Install dependencies:
- (If you have a
requirements.txt, run:pip install -r requirements.txt) - Otherwise, manually install required packages as needed.
- (If you have a
Edit the configuration files in the configs/ directory to specify experiment parameters:
extraction_config.yaml: Controls feature extraction experiments, attack parameters, and data pathsclassification_config.yaml: Specifies model type, training parameters, and data preprocessing optionsgateway_classification_config.yaml: Gateway experiment settings (multi-class + binary classification)background_distributions.json: Background traffic distributions for decorrelation attacks
# Run feature extraction with parallel processing
PYTHONPATH=src python scripts/run_extraction.py# Parallel extraction (alternative script)
PYTHONPATH=src python scripts/extract_all_features_parallel.py
# Sequential processing (for debugging)
PYTHONPATH=src python scripts/extract_all_features.pyPYTHONPATH=src python scripts/run_classification.py --config configs/classification_config.yaml# Run gateway classification experiments (multi-class + binary)
PYTHONPATH=src python scripts/run_gateway_classification.py --config configs/gateway_classification_config.yamlThe gateway classification includes:
- Multi-class Classification: Gateway vs Relay vs Background (3 classes)
- Labels: 0=background, 1=relay, 2=gateway
- Binary Classification: Gateway vs (Relay + Background) (2 classes)
- Labels: 0=background+relay, 1=gateway
Data Sources:
- Background and Relay data:
final_data/br/folder - Gateway data:
final_data/none/folder
Unit tests are located in the tests/ directory. To run all tests:
# Run all tests
PYTHONPATH=src pytest tests/
# Run specific test module
PYTHONPATH=src pytest tests/test_feature_extraction/test_corr_extractor.py- XGBoost: Primary model for tabular feature classification
- MultiClassXGBoost: Specialized XGBoost for 3-class classification (gateway vs relay vs background)
- CNN: For sequence-based feature analysis
- Transformer: For attention-based feature learning
- CorrTransformer: Correlation-based transformer with CorrTransform
- DeepCoFFEA: Deep learning implementation for network traffic analysis
Configure model selection and hyperparameters in configs/classification_config.yaml or configs/gateway_classification_config.yaml.
Modular extractors in src/feature_extraction/extractors/ for different feature types:
- Correlation features (
corr_extractor.py): Timing correlations and statistical relationships - Timing analysis (
ta_extractor.py): Inter-packet delays and timing patterns - Statistical features (
slt_extractor.py): Basic statistical measurements - Research features (
thesis_extractor.py,hayes_usenix2019_features.py): Academic implementations
Handles attack simulation via DataProcessor class:
- Bias removal
- Decorrelation attacks
- Padding and reshaping
- Jitter injection
YAML configs control:
- Experiment parameters
- Data paths and sources
- Model settings and hyperparameters
- Attack simulation parameters
The project expects structured data directories:
- Raw data in train/test/val splits organized by attack type
- Output features saved in CSV format with batch processing
- Results stored in
results/directory with model artifacts and evaluation metrics
- Create new extractor in
src/feature_extraction/extractors/by inheriting fromBaseExtractor - Implement the
process_df()method - Update configuration files to include new features
- Implement new model in
src/classification/models/ - Follow existing model interfaces
- Update classification configuration to support new model type
- PYTHONPATH: Always set
PYTHONPATH=srcwhen running scripts to ensure proper module imports - Multiprocessing: Feature extraction uses ProcessPoolExecutor with 'spawn' method for parallel processing
- Batch Processing: Data is processed in configurable batch sizes to manage memory usage
- Attack Simulation: The
DataProcessorclass applies various network attacks for robustness testing