Skip to content

Sunsvea/Mira_Covid

Repository files navigation

Mira: COVID-19 Detection using Model Ensemble on Chest X-ray Images

Tests

A machine learning system for detecting COVID-19 in chest X-ray images using ensemble neural networks with uncertainty-based weighting. This implementation follows the methodology from the Scientific Reports paper "Generalizable disease detection using model ensemble on chest X-ray images" (2024).

Overview

Implements the robust ensemble approach combining three pre-trained CNN architectures:

  • ResNet50: 50-layer residual network for robust feature extraction
  • DenseNet121: 121-layer densely connected network for detailed pattern recognition
  • Inception-ResNet-v2: Hybrid architecture with inception modules and residual connections

The system uses novel uncertainty-based ensemble weighting that reduces the influence of uncertain predictions while leveraging model confidence for improved COVID-19 detection across different chest X-ray datasets.

image image

Features

  • Multi-Model Ensemble: Three complementary CNN architectures with transfer learning
  • Uncertainty Quantification: Entropy-based weighting for robust predictions
  • Evaluation: COVID-19 detection metrics, statistical testing, and visualization
  • Statistical Analysis: McNemar's test, bootstrap testing, confidence intervals
  • Plots: ROC curves, confusion matrices, performance comparisons
  • Modular Architecture: Clean separation of data processing, models, and evaluation
  • Config Management: YAML-based parameter management

Architecture

src/
├── data/           # Data preprocessing and loading pipelines
├── models/         # Individual CNN model implementations  
├── ensemble/       # Entropy-based ensemble methods
├── evaluation/     # Evaluation framework
└── utils/          # Configuration and utility functions

data/raw/          # Original datasets (COVIDx CXR-2)
saved_models/      # Trained model checkpoints and weights
configs/           # YAML configuration files
tests/             # Test suite (90%+ coverage)
scripts/           # Training and evaluation scripts

Quick Start

Environment Setup

# Clone the repository
git clone git@github.com:Sunsvea/Mira_Covid.git
cd Mira_Covid

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Data Preparation

The system uses the COVIDx CXR-2 dataset from Kaggle. We provide a comprehensive data preparation script to automatically download, extract, and organize the dataset.

Dataset: COVIDx CXR-2

Option 1: Automated Setup (Recommended)

# Full automated setup (requires Kaggle API setup)
python scripts/prepare_covidx_data.py --download

# Or extract from the manually downloaded zip
python scripts/prepare_covidx_data.py --extract-only ~/Downloads/covidx-cxr2.zip

Option 2: Manual Setup

# 1. Download from Kaggle manually:
https://www.kaggle.com/datasets/andyczhao/covidx-cxr2

# 2. Extract using our script
python scripts/prepare_covidx_data.py --extract-only path/to/covidx-cxr2.zip

# 3. Validate the prepared dataset
python scripts/prepare_covidx_data.py --validate-only

Kaggle API Setup (for automated download)

# Install Kaggle CLI
pip install kaggle

# Setup API credentials (choose one):
# Method 1: Download kaggle.json from your Kaggle account settings
mkdir -p ~/.kaggle
cp kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

# Method 2: Set environment variables
export KAGGLE_USERNAME=your_username
export KAGGLE_KEY=your_api_key

Data Structure After Preparation

The script organizes data into the expected structure:

data/raw/
├── train/
│   └── [~55,000 chest X-ray images]
├── test/
│   └── [~10,000 chest X-ray images]  
├── train.txt          # Training labels (patient_id filename label source)
├── test.txt           # Test labels
└── dataset_summary.json  # Dataset statistics and preparation info

Dataset Information:

  • Size: ~65,000 chest X-ray images (2.5GB download)
  • Classes: COVID-19 positive/negative
  • Sources: Multiple medical institutions for generalizability
  • Format: PNG images with corresponding label files
  • Preparation Time: ~5-10 minutes (depending on download speed)

Training

Quick Start Training Workflow

# 1. Analyze your dataset (optional but recommended)
python scripts/train_models.py analyze-data

# 2. Train all individual models (ResNet50, DenseNet121, Inception-ResNet-v2)
python scripts/train_models.py train-individual

# 3. Create ensemble from trained models and evaluate performance
python scripts/train_models.py train-ensemble --checkpoint-dir saved_models

Advanced Training Options

# Train specific models only
python scripts/train_models.py train-individual --models resnet50 densenet121

# Custom save directory for models
python scripts/train_models.py train-individual --save-dir my_models

# Force CPU training (sometimes faster than MPS on M3)
python scripts/train_models.py train-individual --device cpu

# Compare different ensemble strategies
python scripts/train_models.py compare-strategies --checkpoint-dir saved_models

# Get help for any command
python scripts/train_models.py --help
python scripts/train_models.py train-individual --help

Expected Training Time:

  • Apple M3 MacBook Air (24GB): 2-3+ hours per model (optimized for MPS)
  • NVIDIA RTX 5090 (Vast.ai): 20-45+ minutes per model (see configs/rtx5090.yaml)
  • NVIDIA RTX 4090: 45-90+ minutes per model
  • CPU-only: 5-7+ hours per model

Hardware-Specific Configurations:

  • Default config optimized for M3 MacBook (batch_size=32, no multiprocessing)
  • configs/rtx5090.yaml for high-end GPU training on rented hardware from Vast.ai

Evaluation

# Evaluate individual trained models
python scripts/evaluate_models.py individual

# Evaluate ensemble and compare with individual models  
python scripts/evaluate_models.py ensemble

# External validation (test generalization)
python scripts/evaluate_models.py external --external-data path/to/external/data

# Run tests to verify implementation
pytest tests/ -v --cov=src

Methodology

Transfer Learning Approach

  • Pre-trained ImageNet weights for feature extraction
  • Frozen convolutional layers with custom classifier heads
  • Global average pooling → FC layers (128→64→16) → Binary output

Uncertainty-Based Ensemble

Weight(i) = exp(-Entropy(Model_i)) / Σ(exp(-Entropy(Model_k)))
Final_Output = Σ(Weight(i) × Prediction(i))

Data Pipeline

  • Image standardization: 256×256×3 pixels
  • Normalization: [0,1] pixel values
  • Augmentation: rotation, flip, zoom, brightness/contrast
  • Balanced train/validation/test splits with source separation

Performance

Based on reference paper methodology:

  • Internal Validation: 95-97% accuracy
  • External Validation: 75-85% accuracy (cross-institutional)
  • Ensemble Improvement: 5-15% boost over individual models
  • Key Strength: Superior generalization across different data sources

Testing

The project emphasizes comprehensive testing:

# Run all tests
pytest tests/

# Run with coverage
pytest --cov=src tests/

# Run specific test categories
pytest tests/test_models.py
pytest tests/test_ensemble.py
pytest tests/test_data.py

Test Coverage Requirements:

  • Minimum 90% coverage for all modules
  • 100% coverage for critical paths (training, ensemble weighting)
  • Mocked external dependencies for fast, reliable testing

Configuration

Key parameters in configs/default.yaml:

data:
  image_size: [224, 224]  # Standard ImageNet size
  batch_size: 32          # Optimized for M3 MacBook performance
  num_workers: 0          # No multiprocessing for MPS compatibility
  
models:
  resnet50:
    classifier_layers: [128, 64, 16]
    dropout_rate: 0.3
    freeze_backbone: true   # Faster training with transfer learning
    
training:
  epochs: 10              # Full training epochs
  learning_rate: 0.002    # Higher LR for faster convergence
  early_stopping:
    enabled: true
    patience: 3           # Early stopping for development

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Write tests
  4. Ensure 90%+ test coverage
  5. Run linting and formatting
  6. Commit changes
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

Citation

If you use this work in your research, please cite:

@article{abad2024generalizable,
  title={Generalizable disease detection using model ensemble on chest X-ray images},
  author={Abad, Zelalem Dagne and others},
  journal={Scientific Reports},
  year={2024}
}

Contact

Dean Coulstock

Acknowledgments

  • Original research paper authors for the methodological foundation
  • Medical imaging community for dataset contributions
  • Open-source ML community for the underlying frameworks

About

Generalizable COVID-19 detection using model ensemble on chest X-ray images

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published