Mira: COVID-19 Detection using Model Ensemble on Chest X-ray Images

A machine learning system for detecting COVID-19 in chest X-ray images using ensemble neural networks with uncertainty-based weighting. This implementation follows the methodology from the Scientific Reports paper "Generalizable disease detection using model ensemble on chest X-ray images" (2024).

Overview

Implements the robust ensemble approach combining three pre-trained CNN architectures:

ResNet50: 50-layer residual network for robust feature extraction
DenseNet121: 121-layer densely connected network for detailed pattern recognition
Inception-ResNet-v2: Hybrid architecture with inception modules and residual connections

The system uses novel uncertainty-based ensemble weighting that reduces the influence of uncertain predictions while leveraging model confidence for improved COVID-19 detection across different chest X-ray datasets.

Features

Multi-Model Ensemble: Three complementary CNN architectures with transfer learning
Uncertainty Quantification: Entropy-based weighting for robust predictions
Evaluation: COVID-19 detection metrics, statistical testing, and visualization
Statistical Analysis: McNemar's test, bootstrap testing, confidence intervals
Plots: ROC curves, confusion matrices, performance comparisons
Modular Architecture: Clean separation of data processing, models, and evaluation
Config Management: YAML-based parameter management

Architecture

src/
├── data/           # Data preprocessing and loading pipelines
├── models/         # Individual CNN model implementations  
├── ensemble/       # Entropy-based ensemble methods
├── evaluation/     # Evaluation framework
└── utils/          # Configuration and utility functions

data/raw/          # Original datasets (COVIDx CXR-2)
saved_models/      # Trained model checkpoints and weights
configs/           # YAML configuration files
tests/             # Test suite (90%+ coverage)
scripts/           # Training and evaluation scripts

Quick Start

Environment Setup

# Clone the repository
git clone git@github.com:Sunsvea/Mira_Covid.git
cd Mira_Covid

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Data Preparation

The system uses the COVIDx CXR-2 dataset from Kaggle. We provide a comprehensive data preparation script to automatically download, extract, and organize the dataset.

Dataset: COVIDx CXR-2

Option 1: Automated Setup (Recommended)

# Full automated setup (requires Kaggle API setup)
python scripts/prepare_covidx_data.py --download

# Or extract from the manually downloaded zip
python scripts/prepare_covidx_data.py --extract-only ~/Downloads/covidx-cxr2.zip

Option 2: Manual Setup

# 1. Download from Kaggle manually:
https://www.kaggle.com/datasets/andyczhao/covidx-cxr2

# 2. Extract using our script
python scripts/prepare_covidx_data.py --extract-only path/to/covidx-cxr2.zip

# 3. Validate the prepared dataset
python scripts/prepare_covidx_data.py --validate-only

Kaggle API Setup (for automated download)

# Install Kaggle CLI
pip install kaggle

# Setup API credentials (choose one):
# Method 1: Download kaggle.json from your Kaggle account settings
mkdir -p ~/.kaggle
cp kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

# Method 2: Set environment variables
export KAGGLE_USERNAME=your_username
export KAGGLE_KEY=your_api_key

Data Structure After Preparation

The script organizes data into the expected structure:

data/raw/
├── train/
│   └── [~55,000 chest X-ray images]
├── test/
│   └── [~10,000 chest X-ray images]  
├── train.txt          # Training labels (patient_id filename label source)
├── test.txt           # Test labels
└── dataset_summary.json  # Dataset statistics and preparation info

Dataset Information:

Size: ~65,000 chest X-ray images (2.5GB download)
Classes: COVID-19 positive/negative
Sources: Multiple medical institutions for generalizability
Format: PNG images with corresponding label files
Preparation Time: ~5-10 minutes (depending on download speed)

Training

Quick Start Training Workflow

# 1. Analyze your dataset (optional but recommended)
python scripts/train_models.py analyze-data

# 2. Train all individual models (ResNet50, DenseNet121, Inception-ResNet-v2)
python scripts/train_models.py train-individual

# 3. Create ensemble from trained models and evaluate performance
python scripts/train_models.py train-ensemble --checkpoint-dir saved_models

Advanced Training Options

# Train specific models only
python scripts/train_models.py train-individual --models resnet50 densenet121

# Custom save directory for models
python scripts/train_models.py train-individual --save-dir my_models

# Force CPU training (sometimes faster than MPS on M3)
python scripts/train_models.py train-individual --device cpu

# Compare different ensemble strategies
python scripts/train_models.py compare-strategies --checkpoint-dir saved_models

# Get help for any command
python scripts/train_models.py --help
python scripts/train_models.py train-individual --help

Expected Training Time:

Apple M3 MacBook Air (24GB): 2-3+ hours per model (optimized for MPS)
NVIDIA RTX 5090 (Vast.ai): 20-45+ minutes per model (see configs/rtx5090.yaml)
NVIDIA RTX 4090: 45-90+ minutes per model
CPU-only: 5-7+ hours per model

Hardware-Specific Configurations:

Default config optimized for M3 MacBook (batch_size=32, no multiprocessing)
configs/rtx5090.yaml for high-end GPU training on rented hardware from Vast.ai

Evaluation

# Evaluate individual trained models
python scripts/evaluate_models.py individual

# Evaluate ensemble and compare with individual models  
python scripts/evaluate_models.py ensemble

# External validation (test generalization)
python scripts/evaluate_models.py external --external-data path/to/external/data

# Run tests to verify implementation
pytest tests/ -v --cov=src

Methodology

Transfer Learning Approach

Pre-trained ImageNet weights for feature extraction
Frozen convolutional layers with custom classifier heads
Global average pooling → FC layers (128→64→16) → Binary output

Uncertainty-Based Ensemble

Weight(i) = exp(-Entropy(Model_i)) / Σ(exp(-Entropy(Model_k)))
Final_Output = Σ(Weight(i) × Prediction(i))

Data Pipeline

Image standardization: 256×256×3 pixels
Normalization: [0,1] pixel values
Augmentation: rotation, flip, zoom, brightness/contrast
Balanced train/validation/test splits with source separation

Performance

Based on reference paper methodology:

Internal Validation: 95-97% accuracy
External Validation: 75-85% accuracy (cross-institutional)
Ensemble Improvement: 5-15% boost over individual models
Key Strength: Superior generalization across different data sources

Testing

The project emphasizes comprehensive testing:

# Run all tests
pytest tests/

# Run with coverage
pytest --cov=src tests/

# Run specific test categories
pytest tests/test_models.py
pytest tests/test_ensemble.py
pytest tests/test_data.py

Test Coverage Requirements:

Minimum 90% coverage for all modules
100% coverage for critical paths (training, ensemble weighting)
Mocked external dependencies for fast, reliable testing

Configuration

Key parameters in configs/default.yaml:

data:
  image_size: [224, 224]  # Standard ImageNet size
  batch_size: 32          # Optimized for M3 MacBook performance
  num_workers: 0          # No multiprocessing for MPS compatibility
  
models:
  resnet50:
    classifier_layers: [128, 64, 16]
    dropout_rate: 0.3
    freeze_backbone: true   # Faster training with transfer learning
    
training:
  epochs: 10              # Full training epochs
  learning_rate: 0.002    # Higher LR for faster convergence
  early_stopping:
    enabled: true
    patience: 3           # Early stopping for development

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Write tests
Ensure 90%+ test coverage
Run linting and formatting
Commit changes
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Citation

If you use this work in your research, please cite:

@article{abad2024generalizable,
  title={Generalizable disease detection using model ensemble on chest X-ray images},
  author={Abad, Zelalem Dagne and others},
  journal={Scientific Reports},
  year={2024}
}

Contact

Dean Coulstock

📍 Dublin, Ireland
💼 DeanJCoulstock@gmail.com
🔗 LinkedIn
📧 Chickenstock02@gmail.com (personal)

Acknowledgments

Original research paper authors for the methodological foundation
Medical imaging community for dataset contributions
Open-source ML community for the underlying frameworks

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/workflows		.github/workflows
configs		configs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
PAPER.pdf		PAPER.pdf
README.md		README.md
TRAINING_GUIDE.md		TRAINING_GUIDE.md
VAST_DEPLOYMENT.md		VAST_DEPLOYMENT.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mira: COVID-19 Detection using Model Ensemble on Chest X-ray Images

Overview

Features

Architecture

Quick Start

Environment Setup

Data Preparation

Option 1: Automated Setup (Recommended)

Option 2: Manual Setup

Kaggle API Setup (for automated download)

Data Structure After Preparation

Training

Quick Start Training Workflow

Advanced Training Options

Evaluation

Methodology

Transfer Learning Approach

Uncertainty-Based Ensemble

Data Pipeline

Performance

Testing

Configuration

Contributing

Citation

Contact

Acknowledgments

About

Uh oh!

Releases 2

Packages

Languages

Sunsvea/Mira_Covid

Folders and files

Latest commit

History

Repository files navigation

Mira: COVID-19 Detection using Model Ensemble on Chest X-ray Images

Overview

Features

Architecture

Quick Start

Environment Setup

Data Preparation

Option 1: Automated Setup (Recommended)

Option 2: Manual Setup

Kaggle API Setup (for automated download)

Data Structure After Preparation

Training

Quick Start Training Workflow

Advanced Training Options

Evaluation

Methodology

Transfer Learning Approach

Uncertainty-Based Ensemble

Data Pipeline

Performance

Testing

Configuration

Contributing

Citation

Contact

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages