Machine Learning pipeline for the identification and classification of microplastics using ATR-FTIR spectral data.
This repository contains a research-oriented machine learning pipeline for the analysis, classification, and interpretation of microplastics data, combining environmental science with modern data-driven methods.
The project is designed to be modular, reproducible, and extensible, supporting both exploratory analysis and interpretable ML models.
Microplastics (<5 mm plastic particles) are an emerging global environmental threat due to their persistence, widespread distribution, and potential impacts on ecosystems and human health.
This project aims to:
- Apply machine learning models to microplastics datasets
- Improve classification accuracy and interpretability
- Support environmental decision-making through explainable AI
- Provide a research-grade codebase suitable for extension and publication
- Detect and classify microplastics based on measured features
- Explore environmental and chemical patterns in the data
- Train and evaluate ML models with rigorous metrics
- Interpret predictions using SHAP and other explainability tools
- Assess annotation reliability via inter-annotator agreement (Cohen's Kappa)
microplastics-ml-research/
│
├── data/ # Raw and processed datasets
├── models/ # Trained models and serialized artifacts
├── notebooks/ # Jupyter notebooks (EDA, experiments)
├── reports/ # Figures, tables, and generated reports
├── src/ # Core source code (features, utils, models)
│
├── config.py # Global configuration (paths, parameters)
├── main.py # Main execution entry point
│
├── run_training.py # Model training pipeline
├── run_exploration.py # Exploratory data analysis
├── run_shap.py # SHAP explainability analysis
├── run_interpretability.py # Model interpretation tools
├── run_kappa_evaluation.py # Cohen's Kappa agreement analysis
├── run_chemical_interpretation.py # Chemical/environmental analysis
│
├── requirements.txt # Python dependencies
└── README.md # Project documentation
This project uses ATR-FTIR spectroscopy data from the Rusanda Lake peloid study (Serbia):
Prosenc, F. (2024). Identification of microplastics isolated from Rusanda peloid (Serbia) with ATR-FTIR [Data set]. Zenodo.
📍 DOI: 10.5281/zenodo.13850936
The dataset contains:
- Unprocessed FTIR spectra from isolated microplastic candidates
- Processed spectra using OpenSpecy 1.0
- Library matches for polymer identification
- Microplastics categorized by morphology: fibers, films, and fragments
ATR-FTIR spectra_Rusanda sample/
├── OpenSpecy_Rusanda sample library matches/
│ ├── Fibres top matches.csv
│ ├── Films top matches.csv
│ └── Fragments top matches.csv
└── Processed FTIR spectra with OpenSpecy_Rusanda sample/
├── BLATO TOPLICE_fiber_1_processed.csv
├── BLATO TOPLICE_fiber_2_processed.csv
└── ...
- Spectral signatures (wavenumber vs. absorbance)
- Particle morphology (fiber, fragment, film)
- Polymer type (via library matching: PE, PP, PET, PS, etc.)
- Match confidence scores from OpenSpecy
- Chemical functional groups from FTIR peaks
The dataset is available under Creative Commons Attribution 4.0 International and Community Data License Agreement Permissive 1.0, allowing research and educational use with proper attribution.
The project supports classical and interpretable ML approaches, such as:
- Random Forest
- Gradient Boosting
- Logistic Regression
- Tree-based classifiers
- Robust performance on tabular scientific data
- Compatibility with explainability methods
- Lower risk of overfitting on limited datasets
- Accuracy
- Precision / Recall / F1-score
- Confusion Matrix
- Cohen's Kappa (for label agreement and reliability)
- Feature importance and SHAP values
To ensure scientific transparency, the project includes:
- SHAP (SHapley Additive exPlanations)
- Feature importance analysis
- Chemical and environmental interpretation of predictions
This allows linking model behavior to physical and chemical meaning, which is critical for environmental research.
git clone https://github.com/JeffersonConza/microplastics-ml-research.git
cd microplastics-ml-researchpip install -r requirements.txtExploratory Data Analysis
python run_exploration.pyTrain models
python run_training.pyEvaluate inter-annotator agreement
python run_kappa_evaluation.pyModel interpretability (SHAP)
python run_shap.pyChemical interpretation
python run_chemical_interpretation.py- Fixed random seeds (where applicable)
- Config-driven parameters via
config.py - Modular scripts for independent execution
- Clear separation between data, models, and analysis
This project is suitable for:
- Environmental data science research
- Microplastics monitoring studies
- Explainable AI applications in ecology
- Academic theses and peer-reviewed publications
- Deep learning models for spectral data
- Automated microplastics image classification
- Integration with GIS and spatial analysis
- Longitudinal environmental trend analysis
- Risk assessment and policy-oriented outputs
Jefferson Conza
Mathematics Student | Machine Learning & Data Science
GitHub: @JeffersonConza
This project is intended for research and educational purposes.
Please cite appropriately if used in academic work.
- Environmental microplastics research community
- Open-source Python scientific ecosystem
- Contributors to interpretable machine learning tools