This project involves the analysis of immune repertoire data obtained from the blood samples of healthy donors and ovarian cancer patients. The immune repertoire data includes T-cell receptor alpha (TRA) and beta (TRB) sequences obtained through Repertoire Sequencing (Rep-Seq). The analysis includes data preprocessing, feature filtering, and machine learning for classification.
This repository accompanies the publication:
Zuckerbrot-Schuldenfrei, M., et al.
"Ovarian cancer is detectable from peripheral blood using machine learning over T-cell receptor repertoires"
Briefings in Bioinformatics, 2024.
https://doi.org/10.1093/bib/bbae075
-
Blood Collection:
- Blood samples were collected from healthy donors and ovarian cancer patients.
-
Repertoire Sequencing (Rep-Seq):
- Repertoire Sequencing was performed on the collected blood samples to obtain TRA and TRB sequences.
-
Data Processing with MiXCR:
- The obtained raw sequencing data went through the MiXCR pipeline for processing.
-
Concatenation of TRA and TRB Files:
- The processed TRA and TRB files were concatenated for further analysis.
- immunarch_analysis.Rmd:
- This R Markdown document performs subsampling on the concatenated data and conducts basic analyses using the
immunarchpackage.
- This R Markdown document performs subsampling on the concatenated data and conducts basic analyses using the
- feature_filtering.Rmd:
- This R Markdown document implements two different methods for feature filtering on the concatenated data.
- ML_atom_SFM_600f.ipynb and ML_atom_SFM_16f.ipynb
- These Jupyter Notebooks contain Python scripts for machine learning.
- The data resulting from
feature_filtering.Rmdis processed through these notebooks. - Machine learning algorithms are applied for classification using the
atompackage.
- The analysis can be reproduced by following the steps outlined in each analysis script.
- Ensure that the required dependencies (such as R packages and Python libraries) are installed.
- The project was developed and tested with the following package versions:
- Python: 3.8.13
- NumPy: 1.21.2
- Pandas: 1.4.1
- Scikit-learn: 1.0.2
- atom-ml: 4.12.0
- To reproduce the Python environment,
conda env create -f environment.yml.
- The Rep-Seq data were deposited in the NCBI BioProject database under accession number PRJNA1152888.
- Data resulting from the second feature filtering method in
feature_filtering.Rmdis available in thedata_16_ab.csvfile. All analysis on these data can be reproduced withML_atom_SFM_16f.ipynb.