GitHub - rossheat/population_predictor: Predict population groups using SNPs from the 1000 Genomes Project

Population Predictor is an attention-based deep learning model that predicts human super-population groups from genetic variation data. Using SNP (Single Nucleotide Polymorphism) data from the 1000 Genomes Project, it achieves 98.01% accuracy in classifying individuals into five major population groups: European (EUR), East Asian (EAS), African (AFR), South Asian (SAS), and American (AMR).

For computational efficiency, this implementation uses data from only chromosomes 21 and 22, selecting the 100,000 SNPs with highest variance across individuals from each chromosome.

Results

Test Accuracy by Population

Population	Accuracy
EUR	>99.99%
EAS	>99.99%
AFR	96.97%
SAS	>99.99%
AMR	91.18%

Training Visualisation

While training, a real-time chart displaying loss and accuracy metrics will appear and update after each epoch. This visualisation will be automatically saved to the 'models' directory at the end of training.

Model Architecture

The model combines dense neural networks with multi-head attention, featuring:

Input transformation through dense layers
Multi-head self-attention mechanism
Layer normalisation
Dropout for regularisation
Classification head for 5 super-populations

Data Pipeline

Dataset Composition

Using data from chromosomes 21 and 22:

Top 100,000 SNPs selected from each chromosome based on variance across individuals
200,000 total features per individual
2,504 total individuals

Population Distribution

Population	Train (80%)	Validation (10%)	Test (10%)
EUR	402 (20.1%)	50 (20.0%)	51 (20.3%)
EAS	403 (20.1%)	50 (20.0%)	51 (20.3%)
AFR	529 (26.4%)	66 (26.4%)	66 (26.3%)
SAS	391 (19.5%)	49 (19.6%)	49 (19.5%)
AMR	278 (13.9%)	35 (14.0%)	34 (13.5%)
Total	2003	250	251

Feature Selection

For each chromosome, the code:

Converts genotype data to binary format (presence/absence of alternative alleles)
Calculates variance of each SNP across all individuals
Selects the top 100,000 SNPs with highest variance
Combines selected SNPs from both chromosomes

My intuition was to focus on the model on the most variable genetic markers which we hypothesised to be the more informative. Further research is required to determine if there is a most optimal strategy.

Installation

Clone the repository:

git clone https://github.com/rossheat/population_predictor.git
cd population_predictor

Create conda environment:

conda env create -f env.yml
conda activate population_predictor

Usage

1. Download data:

python scripts/download_data.py --chromosomes 21 22

2. Preprocess data:

python scripts/preprocess_data.py --chromosomes 21 22 --max-snps 100000

3. Train model:

python scripts/train_test_model.py --chromosomes 21 22 --epochs 10

The best performing model (based on validation accuracy), will be automatically saved to the 'models' directory at the end of training.

The model will automatically use GPU if available through PyTorch's CUDA detection.

4. Make predictions for individual samples:

python scripts/predict_one.py --model models/model_chr21-22.pt --data data/preprocessed/chromosome_21-22.npz --sample-idx 0

Example output:

Predicted super-population: EUR

Probabilities:
EUR: 0.9990
AMR: 0.0008
SAS: 0.0001
AFR: 0.0000
EAS: 0.0000

Citation

If you use this code or the 1000 Genomes Project data, please cite:

@article{1000genomes2015,
  title={A global reference for human genetic variation},
  author={{The 1000 Genomes Project Consortium}},
  journal={Nature},
  volume={526},
  number={7571},
  pages={68--74},
  year={2015},
  publisher={Nature Publishing Group}
}

License

This project is licensed under the GNU General Public License v3.0

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
data		data
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Results

Test Accuracy by Population

Training Visualisation

Model Architecture

Data Pipeline

Dataset Composition

Population Distribution

Feature Selection

Installation

Usage

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

rossheat/population_predictor

Folders and files

Latest commit

History

Repository files navigation

Results

Test Accuracy by Population

Training Visualisation

Model Architecture

Data Pipeline

Dataset Composition

Population Distribution

Feature Selection

Installation

Usage

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages