HARP — Haplotype Allele Read Profiler.
A Python tool for counting haplotype-specific read support for SNVs from a haplotagged BAM file and a phased VCF file.
- Computes allele counts per haplotype (h1 and h2) for SNVs.
- Supports parallel processing of BAM files for faster execution.
- Handles indexed BAM and bgzipped VCF files.
- Produces a simple TSV output suitable for downstream analysis.
Using Poetry:
git clone https://github.com/yourusername/harp.git
cd harp
poetry installRun HARP via CLI:
poetry run harp run --bam PATH_TO_BAM --vcf PATH_TO_VCF --out OUTPUT_TSV [--threads N]Example
poetry run harp run \
--bam specification/test_data/giab_2023.05.hg002.haplotagged.chr16_28000000_29000000.processed.30x.bam \
--vcf specification/test_data/giab_2023.05.hg002.wf_snp.chr16_28000000_29000000.vcf.gz \
--out result.tsv \
--threads 4CLI options:
- bam FILE — Input BAM file (indexed) [required]
- vcf FILE — Input phased VCF file [required]
- out FILE — Output TSV file [required]
- threads INTEGER — Number of parallel threads to use (default: 4)
The output is a tab-separated file with the following columns:
chrom— Chromosome name pos — 0-based genomic position of the varianth1_REF— Number of reads supporting the reference allele on haplotype 1h1_ALT— Number of reads supporting the alternate allele on haplotype 1h2_REF— Number of reads supporting the reference allele on haplotype 2h2_ALT— Number of reads supporting the alternate allele on haplotype 2
Example output:
chrom pos h1_REF h1_ALT h2_REF h2_ALT
chr16 28001380 11 4 3 12
chr16 28002343 11 5 3 12
chr16 28003017 12 4 3 13
chr16 28003087 12 4 3 11
chr16 28004800 13 4 3 9
...
- HARP assumes that the BAM file is haplotagged (
HPtag present). - The BAM file must be coordinate-sorted and indexed (
.baipresent). Unsorted BAMs will lead to incorrect counts or skipped reads. - Only bi-allelic SNVs are counted.
- Chunk size and threading can be tuned for performance.