GATK 4 Mutect2 Somático
- GATK4 - Mutect2
- Gene JAK2
- Referência chr9
- Sobre as versões do Genoma Humano: https://gatk.broadinstitute.org/hc/en-us/articles/360035890711-GRCh37-hg19-b37-humanG1Kv37-Human-Reference-Discrepancies#grch37
- Amostras:
- WP043 (tumor)
- WP044 (normal)
- https://gatk.broadinstitute.org/hc/en-us/articles/360035894731-Somatic-short-variant-discovery-SNVs-Indels-
- WP190 (tumor) e WP191 (normal)
- WP017 (tumor) e WP018 (normal)
Nota 1: Utilizar o af-gnomad chr13 e chr19.
Nota 2: Será preciso baixar o chr13 e chr19 da UCSC.
Download chr19
wget -c https://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/chr19.fa.gzDownload chr13
wget -c https://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/chr13.fa.gzConcatenar os arquivos .fa.gz
Dica 1: zcat lê arquivos compactados .gz e zip
zcat chr13.fa.gz chr19.fa.gz | sed -e "s/chr//g" > hg19.faGerar o index do arquivo hg19.fa
samtools faidx hg19.fa \ ^__^
\ (oo)\_______
(__)\ )\/\
||----w |
|| ||
Os arquivos BAM (tumor e normal) já foram gerados e estão prontos para a chamada de variates (ver Anexo 1). Então, agora vamos fazer download da referência chr9 e gerar o index com o programa samtools.
- Download
wget -c https://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/chr9.fa.gz- Alterar nome do header: DE: >chr9 para >9
Essa alteração é necessária pois no BAM a referência não tinha
>chrera apenas>9.
zcat chr9.fa.gz | sed -e "s/chr//g" > chr9.fa- Verificar se alteração foi feita com o comando
head
head chr9.faO comando
headlê as 10 primeiras linha de um arquivo texto
Samtools is a suite of programs for interacting with high-throughput sequencing data. It consists of three separate repositories:
- samtools install (Mac)
brew install samtools - samtools install (Ubuntu)
sudo apt-get install samtools - samtools install (Docker)
docker pull biocontainers/samtools- samtools faidx
samtools faidx chr9.faVersion: 4.2.2.0
Genome Analysis Toolkit - Variant Discovery in High-Throughput Sequencing Data. https://gatk.broadinstitute.org/
GATK4 install Docker
docker pull broadinstitute/gatk:4.2.2.0GATK4 install Mac e Linux
- Download
wget -c https://github.com/broadinstitute/gatk/releases/download/4.2.2.0/gatk-4.2.2.0.zip- Descompactar
unzip gatk-4.2.2.0.zip - Testando gatk
./gatk-4.2.2.0/gatk./gatk-4.2.2.0/gatk CreateSequenceDictionary -R chr9.fa -O chr9.dict./gatk-4.2.2.0/gatk ScatterIntervalsByNs -R chr9.fa -O chr9.interval_list -OT ACGTCall somatic SNVs and indels via local assembly of haplotypes
O comando:
samtools view -H normal_JAK2.bamvocê consegue pegar o campo SM: que contém o ID da amostra normal.
samtools view -H normal_JAK2.bam | grep RG | cut -f6 | sed -e "s/SM://g"./gatk-4.2.2.0/gatk Mutect2 \
-R chr9.fa \
-I tumor_JAK2.bam \
-I normal_JAK2.bam \
-normal WP044 \
--germline-resource af-only-gnomad-chr9.vcf.gz \
-O somatic.vcf.gz \
-L chr9.interval_listTabulates pileup metrics for inferring contamination
- GetPileupSummaries Tumor
./gatk-4.2.2.0/gatk GetPileupSummaries \
-I tumor_JAK2.bam \
-V af-only-gnomad-chr9.vcf.gz \
-L chr9.interval_list \
-O tumor_JAK2.table- GetPileupSummaries Normal
./gatk-4.2.2.0/gatk GetPileupSummaries \
-I normal_JAK2.bam \
-V af-only-gnomad-chr9.vcf.gz \
-L chr9.interval_list \
-O normal_JAK2.tableCalculate the fraction of reads coming from cross-sample contamination
./gatk-4.2.2.0/gatk CalculateContamination \
-I tumor_JAK2.table \
-matched normal_JAK2.table \
-O contamination.tableFilter somatic SNVs and indels called by Mutect2
./gatk-4.2.2.0/gatk FilterMutectCalls \
-R chr9.fa \
-V somatic.vcf.gz \
--contamination-table contamination.table \
-O filtered.vcf.gzdocker pull ensemblorg/ensembl-vepCriar o diretorio vep_output
mkdir -p vep_output
Modificar a permissao do diretorio vep_output
chmod 777 vep_output
- Aplicar apenas no Google Colab
mkdir -p vep_output
chmod 777 vep_outputbash vep-docker.shGATK Best Practices - Exome PoN
- vcf
wget -c https://storage.googleapis.com/gatk-best-practices/somatic-b37/Mutect2-exome-panel.vcf- vcf.idx
wget -c https://storage.googleapis.com/gatk-best-practices/somatic-b37/Mutect2-exome-panel.vcf.idx- Mutect2
./gatk-4.2.2.0/gatk Mutect2 \
-R hg19.fa \
-I tumor_wp190.bam \
--germline-resource af-only-gnomad-chr13-chr19.vcf.gz \
--panel-of-normals Mutect2-exome-panel.vcf \
-L hg19.interval_list \
-O WP190.somatic.pon.vcf.gz
- CalculateContamination somente com o table do tumor (ex.: wp190)
./gatk-4.2.2.0/gatk CalculateContamination \
-I tumor_wp190.table \
-O WP190.contamination.pon.table- FilterMutectCalls
./gatk-4.2.2.0/gatk FilterMutectCalls \
-R hg19.fa \
-V WP190.somatic.pon.vcf.gz \
--contamination-table WP190.contamination.pon.table \
-O WP190.filtered.pon.vcf.gzsamtools view -h -b /Volumes/Seagate\ Expansion\ Drive/data-lpfap10/projects/proadi/exome/bam/WP043/WP043.bam 9:5021937-5126899 | samtools bam2fq -1 tumor_R1.fq -2 tumor_R2.fq - samtools view -h -b /Volumes/Seagate\ Expansion\ Drive/data-lpfap10/projects/proadi/exome/bam/WP044/WP044.bam 9:5021937-5126899 | samtools bam2fq -1 normal_R1.fq -2 normal_R2.fq - Aqui estão os comandos que foram utilizados para gerar os BAMs intermediários e como gerar arquivos FASTQs de regiões específicas do seu arquivo BAM completo.
Essas etapas não precisam ser executadas nesse pipeline
- BAM para BAM
samtools view -h -b /Volumes/Seagate\ Expansion\ Drive/data-lpfap10/projects/proadi/exome/bam/WP043/WP043.bam 9:5021937-5126899 > tumor_JAK2.bamsamtools view -h -b /Volumes/Seagate\ Expansion\ Drive/data-lpfap10/projects/proadi/exome/bam/WP044/WP044.bam 9:5021937-5126899 > normal_JAK2.bam- Gerar index do BAM (.BAI)
samtools index tumor_JAK2.bam samtools index normal_JAK2.bam - Header do VCF
zgrep -w "\#" af-only-gnomad.raw.sites.chr.vcf.gz > header- Apenas Região do Gene JAK2
zgrep -w "^chr9" af-only-gnomad.raw.sites.chr.vcf.gz | awk '$2>=5021937 && $2<=5126899' > JAK2.region- Concatenar header + vcf
cat header JAK2.region > af-only-gnomad-chr9.vcf- Compactar
bgzip af-only-gnomad-chr9.vcf- Index do VCF
tabix -p vcf af-only-gnomad-chr9.vcf.gz Rodar amostras extras pareadas:
- Com par:
sh run.chr13-chr19.sh tumor_JAK2.bam normal_JAK2.bam WP043 WP044
sh run.chr13-chr19.sh tumor_wp017.bam tumor_wp018.bam WP017 WP018
sh run.chr13-chr19.sh tumor_wp190.bam tumor_wp191.bam WP190 WP191- Com PoN:
sh run.chr13-chr19.pon.sh tumor_JAK2.bam normal_JAK2.bam WP043 WP044
sh run.chr13-chr19.pon.sh tumor_wp017.bam tumor_wp018.bam WP017 WP018
sh run.chr13-chr19.pon.sh tumor_wp190.bam tumor_wp191.bam WP190 WP191
Código Fonte (run.chr13-chr19.pon.sh e run.chr13-chr19.sh)
# variaveis fixas
gnomad="af-only-gnomad-chr13-chr19.vcf.gz"
interval="hg19.interval_list"
genome="hg19.fa"
# bam e ids
tumor=$1
normal=$2
id_tumor=$3
id_normal=$4
# rodando mutect2
./gatk-4.2.2.0/gatk Mutect2 \
-R $genome \
-I $tumor \
-I $normal \
-normal $id_normal \
--germline-resource $gnomad \
-O $id_tumor.somatic.vcf.gz \
-L $interval
# getPileup tumor
./gatk-4.2.2.0/gatk GetPileupSummaries \
-I $tumor \
-V $gnomad \
-L $interval \
-O $id_tumor.table
# getPileup normal
./gatk-4.2.2.0/gatk GetPileupSummaries \
-I $normal \
-V $gnomad \
-L $interval \
-O $id_normal.table
# CalculateContamination
./gatk-4.2.2.0/gatk CalculateContamination \
-I $id_tumor.table \
-matched $id_normal.table \
-O $id_tumor.contamination.table
# FilterMutectCalls
./gatk-4.2.2.0/gatk FilterMutectCalls \
-R $genome \
-V $id_tumor.somatic.vcf.gz \
--contamination-table $id_tumor.contamination.table \
-O $id_tumor.filtered.vcf.gz
Rodar amostras extras com Panel of Normal (broad institute).
# variaveis fixas
gnomad="af-only-gnomad-chr13-chr19.vcf.gz"
interval="hg19.interval_list"
genome="hg19.fa"
pon="Mutect2-exome-panel.vcf"
# bam e ids
tumor=$1
id_tumor=$3
# rodando mutect2
./gatk-4.2.2.0/gatk Mutect2 \
-R $genome \
-I $tumor \
--germline-resource $gnomad \
--panel-of-normals $pon \
-O $id_tumor.somatic.pon.vcf.gz \
-L $interval
# getPileup tumor
./gatk-4.2.2.0/gatk GetPileupSummaries \
-I $tumor \
-V $gnomad \
-L $interval \
-O $id_tumor.pon.table
# CalculateContamination
./gatk-4.2.2.0/gatk CalculateContamination \
-I $id_tumor.pon.table \
-O $id_tumor.contamination.pon.table
# FilterMutectCalls
./gatk-4.2.2.0/gatk FilterMutectCalls \
-R $genome \
-V $id_tumor.somatic.pon.vcf.gz \
--contamination-table $id_tumor.contamination.pon.table \
-O $id_tumor.filtered.pon.vcf.gz