GATK SNP calling

1. gatk calling

1.1 HaplotypeCaller

1
2
3
4
5
6
7
8
java -Xmx3g -jar GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -R ref.fa -T HaplotypeCaller \
-nct 3 \
--emitRefConfidence GVCF \
--variant_index_type LINEAR \
--variant_index_parameter 128000 \
-I in.bam \
-L Chr1 \
-o out.g.vcf.gz

1.2 GenotypeGVCFs

NB. Chr1.gvcf.list: gvcf list, one individual per line

1
2
3
4
5
6
7
8
9
java -Xmx6g -Djava.io.tmpdir=/tmp -jar GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -T GenotypeGVCFs \
-R ref.fa \
-nt 4 \
--variant Chr1.gvcf.list \
--dbsnp dbsnp.vcf \
-stand_call_conf 30.0 \
-stand_emit_conf 10.0 \
-L Chr1 \
-o raw.Chr1.vcf

1
2
3
4
5
6
# snp SelectVariants
java -Xmx2g -Djava.io.tmpdir=/tmp -jar GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -T SelectVariants \
-R ref.fa \
--variant raw.Chr1.vcf \
-selectType SNP \
-o raw.Chr1.snp.vcf
1
2
3
4
5
6
# indel SelectVariants
java -Xmx2g -Djava.io.tmpdir=/tmp -jar GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -T SelectVariants \
-R ref.fa \
--variant raw.Chr1.vcf \
-selectType INDEL \
-o raw.Chr1.indel.vcf

2. samtools calling

1
2
3
4
5
6
# samtools
samtools mpileup -b bam.list -uf ref.fa -r Chr1 -DS -C50 -m 2 -F 0.002 -P ILLUMINA \
| bcftools view -p 0.99 -bcvg - | bcftools view \
| vcfutils.pl varFilter -1 1e-5 -4 1e-7 -d 2 -Q 10 -a 2 -w 3 | bgzip -c > Chr1.flt.vcf.gz
tabix -f Chr1.flt.vcf.gz

3. confidence snp dataset with overlap of gatk and samtools;SNPQ>30

Chr1.overlap.snp.vcf.gz


4. final SNP

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# VariantRecalibrator & ApplyRecalibration
java -Xmx9g -Djava.io.tmpdir=/tmp -jar GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -T VariantRecalibrator \
-l INFO \
-R ref.fa \
-input raw.Chr1.snp.vcf.gz \
-resource:concordantSet,VCF,known=true,training=true,truth=true,prior=10.0 Chr1.overlap.snp.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbSNP/dbsnp.vcf \
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an DP \
-an InbreedingCoeff \
-recalFile gatk.SNP.recal \
-tranchesFile gatk.SNP.tranches \
-rscriptFile gatk.SNP.plot.R \
--TStranche 90.0 --TStranche 93.0 --TStranche 95.0 \
--TStranche 97.0 --TStranche 99.0 --TStranche 100.0 \
-mode SNP \
-L Chr1
java -Xmx6g -Djava.io.tmpdir=/tmp -jar GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -T ApplyRecalibration \
-l INFO \
-R ref.fa \
-input raw.Chr1.snp.vcf.gz \
--ts_filter_level 99.0 \
-recalFile gatk.SNP.recal \
-tranchesFile gatk.SNP.tranches \
-mode SNP \
-L Chr1 \
-o gatk.snp.Chr1.VQSR.vcf
grep -E 'PASS|#' gatk.snp.Chr1.VQSR.vcf >final.gatk.snp.Chr1.VQSR.vcf
perl tabix.pl gatk.snp.Chr1.VQSR.vcf
perl tabix.pl final.gatk.snp.Chr1.VQSR.vcf

5. use Beagle to inprove genotype calls using genotype likelihoods from SAMtools/GATK

用Beagle,利用GL信息做genotype

1
2
3
java -Xmx3g -jar beagle/4.1/beagle.27Jul16.86a.jar \
gl=input.vcf.gz \
out=output.genotype

6. phase with beagle

1
2
3
java -Xmx3g -jar beagle/4.1/beagle.27Jul16.86a.jar \
gt=output.genotype.vcf.gz \
out=output.phased.vcf.gz

7. phase with shapeit (human)

1
2
3
shapeit --thread 4 --effective-size 14269 --input-vcf chr1.genotype.vcf.gz \
-M 1kg/phasing/genetic_map_b37/genetic_map_chr1_combined_b37.txt \
--output-max chr1.shapeit.haps.gz chr1.shapeit.haps.sample --output-log chr1.shapeit.log