VCF manipulation with GATK

HaplotypeCaller, 每个Sample生成GVCF文件, 相当于一个SNP calling的中间文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
export JAVA_HOME=java/jre1.8.0_45;
jre1.8.0_45/bin/java -Djava.io.tmpdir=tmpdir \
-jar gatk3.6/GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R ref.fa \
-I sampleA.bam \
-nct 15 \
-ERC GVCF \ # gvcf模式
-L interver.list \ # chr10:10000001-15000000
[--pcr_indel_model NONE ] \ # 如果bam去除了PCR duplications可以加上这个
-o sampleA.chr10_10000001_15000000.g.vcf.gz \
-variant_index_type LINEAR \
-variant_index_parameter 128000 \
2>&1 | tee 01.vcfByWindow/chr10_10000001_15000000/14-5.gvcf.gz.log # log输出


GenotypeGVCFs, 多个sample做SNP calling

1
2
3
4
5
6
7
8
9
10
export JAVA_HOME=java/jre1.8.0_45;
jre1.8.0_45/bin/java -Djava.io.tmpdir=tmpdir \
-jar gatk3.6/GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R ref.fa \
-V gvcf.list \ # 多个体的gvcf列表,一行一个
-stand_call_conf 30.0 \
-stand_emit_conf 10.0 \
-o variants.vcf.gz;


CatVariants,把分区域多线程call的variants合并起来,注意使用java1.7+gatk3.3/3.4

1
2
3
4
5
6
7
8
9
10
export JAVA_HOME=java/jre1.7.0_55;
jre1.7.0_55/bin/java -cp GATK-3.4/GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants \
-R ref.fa \
-V 01.vcfByWindow/chr1_1_5000000/variants.vcf.gz \ # 或者-V vcf.list
-V 01.vcfByWindow/chr1_5000001_10000000/variants.vcf.gz \
-V 01.vcfByWindow/chr1_10000001_15000000/variants.vcf.gz \
-V 01.vcfByWindow/chr1_15000001_20000000/variants.vcf.gz \
-out chr1.01.vcf.gz \ # 是-out,不是-o
[-assumeSorted] \ # 如果--variant按顺序排的,则加上;乱的就不加


SelectVariants,取overlap位点,保留set2里和set1 overlap的位点

1
2
3
4
5
6
7
8
9
export JAVA_HOME=java/jre1.8.0_45;
jre1.8.0_45/bin/java -Djava.io.tmpdir=tmpdir \
-jar gatk3.6/GenomeAnalysisTK.jar \
-T SelectVariants \
-R ref.fa \
-V set2.vcf.gz \
-o set2.overlap.vcf.gz \
--concordance set1.vcf.gz


SelectVariants,去除/保留个体

1
2
3
4
5
6
7
8
9
10
11
12
export JAVA_HOME=java/jre1.8.0_45;
jre1.8.0_45/bin/java -Djava.io.tmpdir=tmpdir \
-jar gatk3.6/GenomeAnalysisTK.jar \
-T SelectVariants \
-R ref.fa \
-V set1.vcf.gz \
-o set1.remove.vcf.gz \
# 下面两个二选一
-xl_sf remove.list \ # 去除个体
-sf keep.list \ # 保留个体

CombineVariants -genotypeMergeOptions UNIQUIFY,合并多个vcf,不管个体ID,相同ID的个体都当成不同的并重新命名; 两个set位点可以不完全相同,没信息的genotype会自动标记成./.

1
2
3
4
5
6
7
8
9
10
export JAVA_HOME=java/jre1.8.0_45;
jre1.8.0_45/bin/java -Djava.io.tmpdir=tmpdir \
-jar gatk3.6/GenomeAnalysisTK.jar \
-T CombineVariants \
-R ref.fa \
-V:v1 a.vcf.gz \
-V:v2 b.vcf.gz \
-o ab.UNIQUIFY.vcf.gz \
-genotypeMergeOptions UNIQUIFY


CombineVariants -genotypeMergeOptions PRIORITIZE -priority v1,v2,可能有相同个体的,按照优先级选择具体的genotype

1
2
3
4
5
6
7
8
9
10
export JAVA_HOME=java/jre1.8.0_45;
jre1.8.0_45/bin/java -Djava.io.tmpdir=tmpdir \
-jar gatk3.6/GenomeAnalysisTK.jar \
-T CombineVariants \
-R ref.fa \
-V:v1 a.vcf.gz \
-V:v2 b.vcf.gz \
-o ab.PRIORITIZE.vcf.gz \
-genotypeMergeOptions PRIORITIZE -priority v1,v2


类似:vcftools