vcftools

习惯:Always使用bgzip生成vcf.gz文件

并用tabix生成index,使得vcftools和GATK中直接使用vcf.gz文件。

bgzip和tabix包含在samtools/htslib内。

1
2
samtools/htslib/bgzip -c file.vcf > file.vcf.gz
samtools/htslib/tabix -p vcf file.vcf.gz


小模块
vcftools/vcftools-build/bin/中的小程序,实现简单的特定功能。列出几个最常用的。其他更多

使用前需要导入perl_module lib

1
export PERL5LIB=/home/wanglizhong/software/vcftools/vcftools-build/lib/perl5/site_perl/5.8.8:$PERL5LIB;

单个vcf

重新计算INFO里的AN和AC

fill-an-ac
重新计算AN和AC 并加入INFO中。

1
zcat in.vcf.gz|fill-an-ac > out.vcf


vcf2fa

vcf转成一致性序列(consensus sequence)。

1
2
3
4
5
6
7
cat ref.fa|vcf-consensus file.vcf.gz > out.fa
# 其中某一个个体
cat ref.fa|vcf-consensus -s sampleA_ID file.vcf.gz > out.sampleA_ID.fa
# 只有某一个haplotype(1或者2)
cat ref.fa|vcf-consensus -H 1 file.vcf.gz > out.hap1.fa


取出部分个体

vcf-subset -c

1
vcf-subset -c NA0001,NA0002 file.vcf.gz | bgzip -c > out.vcf.gz


vcf里的信息自定义输出

1
2
vcf-query file.vcf.gz 1:10327-10330
vcf-query file.vcf -f '%CHROM:%POS %REF %ALT [ %DP]\n'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Usage: vcf-query [OPTIONS] file.vcf.gz
Options:
-c, --columns <file|list> List of comma-separated column names or one column name per line in a file.
-f, --format <string> The default is '%CHROM:%POS\t%REF[\t%SAMPLE=%GT]\n'
-l, --list-columns List columns.
-r, --region chr:from-to Retrieve the region. (Runs tabix.)
--use-old-method Use old version of API, which is slow but more robust.
-h, -?, --help This help message.
Expressions:
%CHROM The CHROM column (similarly also other columns)
%GT Translated genotype (e.g. C/A)
%GTR Raw genotype (e.g. 0/1)
%INFO/TAG Any tag in the INFO column
%LINE Prints the whole line
%SAMPLE Sample name
[] The brackets loop over all samples
%*<A><B> All format fields printed as KEY<A>VALUE<B>
Examples:
vcf-query file.vcf.gz 1:1000-2000 -c NA001,NA002,NA003
vcf-query file.vcf.gz -r 1:1000-2000 -f '%CHROM:%POS\t%REF\t%ALT[\t%SAMPLE:%*=,]\n'
vcf-query file.vcf.gz -f '[%GT\t]%LINE\n'
vcf-query file.vcf.gz -f '[%GT\ ]%LINE\n'
vcf-query file.vcf.gz -f '%CHROM\_%POS\t%INFO/DP\t%FILTER\n'


将个体顺序重排

Reorder columns

file.vcf.gz按照template.vcf.gz顺序重排

1
vcf-shuffle-cols -t template.vcf.gz file.vcf.gz > out.vcf

vcf postions Reorder

1
2
vcf-sort file.vcf.gz > out.vcf
zcat file.vcf.gz| vcf-sort > out.vcf


转换成tab格式

1
zcat file.vcf.gz | vcf-to-tab > out.tab

输出如下:

1
2
3
4
5
6
7
8
9
10
#CHROM POS REF 05537
chr2 131 T T/T
chr2 437 G G/G
chr2 453 G G/G
chr2 469 G G/G
chr2 526 G G/G
chr2 618 G G/G
chr2 745 A A/A
chr2 756 T T/T
chr2 760 T T/T


计算tstv

1
cat file.vcf | vcf-tstv


vcf格式校验

1
vcf-validator file.vcf.gz


多个vcf

比较多个vcf里SNP位置的overlap

可以用结果画维恩图Venn-Diagram。

1
2
3
4
5
6
7
8
# run
vcf-compare test.vcf.gz test2.vcf.gz test3.vcf.gz|grep ^VN | cut -f 2-
# results
VN 94 test.vcf.gz (9.5%) test2.vcf.gz (22.3%) test3.vcf.gz (48.5%)
VN 100 test.vcf.gz (10.1%) test3.vcf.gz (51.5%)
VN 327 test.vcf.gz (32.9%) test2.vcf.gz (77.7%)
VN 473 test.vcf.gz (47.6%)


合并多个染色体chr/scaffold

1.个体(header)需完全相同。

1
vcf-concat chr1.vcf.gz chr2.vcf.gz | bgzip -c > all.vcf.gz

2.合并有不同个体的多个vcf

  • 最好有相同Postions,缺失的默认会被填上miss(.)

    1
    vcf-merge A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz
  • 把mis的用自定义genotype(0|0,0/0,1/1等等)填上

    1
    vcf-merge -R '0|0' A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz
  • bcftools

    1
    samtools/bcftools/bcftools merge --merge all --no-version --threads 10 file1.vcf.gz file2.vcf.gz |bgzip -c > merge.vcf.gz

3.一行命令合并vcf,巧用括号

1
2
3
# Merge (that is, concatenate) two VCF files into one, keeping the header
# from first one only.
(zcat A.vcf.gz | grep ^#; zcat A.vcf.gz | grep -v ^#; zcat B.vcf.gz | grep -v ^#; )| bgzip -c > out.vcf.gz


找vcf overlap/互补位置

1
2
3
4
5
# 输出两个vcf里共有的
vcf-isec -n +2 A.vcf.gz B.vcf.gz | bgzip -c > out.vcf.gz
# 输出第一个vcf里有,其他vcf(B和C及后面更多)里没有的
vcf-isec -c A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz