CNVnator Based Analysis

CNVnator

(Other software: Genome STRiP)

Call CNV using CNVnator

Running example

1
2
3
4
5
6
7
export ROOTSYS=/path/to/software/root;
export LD_LIBRARY_PATH=/path/to/software/root/lib:$LD_LIBRARY_PATH;
cnvnator -root AGS10A.Chr1.50.root -unique -tree AGS10A.sorted.bam -chrom Chr1 >AGS10A.Chr1.50.root.l1 2>AGS10A.Chr1.50.root.e1;
cnvnator -root AGS10A.Chr1.50.root -d split_chr_dir -his 50 -chrom Chr1 >AGS10A.Chr1.50.root.l2 2>AGS10A.Chr1.50.root.e2
cnvnator -root AGS10A.Chr1.50.root -stat 50 -chrom Chr1 >AGS10A.Chr1.50.root.l3 2>AGS10A.Chr1.50.root.e3;
cnvnator -root AGS10A.Chr1.50.root -partition 50 -chrom Chr1 >AGS10A.Chr1.50.root.l4 2>AGS10A.Chr1.50.root.e4
cnvnator -root AGS10A.Chr1.50.root -call 50 -chrom Chr1 > AGS10A.Chr1.50.cnv 2>AGS10A.Chr1.50.root.e5

Result format:

1
2
3
4
5
6
7
8
9
CNV_type coordinates CNV_size normalized_RD e-val1 e-val2 e-val3 e-val4 q0
- normalized_RD -- normalized to 1.
- e-val1 -- is calculated using t-test statistics.
- e-val2 -- is from the probability of RD values within the region to be in
the tails of a gaussian distribution describing frequencies of RD values in bins.
- e-val3 -- same as e-val1 but for the middle of CNV
- e-val4 -- same as e-val2 but for the middle of CNV
- q0 -- fraction of reads mapped with q0 quality



Note

  • First thins you should do is try different bin_size settings like 100, 200, 500, 1000. And make sure that for the bin size you are using the ratio of average to its sd is around 4-5. In the running log of -stat or -eval, something like this:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    # cnvnator -root scaffold100.root -chrom scaffold100 -stat 1000
    Making statistics for scaffold100 ...
    Average RD per bin (1-22) is 46.1805 +- 11.2085 (before GC correction)
    Average RD per bin (X,Y) is 0 +- 0 (before GC correction)
    Correcting counts by GC-content for 'scaffold100' ...
    Making statistics for scaffold100 after GC correction ...
    Average RD per bin (1-22) is 45.3823 +- 10.8865 (after GC correction)
    Average RD per bin (X,Y) is 0 +- 0 (after GC correction)
    # OR
    # cnvnator -root scaffold100.root -chrom scaffold100 -eval 1000
    Average RD per bin (1-22) is 45.3823 +- 10.8865 4.16869
    Warning in <Fit>: Fit data is empty
    Warning in <Fit>: Fit data is empty
    Average RD per bin (X,Y) is 0 +- 0 nan
  • RAW cnv call的结果是normalized到1的深度, genotype的深度是normalized到2的深度。

The results of calling is RD signal normalized to 1, while genotyping returns estimated number of copies, which is 2 for all germline chromosomes except for X and Y in males.)

  • By default CNVnator estimates CN (-genotype, last two number in each line) using bin size that you provided and bins of 1000 bp. If histogram for 1k bp bins don’t exist then the software prints the warning and print -1 (the last field).
  • Larger events are more reliable. Also, regions with less repeats and segments duplication are more reliable (but this is taken care of by q0 filtering).
  • First e-values is more conservative. You can ignore e-val3 and e-val4. By default CNVantor reports calls that are with <0.05 either for e-val1 or e-val2.




Few things may help

  • CNVnator_v0.2.7 and root-5.34.36 is a safe chioce

  • the information about genome is taken from BAM header. Option -genome is only relevant when using with SAM files

  • looks like your file got corrupted. Not sure about the reason. If the issue persist, the walk around is to save histograms into a different .root file. e.g. cnvnator -root tree.root -outroot his.root -his 200 -d /ref/sep/

  • by default CNVantor reports calls that are with <0.05 either for e-val1 or e-val2.

  • dealing with scaffolds genome, you may want to remove few scaffold with abnormal RD and sd (and the abnormal ratio of average to its sd ).

    1
    2
    3
    4
    5
    6
    7
    8
    # for example, scaffold with too many repeat sequences
    Making statistics for scaffold1152 ...
    Average RD per bin (1-22) is 3.35436e-07 +- 96.0994 (before GC correction)
    Average RD per bin (X,Y) is 0 +- 0 (before GC correction)
    Correcting counts by GC-content for 'scaffold1152' ...
    Making statistics for scaffold1152 after GC correction ...
    Average RD per bin (1-22) is 3.29977e-07 +- 9.1335e-08 (after GC correction)
    Average RD per bin (X,Y) is 0 +- 0 (after GC correction)



过滤参考

参考CNV过滤标准:

  1. q0 < 0.5, ie. calls with >50% of q0 (zero mapping quality) reads within the CNV regions were removed (q0 filter)

  2. CNV Calls > 1kb (不必要)

  3. P-value <0.01 (e-val1 )

  4. 不在基因组gap区域: it does not overlap a gap in the reference genome for a deletion, and it is not within 0.5 Mb from a gap in the reference genome for a duplication; (nature11632-12-千人-s1)/OR/ All CNV calls overlapping with gaps in the reference genome were excluded from consideration. (这个对于scaffolds的基因组太严了,尝试去掉50%以上)

  5. 重复序列区域: CNV calls that overlapped (more than 50% of their length) with masked regions were excluded from CNV analyses.

  6. 不好处理:a deletion must have at least two paired-end reads supporting the predictions (overlapping by 50% reciprocally) or read depth in the middle part (1 kb away from breakpoints) of the called region should be statistically different from average read depth. Statistical testing was done the same way as when calling CNV regions. (nature11632-12-千人-s1)


千人SV过滤和merge方法(doi:10.1038/nature15394)

For each sample we subsequently selected confident CNVnator calls as follows:
1) calls having paired-end support; (PE reads only in bam)
2) calls with p-values less than 1e-5 (to accounts for multiple hypothesis testing, i.e., calling in ~2,500 samples), and with q0 < 0.5;
3) deletion calls with p-values less than 10^-5 and rd*(1 + q0) < 0.75 — whereby rd is the read-depth normalized to genome average, and q0 is fraction of reads mapped with 0 (zero) mapping quality.

We merged CNV calls for individuals within each population. For CNVnator site merging we initially clustered confident overlapping calls and averaged coordinates of each bound, pursuing the merging initially by population and then across the entire sample set.


CNV处理方式

  1. 不同个体间,大于50% reciprocal overlap的CNV为一个CNV (0-GR-2011-CNVnator, doi:10.1101/gr.114876.110: consider two calls concordant if they have >50% reciprocal overlap. )

  2. For each population, overlapping calls were merged by selecting the largest call of all overlapping ones.(to CNVR)
    For comparisons of CNVs among multiple individuals, we merged all overlapping calls across individuals into unique CNVRs.


实际过滤情况

实际使用过滤标准:

  1. filter q0 >= 0.5

  2. e-val1 >= 0.01

  3. deletion and duplication; dose not within a gap

实际使用个体间merge方法

  • 不同个体间,大于50% reciprocal overlap的CNV为一个CNV,平均两端的坐标(averaged coordinates of each bound)



描述

相关定义

  • Please make sure that for the bin size you are using the ratio of average to its sd is around 4-5. (第一步挑选一个合适的bin_size
  • most of false positives come from repetitive regions. So, the most confident calls are with q0 < 0.5 (see README on how to generate it) (去除可能的重复序列区域结果)

CNV、genetic CNV、CNV gene定义

(all from doi/10.1101/gr.187187.114.)

  • “CNVs” are all duplications or deletions ≥1 kb;
  • “genic CNVs” are calls that contain at least ++one whole transcription unit++, based on the RefSeq- Gene database;
  • and “CNV genes” are transcription units that are completely contained within genic CNVs.
  • In addition, we use the term “CNV regions” (CNVRs) for genomic regions that include all partially or fully overlapping CNVs in any one of the analyzed animals. The borders of a CNVR are defined by the coordinates of the merged CNV calls across all individuals

描述数量

Three to five thousand CNV calls were produced for each individual, ranging in size from 200 to 1,590,400 bp (see Figure S3).

Vst分析 (千人SV文章, doi:10.1038/nature15394)

计算方法

image

copy number based Vst

  • 用copy number(CN)方差来反应群体之间的分化
  • 取0.2作为阈值,鉴定高分化区域
  • 去掉X染色体的

image



个性化分析

基本信息统计

基于个体的

  • 平均乘数、比对reads数、比对率、 All CNVs数量、Average CNV length、Genetic CNVs数量、CNV基因数量 (table)

  • 统计presence and absence patterns of CNV alss between individuals(individuals within populations are more similar to each other than individuals between populations) (FigS5 家鼠 doi/10.1101/gr.187187.114.)

image

  • 不同群体之间,两两个体overlap数(1bp,50%长度分别展示)

image

基于群体的

  • CNV 长度分布,分群体展示(FigS3 家鼠 doi/10.1101/gr.187187.114.)

image

  • 不同群体CNV calls/genetic CNVs/CNV genes数目,barplot (Fig1 家鼠 doi/10.1101/gr.187187.114.)
1
2
boxplot(CNV_number~sample_id,data=a)
boxplot(a$V1,b$V1,outline = F,names=c("A","B"),main="title")

image

  • 不同群体share情况

image

share

image

private

image

  • 和基因组不同区域的overlap情况

image

Extended Data Figure 7 | Enrichment of functional elements intersecting SVs. a, Shadow figure of Fig. 2a. Overlap enrichment analysis of deletions (with resolved breakpoints) versus genomic elements, using partial overlap statistic, deletions categorized into VAF bins. b, Similar to a. The only difference is that engulf overlap statistic is used instead of partial overlap statistic. Engulf overlap statistic is the count of genomic elements (for example, CDS) that are fully imbedded in at least one SV interval (for example, deletions). *no element intersected observed within data set. c, Similar to a and b, with the enrichment/depletion analysis pursued for common SNPs as well as more rare single nucleotide polymorphisms/variants (SNVs). Common SNV alleles show the highest levels of depletion for investigated genomic elements. d, Overlap enrichment analysis of various SV types versus genomic elements, using partial overlap statistic.

  • cnv长度和拷贝数的关系?

  • CNV频率分布(CNV Allele Frequency Distributions)对选择的影响(doi:10.1371/journal.pgen.1004830)

To test for the influence of selection on the allele frequency spectrum of CNVs, we used a Poisson Random Field approach implemented in prfreq, which evaluates the expected allele frequency distributions given different demographic and selection models. We compared the estimates of the scaled selection coefficient c of CNVs (both deletions and duplications) with intergenic SNPs, by fitting the ‘‘single point mass’’ model over a range of c (between 220 and 10). Estimates were taken modeling a stationary population, under the assumption that demographics affect intergenic SNPs and CNVs in a similar fashion.

  • cnv 注释,不同区域的数量、比例,用比例可以不然不同群体间/不同类型(Deletion/Duplication)数量差异在图上太明显

image