lizhong's notes

套路提高生产力


  • 首页

  • 归档

  • 标签

  • 关于

Shell Scripting Cheetsheets

发表于 6666-06-06   |   分类于 Cheetsheets

A Bioinformatician’s UNIX Toolbox

awk、sed

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 常用
# 数某一列出现次数
awk '{l[$1]++}END{for (x in l) print x,l[x]}'
# 去掉文件末尾的\n
awk 'BEGIN{ORS=""}{print $0}' test.txt > test2.txt
# RS是记录分隔符(句尾),默认的分隔符是\n
# RS修改记录分隔符,ORS修改输出记录分隔符
# FS指定列分隔符,OFS指定列输出分隔符
awk 'BEGIN{OFS="\t"}{print $1,$2}' test1
# tar.gz
.tar.gz 和 .tgz
解压:tar zxf Something.tar.gz
压缩:tar zcf Something.tar.gz DirName
分卷压缩tar.gz: tar zcf - DirName/* |split -d -b 100m - Something.tar.gz.
# 生成文件:Something.tar.gz.00 Something.tar.gz.01 Something.tar.gz.02 ...
解压tar.gz分卷
cat Something.tar.gz.* | tar zx
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#取某几列
awk '{print $2,$4,$5}' input.txt > output.txt
cat *log|grep "variants loaded"|awk '{print $1}'|les
# 查看文件有几列
cat in.txt | awk '{print NF}'
# -F 修改分隔符为冒号:
awk -F: '{print $1}' in.txt
awk -F" " '{print $1}' in.txt
# 正则匹配、替换
awk '/^test[0-9]+/' input.txt
awk '$1 ~ /^test[0-9]+/' input.txt
awk '{if($5==66) print $0}' input.txt
sed 's/good/bad/' input.txt # 默认只替换第一个匹配
sed 's/good/bad/2' input.txt # 只替换第二个匹配
sed 's/good/bad/g' input.txt # 替换所有匹配
# 三行程序,以“;”分割
awk '{BASELINE=42; if($1>BASELINE) $5=$5+100; print $0}' data.f
# 几列之间运算 calculate the sum of column 2 and 3 and put it at the end of a row or replace the first column:
awk '{print $0,$2+$3}' input.txt
awk '{$1=$2+$3;print}' input.txt
# 取出文件某几行,比如5-10行
sed -n '10p' data.txt # 打印第十行
sed -n '5,10p' data.txt # 打印第5-10行
sed -n '1-5p' data.txt # 打印第一行,跳过5行,打印第六行
awk 'NR>=5&&NR<=10' data.txt
# 文本、excel文件里面有空的项目不好split的时候,把空的用“-”填充
less input.txt | sed ‘s/\t\t/\t\-\t/g;s/\t$/\t\-/g’ > output.txt
# 输出一个tab分割的文件头
echo "chr pos A1 A2 OR p-value"|awk -v OFS="\t" '{print $1,$2,$3,$4,$5,$6}'


ls长度限制问题 “/bin/ls: Argument list too long”

1
2
find /DirName/ -name *.txt |less -N
for i in DirName/*;do echo "ls $i " >> list.txt;done


rm files in ls/grep

1
ls * | while read dd; do rm -rf $dd; done


tr

1
2
# 把回车都替换成空格
awk '{print $1}' in.txt|tr "\n" " "|less


纵向合并文件

1
paste a.txt b.txt > ab.txt


wc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#显示文件内容信息,输出信息依次是:行数,字数,字节数,文件名称
wc filename
#显示一个文件的行数
wc -l filename
#显示一个文件的字节数
wc -c filename
#显示一个文件的字符数
wc -m filename
#显示一个文件中的最长行的长度
wc -L filename
#注意:每行结尾的换行符也算一个字符,空格也算一个字符
#采用UTF-8编码,所以一个汉字在这里被转换为3字节
#当使用-m选项时,一个汉字就作为一个字符计算


find

1
2
find . -name ‘*.pl’
find /path/dir -regextype "posix-egrep" -regex ".*\.(pl|sh|py|r|pdf|)$"


fasta/fastq/vcf

  • 取fasta中的某一段序列(代替perl里的substr效果)

    1
    samtools faidx hg19.fasta chr1:1-100
  • 为vcf文件建索引,GATK依赖

    1
    2
    samtools/htslib/bgzip -c file.vcf > file.vcf.gz
    samtools/htslib/tabix -p vcf file.vcf.gz


Example header of PBS(qsub) Command Lines

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#$ -S /bin/sh
#$ -e z.1.sh.SuperMem-1.pbs.$JOB_ID.e
#$ -o z.1.sh.SuperMem-1.pbs.$JOB_ID.o
#$ -l vf=1G
#$ -m n
#$ -cwd
#$ -P st_supermem
#$ -q supermem.q@supermem-0-2
# 指定supermem-0-2节点
# start time
date1=`date "+%Y-%m-%d %H:%M:%S"`; date1_sys=`date -d "$date1" +%s`;echo "start running ========= at $date1"
perl test.pl
# end time and running time; output to z.1.sh.SuperMem-1.pbs.$JOB_ID.o
date2=`date "+%Y-%m-%d %H:%M:%S"`; date2_sys=`date -d "$date2" +%s`; interval=`expr $date2_sys - $date1_sys`; hour=`expr $interval / 3600`;left_second=`expr $interval % 3600`; min=`expr $left_second / 60`; second=`expr $interval % 60`; echo "done running ========= at $date2 in $hour hour $min min $second s"


kill

1
2
3
4
5
6
7
8
# 一次杀掉所有进程,慎用
pgrep -u wanglizhong |xargs kill
kill -9 $(ps -ef|grep wanglizhong|awk '{print $2}')
pkill -u wanglizhong
skill -KILL -u username


解压/压缩

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
.tar
解包:tar xvf Something.tar
打包:tar cvf Something.tar DirName
(注:tar是打包,不是压缩!)
———————————————
.gz
解压1:gunzip Something.gz
解压2:gzip -d Something.gz
压缩:gzip Something
.tar.gz 和 .tgz
解压:tar zxf Something.tar.gz
压缩:tar zcf Something.tar.gz DirName
分卷压缩tar.gz: tar zcf - DirName/* |split -d -b 100m - Something.tar.gz.
# 生成文件:Something.tar.gz.00 Something.tar.gz.01 Something.tar.gz.02 ...
解压tar.gz分卷
cat Something.tar.gz.* | tar zx
———————————————
.bz2
解压1:bzip2 -d Something.bz2
解压2:bunzip2 Something.bz2
压缩: bzip2 -z Something
.tar.bz2
解压:tar jxvf Something.tar.bz2
压缩:tar jcvf Something.tar.bz2 DirName
分卷压缩tar.bz2: tar jcf - DirName/* |split -d -b 100m - Something.tar.bz2.
# 生成文件: Something.tar.bz2.00 Something.tar.bz2.01 Something.tar.bz2.02 ...
解压tar.bz2分卷: cat logs.tar.gz* | tar jx
———————————————
.bz
解压1:bzip2 -d Something.bz
解压2:bunzip2 Something.bz
压缩:未知
.tar.bz
解压:tar jxvf Something.tar.bz
压缩:未知
———————————————
.Z
解压:uncompress Something.Z
压缩:compress Something
.tar.Z
解压:tar Zxvf Something.tar.Z
压缩:tar Zcvf Something.tar.Z DirName
———————————————
.zip
解压:unzip Something.zip
压缩:zip Something.zip DirName/*
# 大于2G的文件unzip可能有问题
分卷:split Something.zip -b 100m -d Something2.zip
分解解压:cat Something2.zip* > Something3.zip; unzip Something3.zip
———————————————
.rar
解压:rar x Something.rar
压缩:rar a Something.rar DirName
———————————————
.lha
解压:lha -e Something.lha
压缩:lha -a Something.lha Something
———————————————
.rpm
解包:rpm2cpio Something.rpm | cpio -div
#———————————————
.deb
解包:ar p Something.deb data.tar.gz | tar zxf -
———————————————
.tar .tgz .tar.gz .tar.Z .tar.bz .tar.bz2 .zip .cpio .rpm .deb .slp .arj .rar .ace .lha .lzh .lzx .lzs .arc .sda .sfx .lnx .zoo .cab .kar .cpt .pit .sit .sea
解压:sEx x Something.*
压缩:sEx a Something.* Something
.tar.xz (xz http://tukaani.org/xz/)
xz -d Something.tar.xz
# gzip 压缩比
-num 用指定的数字 num 调整压缩的速度,-1 或 --fast 表示最快压缩方法(低压缩比),-9 或--best表示最慢压缩方法(高压缩比)。系统缺省值为 6。
.7z
# download and install: https://sourceforge.net/projects/p7zip/files/p7zip/
tar xjvf p7zip_9.20.1_x86_linux_bin.tar.bz2
cd p7zip_9.20.1
sh install.sh
.7z
解压:
7za x file.7z -r -o./
(-o后是没有空格的,直接接目录)
压缩:
7za a -t7z -r out.7z /dir/*

python install packages

1
2
3
4
5
6
7
8
9
10
11
12
# set PYTHONPATH
export PYTHONPATH=/data/users/wlz/lib/python2.7/site-packages:/data/users/wlz:$PYTHONPATH
# pip
pip install --user pysnptools
pip install --user matplotlib==1.4.3 #指定版本号
# easy_install
easy_install -d /data/users/wlz/lib/python2.7/site-packages pysnptools
# download source code
python setup.py install --prefix=/data/users/wlz

Search Cheatsheets

发表于 2020-05-15

信息检索

法律,律商
金融,慧博
公司财报,巨潮
查询公司信息,国家企业信用信息公示系统
国家统计局

卡方独立性检验chiq.test

发表于 2018-07-31
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# method 1
> x=c(454,1833,2120)
> y=c(52,216,240)
> df=as.table(rbind(x,y))
# df=as.table(rbind(c(454,1833,2120),c(52,216,240)))
> df
A B C
A 454 1833 2120
B 52 216 240
> chisq.test(df)
Pearson's Chi-squared test
data: df
X-squared = 0.1661, df = 2, p-value = 0.9203
# method 2
> df2=matrix(c(454,1833,2120,52,216,240),nr=3)
> df2
[,1] [,2]
[1,] 454 52
[2,] 1833 216
[3,] 2120 240
> chisq.test(df2)
Pearson's Chi-squared test
data: df2
X-squared = 0.1661, df = 2, p-value = 0.9203

Biomedical information search

发表于 2018-06-14

信息检索

CN:
丁香医生
默沙东诊疗手册-大众版
默沙东诊疗手册-专业版
腾讯医典
WHO世界卫生组织
中国居民膳食指南
A+医学百科

EN:
梅奥诊所MayoClinic
healthline
NIH膳食营养说明
WebMd
美国CDC
美国Dietary Guidelines 2015-2020
NIH遗传相关的查询Genetics Home Reference

Gene:
NCBI Mesh+PubMed
GeneCards
WikiGene
GeneReviews

install emacs

发表于 2018-06-12
1
2
3
4
5
6
wget http://mirrors.ustc.edu.cn/gnu/emacs/emacs-26.1.tar.gz
tar zxvf emacs-26.1.tar.gz
cd emacs-26.1
./configure --prefix=/home/wlz/software/emacs/emacs-26.1 --with-x-toolkit=no --with-xpm=no --with-gif=no --with-tiff=no --with-gnutls=no
make
make install

各色DNA 和 WEGENE 开箱比较

发表于 2018-03-05

image

个人基因组(Personal Genomics)相关的话题持续火爆,从科研大牛 George church 的新公司 Nebula Genomics 打算用虚拟货币购买用户的基因组数据,到《MIT科技评论》评选的十大技术突破榜单中的 基因占卜(Genetic Fortune-Telling)。

不知道你是否体验过个人基因组有关的产品,我在2016年就体验了“中国版23andMe”的 微基因(WEGENE) 的个人基因组产品,WEGENE 也是国内比较早(2015)开始做个人基因组相关业务、同时活到今天仍然活的不错的公司。

WEGENE 之前和之后,也有出现其他各种提供个人基因组产品的公司,要么已经没有消息了、要么总感觉是简单粗暴的模仿,没有想体验的欲望。

前段时间刷微博的时候,通过 @河森堡 和 @女王C-cup 微博发的“过年胖几斤”研究,偶然发现了 各色DNA(各色科技),一个致力于解读用户”行为-基因组遗传数据“关联的个人基因组公司。各色DNA 是 大象工会 内部孵化出来的项目、有 @黄章晋ster 的背书、创始人郭婷婷是原中科院认知神经科学博士研究生;“各色” 源于对DNA的隐喻:DNA是一个人的底色,它和你一生发生的事情、你的成长环境共同组成了现在的你。

加上最近集中看到好几篇 23andMe 根据用户自我报告数据(self-reported data)所做的精神类和行为类基因组关联分析(Genome-wide association study,GWAS)研究。感觉 各色 可能走在了正确的 B2China 道路上,买了一份体验了一下。

顺便和之前做过的 WEGENE 进行一个简单的开箱体验比较。


首先,说明一下各自产品的购买时间:

  • 微基因 购于2016年1月底,算是比较早期的产品包装。
  • 各色DNA 购于2018年2月底。




外观

image
(左 WEGENE,右 各色DNA,下类似)

正面slogan:

  • WEGENE:解读基因的秘密,遇见未来的自己。
  • 各色:破解DNA密码的神奇之旅,即将开始。




image

核心内容是 唾液DNA样本采集管 ,附带一些纸质说明。

  • WEGENE 盒子正方形、比较小,可参照顺丰快递单,内部用小盒子填满整个盒子,同时可能有防震作用,整体比较有设计感。可能和创始人的工业设计背景的“基因”有关。创始人之一的郑强的简介:

郑强,Wecenter开源问答程序创始人,Billwang工业设计社区创始人,连续多次创业,有丰富的互联网社区运营经验和资源。曾担任深圳市微客互动有限公司CEO(微客互动是华大科技、华大研究院、华大医学等的IT供应商,旗下Wecenter程序同时服务国内多个医疗机构和健康咨询平台)。

  • 各色 盒子很大,大概A4纸大小、6厘米厚,内部没有明显的结构、相对简单,显得有点空。

设计上WEGENE感觉风格化比较明显,与官方网站、微信订阅号图文的风格一致,而且早在2016年就已经形成比较一体的解决方案、成熟度比较高;各色 可能产品打磨时间相对短,互联网产品思维、可用的版本就先上、不纠结,感觉相对粗糙、零散一点,功能性OK。

image
各色 内部的文字材料,WEGENE的当时忘记拍照了




物流

DNA采样如果什么时候能突破物理限制,不用快递的话,那才能叫体验。

WEGENE 和 各色DNA 都是顺丰来回包邮,应该算是个人基因组产品的基本配置了;寄回方面,WEGENE 额外的是,有一张已经填好的顺丰快递单,无需填收件地址,比较贴心。

image




核心部件:唾液DNA样本采集管

image

  • WEGENE 是 唾液收集器 和 DNA保存液 分开为两管的设计,收集器体积较大,印象里采样当天早晨吐了很久才吐够。WEGENE 采用的 加拿大DNA Genotek 的唾液DNA采集管,好像成本比较高。

  • 各色 是 唾液收集器 和 DNA保存液管子 二合一的设计。我手贱,以为自己是老司机,没看说明就直接不小心拧开了,保存液直接留到管子里、不方便密封了,打乱了我的采样计划。本来是打算和当时 WEGENE 采样一样,在第二天早晨、经过一晚上积累、口腔里脱落细胞较多的时候采集唾液,这样DNA足够多、成功率比较高。怕DNA保存液挥发变质,决定不等到明天了、漱个口一小时后采样。感觉这个二合一设计挺精巧的,可以节省物料,但是不够防呆。同时 收集器 的体积比较小,加上 采样器 本身插进去的体积占用,一口就吐满了;为了多收集点DNA,把采样器管子取出来之后,我又多吐了一点口水进去;但是盖子是内部突起的,实际体积又小了一点,关上盖的时候又溅出来一点。




整体感觉

WEGENE 包装完成度较高、易用性较好、设计感比较强;各色 的相对糙一点,和产品发展时间短、不够成熟也有关系,不过都不影响使用。

这两家虽然都是面向个人消费者的公司,但定位稍有不同:

  • WEGENE “致力于让中国人受益于知道和理解的自己的基因组数据”,更加想让用户探索自身的秘密,“遇见未来的自己”。感觉定位更”23andMe”、更“科研”一些,从官方网站上看,行事风格上也更加沉稳,对DNA样品采集质量、祖源分析、人类遗传学相关的分析比较严谨。

  • 各色 则定义自己聚焦「行为-基因关联分析」消费级产品,和大象工会背景的“基因”有关,有一定的用户迁移基础、也比较会做消费者传播。比较期待这群站在认知神经科学和新媒体交叉路口的人能“玩”出什么样的东西出来。

image




相似的地方在于,他们都在用自己的方式让用户更加了解自己:

  • 了解自己的遗传信息,知道自己的现状和未来的可能性。

  • 知道自己从何而来,与不同地域、不同时空、不同个性特质的人群产生联系,寻找那个“something bigger than yourself”的存在。

  • 让科学研究的成果更可得、更容易理解;同时对于大样本、大数据的收集能反过来推动科研的进展,让每一个人尽早受益。

用户规模上,微基因 有先发优势,截至2018年1月底“WEGENE完成B轮融资,用户数已过10万,预计在2018年底将超过40万”;各色科技 也势头很猛,2017年10月媒体报道“收获了超10万人的行为数据,各色搭建了目前国内最大的基因-行为数据库,这个数字每天都在良性增长中”。很有默契的和“10万+”关系上了,发展都很迅猛。

目前各色的样品还在我手上、打算明天寄出,等寄回分析报告出来之后,可以继续比较一下两家的详细内容。

更多内容,请关注、并留言说说你的想法。


另外,WEGENE 最近有邀请好友立减返现的活动。

标准检测好友立减50,返现我39;全基因组立减1000,返现100。
刚好有需要的可以扫码或者 点击连接 了解一下。
image

PS:不是广告,但我希望能接到广告。

产品都是自己付费购买,如果不小心起到了广告效果,麻烦帮我at一下 @各色人类研究中心 和 @微基因WeGene 的小伙伴,请给我疯狂地打赏哈哈哈哈哈哈 (ˊ● ω ●ˋ)

image
image
image

Calculate IBD using beagle and germline

发表于 2018-02-26

beagle

1
2
export JAVAHOME=/ifshk4/BC_PUB/biosoft/PIPE_RD/Package/java/jre1.8.0_45;
/ifshk4/BC_PUB/biosoft/PIPE_RD/Package/java/jre1.8.0_45/bin/java -jar /ifshk5/PC_HUMAN_EU/USER/wanglizhong/software/beagle/4.1/beagle.27Jul16.86a.jar ibd=true impute=false window=50000 overlap=20000 ibdlod=3 ibdtrim=1500 gt=$vcf out=$out/$chr

germline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
my $vcftools="/home/wanglizhong/software/vcftools/vcftools-build/bin/vcftools";
my $germline="/home/wanglizhong/software/germline/germline-1-5-1/germline";
my @vcf=<01.VCF/*vcf.gz>;
open(O,"> $0.sh");
foreach my $vcf(@vcf){
print O "$vcftools --gzvcf $vcf --plink --out $vcf; ";
print O "$germline -min_m 1 -bits 128 -err_hom 0 -err_het 4 < $vcf.run;\n";
open(O2,"> $vcf.run");
print O2 "1\n$vcf.map\n$vcf.ped\n$vcf.out\n";
close O2;
}
close O;

PCAdmix(out-of-date;upgrade to RFmix)

发表于 2018-02-11
1
2
3
4
5
6
7
8
9
10
# vcf2beagle
export JAVAHOME=/ifshk4/BC_PUB/biosoft/PIPE_RD/Package/java/jre1.8.0_45; /home/wanglizhong/software/vcftools/vcftools-build/bin/vcftools --gzvcf 00.vcf/1.vcf.gz --keep Ind.ChinaTau.list --recode -c | /home/wanglizhong/software/java/jre1.8.0_45/bin/java -jar /home/wanglizhong/software/beagle/Utilities/vcf2beagle.jar -9 01.vcf2beagle.pl.out/ChinaTau/1;gunzip 01.vcf2beagle.pl.out/ChinaTau/1.bgl.gz;
# PCAdmix
/home/wanglizhong/bin/PCAdmix3_linux -anc /ifshk4/BC_COM_P5/F13HTSNWKF0106/CATwiwR/project/11.PCAdmix/01.vcf2beagle.pl.out/LQC/1.bgl /ifshk4/BC_COM_P5/F13HTSNWKF0106/CATwiwR/project/11.PCAdmix/01.vcf2beagle.pl.out/indicus/1.bgl /ifshk4/BC_COM_P5/F13HTSNWKF0106/CATwiwR/project/11.PCAdmix/01.vcf2beagle.pl.out/taurus/1.bgl \
-adm /ifshk4/BC_COM_P5/F13HTSNWKF0106/CATwiwR/project/11.PCAdmix/01.vcf2beagle.pl.out/ChinaTau/1.bgl \
-map /ifshk4/BC_COM_P5/F13HTSNWKF0106/CATwiwR/project/11.PCAdmix/00.vcf/1.vcf.gz.map.map \
-rho /ifshk4/BC_COM_P5/F13HTSNWKF0106/CATwiwR/project/11.PCAdmix/00.vcf/1.vcf.gz.map.map.genetic_map.txt \
-lab LQC indicus taurus ChinaTau \
-w 100 -bed 1 -o chr1

shell环境下合并pdf文件

发表于 2017-12-12

1 cpdf

1
2
3
4
5
6
7
8
# PDF 合并:
cpdf input1.pdf input2.pdf -o output.pdf
# 切割 PDF 中的 1 至 3 页以及 12 页至最后页:
cpdf input.pdf 1-3,12-end -o output.pdf
# 将 PDF 分割成单页 PDF,编号为 page001.pdf 、 page002.pdf :
cpdf -split in.pdf -o page%%%.pdf

2 pdffjam

1
2
3
4
5
6
# install
sudo apt-get update
sudo apt-get install pdfjam
# combine pdf files
pdfjoin a.pdf b.pdf

投稿前Reference部分的检查清单

发表于 2017-11-09

因citation文件的下载来源不同,导入endnote之后直接使用可能存在一些不一致的格式情况,建议投稿前一一排查。

这些内容很细碎,主要是通过细节给编辑和审稿人体现专业精神,如果发现很多小错误会给人不专业、不仔细的印象。

列举一些情况:

1、多余的空格、标点符号,建议先打开Word设置里面的“视图——显示非打印字符”,勾选空格、制表符等。
image

2、期刊/作者名,首字母大写、全部大写和部分大写不统一
image
image
image

3、期刊全称、缩写不统一
image

4、作者名字,全称和缩写不统一;多个字的名缩写不统一,如Lizhong,「缩写一个」(L.),与「缩写两个」(L. Z.)
image

5、作者名字,「先姓后名」和「先名后姓」不统一
image
image

6、特殊词的斜体,如物种名
image

7、检查目标期刊要求的引用文献style,下载相应的style配置文件放到endnote/Styles 文件夹内,即可选用。

Plot Quantile-Quantile plot with qqline using ggplot2

发表于 2017-09-21

image

1
2
3
4
5
6
myresiduals <- rnorm(100)
pdf("01.f4.pdf",width=3,height=3)
ggplot(data=as.data.frame(qqnorm( myresiduals , plot=F)), mapping=aes(x=x, y=y)) + geom_point() + geom_smooth(method="lm", se=FALSE)+labs(title="D(Agarli,Urial;Mouflon1,Mouflon2)",x="Expected Z-scores",y="Observed Z-scores")+ theme_classic()
dev.off()

Quantile-Quantile plot can use to analysis D-statistics like this:
image
(DOI: 10.1126/science.aan3842 , Fig. 2)

VCF manipulation with GATK

发表于 2017-09-07   |   分类于 Cheetsheets

HaplotypeCaller, 每个Sample生成GVCF文件, 相当于一个SNP calling的中间文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
export JAVA_HOME=java/jre1.8.0_45;
jre1.8.0_45/bin/java -Djava.io.tmpdir=tmpdir \
-jar gatk3.6/GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R ref.fa \
-I sampleA.bam \
-nct 15 \
-ERC GVCF \ # gvcf模式
-L interver.list \ # chr10:10000001-15000000
[--pcr_indel_model NONE ] \ # 如果bam去除了PCR duplications可以加上这个
-o sampleA.chr10_10000001_15000000.g.vcf.gz \
-variant_index_type LINEAR \
-variant_index_parameter 128000 \
2>&1 | tee 01.vcfByWindow/chr10_10000001_15000000/14-5.gvcf.gz.log # log输出


GenotypeGVCFs, 多个sample做SNP calling

1
2
3
4
5
6
7
8
9
10
export JAVA_HOME=java/jre1.8.0_45;
jre1.8.0_45/bin/java -Djava.io.tmpdir=tmpdir \
-jar gatk3.6/GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R ref.fa \
-V gvcf.list \ # 多个体的gvcf列表,一行一个
-stand_call_conf 30.0 \
-stand_emit_conf 10.0 \
-o variants.vcf.gz;


CatVariants,把分区域多线程call的variants合并起来,注意使用java1.7+gatk3.3/3.4

1
2
3
4
5
6
7
8
9
10
export JAVA_HOME=java/jre1.7.0_55;
jre1.7.0_55/bin/java -cp GATK-3.4/GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants \
-R ref.fa \
-V 01.vcfByWindow/chr1_1_5000000/variants.vcf.gz \ # 或者-V vcf.list
-V 01.vcfByWindow/chr1_5000001_10000000/variants.vcf.gz \
-V 01.vcfByWindow/chr1_10000001_15000000/variants.vcf.gz \
-V 01.vcfByWindow/chr1_15000001_20000000/variants.vcf.gz \
-out chr1.01.vcf.gz \ # 是-out,不是-o
[-assumeSorted] \ # 如果--variant按顺序排的,则加上;乱的就不加


SelectVariants,取overlap位点,保留set2里和set1 overlap的位点

1
2
3
4
5
6
7
8
9
export JAVA_HOME=java/jre1.8.0_45;
jre1.8.0_45/bin/java -Djava.io.tmpdir=tmpdir \
-jar gatk3.6/GenomeAnalysisTK.jar \
-T SelectVariants \
-R ref.fa \
-V set2.vcf.gz \
-o set2.overlap.vcf.gz \
--concordance set1.vcf.gz


SelectVariants,去除/保留个体

1
2
3
4
5
6
7
8
9
10
11
12
export JAVA_HOME=java/jre1.8.0_45;
jre1.8.0_45/bin/java -Djava.io.tmpdir=tmpdir \
-jar gatk3.6/GenomeAnalysisTK.jar \
-T SelectVariants \
-R ref.fa \
-V set1.vcf.gz \
-o set1.remove.vcf.gz \
# 下面两个二选一
-xl_sf remove.list \ # 去除个体
-sf keep.list \ # 保留个体

CombineVariants -genotypeMergeOptions UNIQUIFY,合并多个vcf,不管个体ID,相同ID的个体都当成不同的并重新命名; 两个set位点可以不完全相同,没信息的genotype会自动标记成./.

1
2
3
4
5
6
7
8
9
10
export JAVA_HOME=java/jre1.8.0_45;
jre1.8.0_45/bin/java -Djava.io.tmpdir=tmpdir \
-jar gatk3.6/GenomeAnalysisTK.jar \
-T CombineVariants \
-R ref.fa \
-V:v1 a.vcf.gz \
-V:v2 b.vcf.gz \
-o ab.UNIQUIFY.vcf.gz \
-genotypeMergeOptions UNIQUIFY


CombineVariants -genotypeMergeOptions PRIORITIZE -priority v1,v2,可能有相同个体的,按照优先级选择具体的genotype

1
2
3
4
5
6
7
8
9
10
export JAVA_HOME=java/jre1.8.0_45;
jre1.8.0_45/bin/java -Djava.io.tmpdir=tmpdir \
-jar gatk3.6/GenomeAnalysisTK.jar \
-T CombineVariants \
-R ref.fa \
-V:v1 a.vcf.gz \
-V:v2 b.vcf.gz \
-o ab.PRIORITIZE.vcf.gz \
-genotypeMergeOptions PRIORITIZE -priority v1,v2


类似:vcftools

Map with BWA and do Indel realignment with GATK

发表于 2017-08-18   |   分类于 Cheetsheets , pipeline

01.Mapping_BWAMEM-GATKRealgn.pl

get script

1
git clone https://github.com/wlz0726/SequenceMapping.git

step1: Mapping with BWA MEM and sort with samtools, connected with tunnel

NB: samtools should be v1.* or later

1
2
bwa mem -R ref.fa SampleID.1.fq.gz SampleID.2.fq.gz | samtools sort -O bam -o SampleID.sort.bam
samtools index SampleID.sort.bam

step2: remove PCR duplicates with picard

1
2
3
4
5
6
7
java -jar picard.jar MarkDuplicates \
INPUT=SampleID.sort.bam \
OUTPUT=SampleID.rmdup.bam \
METRICS_FILE =SampleID.dup.txt \
REMOVE_DUPLICATES=true \
VALIDATION_STRINGENCY=SILENT \
CREATE_INDEX=true

step3: do local realignment of reads to enhance the alignments in the vicinity of indel polymorphisms

NB: GATK v3.6
NB: Note that indel realignment is no longer necessary for variant discovery if you plan to use a variant caller that performs a haplotype assembly step, such as HaplotypeCaller or MuTect2. However it is still required when using legacy callers such as UnifiedGenotyper or the original MuTect.

1
2
3
4
5
6
7
8
9
10
11
12
# use the RealignerTargetCreator to identify regions where realignment was needed
java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator \
-R ref.fa \
-o SampleID.intervals \
-I SampleID.rmdup.bam
# use IndelRealigner to realign the regions found in the RealignerTargetCreator step
java -jar GenomeAnalysisTK.jar -T IndelRealigner \
-R ref.fa \
-targetIntervals SampleID.intervals \
-I SampleID.rmdup.bam \
-o SampleID.realn.bam

Before mapping: Filter and Trim your Paired-end FastQ file

After mapping: do SNP calling with GATK

Filter and Trim your Paired-end FastQ file

发表于 2017-08-18   |   分类于 Cheetsheets

FastQtrimmer.pl

Filter and Trim your Paired-end FastQ file

get script

1
git clone https://github.com/wlz0726/SequenceMapping.git

As you get the Illumina sequencing data from a company like BGI/Novogene, first thing you (probably) need to do is to do the sequnce quality check and filter the low quality reads.

Here I assume you have no adaptor problem and the sequencing company have already remove the adaptor before hand over you the data.

FastQtrimmer.pl will trim and filtering Paired-end reads in gz format, and the following reads will be removed:

1
2
3
4
1) >=10% unidentified nucleotides (N)
2) with a phred quality <=7 for >65% of the read length
3) reads are trimmed if they had three consecutive bp with a phred quality of <=13,
and discarded if they were shorter than 45 bp

Example:
Your have one Paired-end FastQ file named SampleID.1.fq.gz and SampleID.2.fq.gz with phred33 quality system, the read length is 100 bp,

1
perl FastQtrimmer.pl SampleID.1.fq.gz SampleID.2.fq.gz SampleID.Filter 33 100

the results will be:
SampleID.Filter.1.TrimTwoSide.fq.gz
SampleID.Filter.2.TrimTwoSide.fq.gz


fastq_Phred64_to_33_gz.pl

If your reads are in Phred64 quality system, and you need to transform Phred64 reads to Phred33, run like this:

1
perl fastq_Phred64_to_33_gz.pl input.1.fq.gz input.2.fq.gz OutputPrefix


After read Quality control: Map with BWA and do Indel realignment with GATK

Linux配置

发表于 2017-07-26

1. 安装tmux

2. 安装Oh-my-zsh

3. 同步工具脚本

1
git clone https://github.com/wlz0726/tools.git

4. 安装emacs

Data management of plink format

发表于 2017-06-28   |   分类于 Cheetsheets

plink 1.9 manul

取出某一些位点

1
plink-1.9/plink --bfile input --extract postions.in --make-bed --out 03.array.v2.overlap;

合并两个文件

1
2
3
4
5
6
# 先转换成vcf格式
plink-1.9/plink --bfile 04.array --recode vcf --out 05.array
plink-1.9/plink --bfile 04.seq --recode vcf --out 05.seq
# 合并
samtools/bcftools/bcftools merge --merge all --no-version --threads 10 05.array.vcf.gz 05.seq.vcf.gz |bgzip -c > 06.merge.vcf.gz
tabix -p vcf 06.merge.vcf.gz

1234编码转换成ACGT编码

1
2
3
# allele1234 to alleleACGT
plink-1.9/plink --file data --recode --alleleACGT multichar --out data.plink
# 反过来用 --allele1234

设染色体

1
2
3
4
5
6
7
--chr-set 90
--cow
--dog
--horse
--mouse
--rice
--sheep

–cow = –chr-set 29 no-xy
–dog = –chr-set 38
–horse = –chr-set 31 no-xy no-mt
–mouse = –chr-set 19 no-xy no-mt
–rice5 = –chr-set -12
–sheep = –chr-set 26 no-xy no-mt
–autosome-num [value] = –chr-set [value] no-y no-xy no-mt

More about sequence alignments

发表于 2017-06-23

序列比对PPT

链接:http://pan.baidu.com/s/1i5Ia9wD 密码:jsio


一、延伸材料

Blast使用
https://www.ncbi.nlm.nih.gov/books/NBK1734/
Needleman-Wunsch algorithmdemo
http://experiments.mostafa.io/public/needleman-wunsch/
Dynamic Programming
http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html
FastQ格式
https://en.wikipedia.org/wiki/FASTQ_format
SAM/BAM格式说明
http://samtools.github.io/hts-specs/SAMv1.pdf
SAM flags解释
https://broadinstitute.github.io/picard/explain-flags.html

二、常用比对软件及简单描述

  • 多序列比对
    http://www.ebi.ac.uk/Tools/msa/
    ClustalW,最广、经典
    MUSCLE,快
    T-coffee,相比ClustalW准确度高
    Prank,相比ClustalW准确性高
    MAFFT,速度准确度都好过ClustalW

速度:MUSCLE > MAFFT ~ Prank> ClustalW >T-Coffee
准确性:MAFFT > Prank > MUSCLE > T-Coffee > ClustalW

  • 短序列比对
    BWA
    http://bio-bwa.sourceforge.net/bwa.shtml
    Bowtie2
    http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
    SOAP2
    http://soap.genomics.org.cn/soapaligner.html
    Samtools
    http://samtools.sourceforge.net/

BWA为最常用的序列比对软件,可以处理几十bp到1Mb长的reads;
Bowtie2常用于RNA-seq、Chip-seq等,速度比BWA快;
SOAP2输出的比对格式与BWA、Bowtie2不同(不是SAM/BAM格式),使用时需要注意;
Samtools 是比对文件(SAM/BAM)处理相关最常用的工具软件。

三、比对相关命令(BWA):

  1. 为参考序列Reference建库:

    1
    $ bwa index –a bwtsw ref.fa
  2. 使用BWA-MEM算法比对,保存为SAM格式

    1
    $ bwa mem ref.fa read.1.fq.gz read.2.fq.gz > align.sam
  3. 使用BWA-MEM算法比对,保存为BAM格式(压缩的SAM格式)

    1
    2
    $ bwa mem ref.fa lane1.read.1.fq.gz lane1.read.2.fq.gz | samtools view -bS -o align1.bam –
    $ bwa mem ref.fa lane2.read.1.fq.gz lane2.read.2.fq.gz | samtools view -bS -o align2.bam -
  4. 合并多个比对文件(比如同一个样品,分别在两条lane测序,生成两对reads)

    1
    $ samtools merge merge.bam align1.bam align2.bam
  5. 将比对结果进行排序

    1
    $ samtools sort align.bam align.sort
  6. 去除PCR重复

    1
    $ samtools rmdup output.sort.bam output.dedup.bam
  7. 使用GATK套件Indel附近区域进行重新比对(Realignment)

    1
    2
    $ java -jar GenomeAnalysisTK.jar -R ref.fa -T RealignerTargetCreator -o output.realn.intervals -I output.dedup.bam
    $ java -jar GenomeAnalysisTK.jar -R ref.fa -T IndelRealigner -targetIntervals output.realn.intervals -o output.realn.bam -I output.dedup.bam

四、处理比对文件(SAM/BAM格式)

  1. 打开SAM文件

    1
    $ less align.sam
  2. 打开BAM文件

    1
    2
    3
    $ samtools view align.bam |less # 查看比对信息(无Header信息)
    $ samtools view -h align.bam |less # 查看Header和比对信息
    $ samtools view -H align.bam |less # 查看Header信息
  3. 查看比对深度

    1
    2
    $ samtools depth align.bam | less
    $ samtools depth -r chr1:1000-2000 | less # 只查看染色体chr1上1000-2000位置
  4. samtools主要功能描述
    image


试一试:

  • 使用下列序列进行搜索
    MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDVNLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPWFPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADAYNYRHGQPIPRVEYMEEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGFRLRPVALLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFAQFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSEKPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQRIEVLDNTQQLKILADSINSEIGILCSALQKIK

从结果中获得以下信息:
该序列在人类中的基因描述;
第一个匹配结果的bit score得分、E值、Identities、Gap数目;
上述指标第二个匹配结果如何;
下载前两个匹配结果,选择任一比对软件将与原序列与下载的软件一起进行多序列比对,它们之间的差异在什么地方,与blast结果比对的结果有什么异同。

  • 使用NCBI数据库获得不少于10个物种BRCA2基因氨基酸序列,并应用Clustal软件(http://www.ebi.ac.uk/Tools/msa/clustalo/)进行序列多重比对得到系统进化树,分析其进化关系

remove unmapped reads of a BAM file

发表于 2017-06-23
1
samtools view -bF 4 input.bam -o output.filterUnmapped.bam

Random subsample from a BAM file

发表于 2017-06-21   |   分类于 Cheetsheets

picard DownsampleSam

1
java -jar picard.jar DownsampleSam P=0.25 I=QX70A.sorted.bam O=QX70A.Per0.25.bam CREATE_INDEX=true

picard

samtools -s

1
samtools view -h -b -s 0.25 QX70A.sorted.bam -o QX70A.Per0.25.bam

GATK -dfrac

1
java -jar GenomeAnalysisTK.jar -T PrintReads -R ref.fa -I input.bam -o output.bam -dfrac 0.25

tmux

发表于 2017-06-19

先安装 libevent2.x

tmux安装过程报错可能是缺libevent,先安装libevent

1
2
3
4
5
6
wget https://github.com/libevent/libevent/releases/download/release-2.1.8-stable/libevent-2.1.8-stable.tar.gz
tar zxvf libevent-2.1.8-stable.tar.gz
cd libevent-2.1.8-stable
./configure --prefix=$HOME
make
make install

安装 tmux

安装tmux,运行tmux如果缺库,则把libevent-* 连接到lib目录。

1
2
3
4
5
6
7
git clone https://github.com/tmux/tmux.git
cd tmux
sh autogen.sh
DIR="$HOME"
./configure CFLAGS="-I$DIR/include" LDFLAGS="-L$DIR/lib"
make

tmux 配置

1
2
cd ~/
em .tmux.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
## .tmux.conf 内容:
set -g history-limit 10000
# Automatically set window title
#set-window-option -g automatic-renam
#set-window-option -g xterm-keys on
set-option -g set-titles on
# Shift arrow to switch windows S shift M alt C ctrl
unbind-key -n S-Left
unbind-key -n S-Right
#bind -n C-Left previous-window
#bind -n C-Right next-window
bind -n F2 new-window
bind -n F3 previous-window
bind -n F4 next-window
bind -n F7 copy-mode
# kill window (prefix Ctrl+q)
#bind ^q killw
# display
#set -g status-utf8 on
set -g status-keys vi
set -g status-interval 1
set -g status-attr bright
set -g status-fg white
set -g status-bg black
set -g status-left-length 20
set -g status-left '#[fg=green][#[fg=red]#S#[fg=green]]#[default]'
set -g status-justify centre
set -g status-right '#[fg=green][ %m/%d %H:%M:%S ]#[default]'
setw -g window-status-current-format '#[fg=yellow](#I.#P#F#W)#[default]'
setw -g window-status-format '#I#F#W'

无Root配置zsh和Oh-my-zsh

发表于 2017-06-19

无root,系统里没有zsh的时候,无法用chsh切换zsh

zsh

安装zsh

1
2
3
4
5
6
git clone https://github.com/zsh-users/zsh.git
cd zsh
./Util/preconfig
./configure --prefix=/ifswh1/BC_COM_P2/F13FTSNCKF1344/SHEtbyR/software/zsh/zsh
make
make install

在 .bash_profile 加一行

1
exec /ifswh1/BC_COM_P2/F13FTSNCKF1344/SHEtbyR/software/zsh/zsh/bin/zsh

oh-my-zsh

安装oh-my-zsh

1
2
3
4
5
git clone git://github.com/robbyrussell/oh-my-zsh.git ~/.oh-my-zsh
# 备份原来的.zshrc
cp ~/.zshrc ~/.zshrc.orig
# 拷贝.zshrc模板
cp ~/.oh-my-zsh/templates/zshrc.zsh-template ~/.zshrc

修改.zshrc配置(Optional)

1
2
3
4
5
6
7
8
9
10
11
12
# oh-my-zsh目录 Path to your oh-my-zsh installation.
export ZSH=$HOME/.oh-my-zsh
# oh-my-zsh主题
ZSH_THEME="michelebologna" #
# ZSH_THEME="gentoo" # 备选1
DISABLE_AUTO_UPDATE="true"
plugins=(git)
export LC_ALL=C
source $ZSH/oh-my-zsh.sh

更多oh-my-zsh主题

.zshrc其他配置规则和.bashrc几乎一样

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#shortcut
alias em='/usr/bin/emacs'
alias ll='ls -l'
alias la='ls -a'
alias les='less -S -x4'
alias ks='ls'
alias ch='chmod +x'
alias ee='exit'
alias tt='top -u wanglizhong'
alias sourcerc='source ~/.zshrc'
alias head1='head -1'
alias githelp='cat /home/wanglizhong/helpfile/git'
# tmux
alias tls='tmux ls'
alias tnew='tmux new -s'
alias tatt='tmux att -t'
alias tkill='tmux kill-session -t'
# help
alias find-help='cat /home/wanglizhong/tools/find'
alias awk-help='cat /home/wanglizhong/tools/awk'
# qsub
alias qs='qstat|sort -t" " -k1 -n|les -N'
alias qd='qdel'
alias qj='qstat -j'
# PATH
export R_LIBS=/ifshk7/BC_PS/wanglizhong/R:$R_LIBS # !!!!!
export PERL5LIB=/ifshk7/BC_PS/wanglizhong/perl5:/home/wanglizhong/software/vcftools/vcftools-build/lib/perl5/site_perl/5.8.8:$PERL5LIB
export LD_LIBRARY_PATH=/ifshk4/BC_PUB/biosoft/pipe/bc_paap/01.Usr/lib:/ifshk7/BC_PS/wanglizhong/lib64:/ifshk7/BC_PS/wanglizhong/lib:$LD_LIBRARY_PATH
export PATH=/ifshk7/BC_PS/wanglizhong/bin:$PATH

Do Phasing with SHAPEIT

发表于 2017-06-06   |   分类于 Cheetsheets
1
2
3
4
5
6
7
8
9
10
export JAVA_HOME=/home/wanglizhong/software/java/jre1.8.0_45;
# beagle was used to improve genotype calls using genotype likelihoods
java -jar beagle/4.1/beagle.27Jul16.86a.jar gl=1.vcf.gz out=1.genotype;
# phase with shapeit
shapeit --thread 4 --effective-size 14269 --input-vcf 1.genotype.vcf.gz -M genetic_map_b37/genetic_map_chr1_combined_b37.txt --output-max 1.shapeit.haps 1.shapeit.sample --output-log 1.shapeit.log
# Convert the SHAPEIT haplotypes to vcf formats
shapeit -convert --input-haps 1.shapeit --output-vcf 1.shapeit.phased.vcf --output-log 1.shapeit.convert.log


more about shapeit

LD-prune with plink

发表于 2017-05-27
1
2
plink-1.07/plink --noweb --file input --indep-pairwise 50 5 0.5 --out tmp1
plink-1.07/plink --noweb --file input --extract tmp1.prune.in --recode --out output

Linkage disequilibrium based SNP pruning

--indep-pairwise 50 5 0.5
--indep-pairwise 50 5 0.2

a) consider a window of 50 SNPs
b) calculate LD between each pair of SNPs in the window
b) remove one of a pair of SNPs if the LD is greater than 0.5
c) shift the window 5 SNPs forward and repeat the procedure

Ts/Tv

发表于 2017-05-27

The stricter the calls, the bigger ts/tv value one should get.

1
2
3
vcftools/vcftools-build/bin/vcftools --gzvcf input.vcf.gz --TsTv-summary --out input.gz.TsTv
zcat input.vcf.gz| vcftools/vcftools-build/bin/vcf-tstv > input.vcf.gz.tstv

Publications

发表于 2017-05-02

Publications (*equal contribution in authorship)

  • Mei C*, Wang H*, Liao Q*, Wang L*, et al. Genetic architecture and selection of Chinese cattle revealed by whole genome resequencing. Molecular Biology and Evolution, 2018, 35(3), 688-699. (2018 IF 14.797)[link]
  • Wang K*, Wang L*, Lenstra JA*, Jian J*, et al. The genome sequence of the wisent (Bison bonasus). Gigascience, 2017. 6 (4), gix016. (2017 IF 7.267)[link]
  • Liang C*, Wang L*, Wu X, et al. Genome-wide association study identifies loci for the Polled phenotype in yak. PloS one, 2016, 11(7): e0158642. (2016 IF 2.806)[link]
  • Qiu Q*, Wang L*, Wang K*, Yang Y*, et al. Yak whole-genome resequencing reveals domestication signatures and prehistoric population expansions. Nature Communications, 2015, 6:10283. (2015 IF 11.329)[link]
  • Yang Y*, Wang L*, Han J, et al. Comparative transcriptomic analysis revealed adaptation mechanism of Phrynocephalus erythrurus, the highest altitude Lizard living in the Qinghai-Tibet Plateau. BMC Evol Biol, 2015, 15:101. (2015 IF 3.406)[link]
  • Wang L, Zhou H, Han J, et al. Genome-scale transcriptome analysis of the alpine “glasshouse” plant Rheum nobile (Polygonaceae) with special translucent bracts. PloS one, 2014, 9(10): e110712. (2014 IF 3.234)[link]


  • Zhang X, Wang K, Wang L, et al. Genome-wide patterns of copy number variation in the Chinese yak genome. BMC Genomics, 2016, 17(1):1-12. (2016 IF 3.729)[link]
  • Wang K, Yang Y, Wang L, et al. Different gene expressions between cattle and yak provide insights into high-altitude adaptation. Anim Genet, 2016, 47(1):28-35. (2016 IF 1.815)[link]
  • Wang K, Hu Q, Ma H, Wang L, et al. Genome-wide variation within and between wild and domestic yak. Mol Ecol Resour, 2014, 14(4):794-801. (2013 IF 3.712)[link]
  • Qiu Q, Zhang G, Ma T, Qian W, Wang J, Ye Z, …, Wang L, et al. The yak genome and adaptation to life at high altitude. Nature Genetics, 2012, 44(8):946-949. (2012 IF=35.209)[link]


up to 2018.12

bwa-aln

发表于 2017-04-28
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# aln pipeline
mkdir tmp
# step1
bwa aln -m 100000 -t 4 -i 15 -q 10 -f sampleA_1.sai reference.fa sampleA_1.fq.gz
bwa aln -m 100000 -t 4 -i 15 -q 10 -f sampleA_2.sai reference.fa sampleA_2.fq.gz
# step2
# map and sort; unmapped标记成 mapQ=0;
bwa sampe -a 600 -r "@RG\tID:sampleA\tPL:illumina\tPU:sampleA\tLB:sampleA\tSM:sampleA" reference.fa out_1.sai out_2.sai input_1.fq.gz input_2.fq.gz|awk '{if (!/^@/ && and($2,4) == 4) {$5=0; $6="*";} gsub(/ /,"\t",$0); print}' | samtools view -bS -T reference.fa - > sampleA.bam
# FixMateInformation and sort
java -Xmx3g -Djava.io.tmpdir=tmp -XX:MaxPermSize=512m -XX:-UseGCOverheadLimit -jar picard.jar FixMateInformation I=sampleA.bam O=sampleA.sort.bam TMP_DIR=tmp SO=coordinate VALIDATION_STRINGENCY=SILENT
# index
samtools1.2 index sampleA.sort.bam
# calmd, recalculate MD/NM tags and '=' bases
samtools calmd -b sampleA.bam reference.fa > sampleA.sort.bam
# remove PCR duplicates
java -Xmx2G -Djava.io.tmpdir=tmp -XX:MaxPermSize=512m -XX:-UseGCOverheadLimit -jar picard.jar MarkDuplicates MAX_FILE_HANDLES=1000 REMOVE_DUPLICATES=true I=sampleA.sort.bam O=sampleA.rmdup.bam METRICS_FILE=sampleA.txt TMP_DIR=tmp VALIDATION_STRINGENCY=SILENT
# merge
java -Xmx2G -Djava.io.tmpdir=tmp -XX:MaxPermSize=512m -XX:-UseGCOverheadLimit -jar picard.jar MergeSamFiles MAX_FILE_HANDLES=1000 I=sampleA01.rmdup.bam I=sampleA02.rmdup.bam I=sampleA03.rmdup.bam O=sampleA.all.bam TMP_DIR=tmp SO=coordinate AS=true VALIDATION_STRINGENCY=SILENT
# index
samtools index sampleA.all.bam
# realign
java -Xmx4g -Djava.io.tmpdir=tmp -XX:MaxPermSize=512m -XX:-UseGCOverheadLimit -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fa -I sampleA.all.bam -o sampleA.realn.intervals -B:snps,VCF dbsnp_132.hg19.vcf -B:indels,VCF 1000G_indels_for_realignment.hg19.vcf -l INFO
java -Xmx4g -Djava.io.tmpdir=tmp -XX:MaxPermSize=512m -XX:-UseGCOverheadLimit -jar GenomeAnalysisTK.jar -T IndelRealigner -R reference.fa -I sampleA.all.bam -o sampleA.realn.bam -targetIntervals sampleA.realn.intervals -maxInMemory 300000 -l INFO

CNVnator Based Analysis

发表于 2017-04-27

CNVnator

(Other software: Genome STRiP)

Call CNV using CNVnator

Running example

1
2
3
4
5
6
7
export ROOTSYS=/path/to/software/root;
export LD_LIBRARY_PATH=/path/to/software/root/lib:$LD_LIBRARY_PATH;
cnvnator -root AGS10A.Chr1.50.root -unique -tree AGS10A.sorted.bam -chrom Chr1 >AGS10A.Chr1.50.root.l1 2>AGS10A.Chr1.50.root.e1;
cnvnator -root AGS10A.Chr1.50.root -d split_chr_dir -his 50 -chrom Chr1 >AGS10A.Chr1.50.root.l2 2>AGS10A.Chr1.50.root.e2
cnvnator -root AGS10A.Chr1.50.root -stat 50 -chrom Chr1 >AGS10A.Chr1.50.root.l3 2>AGS10A.Chr1.50.root.e3;
cnvnator -root AGS10A.Chr1.50.root -partition 50 -chrom Chr1 >AGS10A.Chr1.50.root.l4 2>AGS10A.Chr1.50.root.e4
cnvnator -root AGS10A.Chr1.50.root -call 50 -chrom Chr1 > AGS10A.Chr1.50.cnv 2>AGS10A.Chr1.50.root.e5

Result format:

1
2
3
4
5
6
7
8
9
CNV_type coordinates CNV_size normalized_RD e-val1 e-val2 e-val3 e-val4 q0
- normalized_RD -- normalized to 1.
- e-val1 -- is calculated using t-test statistics.
- e-val2 -- is from the probability of RD values within the region to be in
the tails of a gaussian distribution describing frequencies of RD values in bins.
- e-val3 -- same as e-val1 but for the middle of CNV
- e-val4 -- same as e-val2 but for the middle of CNV
- q0 -- fraction of reads mapped with q0 quality



Note

  • First thins you should do is try different bin_size settings like 100, 200, 500, 1000. And make sure that for the bin size you are using the ratio of average to its sd is around 4-5. In the running log of -stat or -eval, something like this:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    # cnvnator -root scaffold100.root -chrom scaffold100 -stat 1000
    Making statistics for scaffold100 ...
    Average RD per bin (1-22) is 46.1805 +- 11.2085 (before GC correction)
    Average RD per bin (X,Y) is 0 +- 0 (before GC correction)
    Correcting counts by GC-content for 'scaffold100' ...
    Making statistics for scaffold100 after GC correction ...
    Average RD per bin (1-22) is 45.3823 +- 10.8865 (after GC correction)
    Average RD per bin (X,Y) is 0 +- 0 (after GC correction)
    # OR
    # cnvnator -root scaffold100.root -chrom scaffold100 -eval 1000
    Average RD per bin (1-22) is 45.3823 +- 10.8865 4.16869
    Warning in <Fit>: Fit data is empty
    Warning in <Fit>: Fit data is empty
    Average RD per bin (X,Y) is 0 +- 0 nan
  • RAW cnv call的结果是normalized到1的深度, genotype的深度是normalized到2的深度。

The results of calling is RD signal normalized to 1, while genotyping returns estimated number of copies, which is 2 for all germline chromosomes except for X and Y in males.)

  • By default CNVnator estimates CN (-genotype, last two number in each line) using bin size that you provided and bins of 1000 bp. If histogram for 1k bp bins don’t exist then the software prints the warning and print -1 (the last field).
  • Larger events are more reliable. Also, regions with less repeats and segments duplication are more reliable (but this is taken care of by q0 filtering).
  • First e-values is more conservative. You can ignore e-val3 and e-val4. By default CNVantor reports calls that are with <0.05 either for e-val1 or e-val2.




Few things may help

  • CNVnator_v0.2.7 and root-5.34.36 is a safe chioce

  • the information about genome is taken from BAM header. Option -genome is only relevant when using with SAM files

  • looks like your file got corrupted. Not sure about the reason. If the issue persist, the walk around is to save histograms into a different .root file. e.g. cnvnator -root tree.root -outroot his.root -his 200 -d /ref/sep/

  • by default CNVantor reports calls that are with <0.05 either for e-val1 or e-val2.

  • dealing with scaffolds genome, you may want to remove few scaffold with abnormal RD and sd (and the abnormal ratio of average to its sd ).

    1
    2
    3
    4
    5
    6
    7
    8
    # for example, scaffold with too many repeat sequences
    Making statistics for scaffold1152 ...
    Average RD per bin (1-22) is 3.35436e-07 +- 96.0994 (before GC correction)
    Average RD per bin (X,Y) is 0 +- 0 (before GC correction)
    Correcting counts by GC-content for 'scaffold1152' ...
    Making statistics for scaffold1152 after GC correction ...
    Average RD per bin (1-22) is 3.29977e-07 +- 9.1335e-08 (after GC correction)
    Average RD per bin (X,Y) is 0 +- 0 (after GC correction)



过滤参考

参考CNV过滤标准:

  1. q0 < 0.5, ie. calls with >50% of q0 (zero mapping quality) reads within the CNV regions were removed (q0 filter)

  2. CNV Calls > 1kb (不必要)

  3. P-value <0.01 (e-val1 )

  4. 不在基因组gap区域: it does not overlap a gap in the reference genome for a deletion, and it is not within 0.5 Mb from a gap in the reference genome for a duplication; (nature11632-12-千人-s1)/OR/ All CNV calls overlapping with gaps in the reference genome were excluded from consideration. (这个对于scaffolds的基因组太严了,尝试去掉50%以上)

  5. 重复序列区域: CNV calls that overlapped (more than 50% of their length) with masked regions were excluded from CNV analyses.

  6. 不好处理:a deletion must have at least two paired-end reads supporting the predictions (overlapping by 50% reciprocally) or read depth in the middle part (1 kb away from breakpoints) of the called region should be statistically different from average read depth. Statistical testing was done the same way as when calling CNV regions. (nature11632-12-千人-s1)


千人SV过滤和merge方法(doi:10.1038/nature15394)

For each sample we subsequently selected confident CNVnator calls as follows:
1) calls having paired-end support; (PE reads only in bam)
2) calls with p-values less than 1e-5 (to accounts for multiple hypothesis testing, i.e., calling in ~2,500 samples), and with q0 < 0.5;
3) deletion calls with p-values less than 10^-5 and rd*(1 + q0) < 0.75 — whereby rd is the read-depth normalized to genome average, and q0 is fraction of reads mapped with 0 (zero) mapping quality.

We merged CNV calls for individuals within each population. For CNVnator site merging we initially clustered confident overlapping calls and averaged coordinates of each bound, pursuing the merging initially by population and then across the entire sample set.


CNV处理方式

  1. 不同个体间,大于50% reciprocal overlap的CNV为一个CNV (0-GR-2011-CNVnator, doi:10.1101/gr.114876.110: consider two calls concordant if they have >50% reciprocal overlap. )

  2. For each population, overlapping calls were merged by selecting the largest call of all overlapping ones.(to CNVR)
    For comparisons of CNVs among multiple individuals, we merged all overlapping calls across individuals into unique CNVRs.


实际过滤情况

实际使用过滤标准:

  1. filter q0 >= 0.5

  2. e-val1 >= 0.01

  3. deletion and duplication; dose not within a gap

实际使用个体间merge方法

  • 不同个体间,大于50% reciprocal overlap的CNV为一个CNV,平均两端的坐标(averaged coordinates of each bound)



描述

相关定义

  • Please make sure that for the bin size you are using the ratio of average to its sd is around 4-5. (第一步挑选一个合适的bin_size)
  • most of false positives come from repetitive regions. So, the most confident calls are with q0 < 0.5 (see README on how to generate it) (去除可能的重复序列区域结果)

CNV、genetic CNV、CNV gene定义

(all from doi/10.1101/gr.187187.114.)

  • “CNVs” are all duplications or deletions ≥1 kb;
  • “genic CNVs” are calls that contain at least ++one whole transcription unit++, based on the RefSeq- Gene database;
  • and “CNV genes” are transcription units that are completely contained within genic CNVs.
  • In addition, we use the term “CNV regions” (CNVRs) for genomic regions that include all partially or fully overlapping CNVs in any one of the analyzed animals. The borders of a CNVR are defined by the coordinates of the merged CNV calls across all individuals

描述数量

Three to five thousand CNV calls were produced for each individual, ranging in size from 200 to 1,590,400 bp (see Figure S3).

Vst分析 (千人SV文章, doi:10.1038/nature15394)

计算方法

image

copy number based Vst

  • 用copy number(CN)方差来反应群体之间的分化
  • 取0.2作为阈值,鉴定高分化区域
  • 去掉X染色体的

image



个性化分析

基本信息统计

基于个体的

  • 平均乘数、比对reads数、比对率、 All CNVs数量、Average CNV length、Genetic CNVs数量、CNV基因数量 (table)

  • 统计presence and absence patterns of CNV alss between individuals(individuals within populations are more similar to each other than individuals between populations) (FigS5 家鼠 doi/10.1101/gr.187187.114.)

image

  • 不同群体之间,两两个体overlap数(1bp,50%长度分别展示)

image

基于群体的

  • CNV 长度分布,分群体展示(FigS3 家鼠 doi/10.1101/gr.187187.114.)

image

  • 不同群体CNV calls/genetic CNVs/CNV genes数目,barplot (Fig1 家鼠 doi/10.1101/gr.187187.114.)
1
2
boxplot(CNV_number~sample_id,data=a)
boxplot(a$V1,b$V1,outline = F,names=c("A","B"),main="title")

image

  • 不同群体share情况

image

share

image

private

image

  • 和基因组不同区域的overlap情况

image

Extended Data Figure 7 | Enrichment of functional elements intersecting SVs. a, Shadow figure of Fig. 2a. Overlap enrichment analysis of deletions (with resolved breakpoints) versus genomic elements, using partial overlap statistic, deletions categorized into VAF bins. b, Similar to a. The only difference is that engulf overlap statistic is used instead of partial overlap statistic. Engulf overlap statistic is the count of genomic elements (for example, CDS) that are fully imbedded in at least one SV interval (for example, deletions). *no element intersected observed within data set. c, Similar to a and b, with the enrichment/depletion analysis pursued for common SNPs as well as more rare single nucleotide polymorphisms/variants (SNVs). Common SNV alleles show the highest levels of depletion for investigated genomic elements. d, Overlap enrichment analysis of various SV types versus genomic elements, using partial overlap statistic.

  • cnv长度和拷贝数的关系?

  • CNV频率分布(CNV Allele Frequency Distributions)对选择的影响(doi:10.1371/journal.pgen.1004830)

To test for the influence of selection on the allele frequency spectrum of CNVs, we used a Poisson Random Field approach implemented in prfreq, which evaluates the expected allele frequency distributions given different demographic and selection models. We compared the estimates of the scaled selection coefficient c of CNVs (both deletions and duplications) with intergenic SNPs, by fitting the ‘‘single point mass’’ model over a range of c (between 220 and 10). Estimates were taken modeling a stationary population, under the assumption that demographics affect intergenic SNPs and CNVs in a similar fashion.

  • cnv 注释,不同区域的数量、比例,用比例可以不然不同群体间/不同类型(Deletion/Duplication)数量差异在图上太明显

image



Insert blank line using markdown

发表于 2017-04-18

markdown里插入空行的方法,两个可用的:
<br />
<br></br>

admixtools FAQs

发表于 2017-04-18   |   分类于 Cheetsheets

常见的和格式有关的问题:more


ped、fam格式第六列的格式问题

Question: convertf decides to “ignore” all my samples. Why?

Answer: A likely reason is that you are using a “fam” or “ped” file with a funny value (0, 9 or -9) in column 6. Try setting column 6 to 1.


推荐去除LD先

Question: Should regions of long-range LD in the genome be removed prior to PCA?

Answer: Yes, to avoid principal components that are artifacts of long-range LD it is ideal to remove such regions. See Table 1 of Price et al. 2008 AJHG. However, EIGENSTRAT can subsequently be run to compute disease association statistics using the full set of SNPs.


仅支持有限的SNP数量

Question: Can I run EIGENSOFT on very large data sets?

Answer: Yes. We currently support GWAS data sets up to 8 billion genotypes. For data sets between 2 billion and 8 billion genotypes, some care is required. See documentation for details.


命名太长的问题

Question: When I run I get an error message about “idnames too long”. What should I do?

Answer: The software supports sample ID names up to a max of 39 characters. Longer sample ID names must be shortened. In addition, if your data is in PED format, the default is to concatenate the family ID and sample ID names so that their total length must meet this limit; however, you can set “familynames: NO” so that only the sample ID name will be used and must meet the 39 character limit.


只取一部分群体

Question: How do I compute principal components using only a subset of populations and project other populations onto those principal components?

Answer: Use -w flag to smartpca.perl (see EIGENSTRAT/README), or use poplistname parameter to smartpca (see POPGEN/README).

Getting free images with copyright

发表于 2017-04-17

PublicDomainPictures

Unsplash

Pixabay

LibreStock

pexels

VisualHunt 可以按颜色筛选

ssyer

gratisography

streetwill 小众、街拍、风景

fancycrave 照片背后的故事

shutterstock

stokpic

finda

gettyimages

vcftools

发表于 2017-04-12   |   分类于 Cheetsheets

习惯:Always使用bgzip生成vcf.gz文件

并用tabix生成index,使得vcftools和GATK中直接使用vcf.gz文件。

bgzip和tabix包含在samtools/htslib内。

1
2
samtools/htslib/bgzip -c file.vcf > file.vcf.gz
samtools/htslib/tabix -p vcf file.vcf.gz


小模块
在vcftools/vcftools-build/bin/中的小程序,实现简单的特定功能。列出几个最常用的。其他更多

使用前需要导入perl_module lib

1
export PERL5LIB=/home/wanglizhong/software/vcftools/vcftools-build/lib/perl5/site_perl/5.8.8:$PERL5LIB;

单个vcf

重新计算INFO里的AN和AC

fill-an-ac
重新计算AN和AC 并加入INFO中。

1
zcat in.vcf.gz|fill-an-ac > out.vcf


vcf2fa

vcf转成一致性序列(consensus sequence)。

1
2
3
4
5
6
7
cat ref.fa|vcf-consensus file.vcf.gz > out.fa
# 其中某一个个体
cat ref.fa|vcf-consensus -s sampleA_ID file.vcf.gz > out.sampleA_ID.fa
# 只有某一个haplotype(1或者2)
cat ref.fa|vcf-consensus -H 1 file.vcf.gz > out.hap1.fa


取出部分个体

vcf-subset -c

1
vcf-subset -c NA0001,NA0002 file.vcf.gz | bgzip -c > out.vcf.gz


vcf里的信息自定义输出

1
2
vcf-query file.vcf.gz 1:10327-10330
vcf-query file.vcf -f '%CHROM:%POS %REF %ALT [ %DP]\n'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Usage: vcf-query [OPTIONS] file.vcf.gz
Options:
-c, --columns <file|list> List of comma-separated column names or one column name per line in a file.
-f, --format <string> The default is '%CHROM:%POS\t%REF[\t%SAMPLE=%GT]\n'
-l, --list-columns List columns.
-r, --region chr:from-to Retrieve the region. (Runs tabix.)
--use-old-method Use old version of API, which is slow but more robust.
-h, -?, --help This help message.
Expressions:
%CHROM The CHROM column (similarly also other columns)
%GT Translated genotype (e.g. C/A)
%GTR Raw genotype (e.g. 0/1)
%INFO/TAG Any tag in the INFO column
%LINE Prints the whole line
%SAMPLE Sample name
[] The brackets loop over all samples
%*<A><B> All format fields printed as KEY<A>VALUE<B>
Examples:
vcf-query file.vcf.gz 1:1000-2000 -c NA001,NA002,NA003
vcf-query file.vcf.gz -r 1:1000-2000 -f '%CHROM:%POS\t%REF\t%ALT[\t%SAMPLE:%*=,]\n'
vcf-query file.vcf.gz -f '[%GT\t]%LINE\n'
vcf-query file.vcf.gz -f '[%GT\ ]%LINE\n'
vcf-query file.vcf.gz -f '%CHROM\_%POS\t%INFO/DP\t%FILTER\n'


将个体顺序重排

Reorder columns

将file.vcf.gz按照template.vcf.gz顺序重排

1
vcf-shuffle-cols -t template.vcf.gz file.vcf.gz > out.vcf

vcf postions Reorder

1
2
vcf-sort file.vcf.gz > out.vcf
zcat file.vcf.gz| vcf-sort > out.vcf


转换成tab格式

1
zcat file.vcf.gz | vcf-to-tab > out.tab

输出如下:

1
2
3
4
5
6
7
8
9
10
#CHROM POS REF 05537
chr2 131 T T/T
chr2 437 G G/G
chr2 453 G G/G
chr2 469 G G/G
chr2 526 G G/G
chr2 618 G G/G
chr2 745 A A/A
chr2 756 T T/T
chr2 760 T T/T


计算tstv

1
cat file.vcf | vcf-tstv


vcf格式校验

1
vcf-validator file.vcf.gz


多个vcf

比较多个vcf里SNP位置的overlap

可以用结果画维恩图Venn-Diagram。

1
2
3
4
5
6
7
8
# run
vcf-compare test.vcf.gz test2.vcf.gz test3.vcf.gz|grep ^VN | cut -f 2-
# results
VN 94 test.vcf.gz (9.5%) test2.vcf.gz (22.3%) test3.vcf.gz (48.5%)
VN 100 test.vcf.gz (10.1%) test3.vcf.gz (51.5%)
VN 327 test.vcf.gz (32.9%) test2.vcf.gz (77.7%)
VN 473 test.vcf.gz (47.6%)


合并多个染色体chr/scaffold

1.个体(header)需完全相同。

1
vcf-concat chr1.vcf.gz chr2.vcf.gz | bgzip -c > all.vcf.gz

2.合并有不同个体的多个vcf

  • 最好有相同Postions,缺失的默认会被填上miss(.)

    1
    vcf-merge A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz
  • 把mis的用自定义genotype(0|0,0/0,1/1等等)填上

    1
    vcf-merge -R '0|0' A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz
  • bcftools

    1
    samtools/bcftools/bcftools merge --merge all --no-version --threads 10 file1.vcf.gz file2.vcf.gz |bgzip -c > merge.vcf.gz

3.一行命令合并vcf,巧用括号

1
2
3
# Merge (that is, concatenate) two VCF files into one, keeping the header
# from first one only.
(zcat A.vcf.gz | grep ^#; zcat A.vcf.gz | grep -v ^#; zcat B.vcf.gz | grep -v ^#; )| bgzip -c > out.vcf.gz


找vcf overlap/互补位置

1
2
3
4
5
# 输出两个vcf里共有的
vcf-isec -n +2 A.vcf.gz B.vcf.gz | bgzip -c > out.vcf.gz
# 输出第一个vcf里有,其他vcf(B和C及后面更多)里没有的
vcf-isec -c A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz

bam2fq

发表于 2017-04-06   |   分类于 Cheetsheets

samtools bam2fq

1
samtools.1.3.1 bam2fq input.bam |gzip -c > output.fq.gz


picard SamToFastq

1
java -jar picard-tools-2.5.0/picard.jar SamToFastq I=input.bam F=out.1.fq.gz F2=out.2.fq.gz


提取 unmapped reads

1
2
# 提取unmapped reads比对到另一个ref(比如线粒体)上
samtools.1.3.1 bam2fq -f 4 input.bam | bwa mem -t 30 -M -R '@RG\tID:Sample01\tLB:Sample01\tSM:Sample01\tPL:Illumina\tPU:Illumina\tSM:Sample01\t' mt.fa - | samtools.1.3.1 sort -O bam -T /dir/tmp -o mapMt.sort.bam


提取 unique reads

1
2
3
# unique reads (a single read mapping at one best position)
# -q INT Skip alignments with MAPQ smaller than INT [0]
samtools view -bq 1 file.bam > unique.bam

GATK SNP calling

发表于 2017-03-23

1. gatk calling

1.1 HaplotypeCaller

1
2
3
4
5
6
7
8
java -Xmx3g -jar GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -R ref.fa -T HaplotypeCaller \
-nct 3 \
--emitRefConfidence GVCF \
--variant_index_type LINEAR \
--variant_index_parameter 128000 \
-I in.bam \
-L Chr1 \
-o out.g.vcf.gz

1.2 GenotypeGVCFs

NB. Chr1.gvcf.list: gvcf list, one individual per line

1
2
3
4
5
6
7
8
9
java -Xmx6g -Djava.io.tmpdir=/tmp -jar GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -T GenotypeGVCFs \
-R ref.fa \
-nt 4 \
--variant Chr1.gvcf.list \
--dbsnp dbsnp.vcf \
-stand_call_conf 30.0 \
-stand_emit_conf 10.0 \
-L Chr1 \
-o raw.Chr1.vcf

1
2
3
4
5
6
# snp SelectVariants
java -Xmx2g -Djava.io.tmpdir=/tmp -jar GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -T SelectVariants \
-R ref.fa \
--variant raw.Chr1.vcf \
-selectType SNP \
-o raw.Chr1.snp.vcf
1
2
3
4
5
6
# indel SelectVariants
java -Xmx2g -Djava.io.tmpdir=/tmp -jar GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -T SelectVariants \
-R ref.fa \
--variant raw.Chr1.vcf \
-selectType INDEL \
-o raw.Chr1.indel.vcf

2. samtools calling

1
2
3
4
5
6
# samtools
samtools mpileup -b bam.list -uf ref.fa -r Chr1 -DS -C50 -m 2 -F 0.002 -P ILLUMINA \
| bcftools view -p 0.99 -bcvg - | bcftools view \
| vcfutils.pl varFilter -1 1e-5 -4 1e-7 -d 2 -Q 10 -a 2 -w 3 | bgzip -c > Chr1.flt.vcf.gz
tabix -f Chr1.flt.vcf.gz

3. confidence snp dataset with overlap of gatk and samtools;SNPQ>30

Chr1.overlap.snp.vcf.gz


4. final SNP

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# VariantRecalibrator & ApplyRecalibration
java -Xmx9g -Djava.io.tmpdir=/tmp -jar GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -T VariantRecalibrator \
-l INFO \
-R ref.fa \
-input raw.Chr1.snp.vcf.gz \
-resource:concordantSet,VCF,known=true,training=true,truth=true,prior=10.0 Chr1.overlap.snp.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbSNP/dbsnp.vcf \
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an DP \
-an InbreedingCoeff \
-recalFile gatk.SNP.recal \
-tranchesFile gatk.SNP.tranches \
-rscriptFile gatk.SNP.plot.R \
--TStranche 90.0 --TStranche 93.0 --TStranche 95.0 \
--TStranche 97.0 --TStranche 99.0 --TStranche 100.0 \
-mode SNP \
-L Chr1
java -Xmx6g -Djava.io.tmpdir=/tmp -jar GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -T ApplyRecalibration \
-l INFO \
-R ref.fa \
-input raw.Chr1.snp.vcf.gz \
--ts_filter_level 99.0 \
-recalFile gatk.SNP.recal \
-tranchesFile gatk.SNP.tranches \
-mode SNP \
-L Chr1 \
-o gatk.snp.Chr1.VQSR.vcf
grep -E 'PASS|#' gatk.snp.Chr1.VQSR.vcf >final.gatk.snp.Chr1.VQSR.vcf
perl tabix.pl gatk.snp.Chr1.VQSR.vcf
perl tabix.pl final.gatk.snp.Chr1.VQSR.vcf

5. use Beagle to inprove genotype calls using genotype likelihoods from SAMtools/GATK

用Beagle,利用GL信息做genotype

1
2
3
java -Xmx3g -jar beagle/4.1/beagle.27Jul16.86a.jar \
gl=input.vcf.gz \
out=output.genotype

6. phase with beagle

1
2
3
java -Xmx3g -jar beagle/4.1/beagle.27Jul16.86a.jar \
gt=output.genotype.vcf.gz \
out=output.phased.vcf.gz

7. phase with shapeit (human)

1
2
3
shapeit --thread 4 --effective-size 14269 --input-vcf chr1.genotype.vcf.gz \
-M 1kg/phasing/genetic_map_b37/genetic_map_chr1_combined_b37.txt \
--output-max chr1.shapeit.haps.gz chr1.shapeit.haps.sample --output-log chr1.shapeit.log

prepare reference genome index

发表于 2017-03-21
1
2
3
4
5
6
7
8
# gatk
export JAVA_HOME=/home/wanglizhong/software/java/jre1.8.0_45; /ifshk4/BC_PUB/biosoft/PIPE_RD/Package/java/jre1.8.0_45/bin/java -Xmx5g -jar /home/wanglizhong/software/gatk/picard/picard.1.129.jar CreateSequenceDictionary R=UMD3.1.fasta O=UMD3.1.v3.dict
# samtools
samtools faidx UMD3.1.fasta
# bwa
bwa index -a bwtsw UMD3.1.fasta

PPT

发表于 2017-02-12   |   分类于 Cheetsheets

字体

中文字体:
微软雅黑,方正兰亭黑简体,华文黑体,冬青黑体
雅痞-简

娃娃体,方正呐喊,小麦体,翩翩体 (娱乐化)

英文字体:
英文一般无衬线字体(san-serif font),不小于28号
Verdana(标题)
Franklin Gothic Medium (标题)
Comic Sans MS (标题)
Arial (正文、标题)
Times New Roman (正文)
Palatino (正文)

强调

用黑体(bold)和斜体(italic)来强调,不要使用下划线或者其他字体

字体大小

对于海报来说,有一个字体大小的简单规则:

To be easily readable the smallest font size, in points, should beat least 4 times the viewing distance in feet.

For example for viewing at 6 feet (4 x 6 = 24) the smallest fontsize you should use is 24 point. (this formula assumes goodeyesight so use larger sizes if possible)

build blog with hexo on github

发表于 2017-01-03

Referece

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 安装Homebrew
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
echo 'export PATH="/usr/local/bin:$PATH"' >> ~/.bash_profile
brew doctor
# 安装Xcode,app store安装
# 安装Nodejs
brew install node
# 安装git
brew install git
# 安装hexo
npm install -g hexo-cli

Creat a new repository on github with name ‘yourname.github.io’, then:

1
2
3
4
5
cd ~/Public
git clone https://github.com/yourname/yourname.github.io.git
hexo init yourname.github.io
cd yourname.github.io
npm install hexo-deployer-git --save

deployment

1
2
hexo generate && hexo server # hgs
hexo generate && hexo deploy # hgd

new article

1
hexo new "new article name"

edit your article in the source/_posts/ directory

Write markdown files with atom

用 ⇧⌃m 呼出预览

markdown语法

html字符实体在markdown里表示特殊符号

比如 |符号 的表示方式有,可用在markdown插入表格中:

1
2
3
4
5
&verbar;
&vert;
&VerticalLine;
&#x0007C;
&#124;

字符实体参考手册

ASCII

符号

颜色代码

颜色名

Google Material design

R Cheetsheets

发表于 2017-01-01   |   分类于 Cheetsheets

基本

基本元素、函数

基本元素

函数名 功能
points(x,y) 添加点
lines(x,y) 添加线
text(x,y,labels,…) 在(x,y)处添加用labels指定的文字
mtext(text,side=3,line=0,…) 在边空添加用text指定的文字
segments(x0,y0,x1,y1) 从(x0,y0)各点到(x1,y1)各点画线段
arrows(x0,y0,x1,y1,…) 同上,但添加箭头
abline(a,b) 绘制斜率为b和截距为a的直线
abline(h=y) 在纵坐标y处画水平线, y可以是向量, abline(h=c(1,2,3))
abline(v=x) 在横坐标x处画垂直线, x可以是向量
abline(lm.obj) 画出lm.obj确定的回归线
rect(x1,y1,x2,y2) 绘制长方形,(x1,y1)为左下角,(x2,y2)为右上角
polygon(x,y) 绘制连接各x,y坐标确定的点的多边形
legend(x,y,legend) 在点(x,y)处添加图例,说明内容由legend给定
title() 添加标题,也可添加一个副标题
axis(side,vect) 画坐标轴
box() 在当前的图上加边框
rug(x) 在x轴上用短线画出x数据的位置
locator(n,type=”n”,…) 获取鼠标在图中点击处的坐标

常用绘图函数

函数名 功能
plot(x) 以x的元素值为纵坐标、以序号为横坐标绘图
plot(x,y) x与y的二元作图
pie(x) 饼图
boxplot(x) 盒形图(也称箱线图)
hist(x) x的频率直方图
barplot(x) x的值的条形图
pairs(x) 如果x是矩阵或是数据框,作x的各列之间的二元图
matplot(x,y) 二元图,其中x的第一列对应y的第一列,依次类推
qqnorm(x) 正态分位数-分位数图
image(x,y,z) x,y,z三元图
heatmap(x) 热图
coplot(x~y|z) 关于z的每个数值(或数值区间)绘制x与y的二元图

常见图,例子

ggplot2 doc

柱状图、饼图、玫瑰图、图例、坐标系、主题、颜色等汇总

R/ggplot2基础查询

柱状图、分类柱状图、簇形图、百分比堆积图、饼图极坐标图、折线图

地图

ggplot概念理解

用ggplot2画qqplot

ggplot2叠加多图层

安装包方法

1
2
3
source("http://www.bioconductor.org/biocLite.R")
# 可以一次装多个包
biocLite(c("gridExtra","ggplot2"))

几何性质相关

alpha, color, fill, linetype, shape, size

shape

image

linetype:

0 = blank, 1 = solid, 2 = dashed, 3 = dotted, 4 = dotdash, 5 = longdash, 6 = twodas

数据操作

矩阵操作
合并

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 合并两个矩阵
# 纵向合并
cbind(pop,admix)
# 横向合并
rbind(a,b)
#横向合并
ID<-c(1,2,3,4)
name<-c("Jim","Tony","Lisa","Tom")
score<-c(89,22,78,78)
student1<-data.frame(ID,name)
student2<-data.frame(ID,score)
total_student<-merge(student1,student2,by="ID")
total_student
#纵向合并
ID<-c(1,2,3)
name<-c("Jame","Kevin","Sunny")
student1<-data.frame(ID,name)
ID<-c(4,5,6)
name<-c("Sun","Frame","Eric")
student2<-data.frame(ID,name)
total<-rbind(student1,student2)
total

取出部分行、列

1
2
3
4
5
6
7
# 取出1,2,3行、列
admixname=admixsort[c(1,2,3),] # 行
admixpatial=admixsort[,c(1,2,3)] # 列
# 去掉1,2,3行、列
admixpatial=admixsort[-c(1,2,3),] # 行
admixpatial=admixsort[,-c(1,2,3)] # 列

截取、排序、构造新列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 截取大于x的所有数值
a=rnorm(10)
a=a[a>0]
# 排序
对于矩阵pop
针对第3列排序:
pop2=pop[order(pop[,3]),] # 从小到大
pop2=pop[order(-pop[,3]),] # 从大到小
先按第3列,再按第4列排序,再按第5列排序:
pop2=pop[order(pop[,3],pop[,4],pop[,5]),]
如果想以按照第3列从小到大,如果第3列一样,按照第4列从大到小排列:
pop2=pop[order(pop[,3],-pop[,4]),]
# R重新构造一列
a$V3=paste(a$V1,a$V2,sep="-")

ggplot 跳过NA数据

1
2
3
4
ggplot(data=subset(a, !is.na(alcohol_AUDIT)), aes(x=alcohol_AUDIT))+geom_histogram()
# na.rm=T
mean(a$pressure_height,na.rm=T)

stat, pos, binwidth, scale

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
stat_identity
stat_bin
stat='bin' # 默认
stat='identity' #
pos='stack' # 不同类/组的X叠加起来,默认
pos='fill' # 将每个x值的数据归一化到1
pos='dodge' # 独立开来
binwidth=30
# scale
scale_x_log10(...)
scale_y_log10(...)
scale_x_reverse(...)
scale_y_reverse(...)
scale_x_sqrt(...)
scale_y_sqrt(...)
scale_x_datetime()
scale_x_date()
scale_x_discrete()

标签

labs, guides

1
2
3
4
5
6
7
8
9
10
+ labs(title="Title", x='Weight', y='Y:mpg')
+ guides(color=guide_legend(title="Yan Se", title.pos = 'bottom'),size=guide_legend(title='Size',reverse=TRUE))
# guides也可以在相应的scale, theme函数里
+ scale_color_brewer(palette='Set1', guide=guide_legend(title="Yan Se", title.pos = 'bottom'))
# guides 分几列显示
+guides(color=guide_legend(ncol=2))
+guides(shape=guide_legend(ncol=2))
+guides(fill=guide_legend(ncol=2))

单独画坐标、竖线(admixture图里面)

1
2
3
4
5
position=c(0,tapply(1:nrow(admixname),admixname[,3],max))
# 坐标
axis(1,at=position,labels=F)
# 一组竖线
abline(v=position,lwd=0.2)

将绘图区划分成 m 行 n 列

用facet把图分成小图,多图

1
2
3
4
5
6
7
8
9
10
11
12
13
+ facet_wrap(~cut)
+ facet_wrap(~cut, ncol=1)
+ facet_wrap(~cut+clarity, ncol=4)
+ facet_wrap(~clarity+cut, ncol=4, scales="free")
# facet_grid与facet_wrap可以实现相同效果,
+ facet_wrap(~gear, nrow=1)
+ facet_grid(.~gear)
+ facet_wrap(~gear, ncol=1)
+ facet_grid(gear~.)
+ facet_grid(clarity~cut)

facet_wrap()说明

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
facet_wrap(facets, nrow = NULL, ncol = NULL, scales = "fixed",
shrink = TRUE, as.table = TRUE, drop = TRUE)
**Arguments**
facets
一般是“~变量名”或者“~变量一+变量二”
nrow
控制总行数
ncol
控制总列数
scales
这个变量控制不同图像间坐标的尺度关系。"fixed"是指多个坐标轴使用相同的尺度,这个是默认值; "free"是指根据当前坐标轴的内容确定尺度,"free_x"和"free_y"是根据x或者y的数据范围确定当前坐标轴的尺度。
as.table
这个控制最大值的变量的位置,TURE是默认值,意思是最大值在右下角;可以设置成FALSE,那么最大值将在右上角。
drop
这个如果是真会自动删除没有数据的坐标轴。默认是真。

mfrow方法

par(mfrow=c(m,n))

1
2
3
4
5
# 划分成 1行 2列
par(mfrow=c(1,2))
# 直接在原图上叠加
par(new=TRUE)

gridExtra方法,可以和ggplot一起使用

1
2
3
4
5
library(grid)
library(gridExtra)
b=ggplot(a,aes(pos,value))+geom_point(col=1)
c=ggplot(a,aes(pos,value))+geom_point(col=2)
grid.arrange(b,c,ncol=2,nrow=2) # 4个小图,2x2排布

常用图例子

点图 point

应用(PCA)

input file:“input.ggplot2”:

1
2
3
4
5
6
name PC1 PC2 species
sample1 1.10 2.039 D
sample2 0.83 1.928 D
sample3 0.678 0.837 W
sample4 1.828 1.282 W
...

geom_point()

1
2
3
4
5
6
7
8
a=read.table("input.ggplot2",header=T);
mycolor=c("#BF812D","#35978F")
ggplot(data,aes(V2,V3,color=factor(V6)))+geom_point(size=8,alpha=0.8)+xlab("PC1(25.0%)")+ylab("PC2(12.0%)")+theme_bw()+theme(legend.title=element_blank(),panel.grid =element_blank(), panel.grid.major=element_blank())+scale_color_manual(values=mycolor)
# OR
p=ggplot(a,aes(PC1,PC2,color=species))+geom_point(alpha=0.9,size=4)
p+theme_bw()+scale_color_manual(values=c(\"#2166ac\",\"#b2182b\"))
1
2
# 用地域名字/数字来表示点
ggplot(a,aes(C1,C2,color=info))+geom_point(shape=20,size=0.000000001)+geom_text(aes(label=info), size=2)

特殊例子,同时设定颜色color和形状,同时只显示一个legend

在scale_colour_manual和scale_shape_manual中使用同样的name和labels
link

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 数据
state1 <- c(rep(c(rep("N", 7), rep("Y", 7)), 2))
year <- rep(c(2003:2009), 4)
group1 <- c(rep("C", 14), rep("E", 14))
group2 <- paste(state1, group1, sep = "")
beta <- c(0.16,0.15,0.08,0.08,0.18,0.48,0.14,0.19,0.00,0.00,0.04,0.08,0.27,0.03,0.11,0.12,0.09,0.09,0.10,0.19,0.16,0.00,0.11,0.07,0.08,0.09,0.19,0.10)
lcl <- c(0.13,0.12,0.05,0.05,0.12,0.35,0.06,0.13,0.00,0.00,0.01,0.04,0.20,0.00,0.09,0.09,0.06,0.06,0.07,0.15,0.11,0.00,0.07,0.03,0.05,0.06,0.15,0.06)
ucl <- c(0.20,0.20,0.13,0.14,0.27,0.61,0.28,0.27,0.00,1.00,0.16,0.16,0.36,0.82,0.14,0.15,0.13,0.13,0.15,0.23,0.21,0.00,0.15,0.14,0.12,0.12,0.23,0.16)
data <- data.frame(state1,year,group1,group2,beta,lcl,ucl)
# 画图
library(ggplot2)
pd <- position_dodge(.65)
ggplot(data = data,aes(x= year, y = beta, colour = group2, shape = group2)) +
geom_point(position = pd, size = 4) +
geom_errorbar(aes(ymin = lcl, ymax = ucl), colour = "black", width = 0.5, position = pd) +
scale_colour_manual(name = "Treatment & State",
labels = c("Control, Non-F", "Control, Flwr", "Exclosure, Non-F", "Exclosure, Flwr"),
values = c("blue", "red", "blue", "red")) +
scale_shape_manual(name = "Treatment & State",
labels = c("Control, Non-F", "Control, Flwr", "Exclosure, Non-F", "Exclosure, Flwr"),
values = c(19, 19, 17, 17))

image

手动指定颜色、形状,例子:

1
ggplot(a,aes(PC1,PC2,shape=Regions,color=Regions))+geom_point(size=2)+scale_color_manual(name="test",labels=c("AWD","EMZ","NQA","RDA","MODA","BCS","GCN","BMN","BSI","AME","GME","DOP","TEX","BGE","GUR","GAR","SUM","AFS","AW","CC","IROA","KRS","KR","NDZ","SKZ","AWT","ALT","BAS","BAY","DUL","HET","KAZ","QIB","TUB","GLT","HUL","HUS","JZS","LLT","LOP","LTH","LZM","SNT","SSF","STH","TAN","THF","TON","UJI","WDS","YFT","CHA","DQS","GBF","HZS","MBF","OLA","TCS","TIB","TIBQ","TIBS","WNS","LPB","NLB","SPG","ZTS","IROO","IROV","MPS","OCAN","ODAL"),values=c("#E41A1C","#E41A1C","#E41A1C","#E41A1C","#E41A1C","#377EB8","#377EB8","#377EB8","#377EB8","#984EA3","#984EA3","#984EA3","#984EA3","#FF7F00","#FF7F00","#FF7F00","#FF7F00","#A65628","#A65628","#A65628","#A65628","#A65628","#A65628","#A65628","#A65628","#A65628","#b2df8a","#b2df8a","#b2df8a","#b2df8a","#b2df8a","#b2df8a","#b2df8a","#b2df8a","#4DAF4A","#4DAF4A","#4DAF4A","#4DAF4A","#4DAF4A","#4DAF4A","#4DAF4A","#4DAF4A","#4DAF4A","#4DAF4A","#4DAF4A","#4DAF4A","#4DAF4A","#4DAF4A","#4DAF4A","#4DAF4A","#4DAF4A","#e7298a","#e7298a","#e7298a","#e7298a","#e7298a","#e7298a","#e7298a","#e7298a","#e7298a","#e7298a","#e7298a","#4eb3d3","#4eb3d3","#4eb3d3","#4eb3d3","#F781BF","#000000","#000000","#000000","#000000"))+scale_shape_manual(name="test",labels=c("AWD","EMZ","NQA","RDA","MODA","BCS","GCN","BMN","BSI","AME","GME","DOP","TEX","BGE","GUR","GAR","SUM","AFS","AW","CC","IROA","KRS","KR","NDZ","SKZ","AWT","ALT","BAS","BAY","DUL","HET","KAZ","QIB","TUB","GLT","HUL","HUS","JZS","LLT","LOP","LTH","LZM","SNT","SSF","STH","TAN","THF","TON","UJI","WDS","YFT","CHA","DQS","GBF","HZS","MBF","OLA","TCS","TIB","TIBQ","TIBS","WNS","LPB","NLB","SPG","ZTS","IROO","IROV","MPS","OCAN","ODAL"),values=c(1,2,3,4,5,1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,1,2,3,4,5,6,7,8,9,10,11,1,2,3,4,1,2,3,4,5))+theme_bw()+theme(plot.background=element_blank(),panel.background=element_blank(),legend.background=element_blank(),legend.position='right',panel.grid =element_blank(),legend.title = element_blank(),axis.title =element_text(size=9,color="black"),axis.text=element_text(size=8,color="black"))+xlab("Principal component 1")+ylab("Principal component 2")+guides(shape=guide_legend(ncol=5),color=guide_legend(ncol=5))

image

plot & points

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 用不同的颜色区分不同群体
d1=read.table("D1.txt",header=T);
d2=read.table("D2.txt",header=T);
w=read.table("W.txt",header=T);
d1x=d1$PC1
d1y=d1$PC2
d2x=d2$PC1
d2y=d2$PC2
wx=w$PC1
wy=w$PC2
pdf(file="PC12.pdf",width=2,height=2)
plot(d1x,d1y,pch=4,col="#89c876",cex=2,cex.lab=1.5,lwd=2,xlab="PC1 (3.24%)",ylab="PC2 (1.36%)",xlim=c(-0.3,0.1),ylim=c(-0.6,0.2))
points(d2x,d2y,pch=4,col="#2166ac",cex=2,lwd=2)
points(wx,wy,pch=4,col="#b2182b",cex=2,lwd=2)
dev.off()

barplot 柱状图

barplot

1
2
3
4
5
6
7
8
9
10
# admixture
# k2.txt
0.000000001 0.999999999
0.029249134 0.970750866
0.000000001 0.999999999
8.65071E-08 0.999999913
0.000000001 0.999999999
1.49306E-08 0.999999985
0.043220336 0.956779664
0.04214767 0.95785233
1
2
admix=t(as.matrix(read.table("k2.txt",header=F)))
barplot(admix,col=c("#2166ac","#b2182b"),space=0,border=NULL ,xlab="Individuals",ylab="K=2",cex.lab=2,yaxt='n')

barplot for admixture results example:

1
2
3
4
5
6
7
8
9
# 02.poplist2
...
# small_pop_id sample_id small_pop_order big_pop_id big_pop_order
TIB SRR501845 60 Dom_EA_TIB 9
TIB SRR501863 60 Dom_EA_TIB 9
BGE SRR501854 27 Dom_SA 6
BGE SRR501873 27 Dom_SA 6
GAR SRR501902 26 Dom_SA 6
...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
par(mfrow=c(10,1), mar=c(0,3,1.5,0.5), las = 0, mgp=c(0.5, 0.5, 0))
color=c("#E41A1C","#377EB8","#4DAF4A","#984EA3","#FF7F00","#A65628","#F781BF")
pop<-read.table("02.poplist2",as.is=T)
admix<-read.table("input.2.Q",as.is=T)
admixmerge=t(as.matrix(cbind(pop,admix)))
admixsort=admixmerge[,order(admixmerge[3,],admixmerge[6,],admixmerge[7,])]
admixpatial=admixsort[-c(1,2,3,4,5),]
admixname=t(admixsort[c(1,2,3,4,5),])
barplot(admixpatial,col=color[1:2],space=0,border=NA,xlab="",ylab="K=2",yaxt='n')
# small population axis
#position=c(0,tapply(1:nrow(admixname),admixname[,3],max))
#axis(1,at=position,labels=F)
# small population line
position=c(0,tapply(1:nrow(admixname),admixname[,3],max))
abline(v=position,lwd=0.2)
# big population line
position2=c(0,tapply(1:nrow(admixname),admixname[,5],max))
abline(v=position2,lwd=1)
# name
barplot(admixpatial,col=c("white"),space=0,border=NA,xlab="",ylab=" ",xaxt='n', yaxt='n')
text(tapply(1:nrow(admixname),admixname[,3],mean),0.8,unique(admixname[,1]),xpd=T,srt=90,cex=0.45)
barplot(admixpatial,col=c("white"),space=0,border=NA,xlab="",ylab=" ",xaxt='n', yaxt='n')
text(tapply(1:nrow(admixname),admixname[,5],mean),0.8,unique(admixname[,4]),xpd=T,srt=0,cex=0.45)

image

geom_bar

给定数据的柱状图
http://blog.csdn.net/bone_ace/article/details/47267981
http://shujuren.org/article/44.html

1
2
# x轴为文本标签时,加上 stat='identity'
ggplot(a,mapping=aes(x=V1,y=V2,fill=V3))+geom_bar(stat='identity')+scale_x_discrete(limits=c("1/302","2/302","3/302","4/302","5/302","6/302","7/302","8/302","9/302","10/302","11/302","12/302","13/302","14/302","15/302","16/302","17/302","18/302","19/302","20/302","21/302","22/302","23/302","24/302","25/302","26/302","27/302","28/302","29/302","30/302","0.1-0.2","0.2-1"))+theme_bw()+theme(panel.border=element_blank(),panel.grid =element_blank(),legend.title=element_blank(),axis.text.x=element_text(angle =90))+scale_fill_manual(values=c("#1F78B4","#FF7F00"))+labs(x="Allele frequency",y="SNP number (M)")

geom_histogram()

1
2
3
4
5
6
7
pdf(file="B1.final.cnv.V3.pdf")
library("ggplot2")
a <- read.table("B1.final.cnv",head=F)
# logx
ggplot(a,aes(x=log(a$V3)/log(10)))+geom_histogram()+xlab("CNV length (log10)")+ylab("Count")+xlim(3,7)
# ggplot(a,aes(x=a$V3))+geom_histogram()+xlab("CNV length (log10)")+ylab("Count")
dev.off()
1
2
3
4
5
6
7
8
9
# histogram纵坐标变成density,默认是count
ggplot(qcctau,aes(x=V1,y=..density..,fill=factor(V2)))+ geom_histogram(position='dodge')
# position (比如直方图)
dodge 并列
fill 标准化堆叠
identity 重叠
jitter 使用jitter方式保证不重叠
stack 堆叠

小提琴图 geom_violin()

1
2
geom_violin()
# 参数:adjust调整平滑程度,scale="count"使得图的面积与每组观测值数目成正比

boxplot 箱线图

boxplot

1
2
3
par(mfrow=c(1,2))
boxplot(bb1$V2,bc1$V2,cc1$V2,outline = F,names=c("B-B","B-C","C-C"),main="1 bp overlap")
boxplot(bb5$V2,bc5$V2,cc5$V2,outline = F,names=c("B-B","B-C","C-C"),main="50% reciprocal overlap")

geom_boxplot

1
2
3
4
5
6
7
8
9
10
# 根据 V3 大小值排序(从小到大)
ggplot(a,aes(x=reorder(V1,V3),y=V3)+geom_boxplot()
# 手动调整顺序
ggplot(b,aes(V1,V4/V5))+geom_boxplot()+xlab("Individuals")+ylab("HET/HOM")+theme( legend.position = "none")+scale_x_discrete(limits=c("BRM","GIR","LQC","NEL","LXC","NYC","QCC","YBC","YNC","FLV","HOL","JBC","JER","LIM","RAN"))
# x、y轴转换,横着的boxplot
+coord_flip()
# 参数 width,outlier.size,outlier.shape

点+erro bar

1
2
3
# 如ABBA-BABA test结果
# range: Mean-3*SE ~ Mean+3*SE
+geom_pointrange(aes(ymin=V4-3*V5,ymax=V4+3*V5))

pheatmap 热图

输入格式 input.txt

1
2
3
4
5
6
7
8
9
10
11
12
B1 B2 B3 B4 B5 B6 C1 C2 C3 C4 C5
B1 1 0.624833110814419 0.616958399636281 0.648058868307536 0.63029315960912 0.628120893561104 0.615183929037308 0.632938076416337 0.637414484548242 0.6375 0.616299232103047
B2 0.624833110814419 1 0.593837535014006 0.65446941975954 0.676940319417204 0.659620596205962 0.653577192038731 0.665760869565217 0.612027158098933 0.626946847960445 0.615502294747578
B3 0.616958399636281 0.593837535014006 1 0.574553571428571 0.538953350698556 0.558471454880295 0.563129002744739 0.557452699584679 0.606948514022604 0.575867205788466 0.572989510489511
B4 0.648058868307536 0.65446941975954 0.574553571428571 1 0.645949535192563 0.629439011837365 0.625958099131323 0.636738906088751 0.62962962962963 0.649846734260788 0.627489072365226
B5 0.63029315960912 0.676940319417204 0.538953350698556 0.645949535192563 1 0.675117112152108 0.667213563029806 0.676982591876209 0.611863155303963 0.635542168674699 0.632996632996633
B6 0.628120893561104 0.659620596205962 0.558471454880295 0.629439011837365 0.675117112152108 1 0.68078348332451 0.652941176470588 0.60755258126195 0.627040194884287 0.650426921145153
C1 0.615183929037308 0.653577192038731 0.563129002744739 0.625958099131323 0.667213563029806 0.68078348332451 1 0.64968152866242 0.611585944919278 0.631986450520203 0.650872817955112
C2 0.632938076416337 0.665760869565217 0.557452699584679 0.636738906088751 0.676982591876209 0.652941176470588 0.64968152866242 1 0.625778629611883 0.652991452991453 0.64551863041289
C3 0.637414484548242 0.612027158098933 0.606948514022604 0.62962962962963 0.611863155303963 0.60755258126195 0.611585944919278 0.625778629611883 1 0.612910332672395 0.581521739130435
C4 0.6375 0.626946847960445 0.575867205788466 0.649846734260788 0.635542168674699 0.627040194884287 0.631986450520203 0.652991452991453 0.612910332672395 1 0.682038275305511
C5 0.616299232103047 0.615502294747578 0.572989510489511 0.627489072365226 0.632996632996633 0.650426921145153 0.650872817955112 0.64551863041289 0.581521739130435 0.682038275305511 1

例子

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
a=read.table("input.txt”)
a=as.matrix(a)
# 默认聚类
pdf("out.cluster.pdf",width=10,height=10,onefile=FALSE)
pheatmap(a,cellwidth = 18, cellheight = 18, fontsize=12, fontsize_row=9,main="title")
dev.off()
# 聚类,但是不显示树
treeheight_row = 0, treeheight_col = 0
# 聚类,同时把聚类的分开画,自己设定分几份
cutree_rows=3, cutree_cols=3
# 不聚类
cluster_rows=0, cluster_cols=0
# 显示格子里的数字
display_numbers=T
# 设置其他样式
fontsize_number=14 # 格子里的文字大小
fontsize=14 # 标签文字大小
border_color="White" # 边界颜色
# 设定分组annotation
annotation_col = data.frame(Type = factor(c(rep(c("item"), 10),"sum")))
rownames(annotation_col) = c(paste("item", 1:10, sep = ""),"PSS")
annotation_row = data.frame(Type = factor(c(rep(c("item"), 10),"sum")))
rownames(annotation_row) = c(paste("item", 1:10, sep = ""),"PSS")
pheatmap(a, display_numbers=T, fontsize_number=14,border_color="White", cutree_rows=3, cutree_cols=3,annotation_col = annotation_col, annotation_row = annotation_row)

其他例子

1
2
3
4
5
6
7
8
9
10
# 最简单地直接出图
pheatmap(data)
# 改变排序算法
pheatmap(data, scale = "row", clustering_distance_row = "correlation")
# 自定义颜色
pheatmap(data, color = colorRampPalette(c("navy", "white", "firebrick3"))(50))
# 关闭图例
pheatmap(data, legend = FALSE)
# 设定格子的尺寸
pheatmap(data, cellwidth = 6, cellheight = 5)

ggplot 调整作图细节

保存文件

1
2
3
pdf(file="plot.allpop.chr.pl.pdf",width=10,height=10)
ggplot ... # 具体图形
dev.off()

自动加label

1
2
3
library(ggrepel)
ggplot(mtcars,aes(wt, mpg,color='red'))+geom_point() +geom_text_repel(aes(wt,mpg, label=rownames(mtcars)))

image

添加注释

添加文本注释

1
2
# 可以加多个
annotate("text", x=3, y=48, label="Group 1") +annotate("text", x=4.5, y=66, label="Group 2")

添加数学表达式注释

1
2
+ annotate("text", x=2, y=0.3, parse=TRUE,label="frac(1, sqrt(2 * pi)) * e ^ {-x^2 / 2}")
# ?plotmath可以见到更多使用数学表达式的例子

添加线条注释

1
2
3
4
# linetype="dashed", size=1
+ geom_hline(yintercept=60) # 横线
+ geom_vline(xintercept=14) # 竖线
+ geom_abline(intercept=37.4, slope=1.75) # 斜线;intercept截距;slope斜率
1
2
3
4
5
6
7
+geom_vline() +geom_hline()
# linetype线的类型(=2 虚线)
# 指定X轴坐标的竖线
+geom_vline(xintercept=1000,color="black", linetype=2)
# 制定Y轴坐标的横线
+ geom_hline(yintercept=10,color="black", linetype=2)
1
2
3
library(plyr)
hw_means <- ddply(heightweight, "sex", summarise, heightIn=mean(heightIn))
p + geom_hline(aes(yintercept=heightIn, colour=sex), data=hw_means,linetype="dashed", size=1)

image

添加分割标记

1
2
p <- ggplot(subset(climate, Source=="Berkeley"), aes(x=Year, y=Anomaly10y)) +geom_line()
p+ annotate("segment", x=1950, xend=1980, y=-.25, yend=-.25)

image

长方形阴影

1
2
p <- ggplot(subset(climate, Source=="Berkeley"), aes(x=Year, y=Anomaly10y)) +geom_line()
p + annotate("rect", xmin=1950, xmax=1980, ymin=-1, ymax=1, alpha=.1,fill="blue")

image

添加误差线

1
2
3
4
5
6
7
# geom_line geom_point同时用,画出线和点
# geom_errorbar,设置ymin和ymax
ce <- subset(cabbage_exp, Cultivar == "c39")
ggplot(ce, aes(x=Date, y=Weight)) +
geom_line(aes(group=1)) +
geom_point(size=4) +
geom_errorbar(aes(ymin=Weight-se, ymax=Weight+se), width=.2)

image

每个小平面增加注释,多小图

1
2
3
4
p <- ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + facet_grid(. ~ drv)
#构造注释数据框
f_labels <- data.frame(drv = c("4", "f", "r"), label = c("4wd", "Front", "Rear"))
p + geom_text(x=6, y=40, aes(label=label), data=f_labels)

image

加x、y坐标名字;标题

1
2
3
4
5
6
7
8
p+xlab("X轴名")+ylab("Y轴名")+ ggtitle("标题")
p+labs(x="Xname",y="Yname",title="title")
# 坐标格式: 数值百分号、美元符号、带逗号
p <- ggplot(df, aes(x, y)) + geom_point()
p + scale_y_continuous(labels = scales::percent)
p + scale_y_continuous(labels = scales::dollar)
p + scale_x_continuous(labels = scales::comma)

x、y坐标反转

1
p+coord_flip()

自定义顺序

1
p+scale_x_discrete(limits=c("BRM","GIR","LQC","NEL","LXC","NYC","QCC","YBC","YNC","FLV","HOL","JBC","JER","LIM","RAN"))

自定义颜色

1
2
3
# set colors; 26
mycolor=c("#A6CEE3", "#1F78B4", "#B2DF8A", "#33A02C", "#FB9A99", "#E31A1C", "#FDBF6F", "#FF7F00", "#CAB2D6", "#6A3D9A", "#FFFF99", "#B15928", "#A6CEE3", "#1F78B4", "#B2DF8A", "#33A02C", "#FB9A99", "#E31A1C", "#FDBF6F", "#FF7F00", "#CAB2D6", "#6A3D9A", "#FFFF99", "#B15928", "#A6CEE3", "#1F78B4")
p+scale_color_manual(values=mycolor) # 注意颜色数量要一致,26种颜色对应26个分类

颜色组合

1
2
# 4种
mycolor4=c("#E31A1C","#1F78B4","#6A3D9A","#FF7F00") # 红蓝紫橙

image

1
2
# 8种
mycolor8=c("#E41A1C","#377EB8","#4DAF4A","#984EA3","#FF7F00","#A65628","#F781BF","#000000") # 红蓝绿紫橙褐粉黑

image

1
2
3
# 10种
color=c("#E41A1C","#377EB8","#4DAF4A","#984EA3","#FF7F00","#A65628","#F781BF","#B2DF8A","#E7298A","#4EB3D3")
image(1:10,1,as.matrix(1:10),col=color,xlab=" ",ylab="",xaxt="n",yaxt="n",bty="n")

image

1
2
# 12种
mycolor12=c("#A6CEE3","#1F78B4","#B2DF8A","#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#CAB2D6","#6A3D9A","#FFFF99","#B15928")

image

好看的颜色ColorBrewer

1
2
3
4
5
6
7
8
9
10
11
# ColorBrewer
p + geom_boxplot(aes(fill=factor(cyl)))+ scale_fill_brewer(palette='Paired')
# 对于一个单连续变量仅突出显示一端的色板:
Blues BuGn BuPu GnBu Greens Greys Oranges OrRd PuBu PuBuGn PuRd Purples RdPu Reds YlGn YlGnBu YlOrBr YlOrRd
# 对于一个单连续变量突出显示两端的色板
BrBG PiYG PRGn PuOr RdBu RdGy RdYlBu RdYlGn Spectral
# 离散变量的色板
Accent Dark2 Paired Pastel1 Pastel2 Set1 Set2 Set3
连续变量用Brewer色板需要调用 scale_*_distiller
scale_*_brewer
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
library(RColorBrewer)
# color names : https://github.com/lh3/gnuplot-colorbrewer
# 查看所有颜色
display.brewer.all()
# 查看颜色说明
brewer.pal.info
# 测试颜色 test colors
mycolor =brewer.pal(n=8, 'Paired')
image(1:8,1,as.matrix(1:8),col=mycolor,xlab=" ",ylab="",xaxt="n",yaxt="n",bty="n")
# 取其中几个
subcolor=mycolor[1]
subcolor=c(mycolor[4],mycolor[1:3],mycolor[7:5])
# ggplot manual colors
p+scale_color_manual(values=mycolor)
p+scale_fill_manual(values=mycolor)
p+scale_color_manual(values=c("BRM"="#E31A1C","GIR"="#E31A1C","NEL"="#E31A1C"))
p+scale_color_manual(values=c("#E31A1C","#1F78B4","#6A3D9A","#FF7F00"))#四个颜色 红蓝紫橙
# 直接用
p+scale_color_brewer(palette='Paired')

image

image

image

image

theme主题设置,X轴、Y轴操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# 常用主题
# 去掉多余的白色背景、框线,坐标文字黑色、图例右侧
+theme_bw()
+theme(plot.background=element_blank(),panel.background=element_blank(),legend.background=element_blank(),panel.grid =element_blank(),axis.title =element_text(color="black"),axis.text=element_text(color="black"),legend.position='right')
# 白色主题模板; theme_bw()放前面
p+theme_bw()+theme()
p+ theme(
# 去掉网格线,分别去掉主要、次要网格线
panel.grid =element_blank(),
panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),
# 去掉外边框
panel.border=element_blank(),
# 加外边框
panel.border = element_rect(fill=NA, color="gray20", size=0.5, linetype="solid"),
# 再加上X轴、Y轴,无刻度、无标签
axis.line = element_line(colour = "black",size = 0.2),
# 去掉图例title
legend.title = element_blank()
# 去掉图例
legend.position = "none",
# X轴标签:大小、字体、颜色、加粗、位置、角度 <http://docs.ggplot2.org/current/element_text.html>,字体需要单独加载字体family=“”,其他element_text()规则一样
## face取值:plain普通,bold加粗,italic斜体,bold.italic斜体加粗
axis.title.x =element_text(size=12,color="black",face="bold",vjust = 0.5, hjust = 0.5, angle = 45),
# X轴刻度标签名
axis.text = element_blank(), # 去掉所有刻度标签
axis.text.x=element_blank(), # 去掉x刻度标签
axis.text.y=element_blank(), # 去掉y刻度标签
# Y轴
axis.title.y=element_text(size=8),
axis.text.y = element_text(size=8),
# X轴刻度线,不显示
axis.ticks = element_blank(), # 去掉所有刻度
axis.ticks.x=element_blank(), # 去掉x刻度
axis.ticks.y=element_blank(), # 去掉y刻度
)

默认主题

  • theme_classic,有x,y轴和没有网格线 #(这个不错)
  • theme_bw,高对比度主题
  • theme_gray,默认
  • theme_light
  • theme_linedraw,网格线变成黑实线
  • theme_minimal,没有背景内容
  • 更多 library(ggthemes)

theme关键词表

参数 设置内容 继承自
line 所有线属性
rect 所有矩形区域属性
text 所有文本相关属性
title 所有标题属性
axis.title 坐标轴标题 text
axis.title.x x轴属性 axis.title
axis.title.y y轴属性 axis.title
axis.text 坐标轴刻度标签属性 text
axis.text.x 属性和继承和前面类似,不再重复
axis.text.y
axis.ticks 坐标轴刻度线 line
axis.ticks.x
axis.ticks.y
axis.ticks.length 刻度线长度
axis.ticks.margin 刻度线和刻度标签之间的间距
axis.line 坐标轴线 line
axis.line.x
axis.line.y
legend.background 图例背景 rect
legend.margin 图例边界
legend.key 图例符号
legend.key.size 图例符号大小
legend.key.height 图例符号高度
legend.key.width 图例符号宽度
legend.text 图例文字标签
legend.text.align 图例文字标签对齐方式 0为左齐,1为右齐
legend.title 图例标题 text
legend.title.align 图例标题对齐方式
legend.position 图例位置 left, right, bottom, top, 两数字向量
legend.direction 图例排列方向 “horizontal” or “vertical”
legend.justification 居中方式 center或两数字向量
legend.box 多图例的排列方式 “horizontal” or “vertical”
legend.box.just 多图例居中方式
panel.background 绘图区背景 rect
panel.border 绘图区边框 rect
panel.margin 分面绘图区之间的边距
panel.grid 绘图区网格线 line
panel.grid.major 主网格线
panel.grid.minor 次网格线
panel.grid.major.x
panel.grid.major.y
panel.grid.minor.x
panel.grid.minor.y
plot.background 整个图形的背景
plot.title 图形标题
plot.margin 图形边距 top, right, bottom, left
strip.background 分面标签背景 rect
strip.text 分面标签文本 text
strip.text.x
strip.text.y

加载字体

1
2
3
4
5
6
7
8
9
# require("sysfonts")
# require("showtext")
# 查看字体列表,可以使用系统已安装的所有字体
font.files()
# 更改字体 family参数
ggplot(data.frame(x = rnorm(100))) +
geom_histogram(aes(x), fill = 'purple', alpha = 0.6) +
labs(x = 'X 取值', y = '频数 Count') +
theme(text = element_text(family = 'Arial Black'))

x、y轴,数据范围

1
2
3
4
5
6
7
# 方法一
p+ scale_y_continuous(limits=c(0,1.8))
# scale_x_continuous(limits=c(0,10))
# 方法二
p+xlim(0,10) # 不一定都适用
p + xlim(min(dt$A, 0)*1.2, max(dt$A)*1.2) ## 一般使用倍数来限定大小,注意定义最小值的方式

坐标轴刻度间隔大小

1
p + scale_x_continuous(breaks=seq(0, 10, 5)) # X轴,每隔5个单位

检验

wilcox.test()

Wilcoxon rank sum test / Wilcoxon秩和检验
Mann–Whitney U test(曼-惠特尼U检验) / Mann–Whitney–Wilcoxon (MWW)

在R中,wilcox.test()函数可以用来做Wilcoxon秩和检验,也可以用于做 Mann-Whitney U检验。

  • 当参数为单个样本/或者是两个样本相减/或者是两个参数,paired=T时,是Wilcoxon秩和检验。
  • 当两个参数,paired = FALSE(独立样本)时,就是 Mann-Whitney U检验

If only x is given, or if both x and y are given and paired is TRUE, a Wilcoxon signed rank test of the null that the distribution of x (in the one sample case) or of x - y (in the paired two sample case) is symmetric about mu is performed.

Otherwise, if both x and y are given and paired is FALSE, a Wilcoxon rank sum test (equivalent to the Mann-Whitney test: see the Note) is carried out. In this case, the null hypothesis is that the distributions of x and y differ by a location shift of mu and the alternative is that they differ by some other location shift (and the one-sided alternative “greater” is that x is shifted to the right of y).

配对Wilcoxon符号秩检验

例8-1

对12份血清分别用原方法(old, 检测时间20分钟)和新方法(new, 检测时间10分钟)测谷-丙转氨酶。问两法所得结果有无差别?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 源代码:
old <- c(60,142,195,80,242,220,190,25,198,38,236,95)
new <- c(76,152,243,82,240,220,205,38,243,44,190,100)
wilcox.test(old,new,paired = TRUE)
# 运行结果
> wilcox.test(old,new,paired = TRUE) #配对wilcoxon检验
Wilcoxon signed rank test with continuity correction
data: old and new
V = 11.5, p-value = 0.06175
alternative hypothesis: true location shift is not equal to 0
Warning messages:
1: In wilcox.test.default(old, new, paired = TRUE) :
cannot compute exact p-value with ties
2: In wilcox.test.default(old, new, paired = TRUE) :
cannot compute exact p-value with zeroes
# 结果分析: p-value=0.06175,因此不能认为两法测谷丙转氨酶有差别。

独立样本Wilcoxon检验(Mann-Whitney U检验)

例8-2

已知某地正常人尿氟含量的中位数为45.30μmol/L 。今在该地某厂随机抽取12名工人,测得尿氟含量。问该厂工人的尿氟含量是否高于当地正常人的尿氟含量?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 源代码:
data82 <- c(44.21,45.30,46.39,49.47,51.05,53.16,53.26,54.37,57.16,67.37,71.05,87.37)
wilcox.test(data82-45.30)
# 运行结果:
> wilcox.test(data82,mu=45.30)
Wilcoxon signed rank test with continuity correction
data: data82
V = 65, p-value = 0.005099
alternative hypothesis: true location is not equal to 45.3
Warning message:
In wilcox.test.default(data82, mu = 45.3) :
cannot compute exact p-value with zeroes
# 结果分析:p值小于0.01,可以视为该工厂的工人尿氟含量高于正常人。

例8-3

对10例肺癌病人和12例矽肺0期工人用X光片测量肺门横径右侧距RD值(cm)。问肺癌病人的RD值是否高于矽肺0期工人的RD值?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 源代码:
lc <- c(2.78,3.23,4.2,4.87,5.12,6.21,7.18,8.05,8.56,9.6)
si <- c(3.23,3.5,4.04,4.15,4.28,4.34,4.47,4.64,4.75,4.82,4.95,5.1)
wilcox.test(lc,si)
#运行结果:
> wilcox.test(lc,si,alternative="greater",exact=F)
Wilcoxon rank sum test with continuity correction
data: lc and si
W = 86.5, p-value = 0.04318
alternative hypothesis: true location shift is greater than 0
# 在R中,wilcox.test()函数可以用来做Wilcoxon秩和检验,也可以用于做Mann-Whitney U检验。当参数为单个样本,或者是两个样本相减,或者是两个参数,paired=F时,是Wilcoxon秩和检验。当paired = FALSE(独立样本)时,就是Mann-Whitney U检验,在这个题目中,用的就是Mann-Whitney U检验,虽然结果中W=86.5与书中的T=141.5不一样,但本质上是一样的,换算如下:W1=141.5-10*(10+1)/2=86.5;W2=111.5-12*(12+1)/2=33.5

例8-4

已知39名吸烟工人和40名不吸烟工人的碳氧血红蛋白HbCO(%)含量,问吸烟工人的HbCO(%)含量是否高于不吸烟工人的HbCO(%)含量?

答:本题书中用的是wilcox检验,其实用Ridit分析更合适一些,下面分别用这两种方法进行检验:

  • Wilcox.test,或Kruskal检验

Wilcox.test(或Kruskal检验)

1
2
3
4
5
6
7
8
9
smoke <- c(1,8,16,10,4)
no.smoke <-c(2,23,11,4,0)
rank.c <- c(1:5)
group1 <- rep(rank.c,smoke)
group2 <- rep(rank.c,no.smoke)
data84 <- c(group1,group2)
group.f <-factor(c(rep(1,length(group1)),rep(2,length(group2))))
wilcox.test(data84~group.f)
kruskal.test(data84~group.f)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 运行结果:
# wilcox.test
wilcox.test(data84~group.f)
Wilcoxon rank sum test with continuity correction
data: data84 by group.f
W = 1137, p-value = 0.0002181
alternative hypothesis: true location shift is not equal to 0
Warning message:
In wilcox.test.default(x = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, :
cannot compute exact p-value with ties
# kruskal.test检验
> kruskal.test(data84~group.f)
Kruskal-Wallis rank sum test
data: data84 by group.f
Kruskal-Wallis chi-squared = 13.707, df = 1, p-value = 0.0002137
  • 此题用Ridit检验更合适一些
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
data84 <- matrix(c(1,8,16,10,4,2,23,11,4,0),nrow=5)
dimnames(data84) <- list(cone=c("very low","low","median","a bit high","high"),
worker=c("smoker","no-smoker"))
library(Ridit)
ridit(data84,2)
chisq.test(data84)
# 运行结果:
> library(Ridit)
> ridit(data84,2)
Ridit Analysis:
Group Label Mean Ridit
----- ----- ----------
1 smoker 0.6159
2 no-smoker 0.387
Reference: Total of all groups
chi-squared = 13.707, df = 1, p-value = 0.0002137
> chisq.test(data84)
Pearsons Chi-squared test
data: data84
X-squared = 15.0785, df = 4, p-value = 0.004541

Kurskal-Wallis检验

Kurskal-Wallis检验是Wilcoxon方法(其实是Mann-Whitney检验)用于多于两个样本的时候的升级版。当对两个样本进行比较的时候,Kurskal-Wallis检验与Mann-Whitney检验是等价的。

例8-5

用三种药物杀灭钉螺,每批用200只活钉螺,用药后清点每批钉螺的死亡数、再计算死亡率(%),问三种药物杀灭钉螺的效果有无差别?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 源代码:
drug <-rep(c("甲药","乙药","丙药"),each=5)
data <- c(32.5,35.5,40.5,46,49,16,20.5,22.5,29,36,6.5,9.0,12.5,18,24)
data85 <- data.frame(drug,data)
kruskal.test(data85$data~data85$drug)
# 运行结果:
> kruskal.test(data85$data~data85$drug)
Kruskal-Wallis rank sum test
data: data85$data by data85$drug
Kruskal-Wallis chi-squared = 9.74, df = 2, p-value = 0.007673
# 结果分析:Kruskal-Wallis的统计量是H值。

例8-6

比较小白鼠接种三种不同菌型伤寒杆菌9D、11C和DSC1后存活日数,问小白鼠接种三种不同菌型伤寒杆菌的存活日数有无差别?

1
2
3
4
5
6
7
8
9
10
11
12
13
# 源代码:
mice <- as.factor(c(rep("9D",10),rep("11C",9),rep("DSC1",11)))
data86 <- c(2,2,2,3,4,4,4,5,7,7,5,5,6,6,6,7,8,10,12,3,5,6,6,6,7,7,9,10,11,11)
data86 <- data.frame(mice,data)
kruskal.test(data86$data~data86$mice)
# 运行结果:
> kruskal.test(data86$data~data86$mice)
Kruskal-Wallis rank sum test
data: data86$data by data86$mice
Kruskal-Wallis chi-squared = 9.9405, df = 2, p-value = 0.006941

例8-7

四种疾病患者痰液内嗜酸性白细胞的检查结果见表8-11。问四种疾病患者痰液内的嗜酸性白细胞有无差别?注:这道例题与《医学统计学及SAS应用》(上海交通大学)的9.11类似

image

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 源代码:
x1 <- c(0,2,9,6)
x2 <- c(3,5,5,2)
x3 <- c(5,7,3,2)
x4 <- c(3,5,3,0)
freq <- function(x){
count <-c()
for(i in 1:4){
count1 <-c(rep(i,x[i]))
count <- append(count,count1)
}
return(count)}
data87 <-c(freq(x1),freq(x2),freq(x3),freq(x4))
group <- c(rep(1,sum(x1)),rep(2,sum(x2)),rep(3,sum(x3)),rep(4,sum(x4)))
kruskal.test(data87~group)
# 运行结果:
> kruskal.test(data87~group)
Kruskal-Wallis rank sum test
data: data87 by group
Kruskal-Wallis chi-squared = 15.5058, df = 3, p-value = 0.001432

例8-8

对例8-6资料作三个样本间的两两比较,Nemenyi检验。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# 源代码:
# 需要安装下列包:
# install.packages("pgirmess")
# install.packages("coin")
# install.packages("multcomp")
# library(pgirmess)
# library(coin)
# library(multcomp) # 安装并加载要用到的包
mice <- as.factor(c(rep("9D",10),rep("11C",9),rep("DSC1",11)))
data <- c(2,2,2,3,4,4,4,5,7,7,5,5,6,6,6,7,8,10,12,3,5,6,6,6,7,7,9,10,11,11)
data88 <- data.frame(mice,data)
kruskal.test(data~mice,data=data88)
kruskalmc(data~mice, data=data88, probs=0.05) # 使用kruskalmc函数做两两比较,但此方法不能给出具体的值
# 下面构建函数计算具体的p值
mult <- oneway_test(data ~ mice, data = data88,
ytrafo = function(data) trafo(data, numeric_trafo = rank),
xtrafo = function(data) trafo(data, factor_trafo = function(x)
model.matrix(~x - 1) %*% t(contrMat(table(x), "Tukey"))),
teststat = "max", distribution = approximate(B = 90000))
pvalue(mult, method = "single-step") #计算具体的p值,这段代码是复制丁香园的[1]
# 运行结果:
> kruskal.test(data~mice,data=data88)
Kruskal-Wallis rank sum test
data: data by mice
Kruskal-Wallis chi-squared = 9.9405, df = 2, p-value = 0.006941
> kruskalmc(data~mice, data=data88, probs=0.05) # 使用kruskalmc函数做两两比较,但此方法不能给出具体的值
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
11C-9D 10.3777778 9.683378 TRUE
11C-DSC1 0.4949495 9.472590 FALSE
9D-DSC1 10.8727273 9.208410 TRUE
> pvalue(mult, method = "single-step") #计算具体的p值
9D - 11C 0.01870000
DSC1 - 11C 0.95110000
DSC1 - 9D 0.01088889

例8-9

已知8名受试对象在相同实验条件下分别接受4种不同频率声音的刺激,他们的反应率(%)资料。问4种频率声音刺激的反应率是否有差别?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 源代码:
freA <- c(8.40,11.60,9.40,9.80,8.30,8.60,8.90,7.80)
freB <- c(9.60,12.70,9.10,8.70,8.00,9.80,9.00,8.20)
freC <- c(9.80,11.80,10.40,9.90,8.60,9.60,10.60,8.50)
freD <- c(11.70,12.00,9.80,12.00,8.60,10.60,11.40,10.80)
matrix89 <- matrix(c(freA,freB,freC,freD),nrow=8,dimnames=list(no=1:8,c("A","B","C","D")))
friedman.test(matrix89)
运行结果:
> friedman.test(matrix89)
Friedman rank sum test
data: matrix89
Friedman chi-squared = 15.1519, df = 3, p-value = 0.001691

手动计算PCA

prcomp

输入格式:

1
2
3
4
5
6
7
# in.txt
B1 B2 B3 B4 B5 B6 C1 C2 C3 C4 C5
2 2 1 1 2 2 1 1 1 1 1
1 1 1 1 1 0 0 0 1 0 0
3 3 3 3 2 3 2 2 2 2 2
......
......

1
2
3
4
5
6
7
8
9
10
11
12
13
14
y=read.table("in.txt",header=T)
y=as.matrix(y)
# 转置矩阵
y=t(y)
p=prcomp(y,scale=T)
p2=summary(p)
# 显示PC1-PC4
p2[5]$x[,1:4]
# 查看PC解释度 Proportion of Variance
p2[6]$importance[,1:4]
# 存到文件里
write.table(p3[5]$x[,1:4],file="pca.out",sep=" ",quote=F,row.names=F,col.names=F)
write.table(p3[6]$importance[,1:6],file="pca.out.log",sep=" ",quote=F,row.names=F,col.names=F)

Awesome-Bioinfomatics-List

发表于 2016-12-12   |   分类于 Cheetsheets

Ctrl+F, and search the key words

Population genomics

Statistical Population Genomics(book)

linux

Windows Subsystem for Linux

Coding language

Perl
Python
R
Cookbook for R

Plot

ggplot2
Circos

Color
Colorbrewer2

data handling

trim fastq reads trimadap
SeqKit
FastQ processing fastp

Mapping

BWA
Maq (out-of-date)
stampy (out-of-date)
Bowtie (out-of-date)
Bowtie2
minimap
SAM/BAM flag explain
lastz

  • axt format
  • maf format

SNP/Indel calling

samtools
gatk
angsd
soapsnp
soapindel
pindel
dindel

Variant call format
about vcf format
vcftools

CNV calling

CNVnator
Genome STRiP
XHMM
ExomeCNV
CONTRA
ExomeCopy
ExomeDepth
CoNIFER
cn.mops - Mixture of Poissons for CNV detection in NGS data paper
cnvHiTSeq
CNVkit
PennCNV

Genotype phasing/ estimation of haplotypes

beagle
phase
fastphase
SHAPEIT
MACH

Genotype imputation

impute2
Minimac

Phylogenetic tree

fasttree
RAxML
Phylip
PAUP* PAUP*
Phylogen Programs

  • draw a nice tree
    here
    here

  • compare two trees dendextend

Analysis tools

plink
ngstools

LD

haploview

IBD

beagle
germline

Population Structure analysis

frappe
structure
admixture
ngsAdmix
ohana

PCA

eigensoft

Demographic history

dadi
psmc
msmc
smc++
treemix
fastsimcoal2
ChromoPainter/fineStructure/GLOBETROTTER
G-PhoCS
SLim
BEAST

Admix relatd:
ADMIXTOOLS here
ALDER

Simulator:
msPrime
msms

Seletive sweeps

sweep
ihs and xpehh
nSL
SweepFinder
SweeD
xp-clr
selscan
Composite of Multiple Signals (CMS), combines the signals from five different tests (ΔiHH, iHS, XP-EHH, FST and ΔDAF) to create a single test statistic CMS2.0 SabetiLab

Transcriptome

Trinity
velvet
Tophat
cufflinks
HISAT2
WGCNA (a R package for weighted correlation network analysis)
Network plot: Cytoscape
image
Iso-seq analysis of (alternative transcription initiation (ATI), alternative splicing (AS), alternative cleavage and polyadenylation (APA), natural antisense transcripts
(NAT), and circular RNAs (circRNAs) ) PRAPI

Database

DDBJ
ENA
NCBI
ensembl
ensembl plant
KEGG
reactome
GO
Plant transcription factor database (PlantTFDB)
Human Phenotype Ontology
GSA, Genome sequence archive of Beijing Institutes of Genomics(BIG)
水稻rice 3k rice
人human David Reich lab

GO/KEGG Enrichment
clusterProfiler
kaas
Gene Ontology
PANTHER
GORILLA
David

Mouse Cell Atlas

GWAS

FAST-LMM
GAPIT
EMMAX
NHGRI-EBI GWAS Catalog
GWAS Central
Odds Ratio(OR)and Risk Ratio(RR)

Metagenomics(MGWAS)

MaAsLin

RAD-seq

Stacks

SNP annotation

SNPeff
NGS-SNP
ANNOVAR

Gene annotation

EVidenceModeler(EVM)
DAVID
Gene2Function
Gene Structure Display Server, GSDS2

Plastome annotation (chloroplast genome)

plann

bed annotation

Genomic search(bed file search/ranks the significance), GIGGLE Nature Method

Protein functional effects

PolyPhen-2
Protein-Protein Interaction Networks (STRING) 基因结构画图展示

Tools

Primer3 here
TMHMM (domain prediction)
Wego
Species & general Chinese name
tree of life
制作流程图 here
Venn diagram 维恩图: VennPainter Venny2.1

draw a plot like this weblogo
image

Understanding genome variation (BEDTOOLS, GEMINI, LUMPY, VCFANNO, PEDAGREE, and GQT)QuinlanLab

Cancer related

The Cancer Genome Atlas(TCGA) pan-cancer anlysis here
International Cancer Genome Consortium (ICGC)
National Cancer Institute (NCI)
National Comprehensive Cancer Network (NCCN), including NCCN guidelines for clinical application
ctDNA analysis(MutScan, GeneFuse, cfDNAPattern) HaploX

Disease

CDC Public Health Genomics Knowledge Base
疾病位点打分相关 M-CAP M-CAP online REVEL
ClinVar
dbNSFP
HGMD, Human Gene Mutation Database
SwissVar
MalaCards, The human disease database

evolution

UC berkeley, Understanding Evolution
tiktaalik roseae 你身体里的鱼 Your inner fish

Mouse Phenotype

mouse gene knockout and Phenotype

查基因功能网站:

GeneCards
wikigenes
GeneReviews

Understanding genetics

Ask a Geneticists: The Tech Musemu of Innovation网站上的遗传学答页面,由斯坦福大学相关背景的志愿者撰写通俗易懂的回复。帮助理解遗传原理,包括遗传的过程、对基因检测中遗传病风险的理解等等。

Knockout mouse project

UCDavis KOMP

Contact

image

libgd and GD problem (installing Circos)

发表于 2016-11-15

clear explaination

1. install libgd-2.1.0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
wget https://github.com/libgd/libgd/releases/download/gd-2.1.0/libgd-2.1.0.tar.gz
tar zxvf libgd-2.1.0.tar.gz
cd libgd-2.1.0
./configure
...
** Configuration summary for libgd 2.1.0:
Support for Zlib: yes
Support for PNG library: yes
Support for JPEG library: yes
Support for VPX library: no
Support for TIFF library: yes
Support for Freetype 2.x library: yes
Support for Fontconfig library: yes
Support for Xpm library: no
Support for pthreads: yes
...

If Freetype and Fontconfig are missing, you would have to download, configure, make and install them like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
curl -O http://www.ijg.org/files/jpegsrc.v8d.tar.gz
tar -xzvf jpegsrc.v8d.tar.gz
cd jpeg-8d
./configure --prefix=/path/to/jpeg-8d
make
make install (or "sudo make install")
curl -O ftp://ftp.simplesystems.org/pub/libpng/png/src/libpng16/libpng-1.6.2.tar.gz
tar -xzvf libpng-1.6.2.tar.gz
cd libpng-1.6.2.tar.gz
./configure --prefix=/path/to/jpeg-8d
make
make install (or "sudo make install")

Finish install libgd

1
2
3
./configure --prefix=/path/to/jpeg-8d
make
make install(or "sudo make install")

2. install GD manually (perl module)

1
2
3
4
5
6
curl -O http://www.cpan.org/authors/id/L/LD/LDS/GD-2.49.tar.gz (if curl fails copy and past on your browser)
tar -xzvf GD-2.49.tar.gz
cd GD-2.49
GD-2.49$ perl Makefile.PL
GD-2.49$ make
GD-2.49$ sudo make install

Fix pheatmap bug

发表于 2016-11-07   |   分类于 Work

pheatmap画图保存为pdf的时候会产生空白页,添加onefile=FALSE可以解决

1
pdf("test.pdf", onefile=FALSE)

perl cheetsheets

发表于 2016-11-01   |   分类于 Cheetsheets

比整个屏幕长的脚本都应该加上use strict;

数据元素

变量

当你需要声明一个复杂的字符串时,可以使用heredoc语法:

1
2
3
4
5
6
7
my $blurb =<<'END_BLURB';
He looked up. "Change is the constant on which they all
can agree. We instead, born out of time, remain perfect
and perfectly self-aware. We only suffer change as we
pursue it. It is against our nature. We rebel against
that change. Shall we consider them greater for it?"
END_BLURB

Undef

Perl里的undef表示一个未分配、不确定、未知的值。
一个声明了但是没有定义的标量值就是undef:

1
2
my $name = undef; # unnecessary assignment
my $rank; # also contains undef

数字

Perl支持整型数字和浮点数字,你可以使用不同的方式来表示它们,二进制、八进制、十进制、十六进制:

1
2
3
4
5
6
my $integer = 42; #整型
my $float = 0.007; #浮点型
my $sci_float = 1.02e14; #科学计数法,浮点数
my $binary = 0b101010; #二进制0b前缀
my $octal = 052; #八进制0前缀
my $hex = 0x20; #十六进制0x前缀

还可以使用下划线来增加可读性:(注意不是逗号,因为逗号在Perl中有特殊意义)

1
2
3
my $billion = 1000000000;
my $billion = 1_000_000_000;
my $billion = 10_0_00_00_0_0_0;

列表

列表list 标量的有序集合;数组array 储存列表的变量

使用范围操作符(..)可以很方便的创建列表:

1
2
my @chars = 'a' .. 'z';
my @count = 13 .. 27;

可以使用qw()操作符分隔空白来产生字符串列表:

1
my @stooges = qw( Larry Curly Moe Shemp Joey Kenny );

数组操作

压入、弹出

1
2
my $fred = pop(@array) # pop从右边弹出;shift从左边弹出
push (@array,$a) # push从右边压入;unshift从左边压入

最后一个元素的索引 $trees[$#trees]

reverse 反向 @fred = reverse @fred #本身不修改

排序

1
2
3
4
my @result = sort by_number @some_numbers
my @result = sort { $a <=> $b} @some_numbers; # <=> 符号表示数字的比较和排序
my @string = sort {$a cmp $b} @any_strings; # cmp 表示字符串的比较和排序
my @string = sort {"\L$a" cmp "\L$b"} @any_strings; # \L表示忽略大小写

1
2
3
4
5
6
7
sub by_score_and_name {
{$score{$b} <=> $score{$a}} # 按照value对key做降序排列
or
$a cmp $b # 按照字母顺序再排序
}

流程控制

条件分支、循环

使用后缀形式,添加括号增加可读性:

1
2
3
print "yes\n" if ($a>1);
print "$_" for 1..10

三目操作符

三目操作符(? :)可以根据给定的条件选择执行表达式.

条件表达式为真时执行表达式1,条件表达式为假时执行表达式2:
条件表达式 ? 表达式1 : 表达式2

1
2
3
4
5
my $time_suffix = after_noon($time)
? 'afternoon'
: 'morning';
push @{ rand() > 0.5 ? \@red_team : \@blue_team }, Player->new;

循环控制

next-立即进入下一循环:

last-立即结束循环。

redo-重新进入当前迭代,返回本次循环的顶端(无需评估循环条件)。

循环嵌套时,可能不容易让人理清楚结构,这时可以使用标签:

1
2
3
4
5
6
7
8
9
10
LINE:
while (<$fh>){
chomp;
PREFIX:
for my $prefix (@prefixes){
next LINE unless $prefix;
say "$prefix: $_";
# next PREFIX is implicit here
}
}

哈希

访问哈希

访问键

1
2
3
for my $addressee (keys %addresses){
say "Found an address for $addressee!";
}

访问值

1
2
3
for my $address (values %addresses){
say "Someone lives at $address";
}

访问键值对
each 函数,唯一使用地方就是在while循环中

1
2
3
while (my ($addressee, $address) = each %addresses){
say "$addressee lives at $address";
}

用哈希去重复用法

1
2
3
4
5
6
my %uniq;
undef @uniq{ @items };
my @uniques = keys %uniq;
# @items是需要去重的数据,@uniques是去重后的
# 对哈希切片使用undef就是将所有的切片值设置为undef。

exists 函数:检查哈四中是否有某个键

1
2
3
if (exists $books{"dino"}) {
print "Hey, there's a library card for dino!\n";
}

反转 松绑哈希

1
2
%inverse_hash = reverse %any_hash # 反转哈希 键值互换
@any_array = %some_hash; #哈希松绑,把哈希转化成 键/值 对的列表

去除哈希里的键

1
delete $h{$a}

符号

数学操作符

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
+ 加
- 减
* 乘
/ 除
** 乘方
% 模,取余 10%3 余数为1
+=,-=, *=, /=, **=, %= 组合形式
-- 前缀自减
== 相等
!= 不等
> 大于
< 小于
>= 大于等于
<= 小于等于
<=> 排序中的比较符号

自增 自减

1
2
3
4
5
6
7
8
自增++ 自减--
操作符++放在变量之前就能先增加变量的值,然后获取标量的值。
$m = 5;
my $n = ++$m; # $m 增加到 6,并把该值赋给 $n
$m = 5;
my $d = $m++; # $d 得到的值为 $m 之前的值 5 ,然后 $m 增加到6

双引号里的转义字符

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
\n #换行
\r #光标回到最左边
\t #水平制表符
\f #换页符
\b #退格
\a #系统响铃
\e #Esc(ASCII编码的转义字符)
\007 #任何八进制的ASCII值(此例中007表示系统响铃)
\x7f #任何十六进制的ASCII值(此例中7f表示删除键的控制代码)
\cC #控制符,也就是control键的代码(此例中表示同时按下Ctrl和C键的返回码)
\\ #反斜线
\" #双引号
\l #将下个字符转为小写
\L #将到\E为止的所有字符转为小写
\u #将下个字符转为大写
\U #将到\E未知的所有字符转为大写
\Q #将到\E为止的非单词(non-word)字符串加上反斜线
\E #结束\L,\U或\Q

字符串操作符

字符串操作符为操作数提供字符串语境。比如

1
2
3
4
5
6
7
8
9
正则表达式的绑定操作符(=~, !~)
字符串连接符(.)
字符串比较操作符(eq)
不等 ne
小于 lt
大于 gt
小于或等于 le
大于或等于 ge
还有sort排序中的字符串比较操作符(cmp)

逻辑操作符

逻辑操作符为操作数提供布尔语境。如&&, ||, and, or,//,三目操作符(?:) 逻辑非 (!), not (优先级比!低) ,还有xor。
注意//和||的区别,操作符//只要目标操作数有定义就为真,即使是0和空字符串。

优先级:
括号 > 量词 > 锚位和序列 > 择一(竖线|) > 字符、字符集、反向引用

位操作符

位操作符提供的是数字语境。这些操作符不太常见。
左移位(<<),右移位(>>), 逐位与运算(&), 逐位或运算(|), 逐位异或运算(^), 以及组合形式(<<=,>>=,&=,|=,^=)。

特殊操作符

自增操作符(++)的行为比较特别,当操作数是数字时就数值加一。如果是字符串就增长字符串。

重复操作符(x)也有着复杂的行为,在列表上下文中将列表重复;在标量上下文中,重复并连接成一个字符串。

范围操作符(..),在列表语境下会生成从一个操作数到另一个操作数的列表,可以是数字也可以是字符串。
在布尔语境下行为很不一样:操作符初始值为假,在左操作数为真时整体变为真值,持续这个状态直到右操作数为真时,整体再转变变为假值。

1
2
3
4
5
6
7
8
9
10
11
12
#列表语境
my @cards = ( 2 .. 10, 'J', 'Q', 'K', 'A' );
#布尔语境
foreach my $x ( 1..10 ){print $x if $x == 3..$x == 4}
# 打印 3 4
#利用该特性来方便地提取邮件正文:
while (/Hello, $user/ .. /Sincerely,/)
{
say "> $_";
}

正则表达式

竖线 | 或 意思是左边或右边匹配都行

正则表达式 可进行双引号内插

可选修饰符: 可连起来使用,顺序不影响结果

1
2
3
4
5
6
7
/i # 不分大小写
/s # 用点号(.)匹配任意字符
/x # 加入空白
/g # 全局匹配修饰 第一次匹配后 立即进行下一次匹配/替换
/m # 跨行模式匹配

锚位

1
2
3
4
^ # 表示字符串的开头
$ # 表示字符串的结尾
\b # 单词锚位 /\bfred\b/ 整词搜索

命名捕获
为捕获串加标签 (?<name>PATTERN) print "$name"

通用变量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
{} 重复次数的范围
/a{5,15}/ 字母a重复5到15次的匹配
/(fred){3,}/ fred重复3次以上
/(fred){8}/ 重复8次
* {0,}
+ {1,}
? {0,1}
# 非贪婪量词
# 在贪婪量词后面加?
* *?
+ +?
? ??

函数

函数参数

1
2
3
4
5
6
7
8
9
sub show_pets_by_type{
my ($type, %pets) = @_; #记得要先将标量分出来
while (my ($name, $species) = each %pets){
next unless $species eq $type;
say "$name is a $species";
}
}
# 列表赋值是贪婪的,所以上面例子中%pets会吞掉 @_ 里面所有剩下的值。所以如果$type参数的位置是在后面,那么就会报错,因为所有的值先被哈希吞掉,但是却发现单了一个。

验证参数

某些时候参数验证是很容易的,比如验证参数的个数:

1
2
3
4
5
sub add_numbers{
croak 'Expected two numbers, received: ' . @_
unless @_ == 2;
...
}

有时则比较麻烦,比如要验证参数的类型。因为Perl中,类型可以发生转换。如果你有这方面的需求,可以看看这些模块:Params::Validate和MooseX::Method::Signatures。

输入和输出

调用参数 @ARGV由调用参数组成的列表,开始运行时,@ARGV已经塞满了调用参数。

6个特殊的文件句柄是perl保留的:

STDIN STDOUT STDERR DATA ARGV ARGVOUT

使用say来输出 与print相比会再每行输出的结尾z自动加上换行符

printf格式化输出

1
2
printf "Hello, %s; your password expires in %d days!\n", $user, $days_to_die;
# %s 代表字符串 $user; %d 代表数字 $days_to_die
1
2
3
4
%d 十进制整数,自动舍去小数点后数字
%s 字符串格式,可设定字段宽度 printf "%10s\n","wilma"; (正数右对齐,负数左对齐)
%f 转换格式(浮点数,四舍五入),可指定小数点后输出位数 printf "12.3f\n",42 + 2/3; (宽度12,小数点3位)
%% 要输出百分号,使用 %%
1
2
3
4
5
6
# 直接给字符串
# Format number with up to 8 leading zeroes
my $result = sprintf("%08d", $number);
# Round number to 3 digits after decimal point
my $rounded = sprintf("%.3f", $number);

perl套路

perl打乱list顺序

perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' < myfile

perl随机打乱数组

1
2
use List::Util;
@array = List::Util::shuffle @array;

The Art of Charm

发表于 2016-11-01   |   分类于 Social

Learn to be a social ninja!

  1. meet somebody new and be genuinely interested in people
  2. assume you’re familiar with somebody
  3. smile often
  4. giving compliment freely
  5. the ability for you to laugh at yourself, self-deprecating
  6. learn to listen
  7. treat people with kindness and respect

mac关闭屏幕方式

发表于 2016-11-01   |   分类于 Cheetsheets

快捷键

  • Ctrl+Shift+Power: 关闭屏幕
  • Cmd+Opt+Power: 睡眠 (sleep)
  • Cmd+Ctrl+Power: 重启 (restart)
  • Cmd+Ctrl+Opt+Power: 关机 (shutdown)

应用程序->实用工具->钥匙链访问
在钥匙串访问程序菜单中选择 偏好设置->通用,勾选“在菜单栏中显示状态”,之后在菜单栏会出现一个小锁, 选择第一个“锁定屏幕”。

为Markdown插入图片

发表于 2016-10-28

使用七牛云做为markdown图床

1
http://cnfeat.com/blog/2015/11/30/cli-qiniu/#section-8

qshell 命令行同步工具

(2016.11.8 更新) qrsync被官方弃用了…
现在可以使用qshell工具实现

1
http://developer.qiniu.com/code/v6/tool/qshell.html

自动上传图片并获得图片外链地址

  • 图片保存至上传目录 /Users/wanglizhong/software/7niu/PicUpload/

  • 写个简单的脚本获得图片地址
    get_url.pl内容:

    1
    2
    3
    4
    5
    6
    7
    8
    #/usr/bin/perl -w
    `/usr/local/bin/qshell qupload /Users/wanglizhong/software/7niu/conf.json 2>/Users/wan\
    glizhong/software/7niu/.tmp`;
    my $pic=`ls /Users/wanglizhong/software/7niu/PicUpload/*`;
    chomp $pic;
    $pic =~ /\/([^\/]+)$/;
    print "![image](http://ofr9vioug.bkt.clouddn.com/$1)";
    `rm $pic`;

终端运行 perl get_url.pl,可以进一步使用Keyboard Maestro绑定快捷键。

  • 插入图片地址
    1
    ![image](http://ofr9vioug.bkt.clouddn.com/QQ20161028-0@2x-chimpanzee.png)

image

调整markdown插入图片尺寸

1
2
3
![image](http://ofr9vioug.bkt.clouddn.com/PairwiseCluster.png)
添加后缀 “?imageView/2/w/600”,600为自定义宽度
![image](http://ofr9vioug.bkt.clouddn.com/PairwiseCluster.png?imageView/2/w/600)

Deal with Emails

发表于 2016-10-12   |   分类于 Work

处理邮件的规则:

  1. 迅速回复,回应可以很简短,回头处理意味着重读邮件,更浪费时间
  2. 每个字都很重要,干净简洁
  3. 经常清理收件箱:尽可能一次处理完,不能立马回复的集中时间处理并在当天清空
  4. 先处理后到的邮件,有的时候先前的问题可能已经被解决了
  5. 将有用信息转发给相关人员
  6. 回复邮件的时候,去掉不相关的收件人
  7. 不要用电邮发火
  8. 需要跟进的邮件,抄送给自己,届时需要查看进度的时候可以将原邮件再发送一遍,并加上类似“这件事情做的怎么样?”的内容;
  9. 转发给自己,加上相关标签、可能用到的搜索词条(描述性关键词),帮助未来的自己搜索

Bioinfomatics Data Skills Cheatsheets

发表于 2016-10-08   |   分类于 Cheetsheets

Code should be readable, broken down into small contained components (modular), and reusable (so you’re not rewriting code to do the same tasks over and over again).

Testing Code Strategy:

  • How many times is this code called by other code?
  • If this code were wrong, how detrimental to the final results would it be?
  • How noticeable would an error be if one occurred?

It’s important to never assume a dataset is high quality. Rather, data’s quality should be proved through exploratory data analysis (known as EDA). EDA is not complex or time consuming, and will make your research much more robust to lurking surprises in large datasets.

Make Figures and Statistics the Results of Scripts

It’s important to always use relative paths (e.g., ../ data/stats/qual.txt) rather than absolute paths (e.g., /home/vinceb/projects/ zmays-snps/data/stats/qual.txt).

Document/Readme in project’s main directories

  • methods and workflows (command-line)
  • origin of all data
  • when you downloaded data/ data version/ how you downloaded the data
  • software version

leverage directories to help stay organized.

Shell Expansion Tips

$ echo dog-{gone,bowl,bark}

1
2
$ ls
dog-gone dog-bowl dog-bark

$ mkdir -p zmays-snps/{data/seqs,scripts,analysis}

$ touch seqs/zmays{A,B,C}_R{1,2}.fastq

1
2
3
$ ls seqs/
zmaysA_R1.fastq zmaysB_R1.fastq zmaysC_R1.fastq zmaysA_R2.fastq zmaysB_R2.fastq zmaysC_R2.fastq

shell wildcards

Wildcard What it matches

  • *: Zero or more characters (but ignores hidden les starting with a period).

  • ?: One character (also ignores hidden les).

  • [A-Z]: Any character between the supplied alphanumeric range (in this case, any character betweenAandZ); this works for any alphanumeric character range (e.g.,[0-9]matches any character between 0 and 9).

  • best to be as restrictive as possible with wildcards
    Instead of zmaysB, use `**zmaysB\fastqorzmaysB_R?.fastq**` (the ? only matches a single character).

$ ls zmays[AB]_R1.fastq

zmaysA_R1.fastq zmaysB_R1.fastq

$ ls zmays[A-B]_R1.fastq

zmaysA_R1.fastq zmaysB_R1.fastq

Leading Zeros and Sorting

e.g., le-0021.txt rather than le-21.txt

$ ls -l

-rw-r–r– 1 vinceb staff 0 Feb 21 21:23 genes-001.txt

-rw-r–r– 1 vinceb staff 0 Feb 21 21:23 genes-002.txt

[…]

-rw-r–r– 1 vinceb staff 0 Feb 21 21:23 genes-013.txt

-rw-r–r– 1 vinceb staff 0 Feb 21 21:23 genes-014.txt

use markdown to

Using pipelines

tee

1
$ program1 input.txt | tee intermediate-file.txt | program2 > results.txt

Here, program1’s standard output is both written to intermediate- le.txt and piped directly into program2’s standard input.

Tmux

new session with a name

1
$ tmux new-session -s maize-snps

Key sequence Action
Control-a d Detach
Control-a c Create new window
tmux ls list all sessions
tmux new -s new creat a session named “new”
tmux att -t new attach a session named “new”
tmux att -d -t new attach a session named “test”, detaching it first

change defalt key with .tmux.conf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
set -g history-limit 10000
# Automatically set window title
#set-window-option -g automatic-renam
#set-window-option -g xterm-keys on
set-option -g set-titles on
# Shift arrow to switch windows S shift M alt C ctrl
unbind-key -n S-Left
unbind-key -n S-Right
#bind -n C-Left previous-window
#bind -n C-Right next-window
bind -n F2 new-window
bind -n F3 previous-window
bind -n F4 next-window
bind -n F7 copy-mode
# kill window (prefix Ctrl+q)
#bind ^q killw
# display
#set -g status-utf8 on
set -g status-keys vi
set -g status-interval 1
set -g status-attr bright
set -g status-fg white
set -g status-bg black
set -g status-left-length 20
set -g status-left '#[fg=green][#[fg=red]#S#[fg=green]]#[default]'
set -g status-justify centre
set -g status-right '#[fg=green][ %m/%d %H:%M:%S ]#[default]'
setw -g window-status-current-format '#[fg=yellow](#I.#P#F#W)#[default]'
setw -g window-status-format '#I#F#W'
Lizhong Wang

Lizhong Wang

自我提升和自我完善是一种终身的修行

46 日志
4 分类
50 标签
微博 知乎 Github Wechat
© 2020 Lizhong Wang
由 Hexo 强力驱动
主题 - NexT.Muse