remove unmapped reads of a BAM file

发表于 2017-06-23 分类于 bioinformatics

remove unmapped reads of a BAM file

原文日期: 2017-06-23
来源: https://github.com/wlz0726/wlz0726.github.io

移除 BAM 文件中未比对的 reads

使用 samtools

保留比对的 reads

1	samtools view -bF 4 input.bam > output.mapped.bam

保留未比对的 reads

1	samtools view -bf 4 input.bam > output.unmapped.bam

FLAG 说明

双端数据

# 保留正确配对的 reads
samtools view -bF 260 input.bam > output.paired.bam

# 保留至少一端比对的 reads
samtools view -bF 12 input.bam > output.at_least_one.bam

排序和索引

# 排序
samtools sort output.mapped.bam -o output.mapped.sorted.bam

# 索引
samtools index output.mapped.sorted.bam

统计

1 2	# 比对统计 samtools flagstat input.bam

此文档为 GitHub 博客自动归档

Random subsample from a BAM file

发表于 2017-06-21 分类于 tools

Random subsample from a BAM file

原文日期: 2017-06-21
来源: https://github.com/wlz0726/wlz0726.github.io

BAM 文件随机抽样

使用 samtools

# 抽样 10%
samtools view -s 0.1 -b input.bam > subsample.bam

# 抽样 50%
samtools view -s 0.5 -b input.bam > subsample.bam

# 指定种子（可重复）
samtools view -s 42.1 -b input.bam > subsample.bam

参数格式

-s seed.fraction

seed: 随机种子（整数）
fraction: 抽样比例（0-1）

使用 seqtk

# 抽样 100 万条 reads
samtools fastq input.bam | \
  seqtk sample -s100 - 1000000 | \
  bwa mem -p reference.fasta - | \
  samtools sort -o subsample.bam

注意事项

双端数据: 保持配对关系
种子: 相同种子产生相同结果
比例: 根据需求选择

此文档为 GitHub 博客自动归档

tmux

发表于 2017-06-19 分类于 tools

tmux

原文日期: 2017-06-19
来源: https://github.com/wlz0726/wlz0726.github.io

tmux 终端复用器

安装

先安装 libevent 2.x（tmux 依赖）：

# macOS
brew install libevent

# Ubuntu/Debian
sudo apt-get install libevent-dev

安装 tmux：

# macOS
brew install tmux

# Ubuntu/Debian
sudo apt-get install tmux

# 源码编译
wget https://github.com/tmux/tmux/releases/download/2.5/tmux-2.5.tar.gz
tar -xzf tmux-2.5.tar.gz
cd tmux-2.5
./configure && make && sudo make install

基本命令

# 新建会话
tmux new -s session_name

# 列出会话
tmux ls

# 附加会话
tmux attach -t session_name

# 分离会话
Ctrl+b d

# 杀死会话
tmux kill-session -t session_name

窗口管理

窗格管理

配置文件 ~/.tmux.conf

# 历史记录限制
set -g history-limit 10000

# 启用鼠标
set -g mouse on

# 状态栏设置
set -g status-bg black
set -g status-fg white
set -g status-left '[#S] '
set -g status-right ' %Y-%m-%d %H:%M '

# 窗口状态
setw -g window-status-current-format '#I:#W'

常用场景

1. 远程工作

在远程服务器上启动 tmux 会话，即使断开连接，任务也会继续运行。

2. 多任务处理

一个窗口写代码，一个窗口运行测试，一个窗口查看日志。

3. 配对编程

使用 tmux 共享会话，多人同时编辑。

此文档为 GitHub 博客自动归档

无 Root 配置 zsh 和 Oh-my-zsh

发表于 2017-06-19 分类于 tools

无 Root 配置 zsh 和 Oh-my-zsh

原文日期: 2017-06-19
来源: https://github.com/wlz0726/wlz0726.github.io

无 Root 权限安装 zsh

1. 下载源码

1
2
3

wget https://sourceforge.net/projects/zsh/files/zsh/5.8/zsh-5.8.tar.xz
tar xf zsh-5.8.tar.xz
cd zsh-5.8

2. 编译安装到用户目录

1
2
3

./configure --prefix=$HOME/.local
make
make install

3. 添加到 PATH

1 2	echo 'export PATH=$HOME/.local/bin:$PATH' >> ~/.bashrc source ~/.bashrc

安装 Oh-my-zsh

# 设置 zsh 路径
export ZSH=$HOME/.oh-my-zsh

# 克隆 Oh-my-zsh
git clone https://github.com/ohmyzsh/ohmyzsh.git $ZSH

# 复制模板配置
cp $ZSH/templates/zshrc.zsh-template $HOME/.zshrc

编辑 ~/.zshrc

# 设置主题
ZSH_THEME="robbyrussell"

# 设置插件
plugins=(git zsh-autosuggestions zsh-syntax-highlighting)

# 设置 zsh 路径
export ZSH=$HOME/.oh-my-zsh

# 启动 Oh-my-zsh
source $ZSH/oh-my-zsh.sh

安装插件

zsh-autosuggestions

1	git clone https://github.com/zsh-users/zsh-autosuggestions ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-autosuggestions

zsh-syntax-highlighting

1	git clone https://github.com/zsh-users/zsh-syntax-highlighting.git ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-syntax-highlighting

设置默认 shell

如果有 sudo 权限：

# 查看可用 shell
cat /etc/shells

# 设置默认 shell
chsh -s $HOME/.local/bin/zsh

如果没有 sudo 权限，每次手动启动：

1	$HOME/.local/bin/zsh

Do Phasing with SHAPEIT

发表于 2017-06-06 分类于 bioinformatics

Do Phasing with SHAPEIT

原文日期: 2017-06-06
来源: https://github.com/wlz0726/wlz0726.github.io

SHAPEIT 单倍型定相

简介

SHAPEIT 用于将基因型数据转换为单倍型数据（Phasing），确定等位基因的染色体来源。

安装

方法 1: 源码编译

1
2
3

git clone https://github.com/Delaneau/shapeit4.git
cd shapeit4
make

方法 2: conda 安装

1	conda install -c bioconda shapeit4

基本用法

shapeit4 --input input.vcf.gz \
         --map genetic_map.txt \
         --region 1 \
         --output output.bcf \
         --thread 4 \
         --seed 12345

参数说明

遗传图谱文件

从以下来源获取：

1000 Genomes: ftp://ftp.1000genomes.ebi.ac.uk/
HapMap: https://www.hapmap.org/

输出

定相后的 VCF/BCF 文件
单倍型概率信息
定相质量评估

质量评估

# 计算 switch error rate
shapeit4 --input output.bcf \
         --check \
         --true-haps reference.vcf.gz

应用场景

单倍型分析: Haplotype-based 分析
IBD 检测: 基于单倍型的 IBD 检测
Imputation: 提高基因型填补准确性
选择信号: 检测自然选择信号

此文档为 GitHub 博客自动归档

LD-prune with plink

发表于 2017-05-27 分类于 bioinformatics

LD-prune with plink

原文日期: 2017-05-27
来源: https://github.com/wlz0726/wlz0726.github.io

PLINK 连锁不平衡修剪

目的

去除高度相关的 SNP，用于：

PCA 分析
群体结构分析
减少多重检验
提高计算效率

命令

# LD 修剪
plink --bfile input \
      --indep-pairwise 50 5 0.2 \
      --out pruned

# 提取修剪后的 SNP
plink --bfile input \
      --extract pruned.prune.in \
      --make-bed \
      --out input_pruned

参数说明

--indep-pairwise <窗口大小> <步长> <r²阈值>

参数调整

更严格（保留更少 SNP）

1	plink --bfile input --indep-pairwise 50 5 0.1 --out pruned_strict

更宽松（保留更多 SNP）

1	plink --bfile input --indep-pairwise 50 5 0.5 --out pruned_loose

结果文件

验证

# 检查保留的 SNP 数量
wc -l pruned.prune.in

# 检查 LD 结构
plink --bfile input_pruned --r2 --ld-window 99999 --ld-window-kb 1000 --ld-window-r2 0

应用场景

PCA 分析: 去除 LD 影响
群体结构: ADMIXTURE 分析前处理
GWAS: 减少多重检验负担
系统发育: 构建无偏树

此文档为 GitHub 博客自动归档

Ts-Tv

发表于 2017-05-27 分类于 bioinformatics

Ts-Tv

原文日期: 2017-05-27
来源: https://github.com/wlz0726/wlz0726.github.io

转换/颠换比率（Ts/Tv）

定义

期望值

计算

使用 vcftools

1	vcftools --vcf input.vcf --TsTv

使用 GATK

gatk VariantsToTable \
  -V input.vcf \
  -F CHROM -F POS -F TYPE \
  -O variants.table

# 然后计算 Ts/Tv

解读

应用场景

质量控制: 评估变异检测质量
过滤优化: 调整过滤阈值
方法比较: 评估不同流程的表现

注意事项

参考数据库: 使用合适的已知变异数据库
测序深度: 深度影响 Ts/Tv 估计
群体差异: 不同群体可能有差异

此文档为 GitHub 博客自动归档

Publications

发表于 2017-05-02 分类于 tools

Publications

原文日期: 2017-05-02
来源: https://github.com/wlz0726/wlz0726.github.io

发表论文列表

同行评审论文

(此处记录发表的学术论文)

会议摘要

(此处记录会议报告)

预印本

(此处记录 bioRxiv 等预印本)

研究兴趣

群体遗传学
基因组学
生物信息学方法开发

合作机会

欢迎交流合作！

Email: [联系邮箱]
GitHub: https://github.com/wlz0726
ResearchGate: [个人主页]

此文档为 GitHub 博客自动归档

bwa-aln

发表于 2017-04-28 分类于 bioinformatics

bwa-aln

原文日期: 2017-04-28
来源: https://github.com/wlz0726/wlz0726.github.io

BWA 比对工具

三种算法

安装

# macOS
brew install bwa

# Ubuntu/Debian
sudo apt-get install bwa

# 源码编译
wget https://github.com/lh3/bwa/releases/download/v0.7.17/bwa-0.7.17.tar.bz2
tar -xjf bwa-0.7.17.tar.bz2
cd bwa-0.7.17
make

比对流程

1. 创建索引

1	bwa index reference.fasta

输出文件:

.amb - 模糊位点
.ann - 序列注释
.bwt - Burrows-Wheeler 转换
.pac - 压缩序列
.sa - 后缀数组

2. 比对（bwa mem）

# 单端
bwa mem -t 8 reference.fasta reads.fastq > output.sam

# 双端
bwa mem -t 8 reference.fasta read1.fastq read2.fastq > output.sam

# 双端 + 读取组信息
bwa mem -t 8 -M \
  -R '@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA' \
  reference.fasta read1.fastq read2.fastq > output.sam

3. SAM 转 BAM 并排序

1	samtools view -bS output.sam \| samtools sort -o output.sorted.bam

常用参数

读取组格式

1	@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA\tLB:lib1\tPU:unit1

此文档为 GitHub 博客自动归档

CNVnator Based Analysis

发表于 2017-04-27 分类于 bioinformatics

CNVnator Based Analysis

原文日期: 2017-04-27
来源: https://github.com/wlz0726/wlz0726.github.io

CNVnator 拷贝数变异分析

简介

CNVnator 是基于读取深度 (Read Depth) 的 CNV 检测工具，适用于：

全基因组 CNV 检测
大片段拷贝数变异
群体水平 CNV 分析

安装

# 克隆仓库
git clone https://github.com/abyzovlab/CNVnator.git
cd CNVnator/src

# 编译（需要 ROOT 库）
make

分析流程

1. 生成 ROOT 文件

1	cnvnator -root output.root -tree input.bam

2. 生成直方图

1	cnvnator -root output.root -his 100

bin 大小选择:

100bp - 高分辨率，适合小 CNV
500bp - 中等分辨率
1kb - 适合大 CNV

3. 分区

1	cnvnator -root output.root -part 100

4. CNV 调用

1	cnvnator -root output.root -call 100

结果解读

输出格式：

1	CNV_type chr start end size RD pval q0 annotation

过滤建议

# 保留高质量 CNV
- pval < 0.05
- q0 < 0.5
- size > 1kb

可视化

1 2	# 生成覆盖率图 cnvnator -root output.root -chrom chr1 -png 100

注意事项

GC 校正: CNVnator 会自动进行 GC 校正
bin 大小: 根据研究目的选择
对照样本: 建议使用对照提高准确性
验证: 重要 CNV 建议用其他方法验证

此文档为 GitHub 博客自动归档

remove unmapped reads of a BAM file

移除 BAM 文件中未比对的 reads

使用 samtools

保留比对的 reads

保留未比对的 reads

FLAG 说明

双端数据

排序和索引

统计

Random subsample from a BAM file

BAM 文件随机抽样

使用 samtools

参数格式

使用 seqtk

注意事项

tmux

tmux 终端复用器

安装

基本命令

窗口管理

窗格管理

配置文件 ~/.tmux.conf

常用场景

1. 远程工作

2. 多任务处理

3. 配对编程

无 Root 配置 zsh 和 Oh-my-zsh

无 Root 权限安装 zsh

1. 下载源码

2. 编译安装到用户目录

3. 添加到 PATH

安装 Oh-my-zsh

编辑 ~/.zshrc

安装插件

zsh-autosuggestions

zsh-syntax-highlighting

设置默认 shell

推荐主题

Do Phasing with SHAPEIT

SHAPEIT 单倍型定相

简介

安装

方法 1: 源码编译

方法 2: conda 安装

基本用法

参数说明

遗传图谱文件

输出

质量评估

应用场景

LD-prune with plink

PLINK 连锁不平衡修剪

目的

命令

参数说明

参数调整

更严格（保留更少 SNP）

更宽松（保留更多 SNP）

结果文件

验证

应用场景

Ts-Tv

转换/颠换比率（Ts/Tv）

定义

期望值

计算

使用 vcftools

使用 GATK

解读

应用场景

注意事项

Publications

发表论文列表

同行评审论文

会议摘要

预印本

研究兴趣

合作机会

bwa-aln

BWA 比对工具