Bioinfomatics Data Skills Cheatsheets

Code should be readable, broken down into small contained components (modular), and reusable (so you’re not rewriting code to do the same tasks over and over again).

Testing Code Strategy:

  • How many times is this code called by other code?
  • If this code were wrong, how detrimental to the final results would it be?
  • How noticeable would an error be if one occurred?

It’s important to never assume a dataset is high quality. Rather, data’s quality should be proved through exploratory data analysis (known as EDA). EDA is not complex or time consuming, and will make your research much more robust to lurking surprises in large datasets.

Make Figures and Statistics the Results of Scripts

It’s important to always use relative paths (e.g., ../ data/stats/qual.txt) rather than absolute paths (e.g., /home/vinceb/projects/ zmays-snps/data/stats/qual.txt).

Document/Readme in project’s main directories

  • methods and workflows (command-line)
  • origin of all data
  • when you downloaded data/ data version/ how you downloaded the data
  • software version

leverage directories to help stay organized.

Shell Expansion Tips

$ echo dog-{gone,bowl,bark}

1
2
$ ls
dog-gone dog-bowl dog-bark

$ mkdir -p zmays-snps/{data/seqs,scripts,analysis}

$ touch seqs/zmays{A,B,C}_R{1,2}.fastq

1
2
3
$ ls seqs/
zmaysA_R1.fastq zmaysB_R1.fastq zmaysC_R1.fastq zmaysA_R2.fastq zmaysB_R2.fastq zmaysC_R2.fastq

shell wildcards

Wildcard What it matches

  • *: Zero or more characters (but ignores hidden les starting with a period).

  • ?: One character (also ignores hidden les).

  • [A-Z]: Any character between the supplied alphanumeric range (in this case, any character betweenAandZ); this works for any alphanumeric character range (e.g.,[0-9]matches any character between 0 and 9).

  • best to be as restrictive as possible with wildcards
    Instead of zmaysB, use `**zmaysB\fastqorzmaysB_R?.fastq**` (the ? only matches a single character).

$ ls zmays[AB]_R1.fastq

zmaysA_R1.fastq zmaysB_R1.fastq

$ ls zmays[A-B]_R1.fastq

zmaysA_R1.fastq zmaysB_R1.fastq

Leading Zeros and Sorting

e.g., le-0021.txt rather than le-21.txt

$ ls -l

-rw-r–r– 1 vinceb staff 0 Feb 21 21:23 genes-001.txt

-rw-r–r– 1 vinceb staff 0 Feb 21 21:23 genes-002.txt

[…]

-rw-r–r– 1 vinceb staff 0 Feb 21 21:23 genes-013.txt

-rw-r–r– 1 vinceb staff 0 Feb 21 21:23 genes-014.txt

use markdown to

Using pipelines

tee

1
$ program1 input.txt | tee intermediate-file.txt | program2 > results.txt

Here, program1’s standard output is both written to intermediate- le.txt and piped directly into program2’s standard input.

Tmux

new session with a name

1
$ tmux new-session -s maize-snps

Key sequence Action
Control-a d Detach
Control-a c Create new window
tmux ls list all sessions
tmux new -s new creat a session named “new”
tmux att -t new attach a session named “new”
tmux att -d -t new attach a session named “test”, detaching it first

change defalt key with .tmux.conf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
set -g history-limit 10000
# Automatically set window title
#set-window-option -g automatic-renam
#set-window-option -g xterm-keys on
set-option -g set-titles on
# Shift arrow to switch windows S shift M alt C ctrl
unbind-key -n S-Left
unbind-key -n S-Right
#bind -n C-Left previous-window
#bind -n C-Right next-window
bind -n F2 new-window
bind -n F3 previous-window
bind -n F4 next-window
bind -n F7 copy-mode
# kill window (prefix Ctrl+q)
#bind ^q killw
# display
#set -g status-utf8 on
set -g status-keys vi
set -g status-interval 1
set -g status-attr bright
set -g status-fg white
set -g status-bg black
set -g status-left-length 20
set -g status-left '#[fg=green][#[fg=red]#S#[fg=green]]#[default]'
set -g status-justify centre
set -g status-right '#[fg=green][ %m/%d %H:%M:%S ]#[default]'
setw -g window-status-current-format '#[fg=yellow](#I.#P#F#W)#[default]'
setw -g window-status-format '#I#F#W'