Bioinformatics Bootcamp 2025

Schedule

Date	Time	Activity
Mon 03/11/2024	10:00 to 16:00	Unix-like systems, bash, connecting to SCWales
Tue 04/11/2024	10:00 to 16:00	More bash, SCWales, using SCWales and slurm
Wed 05/11/2024	10:00 to 16:00	Illumina data, BEARCAVE, data processing
Thu 06/11/2024	10:00 to 16:00	ANGSD, covariance and distance matrices, heterozygosity, intro to R
Fri 07/11/2024	10:00 to 16:00	Maps, PCA’s, NJ trees, Manhattan plots and Rmarkdown

Prequisites

a computer with internet connection
an account on super-computing Wales
basic knowledge of DNA and genome structure

Resources

Supercomputing Wales shared project

please request access to: scw2141 Bioinformatics training

Introductions

Axel Barlow, Lecturer in Zoology at Bangor University

Self taught post-PhD
No proper bioinformatics/computer science background
Knowledge of bash and R.

Johanna Paijmans, Lecturer in Zoology at Bangor University

Self taught (rebelling against a Linux-expert dad)
No proper bioinformatics/computer science background
Knowledge of bash and R.

Introductions

Unix-like systems and bash

Unix

Operating system developed in 1969 by Bell Labs
Unix philosophy: operating system should provide a set of simple tools, each of which performs a limited, well-defined function.
Modular (small programs strung together)
Inter-process communication: “pipes”
Separate normal and “super” users (sudo)
Hierarchical filesystem
A shell for executing and combining tools
The basis of many subsequent OS

Unix

Unix-like systems

Mac OS

Released 1984
Developed from NeXTSTEP, which is developed from Unix
Proprietary, Apple hardware only

Linux

21 year old Linus Torvalds coded a Unix inspired OS in 1991
Free and open source
The core linux kernel available under many distributions: Ubuntu, Mint, Arch, Android, Tesla, etc.

MS-DOS (Windows)

Developed by Microsoft, released 1981
Main OS for IBM PCs in 1980s
GUI introduced with Windows, released 1985
Largest market share (70% of PCs)
Some bioinformatics possible (e.g. R typically via Rstudio)
No bash
Encoding of text files is different
Majority of bioinfomatics software unsupported
Windows subsystem for linux https://learn.microsoft.com/en-us/windows/wsl/install
Seamless transfer between DOS and Unix not yet possible

Windows Subsystem for Linux

Released in 2016
Allows installation of Linux distributions (e.g. Ubuntu) within Windows 10/11
Provides a bash terminal and Linux environment
Transfer between Windows and Linux filesystems possible
Essentially WSL is Microsoft’s admittance to Linux superiority

OS comparison

	Windows	Mac	Linux
standard PC functions	yes	yes	yes
cost	yes	yes	free
hardware choice	yes	no	yes
bioinformatics	no	yes	yes
HPC	no	no	yes
open source	no	no	yes
active community	no	no	yes
games	yes	no	no

Terminal emulators and Bash

A shell allows users to execute OS tools
Accessed using a terminal
Unix terminal came with the Bourne shell (sh), developed by Steven Bourne in 1979
In 1979 Brian Fox in improved version: the Bourne again shell (bash)

Most Unix-like OS (incl Mac) use bash or something like it
- execute standard OS functions and installed programs
- access filesystem
- supports bash scripts
- pipes, auto-completion, loops, wildcards, etc.

Supercomputing Wales

Hawk

Computational nodes
- >134x Intel nodes with 40 cpus + 192 Gb RAM each
- >64x AMD nodes with 64 cpus + 256 Gb RAM each
- >26x Highmem nodes with 384 Gb RAM each
- >28x Nvidia GPU nodes
Storage space
- 1192TB (usable) scratch space
- 420TB of home directory space

Support
- Online documentation https://www.supercomputing.wales/
- eResearch manager Ade Fewings (a.fewings@bangor.ac.uk)

Hawk usage

Command-line only (via ssh)
Access through SSH jump host
File transfer using scp or sftp
Job submission using slurm job scheduler
Software installed as modules
No super user access
Space & quota
- Home directory (small, persistent): 50Gb
- Scratch space (bigger, temporary): SCW projects, few Tb
- Max 10 active jobs, max 30 in the queue
- Max 3 day runtime

Secure shell (SSH)

SSH via jump host

Getting logged on…

Connecting to the jump host (with MFA)

ssh you25usr@ssh.bangor.ac.uk

Note: most UNIX systems do not show anything when you’re typing your password!

If successful, connecting to Hawk

ssh b.you25usr@hawklogin.cf.ac.uk

Raise your hand if you are having issues 🙌

AI

A brief note on AI in Bioinformatics

AI tools are becoming increasingly common in bioinformatics (and everywhere else)
Different models are better or worse for different tasks (e.g. I find ChatGPT 4 is not great at programming)
But it doesn’t know everything, and it can’t do everything
You need knowledge of coding, assessing output, error messages, structure, logic, etc
Also the standard AI considerations: ethics, reproducibility, accuracy, privacy…

A brief note on AI in Bioinformatics

Filesystem

/ [root] is uppermost level of filesystem
Everything is contained in /
Directories exist within the filesystem, they can contain files and other directories
We specify a path through this hierarchy using forward-slashes
Our current directory is called the working directory

/home/b.xlb21brx/
/scratch/b.xlb21brx/

We can navigate through the filesystem (change working directory)
Or we can specify the patch to directories or files remotely

Copying files via SCP

Slurm

Simple Linux Utility for Resource Management: slurm
Free open source job scheduler for linux systems
Used on 60% of World’s top 500 computers
Assigns user jobs to computer resources
Submit to queue
Short, low-resource jobs move faster through the queue
Useful tools for scheduling, reporting, etc

Illumina data

Illumina sequencing platforms

Illumina summary

The current market leader
Massive output
Many applications (genome resequencing, RADseq, transcriptomes, metabarcoding)
Cheap (£10 per Gb)
Major limitation is the read length

Data output

Platform	Million reads	Read length	Gb data	Genome coverage
iSeq	4	2 x 150 bp	1.2	0.4x
MiniSeq	25	2 x 150 bp	7.5	2.5x
MiSeq	100	2 x 500 bp	30	10x
Nextseq 550	400	2 x 150 bp	120	40x
NextSeq 1000/2000	1800	2 x 300 bp	540	180x
NovaSeq 6000	20000	2 x 250 bp	3000	1000x
NovaSeq X	52000	2 x 150 bp	8000	2667x

Sequencing by synthesis

Sample preparation
Bind DNA to flowcell
Generate clusters
Sequencing by synthesis
Image analysis

Sample preparation

*Indexes allow multiple samples to be sequenced at the same time

Flow cell

Cluster generation

Sequencing by synthesis

Data analysis (in the machine)

What do we sequence?

[Not an exhaustive list]

Whole genome sequencing (pure DNA sample from a single individual)
Reduced representation genome data (RADseq, targeted SNPs, single individual)
Poolseq (multiple individuals)
Transcriptome (RNA sample from single tissue/individual)
Metabarcoding (PCR amplicon, multiple individuals/species)
Metagenomics (whole genomes, multiple individuals/species)

Whole genome sequencing

Short reads from a single individual can be mapped to a reference genome assembly

Whole genome sequencing

Our project: adder population genomics

Adders (Vipera berus berus) widespread across northern Eurasia
Threatened or near-threatened in UK
Illumina PE data from 47 individuals
Plus one outgroup (Vipera berus bosniensis)
12 locations
Our tasks
- Data format
- Adapter trimming and read merging
- Map to reference genome: chr7

Adder locations

      +-------------+---------------+
      |   Sample    |   Locality    |
      +-------------+---------------+
      | adder01-04  | Leeds         |
      | adder05-08  | Wensleydale   |
      | adder09-12  | Manchester    |
      | adder13-16  | Caerphilly    |
      | adder17-20  | Gouda         |
      | adder21-24  | Stockholm     |
      | adder25-27  | Cheddar       |
      | adder28-31  | Huddersfield  |
      | adder32-35  | Sheffield     |
      | adder36-39  | Leicester     |
      | adder40-43  | Nottingham    |
      | adder44-47  | Stilton       |
      | adderout    | outgroup      |
      +-------------+---------------+

Illumina data processing

.fastq file format

fastq is the standard output format for data from Illumina (and other) platforms

@A00551:758:HKTVJDSX7:4:1101:3595:6872 1:N:0:CCTGAGATGT+GGTCTAGTTG
CTGAATATGGATTTTAATTGAATCCTAAGATATTATAGCATCTTTCACTCCCTGTCCTGTGCATGTCAGA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Line 1: info on sequencer, flowcell, cluster position, indexes (sometimes)
Line 2: called bases
Line 3: a +
Line 4: quality scores on Phred scale
10 = 90% accuracy; 20 = 99% accuracy; 30 = 99.9% accuracy
Recoded as single character: F = 37; ? = 30; 5 = 20; + = 10

BEARCAVE

Nikolas Basler, Achim Klittich, Axel Barlow
An environment for organising, processing, and archiving Illumina data
BEARCAVE philosophy
- All users can access all data
- Avoid data redundancy
- All samples processed using identical software programs and parameters
- Incorporates sample metadata
- Documents results of data processing
- Easy to use wrapper scripts for programs
- Publicly available
- Safeguards in place to ensure consistency
Consequently BEARCAVE is not for everyone, and has idiosyncrasies in use

Adapter trimming and read merging

DNA fragment length distribution

DNA can be fragmented
The fragment lengths have a distribution

45 ka cave bear (Ursus kudarensis)

Effect of insert size

Adapter trimming

Overlapping reads are merged

BEARCAVE script

decompress fastqs
trim adapter seqs using Cutadapt
- 30 bp min length
- min overlap 1 bp
merge overlapping read pair using FLASH
recompress files and clean up
save appropriate log files

Expected output in /BEARCAVE2/trimdata/*processing/

merged reads *_mappable.fastq.gz [big file]
unmerged R1 *_mappable_R1.fastq.gz [big file]
unmerged R2 *_mappable_R2.fastq.gz [big file]
trim report *_trim_report.log and merge report *_merge_report.log

Cutadapt

FLASH

Mapping

BEARCAVE mapping script

decompress fastqs
merged (SE) and PE data processed separately
- mapping using bwa mem algorithm
- PCR duplicates identified and removed using samtools
- Reads with poor mapping quality (Q30) removed using samtools
SE and PE data merged
mapping log file generated
file cleanup and renaming

Expected output in /BEARCAVE2/mapped*/*processing/

mapped filtered data *.bam [big file]
bam index *.bam.bai
mapping log *_mapping.log

bwa

samtools (48,024 citations)

Population genomics using ANGSD

angsd

Widely used program
MANY population genetics analyses possible
Tends to work directly from bam files (unlike plink, admixtools, etc)
MANY filters available
Genotype likelihood approach is a particular speciality

Allele1|Allele2|prob11|prob12|prob22 |||| A|T|0.05|0.9|0.05

Several spin off programs that use GLs
- NGSadmix
- PCangsd
- NGSrelate
- realSFS

angsd website

angsd paper

Analyses

Covariance matrix

All indviduals
Allele frequency covariance matrix
Used for PCA

Distance matrix

All individuals
absolute genetic distance between populations
used as input for NJ algorithm

Heterozygosity

Your single adder
Calculate GLs, ML estimation of SFS along a sliding window using realSFS

Intro to R

What is `R`

R was born from S in the Bell labs in 1992
Statistical analysis
Data visualisation
Free and open source
Linux, Mac, Windows
Many additional packages
Several GUIs e.g. Rstudio
Graphics (even interactive), text documents, websites
This presentation!

`R` from a heretic and a pragmatist

(guess who is who)

Most people disagree (in some cases strongly)

Base R is really good!
R ≠ Rstudio
tidyverse is not the way to teach R to beginners
Data doesn’t have to be tidy
ggplot2 code is restrictive
It’s boring if everyone’s graphs look the same
It’s OK to type commands directly into the terminal
It’s OK to modify a raw data file
It’s OK to combine other tools and languages

How `R` works

“Everything is an object; everything that happens is a function”

How `R` works

Objects and functions

Objects

Contain data and results
Created with <-
Stored in active memory
Names include letters numbers _ .
Names must begin with a letter

Functions

Carry out operations on objects
Often generates new objects
function()
?function, example(function)

Data structures: vector and matrix

Vector

List of values of the same type
Numbers, strings, or logical values
Can be generated using c()
Indexing vector objects my_vector[]

Matrix

2D data of same type in rows and columns
Indexing matrix objects my_matrix[row, column]

Data structures: dataframe and list

Dataframe

Rectangular “table”
Mixture of data types
Set of vectors of equal length
Extract columns with $, which can then be indexed like vectors

List

Set of components with different structures
Extract named components with $

Data structures: as shapes

`Rstudio`

GUI for R
Preferred by many
Linked to tidyverse
Linked to R markdown
Version control (git) and other development tools

`Rstudio`

Population genetics: PCA

Input is allele covariance matrix
eigen decomposition eigen()
Actually a recent method
No knowledge of PC loadings in terms of SNPs/sites
PC scatterplot
PC variance explained

Population genetics

Population genetics: Neighbour-joining clustering

Input is distance matrix
Neighbour joining algorithm
Clusters based on genetic similarity
Rooted using outgroup
Requires ape library
Basic estimate of phylogeny

Schedule

Prequisites

Resources

Lecture slides

Bootcamp worksheet

Github repo

Supercomputing Wales shared project

Introductions

Axel Barlow, Lecturer in Zoology at Bangor University

Johanna Paijmans, Lecturer in Zoology at Bangor University

Introductions

Unix-like systems and bash

Unix

Unix

Unix-like systems

MS-DOS (Windows)

Windows Subsystem for Linux

OS comparison

Terminal emulators and Bash

Supercomputing Wales

Hawk

Hawk

Hawk usage

Secure shell (SSH)

SSH via jump host

SSH via jump host

Getting logged on…

AI

A brief note on AI in Bioinformatics

A brief note on AI in Bioinformatics

Filesystem

Copying files via SCP

Slurm

Illumina data

Illumina sequencing platforms

Illumina summary

Data output

Sequencing by synthesis

Sample preparation

Flow cell

Cluster generation

Sequencing by synthesis

Data analysis (in the machine)

What do we sequence?

Whole genome sequencing

Whole genome sequencing

Our project: adder population genomics

Adder locations

Illumina data processing

.fastq file format

BEARCAVE

BEARCAVE

Adapter trimming and read merging

DNA fragment length distribution

Effect of insert size

Effect of insert size

Effect of insert size

Effect of insert size

Effect of insert size

Effect of insert size

Adapter trimming

Overlapping reads are merged

BEARCAVE script

Cutadapt

FLASH

Mapping

BEARCAVE mapping script

bwa

samtools (48,024 citations)

Population genomics using ANGSD

angsd

angsd website

angsd paper

Analyses

Intro to R

What is R

R from a heretic and a pragmatist

How R works

How R works

Objects and functions

What is `R`

`R` from a heretic and a pragmatist

How `R` works

How `R` works

`Rstudio`

`Rstudio`