Schedule

Date Time Activity
Mon 03/11/2024 10:00 to 16:00 Unix-like systems, bash, connecting to SCWales
Tue 04/11/2024 10:00 to 16:00 More bash, SCWales, using SCWales and slurm
Wed 05/11/2024 10:00 to 16:00 Illumina data, BEARCAVE, data processing
Thu 06/11/2024 10:00 to 16:00 ANGSD, covariance and distance matrices, heterozygosity, intro to R
Fri 07/11/2024 10:00 to 16:00 Maps, PCA’s, NJ trees, Manhattan plots and Rmarkdown

Prequisites

  • a computer with internet connection
  • an account on super-computing Wales
  • basic knowledge of DNA and genome structure

Resources

Introductions

Axel Barlow, Lecturer in Zoology at Bangor University

  • Self taught post-PhD
  • No proper bioinformatics/computer science background
  • Knowledge of bash and R.

Johanna Paijmans, Lecturer in Zoology at Bangor University

  • Self taught (rebelling against a Linux-expert dad)
  • No proper bioinformatics/computer science background
  • Knowledge of bash and R.

Introductions

Unix-like systems and bash

Unix

  • Operating system developed in 1969 by Bell Labs
  • Unix philosophy: operating system should provide a set of simple tools, each of which performs a limited, well-defined function.
  • Modular (small programs strung together)
  • Inter-process communication: “pipes”
  • Separate normal and “super” users (sudo)
  • Hierarchical filesystem
  • A shell for executing and combining tools
  • The basis of many subsequent OS

Unix

Unix-like systems

Mac OS

  • Released 1984
  • Developed from NeXTSTEP, which is developed from Unix
  • Proprietary, Apple hardware only

Linux

  • 21 year old Linus Torvalds coded a Unix inspired OS in 1991
  • Free and open source
  • The core linux kernel available under many distributions: Ubuntu, Mint, Arch, Android, Tesla, etc.

MS-DOS (Windows)

  • Developed by Microsoft, released 1981
  • Main OS for IBM PCs in 1980s
  • GUI introduced with Windows, released 1985
  • Largest market share (70% of PCs)
  • Some bioinformatics possible (e.g. R typically via Rstudio)
  • No bash
  • Encoding of text files is different
  • Majority of bioinfomatics software unsupported
  • Windows subsystem for linux https://learn.microsoft.com/en-us/windows/wsl/install
  • Seamless transfer between DOS and Unix not yet possible

Windows Subsystem for Linux

  • Released in 2016
  • Allows installation of Linux distributions (e.g. Ubuntu) within Windows 10/11
  • Provides a bash terminal and Linux environment
  • Transfer between Windows and Linux filesystems possible
  • Essentially WSL is Microsoft’s admittance to Linux superiority

OS comparison

Windows Mac Linux
standard PC functions yes yes yes
cost yes yes free
hardware choice yes no yes
bioinformatics no yes yes
HPC no no yes
open source no no yes
active community no no yes
games yes no no

Terminal emulators and Bash

  • A shell allows users to execute OS tools
  • Accessed using a terminal
  • Unix terminal came with the Bourne shell (sh), developed by Steven Bourne in 1979
  • In 1979 Brian Fox in improved version: the Bourne again shell (bash)

  • Most Unix-like OS (incl Mac) use bash or something like it
    • execute standard OS functions and installed programs
    • access filesystem
    • supports bash scripts
    • pipes, auto-completion, loops, wildcards, etc.

Supercomputing Wales

Hawk

Hawk

  • Computational nodes
    • >134x Intel nodes with 40 cpus + 192 Gb RAM each
    • >64x AMD nodes with 64 cpus + 256 Gb RAM each
    • >26x Highmem nodes with 384 Gb RAM each
    • >28x Nvidia GPU nodes
  • Storage space
    • 1192TB (usable) scratch space
    • 420TB of home directory space

Hawk usage

  • Command-line only (via ssh)
  • Access through SSH jump host
  • File transfer using scp or sftp
  • Job submission using slurm job scheduler
  • Software installed as modules
  • No super user access
  • Space & quota
    • Home directory (small, persistent): 50Gb
    • Scratch space (bigger, temporary): SCW projects, few Tb
    • Max 10 active jobs, max 30 in the queue
    • Max 3 day runtime

Secure shell (SSH)

SSH via jump host

SSH via jump host

Getting logged on…

Connecting to the jump host (with MFA)

ssh you25usr@ssh.bangor.ac.uk

Note: most UNIX systems do not show anything when you’re typing your password!

If successful, connecting to Hawk

ssh b.you25usr@hawklogin.cf.ac.uk

Raise your hand if you are having issues đŸ™Œ

AI

A brief note on AI in Bioinformatics

  • AI tools are becoming increasingly common in bioinformatics (and everywhere else)
  • Different models are better or worse for different tasks (e.g. I find ChatGPT 4 is not great at programming)
  • But it doesn’t know everything, and it can’t do everything
  • You need knowledge of coding, assessing output, error messages, structure, logic, etc
  • Also the standard AI considerations: ethics, reproducibility, accuracy, privacy…

A brief note on AI in Bioinformatics

Filesystem

  • / [root] is uppermost level of filesystem
  • Everything is contained in /
  • Directories exist within the filesystem, they can contain files and other directories
  • We specify a path through this hierarchy using forward-slashes
  • Our current directory is called the working directory
/home/b.xlb21brx/
/scratch/b.xlb21brx/
  • We can navigate through the filesystem (change working directory)
  • Or we can specify the patch to directories or files remotely

Copying files via SCP

Slurm

  • Simple Linux Utility for Resource Management: slurm
  • Free open source job scheduler for linux systems
  • Used on 60% of World’s top 500 computers
  • Assigns user jobs to computer resources
  • Submit to queue
  • Short, low-resource jobs move faster through the queue
  • Useful tools for scheduling, reporting, etc

Illumina data

Illumina sequencing platforms

Illumina summary

  • The current market leader
  • Massive output
  • Many applications (genome resequencing, RADseq, transcriptomes, metabarcoding)
  • Cheap (£10 per Gb)
  • Major limitation is the read length

Data output

Platform Million reads Read length Gb data Genome coverage
iSeq 4 2 x 150 bp 1.2 0.4x
MiniSeq 25 2 x 150 bp 7.5 2.5x
MiSeq 100 2 x 500 bp 30 10x
Nextseq 550 400 2 x 150 bp 120 40x
NextSeq 1000/2000 1800 2 x 300 bp 540 180x
NovaSeq 6000 20000 2 x 250 bp 3000 1000x
NovaSeq X 52000 2 x 150 bp 8000 2667x

Sequencing by synthesis

  1. Sample preparation
  2. Bind DNA to flowcell
  3. Generate clusters
  4. Sequencing by synthesis
  5. Image analysis

Sample preparation

*Indexes allow multiple samples to be sequenced at the same time

Flow cell

Cluster generation

Sequencing by synthesis

Data analysis (in the machine)

What do we sequence?

[Not an exhaustive list]

  • Whole genome sequencing (pure DNA sample from a single individual)
  • Reduced representation genome data (RADseq, targeted SNPs, single individual)
  • Poolseq (multiple individuals)
  • Transcriptome (RNA sample from single tissue/individual)
  • Metabarcoding (PCR amplicon, multiple individuals/species)
  • Metagenomics (whole genomes, multiple individuals/species)

Whole genome sequencing

Short reads from a single individual can be mapped to a reference genome assembly

Whole genome sequencing

Our project: adder population genomics

  • Adders (Vipera berus berus) widespread across northern Eurasia
  • Threatened or near-threatened in UK
  • Illumina PE data from 47 individuals
  • Plus one outgroup (Vipera berus bosniensis)
  • 12 locations
  • Our tasks
    • Data format
    • Adapter trimming and read merging
    • Map to reference genome: chr7

Adder locations

      +-------------+---------------+
      |   Sample    |   Locality    |
      +-------------+---------------+
      | adder01-04  | Leeds         |
      | adder05-08  | Wensleydale   |
      | adder09-12  | Manchester    |
      | adder13-16  | Caerphilly    |
      | adder17-20  | Gouda         |
      | adder21-24  | Stockholm     |
      | adder25-27  | Cheddar       |
      | adder28-31  | Huddersfield  |
      | adder32-35  | Sheffield     |
      | adder36-39  | Leicester     |
      | adder40-43  | Nottingham    |
      | adder44-47  | Stilton       |
      | adderout    | outgroup      |
      +-------------+---------------+

Illumina data processing

.fastq file format

  • fastq is the standard output format for data from Illumina (and other) platforms
@A00551:758:HKTVJDSX7:4:1101:3595:6872 1:N:0:CCTGAGATGT+GGTCTAGTTG
CTGAATATGGATTTTAATTGAATCCTAAGATATTATAGCATCTTTCACTCCCTGTCCTGTGCATGTCAGA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
  • Line 1: info on sequencer, flowcell, cluster position, indexes (sometimes)
  • Line 2: called bases
  • Line 3: a +
  • Line 4: quality scores on Phred scale
  • 10 = 90% accuracy; 20 = 99% accuracy; 30 = 99.9% accuracy
  • Recoded as single character: F = 37; ? = 30; 5 = 20; + = 10

BEARCAVE

BEARCAVE

  • Nikolas Basler, Achim Klittich, Axel Barlow
  • An environment for organising, processing, and archiving Illumina data
  • BEARCAVE philosophy
    • All users can access all data
    • Avoid data redundancy
    • All samples processed using identical software programs and parameters
    • Incorporates sample metadata
    • Documents results of data processing
    • Easy to use wrapper scripts for programs
    • Publicly available
    • Safeguards in place to ensure consistency
  • Consequently BEARCAVE is not for everyone, and has idiosyncrasies in use

Adapter trimming and read merging

DNA fragment length distribution

  • DNA can be fragmented
  • The fragment lengths have a distribution

45 ka cave bear (Ursus kudarensis)

Effect of insert size

Effect of insert size

Effect of insert size

Effect of insert size

Effect of insert size

Effect of insert size

Adapter trimming

Overlapping reads are merged

BEARCAVE script

  • decompress fastqs
  • trim adapter seqs using Cutadapt
    • 30 bp min length
    • min overlap 1 bp
  • merge overlapping read pair using FLASH
  • recompress files and clean up
  • save appropriate log files

Expected output in /BEARCAVE2/trimdata/*processing/

  • merged reads *_mappable.fastq.gz [big file]
  • unmerged R1 *_mappable_R1.fastq.gz [big file]
  • unmerged R2 *_mappable_R2.fastq.gz [big file]
  • trim report *_trim_report.log and merge report *_merge_report.log

Cutadapt

FLASH

Mapping

BEARCAVE mapping script

  • decompress fastqs
  • merged (SE) and PE data processed separately
    • mapping using bwa mem algorithm
    • PCR duplicates identified and removed using samtools
    • Reads with poor mapping quality (Q30) removed using samtools
  • SE and PE data merged
  • mapping log file generated
  • file cleanup and renaming

Expected output in /BEARCAVE2/mapped*/*processing/

  • mapped filtered data *.bam [big file]
  • bam index *.bam.bai
  • mapping log *_mapping.log

bwa

samtools (48,024 citations)

Population genomics using ANGSD

angsd

  • Widely used program
  • MANY population genetics analyses possible
  • Tends to work directly from bam files (unlike plink, admixtools, etc)
  • MANY filters available
  • Genotype likelihood approach is a particular speciality

Allele1|Allele2|prob11|prob12|prob22 |||| A|T|0.05|0.9|0.05

  • Several spin off programs that use GLs
    • NGSadmix
    • PCangsd
    • NGSrelate
    • realSFS

angsd website

angsd paper

Analyses

Covariance matrix

  • All indviduals
  • Allele frequency covariance matrix
  • Used for PCA

Distance matrix

  • All individuals
  • absolute genetic distance between populations
  • used as input for NJ algorithm

Heterozygosity

  • Your single adder
  • Calculate GLs, ML estimation of SFS along a sliding window using realSFS

Intro to R

What is R

  • R was born from S in the Bell labs in 1992
  • Statistical analysis
  • Data visualisation
  • Free and open source
  • Linux, Mac, Windows
  • Many additional packages
  • Several GUIs e.g. Rstudio
  • Graphics (even interactive), text documents, websites
  • This presentation!

R from a heretic and a pragmatist

(guess who is who)

Most people disagree (in some cases strongly)

  • Base R is really good!
  • R ≠ Rstudio
  • tidyverse is not the way to teach R to beginners
  • Data doesn’t have to be tidy
  • ggplot2 code is restrictive
  • It’s boring if everyone’s graphs look the same
  • It’s OK to type commands directly into the terminal
  • It’s OK to modify a raw data file
  • It’s OK to combine other tools and languages

How R works

“Everything is an object; everything that happens is a function”

How R works

Objects and functions

Objects

  • Contain data and results
  • Created with <-
  • Stored in active memory
  • Names include letters numbers _ .
  • Names must begin with a letter

Functions

  • Carry out operations on objects
  • Often generates new objects
  • function()
  • ?function, example(function)

Data structures: vector and matrix

Vector

  • List of values of the same type
  • Numbers, strings, or logical values
  • Can be generated using c()
  • Indexing vector objects my_vector[]

Matrix

  • 2D data of same type in rows and columns
  • Indexing matrix objects my_matrix[row, column]

Data structures: dataframe and list

Dataframe

  • Rectangular “table”
  • Mixture of data types
  • Set of vectors of equal length
  • Extract columns with $, which can then be indexed like vectors

List

  • Set of components with different structures
  • Extract named components with $

Data structures: as shapes

Rstudio

  • GUI for R
  • Preferred by many
  • Linked to tidyverse
  • Linked to R markdown
  • Version control (git) and other development tools

Rstudio

Population genetics: PCA

  • Input is allele covariance matrix
  • eigen decomposition eigen()
  • Actually a recent method
  • No knowledge of PC loadings in terms of SNPs/sites
  • PC scatterplot
  • PC variance explained

Population genetics

Population genetics: Neighbour-joining clustering

  • Input is distance matrix
  • Neighbour joining algorithm
  • Clusters based on genetic similarity
  • Rooted using outgroup
  • Requires ape library
  • Basic estimate of phylogeny

Heterozygosity

  • Input is sliding window estimates of heterozygous sites

Rmarkdown

Rmarkdown

  • Dynamic documents in Rstudio
  • Combine text, code, and results
  • Publishers are increasingly asking for reproducible code capsules

Reproducability

Rmarkdown example

That’s all folks!

See you next year :)