This lesson is still being designed and assembled (Pre-Alpha version)

04 Visualizing and counting of the alignments

Overview

Teaching: 45 min
Exercises: 15 min
Questions
  • What is a BAM file?

  • What do all these numbers mean?

  • how to prepare BAM files for visualisation?

  • How to use IGV, an interactive genome browser?

Objectives
  • Learn how to upload and view your BAM alignment file in IGV.

  • Create a count file to extract the number of reads that maps within a gene.

Table of contents

1. The SAM/BAM format

SAM files, are tab-delimited text files that contain information for each individual read and its alignment to the genome. While we do not have time to go in detail of the features of the SAM format, the paper by Heng Li et al. provides a lot more detail on the specification.

The compressed binary version of SAM is called a BAM file. We use this version to reduce size and to allow for indexing, which enables efficient random access of the data contained within the file.

1.1 What’s in a SAM/BAM file

The file begins with a header, which is optional. The header is used to describe source of data, reference sequence, method of alignment, etc., this will change depending on the aligner being used. Following the header is the alignment section. Each line that follows corresponds to alignment information for a single read. Each alignment line has 11 mandatory fields for essential mapping information and a variable number of other fields for aligner specific information. An example entry from a SAM file is displayed below with the different fields highlighted.



Additionally tags (or attribute) can be aded to each of of the lines. These tags give some aditional information on the alignment. The number and type of tags varies between different alinment tools and the settings within these tools. Here a list of tags that are commonly used.

Tag:Type Meaning
NM:i Edit distance
MD:i Mismatching positions/bases
AS:i Alignment score
BC:z Barcode sequence
X0:i Number of best hits
X1:i Number of suboptimal hits found by BWA
XN:i Number of ambiguous bases in the reference
XM:i Number of mismatches in the alignment
XO:i Number of gap opens
XG:i Number of gap extentions
XT Type: Unique/Repeat/N/Mate-sw
XA:z Alternative hits; format: (chr,pos,CIGAR,NM;)
XS:i Suboptimal alignment score
XF:i Support from forward/reverse alignment
XE:i Number of supporting seeds


To start of we’ll have a look at how to use samtools to have a peak at the the contents of the bam files.

As these file are binary you can not simply use:

$ head Arabidopsis_sample1.bam 
?BC?mRK??0t
=8?W???F?????BlӔ?^;?n
                     ?#??ٟ۟T??CA??h?????????$?|?K??
?????чa??z?9|M???4??~WH??5??7???ǻ?Ԇr?~wE?Bd"???}q????.??c?K^?}?GM?s???@(}??ql&R??=RF?H??I???9?????Q:5eM?M?4?ጃ3??=??7?^?+x????sp
??
8??$???0?g?V?Xy?!???hm?#?P2?qX?`<t???	-?<???????
?@?81??
       ? ???+?
c:??G?!       H
       ???v*???І?Pv???c?x????y?a)?/??S?8[??ޒ?y!?P:,??-5??????֫$I^ǽ???ͧ_?ߗ??<??BChc?\k?eYU????a???kw?}????}??????8?:???!?
3(
  QD??(?T??p?C??
?D?"?ф  0?? F???? ?0h&d?o?}έ?==u?F?tUݺU???k?????o??F???q????v????U????A????p{????ޕg?^p?ht?n?zj4j?W{L?mٕ??
!M)J???n:?3M??*5???U?>P???!???ٍi?$I?eY?-
                                       c???0
                                            ?H?????	=?R?}WG/~>??????ޫ????s??/l?^r?/z??????x^u???'?'U?x@???N`o??G#?m??P?)ӕ?S??U?H?5?ѕ?\xJy???NH??????
?f?>?R?I

This will give an unreadable result. SAMtools can help us to make the content readable.

1.2 SAMtools

SAMtools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.

Like many Unix commands, samtools commands follow a stream model, where data runs through each command as if carried on a conveyor belt. This allows combining multiple commands into a data processing pipeline. Although the final output can be very complex, only a limited number of simple commands are needed to produce it. If not specified, the standard streams (stdin, stdout, and stderr) are assumed. Data sent to stdout are printed to the screen by default but are easily redirected to another file using the normal Unix redirectors (> and >>), or to another command via a pipe (|).

1.3 SAMtools commands

SAMtools provides the following commands, each invoked as “samtools some_command”.

Looking at the content of the file using samtools view:

$ samtools view Arabidopsis_sample1.bam | head
ERR1406259.2	16	Chr1	17425094	60	85M228N16M	*	0	0	TGAGAATAAAACCATATGGGTTGGTGATTTGCATCACTGGATGGATGAGGCTTATCTTAATTCTTCTTTTGCTTCCGGCGACGAGAGAGAGATTGTTTCGG	CDEDDDDDB@4DEDDDDDDDDDDDDDDDDDDDDDCAEEDEFFFEFFHHHGHJJJIHJJJIIIGIIJIJJJJJJJFJJJIIIGJJIJJJHHHHHFFFFFCCC	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:101	YT:Z:UU	XS:A:+	NH:i:1
ERR1406259.5	0	Chr1	25446585	60	101M	*	0	0	AATTATTGGGCCATATCCCGACCCCTTTGGCAAACCCATCGACCGTTCCAAGAATCCCTCCGGTAATCCCTCCGGCAACCCCAATAATAAGCTTATCAAGC	CCCFFFFFHHHHGJIJJJJJJJJIJJJIIIDHJIIJJJJJGIJGIFGIGIHHHHHHFFFFDDDB?BD@CDDDBBDDDDDDDDDDDDDEEDDDDDDDDEDDC	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:101	YT:Z:UU	NH:i:1
ERR1406259.11	0	Chr1	22700646	60	101M	*	0	0	GCCATGTTGGGTGCAGCTGGAGCTATTGCTCCTGAGATTTTAGGAAAGGCTGGTCTGATTCCAGCAGAGACTGCTCTTCCTTGGTTCCAAACCGGTGTGAT	@@@DFDDEFDHBFHIIJJGIIJJIIJJFJFJJJJJJGIJJJJIGGHJ@FGHIE=FHIJJIGEHGJJGH=A3=CED?DEFF<CCCCCDCEDDDDBD385@DD	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:101	YT:Z:UU	NH:i:1
ERR1406259.10	0	Chr1	11215252	60	69M263N32M	*	0	0	CCGTTCACTATCAAGTGCGGTAAACCGTCTGACCCGCATAACAATCCGTACTTCCCGTAGTTGTCGAACCTGCGTTTGGTTTTCTCGATCTGAGCATTGAG	@@CFFFFFHHHHHJJBGIHIFFIJJGGIIIIJIGGGIIIIJIIGIIJIGIIGGJJHHHFFBDEECBA;ABCDDD289@D<?BB>@AB:>BBD>>>CCC@>C	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:101	YT:Z:UU	XS:A:-	NH:i:1
ERR1406259.12	16	Chr1	30360823	60	101M	*	0	0	AAGAGTGTCACAGAGCTTGAGAAGGAATATGAGATTAGGGGCTGCGGCAGAAAGGACTGGATTGATAAAAGAGGAGACTGGAGGTCTAAGGCTTATGGTTG	CDDDDDCDCCCDCCDDEEEEEEFFDDFEFFHEEC;ACGHFGGFJJIIHGF?*JJIHGGJJIJIGHBJJIJIJJIHFBIHJJJIGGGJJHHHHHFDFFF@@?	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:101	YT:Z:UU	NH:i:1
ERR1406259.13	0	Chr1	25048653	60	49M136N52M	*	0	0	CATCATAGTATCCGGGTGAGTTACCGTGCTCACGGTACACAAATCCGTGCTCCAACTCGAATTCAACACAAGGAATCCACTTGTTGCGGATAAGGTAGTCA	CCCFFFFFFHHHHJJJEFHJJIJJJIGIJJJJJJJGIIJJJJJJJJJJJJJJJJGIJJHHHHFFFFFFDDDEDDDCDDDDDDDDDDDDDDDDDDD@@CDED	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:101	YT:Z:UU	XS:A:-	NH:i:1
ERR1406259.17	16	Chr1	19517907	60	101M	*	0	0	TCCAAGCTGAGGGAAGAACTCTAGATGATCAAGAATCCTATCGGGACAAAATAATTGAGTACAAGAGTCTTGAAGATGTTGTTGAGGATAATATCAGTTTG	ADDDDDDDDEEEEEEFFDFFFHHHHHHHHJJJIIGGIJJJJJJJJJJJJJIJJJJJJJJJIJJIJIHIJJJJJJJJJJJJJJJJJJJJHHHHHFFFFFCCC	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:101	YT:Z:UU	NH:i:1
ERR1406259.24	16	Chr1	6804452	60	49M153N52M	*	0	0	CAGCATCGTCCACTTCCACTTGCCCTCCCGGCGGCAATAATTTGCACAACTGTGGGGCTACAAGAATGTAACCATGCGAAGCGATGTGGTTAAGAACGTCA	:4?8(0>@CCC@CC@>>4<82<@<8=?8896=7E>EC?CD@IGGC=@C7CAF@@BDB9?9FHGD>BBHDFBCIIHHBGHEHBGA<+ACBBHDFDBA;D@<@	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:101	YT:Z:UU	XS:A:-	NH:i:1
ERR1406259.25	0	Chr1	7738049	60	101M	*	0	0	GCTGAACGAATGGCCCGGAGAAGCGATTTTGATCAGNATTCATCAGGAGGAGCCAGTTATCCATCACACGGTGAGATNTACGAAGTTGTTCTCTACTTTGG	CCCFFFFFGFHHHJJJJJGGIJBBHIGIIIEAEDHI#0<FHEIEGIEGIICFCACE>BEFCCEDEEDCD@B?B@BCC#,8??@AB?CDDDDECCACDDEAC	AS:i:-2	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:36T40G23	YT:Z:UU	NH:i:1
ERR1406259.32	16	Chr1	4393371	60	99M2S	*	0	0	TAAGAACCTTCTTGGCTCCAGCCTGAAGGTGCTTCCCAGCACCGTCTCTGTCAACAAACACTCCGGTTCCTTCGATAACTAAGTCAATGCCTAGTTCCCCC	ADCA>3BC?BDCDDDDBDDBDDDDCCCDDDDBCBDEEED?5GGHHIIJJJIGHIJHF=FCJHJJJJJIIIJJIJIJJJJIJJJIIIGJGHFFHFFFFFCC@	AS:i:-2	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:99	YT:Z:UU	NH:i:1

SAMtools will make the data readeble, this data is then piped through head to show the first 10 lines of the file.

2.4 Counting and sorting

SAMtools view can be used to filter the alignment based on characters like mapping quality, chromosome, orientation etc. When the -c option is added the filtered selection is counted.


Count the total number of records:

$ samtools view -c Arabidopsis_sample1.bam 
$ 263657


Count with flagstat for additional information:

$ samtools flagstat arabidopsis1.bam
263657 + 0 in total (QC-passed reads + QC-failed reads)
14232 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
263142 + 0 mapped (99.80% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


Count the records using the FLAG argument. Count the alignments that don’t align.

$ samtools view -f 4 -c Arabidopsis_sample1.bam
$ 515

The argument -f includes reads that fit samflag 4, read unmapped.

Count the reads that do align:

$ samtools view -F 4 -c Arabidopsis_sample1.bam
$ 263657

Here -F is used to exclude reads that fit samflag 4, read unmapped. Everything else is included.

Question

Sometimes you will see that this number of alignments is higher then the number of sequences in your fastq file. How can this be?

Solution

When a read multimaps (aligned to multiple positions in the genome), each of these positions is included as a separate alignment.


Write your selection to a new file:

samtools view -Sb -F 4 -o Arabidopsis_sample1_mapped.bam Arabidopsis_sample1.bam

In this command -Sb is needed to keep the file binairy(compressed), and -o specifies the output filename (a bam file again).

Count the reads that align to the forward strand:

$ samtools view -F 20 -c Arabidopsis_sample1.bam
$ 131631

Use -F 20 to exclude “read reverse strand” and “read unmapped”.

Count the reads that align to the reverse strand:

$ samtools view -f 16 -c Arabidopsis_sample1.bam 
$ 131511

With -f 16 you select for “read reverse strand”.

With SAMtools it is also posible to select for alignments with a minimal mapping quality.
Alignments with a maximal score (60 for hisat2 output files and 255 for STAR output files) are truly unique:

$ samtools view -q 60 -c Arabidopsis_sample1.bam
$ 240224

This number can never be bigger then the number of reads in the fastq file, as all reads in the output give a single alignment.

Question

BAM files usually contain a tag or attribute that gives the number of mismatches between the read and the reference genome. With SAMtools it is unfortunately not possible to filter on these values. Could you think of an other way to select for alignments that align without any mismatches?
Hint: make use of grep "XM:i:0" among others.

Solution

samtools view Arabidopsis_sample1.bam | grep "XM:i:0" | wc -l


3. Visualization of a BAM file

A BAM file can be visualized using a genome viewer like IGV.
We can’t just upload the files in the viewer. We first need the files to be sorted and indexed. This can be done making use of SAMtools sort and index commands.

3.1. Preparation of the BAM file for IGV

Sorting

Samtools can also be used to sort the read alignments. The aliments will be sorted based on the position ofv the alignment on the reference genome, starting from the beginning of chromosome 1 to the end of the last chromosome.

$ samtools sort -o arabidopsis1_sorted.bam arabidopsis1.bam

where -o defines the name of the output file (also a BAM file).
The default for samtools sort is sorting by position. There are more sorting posibilities to be found with samtools sort --help.

Indexing

An index file is needed to get access rapidly to different alignment regions in the BAM alignment file.

samtools index arabidopsis1_sorted.bam

Only the input file name needs to be specified, based on this name a .bai (BAM index) is produced.

Transfer the files to your local computer

Next the files need to be downloaded to your local computer.
First, we need to exit the container (ctrl/cmd + d) and next create a folder on your VM outside the container:

$ mkdir IGVfiles


Copy the files (both the .bam and the .bai), or just the whole content of the directory mapped:

$ docker cp bioinfo:/home/mapped/ ~/IGVfiles/


Go to your local computer, create a directory and download the files from the VM:

$ mkdir IGV


Download the sorted .sorted.bam files:

$  scp -r root@178.128.240.207:~/home/tutorial/IGVfiles/mapped/*sorted.bam ~/Desktop/IGV


And the index (.bai) files:

scp -r root@178.128.240.207:~/home/tutorial/IGVfiles/mapped/*.bai ~/Desktop/IGV

3.2. IGV

For this exercise we’ll be making use of an online version of IGV. The Arabidopsis genome that we used for the mapping is available in this web app.
It can be found under “Genome -> A. thaliana (TAIR 10)”.

The BAM files can be added as a track. Choose “Tracks -> local file”. Select both the .bam file and the accompanying .bai.

You should get something like this:



Mapping has in this case only been done against chromosome 1. So if we want to see are reads choose chromosome 1. Zoom in to see the reads.


It is also posible to search for genes. Just write the name in the search box and click the magnification glass. Try with the genes:



Question

Pick a sample and visualize the forward and reverse alignments separately in IGV.

Solution

Select and sort forward reads in one go.
samtools view -Sb -F 20 Arabidopsis_sample1.bam | samtools sort -o Arabidopsis_FW_sorted.bam
Index the forward reads.
samtools index Arabidopsis_FW_sorted.bam
Select and sort the reverse reads.
samtools view -Sb -f 16 Arabidopsis_sample1.bam | samtools sort -o Arabidopsis_RV_sorted.bam
Index the reverse reads.
samtools index Arabidopsis_RV_sorted.bam
From the Desktop of your local computer create directory and download the files.
mkdir IGV
scp -r root@178.128.240.207:~/home/tutorial/IGVfiles/mapped/*sorted.bam* ~/Desktop/IGV

Upload the files in IGV and you should get something like this.


4. Creating the counts file

For downstream applications e.g. differential expression analysis, the number of reads that maps within a gene has to be determined for each sample.

The featureCounts program from the Subread package can do this. The complete user guide is available here and featureCounts is in section 6.2.

featureCounts can…count (!) the number of reads that map within a feature. The Arabidopsis genome annotation in the GFF3 format contain three different features to choose from.

Depending on the downstream applications the choice is gene, transcript or exon. In this study we are just looking for differentially expressed genes so our feature of interest specified by the -t will be gene.

$ cd /home/

$ gunzip ath_annotation.gff3.gz

$ featureCounts -O -t gene -g ID -a ath_annotation.gff3 -o counts.txt mapped/*.bam

Here is an explanation of the different arguments used:

The output file produced by featureCounts is a tab-delimited file that can be opened in a spreadsheet program like Excel.

Key Points

  • The SAM/BAM format is the end-result of a read alignment to a reference genome.

  • The samtools software can be used to view, filter and order aligments in a .bam file.

  • The aligments can be visualized in a genome using an genome viewer like IGV

  • The resulting BAM files are used to generate a count table for use in differential expression analyses.