Introduction to SAM and BAM files

The tutorial is intended as a gentle introduction to Sequence Alignment/Map (SAM) formatted files and their binary equivalents BAM.

Methods used in this tutorial include:

The computational requirements for this tutorial are:

⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select Edit > Clear all outputs before using the notebook to analyse your own data.

Introduction

This tutorial aims to elucidate the information stored with a SAM and BAM files, and how such files can be read, or parsed, within the Python programming language and on the command line.

The goals from this tutorial include understanding:

The tutorial includes a small E.coli dataset (100x) in the form of a compressed fastq file to demonstrate the key learning objectives.

Getting started

Before anything else we will create and set a working directory:

Sample Data

For the purpose of an introduction to the SAM/BAM format, we have uploaded a small E.coli dataset (100x) which is available to download here in the form of a compressed fastq.

To download the sample file we run the linux command wget. To execute the command click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.

Before we move on, let's quickly inspect the sample .FASTQ we have downloaded using seqkit. This should give us an overview which we can refer back to later. Again, click play to run the following command.

SAM Files

The SAM format) is a file type most commonly used to store the results of nucleotide sequence alignments. However, the SAM format is also used to store unaligned nucleotide sequence data, and this is use-case is steadily increasing in popularity. In this tutorial, we'll focus on aligned-SAM files.

SAM is an important biological file format, not least because of the extensive ecosystem of tools which require it as input, but also thanks to the range of utilities which make it easy to write, query and transform them. In this tutorial we will introduce the format, provide useful links to tools and documentation, as well as help you to get started with some basic examples of creation and inspection.

Basic Format

With our introduction out of the way, let's dive into what a SAM file looks like. However, before we do, arm yourself with this link to the SAM specification, and keep it handy. It may look daunting, but its a very clearly laid out document which you can consult with any time you have a question related to SAM formatting.

Quoting the spec, a SAM file is described as a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. Let's go through each section in turn.

Header section

The header section naturally precedes the alignment section. If it exists, it can contain any information but is primarily the home of meta data, like SAM versioning info, the commands used to generate the file, or information about the references used in alignment. Header lines start with an @ character.

Here's an example header, in this case generated from an alignment of the E. coli dataset we provided to the E. coli K12 reference sequence:

@SQ     SN:U00096.3     LN:4641652
@PG     ID:minimap2     PN:minimap2     VN:2.17-r941    CL:minimap2 -ax map-ont ../ecoli_k12.fasta reads.part_001.fastq

The first two letters of a header line are known as 'header record type codes', and denote the kind of information each header line will contain. For instance, SQ refers to Reference sequence dictionary, which stores info about the reference sequence against which the alignments in this file were made.

Thereafter, each entry within a header line follows the format key:value, e.g. LN:4641652 which gives us the length of the reference in bases. It's as simple as that!

By looking at the SAM specification, can you figure out what PG stands for?

Various command-line tools and software libraries make use of the header section by writing all sorts of meta-data for later use, either by a human manually inspecting the header, or by another tool looking for certain key bits of information. However, it is important to remember that the header and the information it contains is optional.

Alignment section

The alignment section is where the bulk of the information in the file is stored. A SAM record is a description of an alignment (or an unmapped read, if it was unaligned). There is one record per line, and each record consists of a number of tab-separated columns. There are 11 required columns. Here's an overview of those, with informal descriptions for each:

Column Descripton
1 QNAME Query name, i.e. the name of the read which was mapped (copied from the read's FASTQ record header)
2 FLAG An integer which is a short hand way of describing the type of alignment this is. See below for an explanation of FLAGS.
3 RNAME Name of the reference to which the read was mapped, e.g. chr1
4 POS 1-based leftmost mapping position with respect to reference, e.g. a POS of 100 would indicate the read mapping starts at the 100th position of RNAME
5 MAPQ An integer which should be a reflection of the quality of the mapping, but for which the scale and meaning may differ by aligner
6 CIGAR A string of letters and numbers describing the matches and mismatches of the alignment from left to right.
7 RNEXT Ignore for now. Becomes relevant when working with pair-end data.
8 PNEXT ""
9 TLEN ""
10 SEQ The sequence of bases constituting the underlying read
11 QUAL The sequence of ASCII characters encoding the PHRED quality scores for each base of the underlying read

The formal descriptions for these columns (and everything) can always be found in the SAM specification.

It should be noted that there is also an optional section, placed after the 11 above, which not unlike the header is used flexibly and optionally by tools to store pieces of additional information in the shape of TAGs.

Each tag takes the form TAG:TYPE:VALUE format where TAG is a two-character string, TYPE is a symbol denoting the type which the value may be (e.g. an integer, or a word), and VALUE is the VALUE of the TAG for this alignment. This area is often used by tools to store some additional meta data per alignment, e.g. the edit distance between the read and the reference:

NM:i:2

There are some pre-defined TAGs that, if set, should have a certain consistent meaning, but equally there is great scope for tools to come up with their own tags. That's all we need to know for now.

In order to illustrate everything we have just explored, here's an example alignment record from our E.coli sample:

e813406f-821c-4cbb-85f6-c7fc811703a3    0       U00096.3        1958697 60      69S25M2D8M4I7M1I3M2I42M1I36M1D36M2D34M1I14M2D129M2D22M2I30M1D2M1D75M1D15M4I4M3D2M1D108M1I58M1D54M1I3M1I56M1I5M1D3M1I19M1I30M1D95M1I73M2I64M1D91M1I114M1D19M1D185M1D3M1I110M1I21M1D291M1I123M1D6M2D5M1I32M1I4M1I77M1D35M1I27M2D2M1I50M1D36M1D5M1I33M1D21M1D34M1I31M1D41M2I16M1D105M1D11M1I42M1I62M3I2M1D12M1I91M1I12M2I69M3D302M2S       *       0       0       TTCAGACCTTCGTTCAGTTACGTATTGCTAAGGTTAAACAGACAATTAAACGGAATGACAGCACCTACTGGCATCGAGGAAGTGGTTCAGGTTTTGCACCGAAGAGCTCCTGCTTCCAGGGCCATAATGATTGGAGCATCACGATGGGCAATAAGTTTCGTAAATTTTTGATAAACCTTGCAGTGCGTCTGATTTTTGCCTGATGTCAGTACCCGGGCAATGCCCAATTCCGCAATTTCTGAGTGTATATAAAGGGTTAGCGCGCATACTCGAAGGCGCGATGAAGACGCTGCCAGCGGACCGGCAGCAGCCATTTCCTTTTCCATTGGTGGCATATCAACATTCCATTCAACATCGAGAACGCCCGTCACCAGTCCAGGAAAACCTAATTCGCGGACCGTGCGCACATCCTGGAGTGGCGGCAAACTCACCGTCGCGCTGTCGCGGAAATCGCCACCGCGTGGGCAATATTGGATGCACCGGGATCGTCACCCGCTGGCGCACGGATTTCAGTACACCCAACGACGGCGTTAAGCCCCCCTCTTTGGGGCTACGCGTGGTGACTTCTTGTCTGCGCCGTTTTGCTGCGCCGTTAGTGCACATTCCATGCTGTAACAGCAAATTTCCAGTAATGCCATTTTTTCTCCTTAATTGCGCCGACTGCTCGCTGGCAACGAATCTCTTCAATGGACCACGGATGAAACTGCAACGTTGTTTTGCCGTTCATAACCGCCAAGTAGGATTCGGTAATCGCTCATGCTCACCTTTTGGCGAACTGGTTTTCACTGACAATCCCCCGGTAGGCTGAGTCCTTCATCAGAAAGCAATGCCAGCGCACGGGCGTCCAGCGTTTTCTGATACACCCGGCAGAACGATTTCAAATATGTTACCAACCTTCGTGTGGGTAACGTTTTCCCCGGGCCACGGTAGCTCCACAATAGAAAACTGCCAGTGCGCAACCTGTACCGGTTCATGCAATTTAAACAGACAAATCGGTCTGCCATTGGATCATATTTTCTGACAAAAGCTCGCCACACTGTTCAAACCCGCGACGCCAGCGTTCAGCAGTGGCGTTTTGATGTGGCAACGCAAAAAAATGTGATCGGCAGTCAGCGGAGTGATATTCAACCCCAGACGGCGGGAAACCCATCTAATGCGTGGATAAATCGCGGTAAATCCGATGCAATATCCTGCAGCTCGTCGATAGATTGCCAGTTCGCCATAATCACTCTTCGTACTTTCAGTAAAAGCGTTAATTTACCCTGTTGCCCTGTGCCAACCAACCGCTGATTTCACGCCGCTTCTGATGCAATCGTGAAAACGGCAATACGCCGCGCGCACGTTGCTGACGGAGCGGCCATTTGCGGTGTCTCCCGCCCTAATTTCTTTAACTGGTGCGGGCAATTTTTGCTCGCTTCATCAATGTAAGGTATTCCGGTGAATATTCAGGCTCTTCTCTCAGAAAAAGTCCGTCAGGCCATGATTGCGGCAGGCGCGCCTGCGGATTGCGAACCGCAGGTTCGTCAGTCAGCAAAAGTTCAGTTCGGCGACTATCGGCTTAACGGCATGATGGCAGTTGCTAAAAAACTGGGTATGGCACCGCGACAATTAGCAGAGCAGGTGCTGACTCATCTGGATATTAACGGTATCGCCAGCAAAGTTGAGATCGCCCAGTACAGGCTTTATCAACATTTCCTTGATCCGGCATTCCTGGCTGAACATGTTCAGCGGGCGCTAGCATCCGATCGTCTCGGTGTTGCTACGCCAGAAAAACAGACCATTGTGGTTGACTACTCTGCGCCAAACGTGGCGAAAGAGATGCATGTCGGTCACCTGCGCTCTACCATTATTGGTGACGCAGCAGTGCGTACTCTGGAGTTCCTCGGTCACAAAGTGATTCGCGCAAACCACGTCGGCGACTGGGGCACTCAGTTGGGTATGCTGATTGCATGGCTGGAAAAGCAGCAGCAGGAAAACGCCGGTTGAAATGGAGCTGGCTGACCTTGAAGGTTTCTACCGCGATGCGAAAAAGCATTACGATGAAGATGAAGAGTTCGCCGAGCGCGCACGTAACTACGTGGTAAAACTGCAAAGCGGTGACGAATATTCCGCGATGTCAGCACAAACTGGTCGACATCACCATGACGCAGAAACCGGGATCACCTACGATCGTCTCAACGTGACGCTGACCCGTGATGACGTGATGGGCGAAAGCCTCTACAACCCGATGCTGCAGGAATTGTGGCGGATCTCAAAGCCAAAGGTCTGAGCAGTAAGAAGCGAAGGGGCGACCGTCATCTCCTTGATGAGTTTAAAAACAAGGAAGGCGAACCGATGGGCGTGATCATTAGAAGAAAGATGGCGGCTATCTCTACACCACCACTGCGTCGCCCTGTGCGAAATATCGTTATGAAACACTGCATGCGATCGCGTGCTGTATTACATGACTCCCGTCAGCATCAACACCTGATGCAGGCGTGGGGCGATCGTCCGTAAAGCAGGCTATGTACCGAATCCGTACCGCTGGAACACCACATGTTCGGCATGATGCTGCGGGTGAAGACGGCAAACGTTCAAAACCCGCGCGGGTGGTACAGTGAAACTGGCCGATCTGCTGGATGAAACCCTGGAACGTGCACGCCGTCTGGTGGCAGAAAAGAACCCGGATATGCCAGCGACGAGCTGGAAAAAACTGGCTAACGCGGTTGGTATTGGTGCGGTGAAATATGCCGGATCTCTCCAAAAACCGCACCACGGACTACATCTTCGACTGGGACAACATGCTGGCGTTTAAAGAGGTAATACCGCGCCCATACATGCAGTATGCATACACGCGTGTATTGTCCGTGTTCCGTAAAGCAGAAATTGACGAAGAGCAACTGGCTGCAGCTCCGGTTCTGAGTCCGTGAAAATCAGGTGAAGCGCAACTGGCAGCTCGCCTGCTGCAGTTTGAAGAAACCCTCACCGTGGTTGCCCGTGAAGGCACGCATGTAATGTGTGCTTACCTGTACAATCTGGCCGGTCTGTTCTCTGGCTTCTACGAGCACTGCCCGATCCTCAGCGCAGAAAACGAAGAAGTGCGTAACAGCCGTCTAAAACTGGCACAACTGACGGCGAAGACGCTGAAGCTGGGTCTGGATACGCTGGGTATTGAGACTGTAGAGCGTATGTAATCGATTTTTCGTGAGAGTAAAGCCTGATCAGGGTTAGCCGATCAGGCTTTTTTATTGCCATCTAAATGTATTCTGAAAATGGACATGCCATTGTTTTCTCACTGTTGGATAAGAAC       %$&#&#"'/,&*')5869;<?=<:@@?;;<<99JG00@/#%(5$&+**)@'#$@(%(AFB@EH@@DA?>(/7.*#$-+,30'+'$/$$$05765-%&%'*%38,%*)0("''*%&)-###.8401&&+,89:>?>;D77890/DDGFCEDC)))0))&*001494:DH*9JENJLCC@@BCG?FBB:::?>G@GUMC8%7649996>@?@>>F=57F=35CBDC:-2<;$5+3*().$%0+(-8770,+'*0<,-222'&/16+5,/.&(%*.5+/,-+%)&$+/#&'$<=7CE=AA68CA/9DA;EA?>D6,/8(*(6C<,-;057%(<>AA89.-0$(7#8/01,.%(-1><:8;?8?CC/$A+(2.(77-/2-),--,8'*-+**+3,04<>@<;=C>DF@CBD..9::$%,-44<CFF867A894(==9:.(-$*-#&$+($%64308%>C6A:9>00&(1($'0*-'&4?<7>/(+6=:<11:9??,@?<FCFF@DDB=2@=RTG88==7<788F9<65DB?GGDIHLJEJ=C?>:<2EG-*2/206('**'*%%)&'4*,$525.067=4G@@@MRJC;@BA;@EJIK./=B9?GGTK?FD>@>AE<@BE@F@78E4MLCL;;:FF8557674//'$%57@LF&.11-?5=A<?=;4246/=>@>=,/(#%-=7.@A>>;?EJCN4CB6?B:77CJD+%"$/&*1887=B74A<<122+%&#+4:%&&#+,,#/4<==88A?CC=E??>7=>@84C/A:6<G/0H::O:JCKKB?G6G<A=%#$'+?9BG92:A<F>O??EF::D?KBDCBHCMRSTCBIH?K@OK@><C?6992>.1-2&>@=BF?--*(32$&&$>??HELCOME@DKK;6*+/97==;>%:?B7GAGD=6>><43:$$/647)'$.BI:'4=5HC;=CE::A@>G5.%+344@BBE=@A@LKEHG@DBIKECCJDF@F@HQPEEFKEQKKQA88M:9@A@ABIA?>>A@@>*-(/.&(11,13486631@CCDA<MID>BC9GFDB@;@A?H@3FH4E:@:9B1AGDGAE>DI>:>FHC:PFJ?QOS@*-((#%%'&*A6>%&:9899=>ACHAE?LLC=<NGGEEFJ>@EKHKHH>GHJI@HC??312)(2::-(**9++*;'2==8>DHB1EOMJCA>=;><EEICB<>@<<8AA?CHCCDEHHAA?<;;-?G?++8BCHKP?>KHL?HCGFDEFK43:@KG:36&35E835+>?@RG:GBOA@@AM><+?@?ABCGSHB@E=GG@6::BBBIFEEBKDNPCFBDG@+=:I>HBGE?DG9;C/8<<FEHFB@B@?@22G<9+'%-(+1.4*>29;:0<.')+'/0?GD.<@9,&*%%%#%)(?<(8;67;69A4,7G:0>JCH>A@JBEAE/0137,(+8=GEDAB=FIAFHB?0=>=B@ANFD:CG?DQDKA2439>>B@=B6<=:>+(/,0021/41=FD:..B@?,536@9$-,678;@><;:<A=<B>@@BA@:?J>D>JEIGDDN7<<42:6/273))@HAIE68;146AE8=95:2)(01#'&.968BE<BDOEKI?$C7:0&5-/@=;=CAEC.;+9*,E@@AF@H@:@BKG@<BLIABF>4;<<=?AD875007B1/<6##&%())?7)(&*8>.3.+,0()*%(&$%&,'4)#*+$$%,755G$>:,'()#39*'0:2.19:?CEDDKD';AADG<DI;6$+./.+6=0-/<@;55&)*'(.4,4#$7:635DA?>GH;;>=CLAQC;FLKG@BLFB;>EG@FAB=KD?EB==B?=BJEHHA/AA0AEDB'&DDC<:9FDDLFACAE=<<C.<AFJ>>ECA?;<@::3,,.>B@;B?6F(531:@*+'+3=<@BI@85FA>BK@@@?JILFHRE?IBN@;B=@?&'D,D5>/0*$(,(+)33414<0033(301$%A;CAH9;GB>>:>9BGHEAN;34983@)G?=9DC=<7EHFB:B@CCBB6A:622==7JDDAB=0?66*-@DI@@=F34A=@CD@DGDE=?HJHGIBI:G5A:<IFHIHIFHJRL/>>=><,0&'&&7)?:CD@C;JJNA==?@FE@FFYG?>GC>?H.9:;=0'+.68?><8.33/4D??<$'0%$C23C4DH>=DA@45FKC@<?AEG>:C($-/%37$%,:;<&60:*'&>98>>77@CC3FHG<<FA?PGPL<D<=GBC>?@F>/,,0*.$%*-'.22**+.04<;///++--+<*,.%+:9'.-.1.0<=4560369GDAA>987%%%%/*%4+&&&76//.;>33/.'1''*&'/&&&=,4?6#%&%$$/31&+G=@F???@GLBDADE?=86C<A?:I:?07*)*.*&-<<6<<=3<<><&$6797711.)+9'==3:<--8+#&)'%$964'8=65JMMDLH7@CCC>=HPD@,;=<71;*0(A698=;<1<<@A.3699&$&;680=9;=8>2>22:ABK=5DCFG@LEMEE?IA>@B</.B:/%%=;DAABKAE=IB>=?A@>/5,)40FC>?GACGBF@LJFBJ9GF3C/A:-0?;FG7DGB?<,,*$-.#%$+-.$39?@HHC4(=?64CFHDA;<A>=4<=CDDC?E;;68;<4548<81)84:?E8FECF@='8(1,1259:77+(##388<?EC7767;EHB312=0&$5,:>DH>J@?;:E10*0/A7,-,BEC.3339</8530*+'(+3:;=H@E@>?BBIA@B=>=A>+:11/78(1/2<;00037D94ULKD??C;:C6;68A=>@@EDEMIA:GHCH9@FDHHK5GE44,+.(-,$#$'((.&&#$$$*,$.&&@<:9AFMFF=6H@CCFB338C5?C==<<?HLKEDIFJ@;KRI:@=A@FADEIQHFB1BK::FIMF?@YVNEG=C@:>AB8G9='5.0#$#&&+597=8)B2694*+-E<8E>DA8H:DBGFG@@98/169;9AC?@F@=D?ABFCDQJCB8CDDCEC<E=QCFK@3997DB*43,$$$28;;*,5661,$+4337=AICE>-/ACEFKG<CDDF@CML<A@AJIHLGG@ABGF?ECABD>?=.<98=?73=E1K17;;>B:9B@=HD8EFDA=BB@&<=844,,23I8;?>JB4@:;DADH<DB<=JHB@EDCBGLL=ICBB<9>@KEG<<=BGB48;;<?1+$4:>@B=3+,,B<<D<?;KKGEDNLLXNE;??32.9@:$$D97@>:;:0*+17;797>:.;9**/:73/2((&)+((*+))+/''((3##%(,943CKGDG?LDB@HBHAJD=:JJ3@DF<BABF>43:1//&&(*       NM:i:163        ms:i:5458       AS:i:5458       nn:i:0  tp:A:P  cm:i:366        s1:i:2411       s2:i:0  de:f:0.0433  rl:i:0

What is the value of the FLAG column?
What is the edit distance of this read to the reference?

What the flag?

The FLAG column explained.

Above we encountered FLAG, one of the required columns in a SAM record. As we noted, it is used to denote what kind of SAM record we are looking at. To the end-user, the practical value of this is that the value of FLAG may be used to filter alignments with samtools. This is especially useful since you may not want to carry forward all alignments in your SAM file to the next step of your analysis, you may require only a specific subset of them.

So with that in mind, what are the main types of alignment record and what values may FLAG take? Primarily a SAM file will contain four broad categories of alignments:

⚠️ Note: Not all aligners operate in the same way. The above information is based on Minimap2's behaviour. Some aligners, for instance, may not always write out unaligned records. If you choose to use a different aligner for any reason, it is always important to read the documentation first.

As mentioned above flags are a short-hand way of denoting the type of alignment recorded in each record. They can be referenced as integers or in hexadecimal form.

Integer Bit Description
1 0x1 template having multiple segments in sequencing
2 0x2 each segment properly aligned according to the aligner
4 0x4 segment unmapped
8 0x8 next segment in the template unmapped
16 0x10 SEQ being reverse complemented
32 0x20 SEQ of the next segment in the template being reverse complemented
64 0x40 the first segment in the template
128 0x80 the last segment in the template
256 0x100 secondary alignment
512 0x200 not passing filters, such as platform/vendor quality controls
1024 0x400 PCR or optical duplicate
2048 0x800 supplementary alignment

For instance, here you can see that 4 or 0x4 refers to unmapped reads, whereas 256 or 0x100 refers to secondary alignments. Later we'll provide examples of how flags can be combined and provided as an argument to samtools, which permits us to view only matching alignments. For now, here's the rule to remember:

For instance, if we wanted to address secondary alignments and unmapped, we would add the flags together, to arrive at 260 or 0x104.

If you ever need a helping hand to figure out which FLAG is right for you, this webpage should be useful.

CIGAR, anyone?

The CIGAR column explained.

Above we encountered the concept of the CIGAR string, which is a condensed alphanumeric representation of an alignment.

Let's take the example from above:

69S25M2D8M4I7M1I3M2I42M1I36M1D36M2D34M1I14M2D129M2D22M2I30M1D2M1D75M1D15M4I4M3D2M1D108M1I58M1D54M1I3M1I56M1I5M1D3M1I19M1I30M1D95M1I73M2I64M1D91M1I114M1D19M1D185M1D3M1I110M1I21M1D291M1I123M1D6M2D5M1I32M1I4M1I77M1D35M1I27M2D2M1I50M1D36M1D5M1I33M1D21M1D34M1I31M1D41M2I16M1D105M1D11M1I42M1I62M3I2M1D12M1I91M1I12M2I69M3D302M2S

What we see is a pattern of integers followed by a single letter, e.g. 69S which is followed by 25M, and so on. The letter denotes the type of operation that the aligner performed when constructing the alignment, and for how many positions that operation continued. For instance 25M means that the aligner was able to align 25 bases of the query to the reference in a single stretch. Note however that:

In SAM-speak the idea of "Match" simply means a base in the query sequence is aligned to a base in the reference. In does not mean the two bases are the same.

In our example 25M is followd by 2D. What does this mean? When an aligner makes an alignment, there might be errors or variation in the read, meaning that the alignment to the target may be imperfect. For instance if some bases appear to be missing in the read, the aligner may mark that a deletion has occured, thereby adding 2D into the CIGAR string.

The set of possible operations are listed within SAM specification, as follows:

Operation Description Consumes query Consumes reference
M alignment match (can be a sequence match or mismatch) yes yes
I insertion to the reference yes no
D deletion from the reference no yes
N skipped region from the reference no yes
S soft clipping (clipped sequences present in SEQ) yes no
H hard clipping (clipped sequences NOT present in SEQ) no no
P padding (silent deletion from padded reference) no no
= sequence match yes yes
X sequence mismatch yes yes

The clipping operations S and H warrant further explanation:

These, however, are not hard and fast rules, but just general rules of thumb to get you acquainted with the concepts. Again, aligners may employ hard or soft clipping strategies differently.

Creating a SAM/BAM file via Alignment

Alignment (or mapping) is one of the key methods in bioinformatics, and is essential for making comparisons between sequence data. This is neither the time nor place to cover mapping in full detail, and there's a lot one could say about it, but in-so-far as it pertains to aligning Oxford Nanopore Technologies' reads and generating a BAM file, the process is actually quite straightforward. So, let's get started!

Aligning with Minimap2

In this demonstration we use minimap2, the aligner most commonly used with ONT data. It is super fast, accurate and easy to use. Minimap2 lives here on github, and the README there provides detailed usage instructions.

Let's break down the basic invocation of the tool:

minimap2 -a -x map-ont ref.fa query.fq > alignment.sam

Substituting that example invocation for our own, let's map the sample data downloaded above against the E. coli K12 reference. Run the cell as before:

Compressing, sorting, indexing with Samtools

We should now have a SAM file. We can do quite a lot with it already, but in general we want to compress the file to make best use of disk space, and because many tools (such as IGV) require BAM files as input. As such, its always good practice to convert SAM to BAM.

Let's introduce samtools to take care of that for us. Here's the command:

samtools view -S -b alignment.sam > alignment.bam

For every samtools subcommand, there are many flags and arguments that you may provide to customise its behaviour. Whenever you need to do something, but don't immediately know how to, you should consult the documentation.

There a few steps we want to perform during our conversion from SAM to BAM, as follows:

  1. Compression - this step reduces the size of the SAM file.
  2. Sorting - this step sorts the records by position on the reference, rather than by the original order that the reads in the .FASTQ file had. This makes it much quicker for downstream tools to access data for reads co-located on the reference sequence.
  3. Indexing - this step creates an extra file (.bai or BAM index) allowing programs to find reads spanning a genomic location without having to read the whole file.

In fact it is possible to compress and sort a BAM file in a single command:

If given the -o argument with a filename ending in .bam, samtools sort will output a BAM file, and of course take care of the sorting at the same time. Great! Now we just need to index our sorted file.

We should now see two files sorted.alignment.bam and sorted.alignment.bam.bai.

Manipulating BAM files on the Command Line

Now that we have our sorted, indexed BAM file we can continue on to interrogating the data stored within by using other samtools commands.

Here is a number of simple examples to whet your appetite.

More samtools view

To view the binary BAM file as human readable text we can use the view command. Adding the -h option prints also the header:

By providing -c we can tell samtools view to count the number of records in the BAM file.

Does this tally with the number of reads in our fastq file? If not, why not? Consider what we have learned about primary, secondary and supplementary alignments

Get a summary of the records in the file

As you can see from the output of flagstat below, we have a few thousand secondary and supplementary alignments each, and a mapping rate of 93.98%. The remainder, i.e. the total number of records minus the mapped reads, 65646 - 61694, should give us the number of unmapped reads.

Access only the primary alignments

In this example, we are filtering only the primary alignments by addition of two FLAGs, 0x100 and 0x800, as an argument to -F, which excludes any alignments matching the FLAG provided.

Access only the unmapped reads

This time we're using -f to match the flag we provide, giving us any unmapped reads.

Does our calculation from flagstat agree with the number of reads output by this step?

Convert to .FASTQ

What if we want to convert our unmapped reads back to fastq, perhaps so that we can try aligning them to something else? samtools fastq provides a convenient method. By default it will only output primaries and unmapped, i.e. the unique sequences belonging to the underlying reads.

Merge two BAM files together

And what if, hypothetically, we wanted to stick our unmapped and primary-only BAM files back together? We could use samtools merge.

This is a fairly unrealistic use-case, but more common might be combining samples from two different sequencing runs.

Select alignments that overlap a given region

Now, what if we're only interested in a certain region of the genome, e.g. positions 50000 to 150000 in the E. coli reference? We can yet again use samtools view. We just need to specify our region as: chromosome:start-end.

Downsampling a BAM file

You may want to only carry forward a subset of data into the next stage of your experiment. For instance, if you are planning to perform variant calling, you will find that some tools operate better within certain depth ranges. As such, if you have more coverage than needed, you may need to 'down-' or 'sub-' sample your alignments.

The simplest way to do this is using samtools view and using the -s, which will randomly subsample all input alignments.

For instance an -s of 0.25 should output only 25% of the input reads:

Downsampling a BAM to uniform coverage

Sampling reads to create uniform coverage across a genome is a form of the box packing problem: how do we best fit a selection of reads into a box with sides the length of the genome and the required depth of coverage.

The subsample_bam program from pomoxis implements a heuristic to approximately solve this problem:

This program also has convenient filters for read quality, accuracy, and length.

Reading our BAM file in Python with pysam

Sometimes, one needs more powerful programmatic access than that which is available via command line tools alone. The good news is that there are several ways to access SAM and BAM files using Python and other languages.

For example, pysam is a python library which makes the manipulation of various biological formats as straightforward as possible. Specifically, Pysam provides the pysam.AlignmentFile parser.

We can use the parser to iterate over the alignments within our BAM file, and quite easily visualise the alignment results, as below:

The documentation for pysam is generally quite helpful for summarising the information contained in BAM records.

To create a simple tabular summary of per-read statistics the stats_from_bam program in pomoxis is useful:

Plotting depth of coverage across a BAM file with Mosdepth

A very common requirement is to investigate the depth of coverage across the genome, i.e. the number of reads on average overlapping any given locus. This may be in order to determine how well a sequencing run performed in terms of on-target throughput, to check if we have enough depth of coverage to support variant calling, or to find out if there are any areas of the genome that have been under-represented in our sequencing run.

Whilst samtools does provide tools for performing these checks, mosdepth is simpler to use whilst also being faster, making it a natural choice for running coverage analysis.

Fundamentally mosdepth accepts a BAM file and in turn creates several output files containing coverage information. Before we look at those, let's go over the command we'll run:

We should now be seeing a set of new files all prefixed with ecoli_cov, but what do they contain?

First, let's examine the summary info:

Once again, we only have one chromosome U00096.3, and we can see that we have mean coverage of 21.61 across it, with highs of 105 and lows of 0.

What could cause such highs and lows? Is this unexpected?

Usually, we want a bit more insight into the our coverage data, whilst maintaining a high level view. So let's make some plots, and for that we'll drop back into python.

The visualisations above should provide even greater insight into the depth of your alignment data across the genome.

The cumulative coverage plot (top) reassures us that a high proportion of the genome is well covered, and that the mean we saw in the summary file is reasonably representative.

However, with the coverage across genome plot (bottom) it is easy to see that whilst coverage remains quite even across the genome, there are dips and troughs in certain regions, as well as regular outlying areas of extremely high or low depth. In your experiment these regions could lend themselves to further investigation.

To show the effect of the subsample_bam program from pomoxis lets also plot the coverage trace from the bam produced: