The tutorial is intended as a gentle introduction to Sequence Alignment/Map (SAM) formatted files and their binary equivalents BAM.
Methods used in this tutorial include:
The computational requirements for this tutorial are:
⚠️ Warning: This notebook has been saved with its outputs for demostration purposed. It is recommeded to select
Edit > Clear all outputs
before using the notebook to analyse your own data.
This tutorial aims to elucidate the information stored with a SAM and BAM files, and how such files can be read, or parsed, within the Python programming language and on the command line.
The goals from this tutorial include understanding:
The tutorial includes a small E.coli dataset (100x) in the form of a compressed fastq file to demonstrate the key learning objectives.
Before anything else we will create and set a working directory:
from epi2melabs import ping
tutorial_name = "bam_tutorial"
pinger = ping.Pingu()
pinger.send_notebook_ping('start', tutorial_name)
# create a work directory and move into it
working_dir = '/epi2melabs/{}/'.format(tutorial_name)
!mkdir -p "$working_dir"
%cd "$working_dir"
/epi2melabs/bam_tutorial
For the purpose of an introduction to the SAM/BAM format, we have uploaded a small E.coli dataset (100x) which is available to download here in the form of a compressed fastq.
To download the sample file we run the linux
command wget
. To execute the command click on the cell and then press Command/Ctrl-Enter, or click the Play symbol to the left-hand side.
bucket = "ont-exd-int-s3-euwst1-epi2me-labs"
domain = "s3-eu-west-1.amazonaws.com"
site = "https://{}.{}".format(bucket, domain)
# E. coli sample dataset
!wget -O 100X_ecoli_zymo_r103.fastq.gz \
"$site/$tutorial_name/100X_ecoli_zymo_r103.fastq.gz"
# Reference sequence
!wget -O ecoli_k12.fasta \
"$site/$tutorial_name/ecoli_k12.fasta"
Before we move on, let's quickly inspect the sample .FASTQ we have downloaded using seqkit
. This should give us an overview which we can refer back to later. Again, click play to run the following command.
!seqkit stats *.fastq.gz
file format type num_seqs sum_len min_len avg_len max_len 100X_ecoli_zymo_r103.fastq.gz FASTQ DNA 52,061 448,038,968 103 8,606 111,547
The SAM format is a file type most commonly used to store the results of nucleotide sequence alignments. However, the SAM format is also used to store unaligned nucleotide sequence data, and this is use-case is steadily increasing in popularity. In this tutorial, we'll focus on aligned-SAM files.
SAM is an important biological file format, not least because of the extensive ecosystem of tools which require it as input, but also thanks to the range of utilities which make it easy to write, query and transform them. In this tutorial we will introduce the format, provide useful links to tools and documentation, as well as help you to get started with some basic examples of creation and inspection.
Sequence Alignment Map
Binary Alignment Map
, and is the compressed, binary version of a SAM file. This compression helps reduce the file size significantly. Most of the tools which operate on SAM files also work on BAM files interchangeably, but equally many tools require BAM instead of SAM.With our introduction out of the way, let's dive into what a SAM file looks like. However, before we do, arm yourself with this link to the SAM specification, and keep it handy. It may look daunting, but its a very clearly laid out document which you can consult with any time you have a question related to SAM formatting.
Quoting the spec, a SAM file is described as a TAB-delimited text format consisting of a header section, which is optional, and an alignment section.
Let's go through each section in turn.
The header section naturally precedes the alignment section. If it exists, it can contain any information but is primarily the home of meta data, like SAM versioning info, the commands used to generate the file, or information about the references used in alignment. Header lines start with an @
character.
Here's an example header, in this case generated from an alignment of the E. coli dataset we provided to the E. coli K12 reference sequence:
@SQ SN:U00096.3 LN:4641652
@PG ID:minimap2 PN:minimap2 VN:2.17-r941 CL:minimap2 -ax map-ont ../ecoli_k12.fasta reads.part_001.fastq
The first two letters of a header line are known as 'header record type codes', and denote the kind of information each header line will contain. For instance, SQ
refers to Reference sequence dictionary
, which stores info about the reference sequence against which the alignments in this file were made.
Thereafter, each entry within a header line follows the format key:value
, e.g. LN:4641652
which gives us the length of the reference in bases. It's as simple as that!
❓ By looking at the SAM specification, can you figure out what PG stands for?
Various command-line tools and software libraries make use of the header section by writing all sorts of meta-data for later use, either by a human manually inspecting the header, or by another tool looking for certain key bits of information. However, it is important to remember that the header and the information it contains is optional.
The alignment section is where the bulk of the information in the file is stored. A SAM record is a description of an alignment (or an unmapped read, if it was unaligned). There is one record per line, and each record consists of a number of tab-separated columns. There are 11 required columns. Here's an overview of those, with informal descriptions for each:
Column | Descripton | |
---|---|---|
1 | QNAME | Query name, i.e. the name of the read which was mapped (copied from the read's FASTQ record header) |
2 | FLAG | An integer which is a short hand way of describing the type of alignment this is. See below for an explanation of FLAGS. |
3 | RNAME | Name of the reference to which the read was mapped, e.g. chr1 |
4 | POS | 1-based leftmost mapping position with respect to reference, e.g. a POS of 100 would indicate the read mapping starts at the 100th position of RNAME |
5 | MAPQ | An integer which should be a reflection of the quality of the mapping, but for which the scale and meaning may differ by aligner |
6 | CIGAR | A string of letters and numbers describing the matches and mismatches of the alignment from left to right. |
7 | RNEXT | Ignore for now. Becomes relevant when working with pair-end data. |
8 | PNEXT | "" |
9 | TLEN | "" |
10 | SEQ | The sequence of bases constituting the underlying read |
11 | QUAL | The sequence of ASCII characters encoding the PHRED quality scores for each base of the underlying read |
The formal descriptions for these columns (and everything) can always be found in the SAM specification.
It should be noted that there is also an optional section, placed after the 11 above, which not unlike the header is used flexibly and optionally by tools to store pieces of additional information in the shape of TAGs.
Each tag takes the form TAG:TYPE:VALUE
format where TAG
is a two-character string, TYPE
is a symbol denoting the type which the value may be (e.g. an integer, or a word), and VALUE
is the VALUE
of the TAG
for this alignment. This area is often used by tools to store some additional meta data per alignment, e.g. the edit distance between the read and the reference:
NM:i:2
There are some pre-defined TAGs that, if set, should have a certain consistent meaning, but equally there is great scope for tools to come up with their own tags. That's all we need to know for now.
In order to illustrate everything we have just explored, here's an example alignment record from our E.coli sample:
e813406f-821c-4cbb-85f6-c7fc811703a3 0 U00096.3 1958697 60 69S25M2D8M4I7M1I3M2I42M1I36M1D36M2D34M1I14M2D129M2D22M2I30M1D2M1D75M1D15M4I4M3D2M1D108M1I58M1D54M1I3M1I56M1I5M1D3M1I19M1I30M1D95M1I73M2I64M1D91M1I114M1D19M1D185M1D3M1I110M1I21M1D291M1I123M1D6M2D5M1I32M1I4M1I77M1D35M1I27M2D2M1I50M1D36M1D5M1I33M1D21M1D34M1I31M1D41M2I16M1D105M1D11M1I42M1I62M3I2M1D12M1I91M1I12M2I69M3D302M2S * 0 0 TTCAGACCTTCGTTCAGTTACGTATTGCTAAGGTTAAACAGACAATTAAACGGAATGACAGCACCTACTGGCATCGAGGAAGTGGTTCAGGTTTTGCACCGAAGAGCTCCTGCTTCCAGGGCCATAATGATTGGAGCATCACGATGGGCAATAAGTTTCGTAAATTTTTGATAAACCTTGCAGTGCGTCTGATTTTTGCCTGATGTCAGTACCCGGGCAATGCCCAATTCCGCAATTTCTGAGTGTATATAAAGGGTTAGCGCGCATACTCGAAGGCGCGATGAAGACGCTGCCAGCGGACCGGCAGCAGCCATTTCCTTTTCCATTGGTGGCATATCAACATTCCATTCAACATCGAGAACGCCCGTCACCAGTCCAGGAAAACCTAATTCGCGGACCGTGCGCACATCCTGGAGTGGCGGCAAACTCACCGTCGCGCTGTCGCGGAAATCGCCACCGCGTGGGCAATATTGGATGCACCGGGATCGTCACCCGCTGGCGCACGGATTTCAGTACACCCAACGACGGCGTTAAGCCCCCCTCTTTGGGGCTACGCGTGGTGACTTCTTGTCTGCGCCGTTTTGCTGCGCCGTTAGTGCACATTCCATGCTGTAACAGCAAATTTCCAGTAATGCCATTTTTTCTCCTTAATTGCGCCGACTGCTCGCTGGCAACGAATCTCTTCAATGGACCACGGATGAAACTGCAACGTTGTTTTGCCGTTCATAACCGCCAAGTAGGATTCGGTAATCGCTCATGCTCACCTTTTGGCGAACTGGTTTTCACTGACAATCCCCCGGTAGGCTGAGTCCTTCATCAGAAAGCAATGCCAGCGCACGGGCGTCCAGCGTTTTCTGATACACCCGGCAGAACGATTTCAAATATGTTACCAACCTTCGTGTGGGTAACGTTTTCCCCGGGCCACGGTAGCTCCACAATAGAAAACTGCCAGTGCGCAACCTGTACCGGTTCATGCAATTTAAACAGACAAATCGGTCTGCCATTGGATCATATTTTCTGACAAAAGCTCGCCACACTGTTCAAACCCGCGACGCCAGCGTTCAGCAGTGGCGTTTTGATGTGGCAACGCAAAAAAATGTGATCGGCAGTCAGCGGAGTGATATTCAACCCCAGACGGCGGGAAACCCATCTAATGCGTGGATAAATCGCGGTAAATCCGATGCAATATCCTGCAGCTCGTCGATAGATTGCCAGTTCGCCATAATCACTCTTCGTACTTTCAGTAAAAGCGTTAATTTACCCTGTTGCCCTGTGCCAACCAACCGCTGATTTCACGCCGCTTCTGATGCAATCGTGAAAACGGCAATACGCCGCGCGCACGTTGCTGACGGAGCGGCCATTTGCGGTGTCTCCCGCCCTAATTTCTTTAACTGGTGCGGGCAATTTTTGCTCGCTTCATCAATGTAAGGTATTCCGGTGAATATTCAGGCTCTTCTCTCAGAAAAAGTCCGTCAGGCCATGATTGCGGCAGGCGCGCCTGCGGATTGCGAACCGCAGGTTCGTCAGTCAGCAAAAGTTCAGTTCGGCGACTATCGGCTTAACGGCATGATGGCAGTTGCTAAAAAACTGGGTATGGCACCGCGACAATTAGCAGAGCAGGTGCTGACTCATCTGGATATTAACGGTATCGCCAGCAAAGTTGAGATCGCCCAGTACAGGCTTTATCAACATTTCCTTGATCCGGCATTCCTGGCTGAACATGTTCAGCGGGCGCTAGCATCCGATCGTCTCGGTGTTGCTACGCCAGAAAAACAGACCATTGTGGTTGACTACTCTGCGCCAAACGTGGCGAAAGAGATGCATGTCGGTCACCTGCGCTCTACCATTATTGGTGACGCAGCAGTGCGTACTCTGGAGTTCCTCGGTCACAAAGTGATTCGCGCAAACCACGTCGGCGACTGGGGCACTCAGTTGGGTATGCTGATTGCATGGCTGGAAAAGCAGCAGCAGGAAAACGCCGGTTGAAATGGAGCTGGCTGACCTTGAAGGTTTCTACCGCGATGCGAAAAAGCATTACGATGAAGATGAAGAGTTCGCCGAGCGCGCACGTAACTACGTGGTAAAACTGCAAAGCGGTGACGAATATTCCGCGATGTCAGCACAAACTGGTCGACATCACCATGACGCAGAAACCGGGATCACCTACGATCGTCTCAACGTGACGCTGACCCGTGATGACGTGATGGGCGAAAGCCTCTACAACCCGATGCTGCAGGAATTGTGGCGGATCTCAAAGCCAAAGGTCTGAGCAGTAAGAAGCGAAGGGGCGACCGTCATCTCCTTGATGAGTTTAAAAACAAGGAAGGCGAACCGATGGGCGTGATCATTAGAAGAAAGATGGCGGCTATCTCTACACCACCACTGCGTCGCCCTGTGCGAAATATCGTTATGAAACACTGCATGCGATCGCGTGCTGTATTACATGACTCCCGTCAGCATCAACACCTGATGCAGGCGTGGGGCGATCGTCCGTAAAGCAGGCTATGTACCGAATCCGTACCGCTGGAACACCACATGTTCGGCATGATGCTGCGGGTGAAGACGGCAAACGTTCAAAACCCGCGCGGGTGGTACAGTGAAACTGGCCGATCTGCTGGATGAAACCCTGGAACGTGCACGCCGTCTGGTGGCAGAAAAGAACCCGGATATGCCAGCGACGAGCTGGAAAAAACTGGCTAACGCGGTTGGTATTGGTGCGGTGAAATATGCCGGATCTCTCCAAAAACCGCACCACGGACTACATCTTCGACTGGGACAACATGCTGGCGTTTAAAGAGGTAATACCGCGCCCATACATGCAGTATGCATACACGCGTGTATTGTCCGTGTTCCGTAAAGCAGAAATTGACGAAGAGCAACTGGCTGCAGCTCCGGTTCTGAGTCCGTGAAAATCAGGTGAAGCGCAACTGGCAGCTCGCCTGCTGCAGTTTGAAGAAACCCTCACCGTGGTTGCCCGTGAAGGCACGCATGTAATGTGTGCTTACCTGTACAATCTGGCCGGTCTGTTCTCTGGCTTCTACGAGCACTGCCCGATCCTCAGCGCAGAAAACGAAGAAGTGCGTAACAGCCGTCTAAAACTGGCACAACTGACGGCGAAGACGCTGAAGCTGGGTCTGGATACGCTGGGTATTGAGACTGTAGAGCGTATGTAATCGATTTTTCGTGAGAGTAAAGCCTGATCAGGGTTAGCCGATCAGGCTTTTTTATTGCCATCTAAATGTATTCTGAAAATGGACATGCCATTGTTTTCTCACTGTTGGATAAGAAC %$&#&#"'/,&*')5869;<?=<:@@?;;<<99JG00@/#%(5$&+**)@'#$@(%(AFB@EH@@DA?>(/7.*#$-+,30'+'$/$$$05765-%&%'*%38,%*)0("''*%&)-###.8401&&+,89:>?>;D77890/DDGFCEDC)))0))&*001494:DH*9JENJLCC@@BCG?FBB:::?>G@GUMC8%7649996>@?@>>F=57F=35CBDC:-2<;$5+3*().$%0+(-8770,+'*0<,-222'&/16+5,/.&(%*.5+/,-+%)&$+/#&'$<=7CE=AA68CA/9DA;EA?>D6,/8(*(6C<,-;057%(<>AA89.-0$(7#8/01,.%(-1><:8;?8?CC/$A+(2.(77-/2-),--,8'*-+**+3,04<>@<;=C>DF@CBD..9::$%,-44<CFF867A894(==9:.(-$*-#&$+($%64308%>C6A:9>00&(1($'0*-'&4?<7>/(+6=:<11:9??,@?<FCFF@DDB=2@=RTG88==7<788F9<65DB?GGDIHLJEJ=C?>:<2EG-*2/206('**'*%%)&'4*,$525.067=4G@@@MRJC;@BA;@EJIK./=B9?GGTK?FD>@>AE<@BE@F@78E4MLCL;;:FF8557674//'$%57@LF&.11-?5=A<?=;4246/=>@>=,/(#%-=7.@A>>;?EJCN4CB6?B:77CJD+%"$/&*1887=B74A<<122+%&#+4:%&&#+,,#/4<==88A?CC=E??>7=>@84C/A:6<G/0H::O:JCKKB?G6G<A=%#$'+?9BG92:A<F>O??EF::D?KBDCBHCMRSTCBIH?K@OK@><C?6992>.1-2&>@=BF?--*(32$&&$>??HELCOME@DKK;6*+/97==;>%:?B7GAGD=6>><43:$$/647)'$.BI:'4=5HC;=CE::A@>G5.%+344@BBE=@A@LKEHG@DBIKECCJDF@F@HQPEEFKEQKKQA88M:9@A@ABIA?>>A@@>*-(/.&(11,13486631@CCDA<MID>BC9GFDB@;@A?H@3FH4E:@:9B1AGDGAE>DI>:>FHC:PFJ?QOS@*-((#%%'&*A6>%&:9899=>ACHAE?LLC=<NGGEEFJ>@EKHKHH>GHJI@HC??312)(2::-(**9++*;'2==8>DHB1EOMJCA>=;><EEICB<>@<<8AA?CHCCDEHHAA?<;;-?G?++8BCHKP?>KHL?HCGFDEFK43:@KG:36&35E835+>?@RG:GBOA@@AM><+?@?ABCGSHB@E=GG@6::BBBIFEEBKDNPCFBDG@+=:I>HBGE?DG9;C/8<<FEHFB@B@?@22G<9+'%-(+1.4*>29;:0<.')+'/0?GD.<@9,&*%%%#%)(?<(8;67;69A4,7G:0>JCH>A@JBEAE/0137,(+8=GEDAB=FIAFHB?0=>=B@ANFD:CG?DQDKA2439>>B@=B6<=:>+(/,0021/41=FD:..B@?,536@9$-,678;@><;:<A=<B>@@BA@:?J>D>JEIGDDN7<<42:6/273))@HAIE68;146AE8=95:2)(01#'&.968BE<BDOEKI?$C7:0&5-/@=;=CAEC.;+9*,E@@AF@H@:@BKG@<BLIABF>4;<<=?AD875007B1/<6##&%())?7)(&*8>.3.+,0()*%(&$%&,'4)#*+$$%,755G$>:,'()#39*'0:2.19:?CEDDKD';AADG<DI;6$+./.+6=0-/<@;55&)*'(.4,4#$7:635DA?>GH;;>=CLAQC;FLKG@BLFB;>EG@FAB=KD?EB==B?=BJEHHA/AA0AEDB'&DDC<:9FDDLFACAE=<<C.<AFJ>>ECA?;<@::3,,.>B@;B?6F(531:@*+'+3=<@BI@85FA>BK@@@?JILFHRE?IBN@;B=@?&'D,D5>/0*$(,(+)33414<0033(301$%A;CAH9;GB>>:>9BGHEAN;34983@)G?=9DC=<7EHFB:B@CCBB6A:622==7JDDAB=0?66*-@DI@@=F34A=@CD@DGDE=?HJHGIBI:G5A:<IFHIHIFHJRL/>>=><,0&'&&7)?:CD@C;JJNA==?@FE@FFYG?>GC>?H.9:;=0'+.68?><8.33/4D??<$'0%$C23C4DH>=DA@45FKC@<?AEG>:C($-/%37$%,:;<&60:*'&>98>>77@CC3FHG<<FA?PGPL<D<=GBC>?@F>/,,0*.$%*-'.22**+.04<;///++--+<*,.%+:9'.-.1.0<=4560369GDAA>987%%%%/*%4+&&&76//.;>33/.'1''*&'/&&&=,4?6#%&%$$/31&+G=@F???@GLBDADE?=86C<A?:I:?07*)*.*&-<<6<<=3<<><&$6797711.)+9'==3:<--8+#&)'%$964'8=65JMMDLH7@CCC>=HPD@,;=<71;*0(A698=;<1<<@A.3699&$&;680=9;=8>2>22:ABK=5DCFG@LEMEE?IA>@B</.B:/%%=;DAABKAE=IB>=?A@>/5,)40FC>?GACGBF@LJFBJ9GF3C/A:-0?;FG7DGB?<,,*$-.#%$+-.$39?@HHC4(=?64CFHDA;<A>=4<=CDDC?E;;68;<4548<81)84:?E8FECF@='8(1,1259:77+(##388<?EC7767;EHB312=0&$5,:>DH>J@?;:E10*0/A7,-,BEC.3339</8530*+'(+3:;=H@E@>?BBIA@B=>=A>+:11/78(1/2<;00037D94ULKD??C;:C6;68A=>@@EDEMIA:GHCH9@FDHHK5GE44,+.(-,$#$'((.&&#$$$*,$.&&@<:9AFMFF=6H@CCFB338C5?C==<<?HLKEDIFJ@;KRI:@=A@FADEIQHFB1BK::FIMF?@YVNEG=C@:>AB8G9='5.0#$#&&+597=8)B2694*+-E<8E>DA8H:DBGFG@@98/169;9AC?@F@=D?ABFCDQJCB8CDDCEC<E=QCFK@3997DB*43,$$$28;;*,5661,$+4337=AICE>-/ACEFKG<CDDF@CML<A@AJIHLGG@ABGF?ECABD>?=.<98=?73=E1K17;;>B:9B@=HD8EFDA=BB@&<=844,,23I8;?>JB4@:;DADH<DB<=JHB@EDCBGLL=ICBB<9>@KEG<<=BGB48;;<?1+$4:>@B=3+,,B<<D<?;KKGEDNLLXNE;??32.9@:$$D97@>:;:0*+17;797>:.;9**/:73/2((&)+((*+))+/''((3##%(,943CKGDG?LDB@HBHAJD=:JJ3@DF<BABF>43:1//&&(* NM:i:163 ms:i:5458 AS:i:5458 nn:i:0 tp:A:P cm:i:366 s1:i:2411 s2:i:0 de:f:0.0433 rl:i:0
❓ What is the value of the FLAG column?
❓ What is the edit distance of this read to the reference?
The FLAG column explained.
Above we encountered FLAG
, one of the required columns in a SAM record. As we noted, it is used to denote what kind of SAM record we are looking at. To the end-user, the practical value of this is that the value of FLAG may be used to filter alignments with samtools. This is especially useful since you may not want to carry forward all alignments in your SAM file to the next step of your analysis, you may require only a specific subset of them.
So with that in mind, what are the main types of alignment record and what values may FLAG
take? Primarily a SAM file will contain four broad categories of alignments:
⚠️ *Note*: Not all aligners operate in the same way. The above information is based on Minimap2's behaviour. Some aligners, for instance, may not always write out unaligned records. If you choose to use a different aligner for any reason, it is always important to read the documentation first.
As mentioned above flags are a short-hand way of denoting the type of alignment recorded in each record. They can be referenced as integers or in hexadecimal form.
Integer | Bit | Description |
---|---|---|
1 | 0x1 | template having multiple segments in sequencing |
2 | 0x2 | each segment properly aligned according to the aligner |
4 | 0x4 | segment unmapped |
8 | 0x8 | next segment in the template unmapped |
16 | 0x10 | SEQ being reverse complemented |
32 | 0x20 | SEQ of the next segment in the template being reverse complemented |
64 | 0x40 | the first segment in the template |
128 | 0x80 | the last segment in the template |
256 | 0x100 | secondary alignment |
512 | 0x200 | not passing filters, such as platform/vendor quality controls |
1024 | 0x400 | PCR or optical duplicate |
2048 | 0x800 | supplementary alignment |
For instance, here you can see that 4
or 0x4
refers to unmapped reads, whereas 256
or 0x100
refers to secondary alignments. Later we'll provide examples of how flags can be combined and provided as an argument to samtools, which permits us to view only matching alignments. For now, here's the rule to remember:
For instance, if we wanted to address secondary alignments and unmapped, we would add the flags together, to arrive at 260
or 0x104
.
If you ever need a helping hand to figure out which
FLAG
is right for you, this webpage should be useful.
The CIGAR column explained.
Above we encountered the concept of the CIGAR string, which is a condensed alphanumeric representation of an alignment.
Let's take the example from above:
69S25M2D8M4I7M1I3M2I42M1I36M1D36M2D34M1I14M2D129M2D22M2I30M1D2M1D75M1D15M4I4M3D2M1D108M1I58M1D54M1I3M1I56M1I5M1D3M1I19M1I30M1D95M1I73M2I64M1D91M1I114M1D19M1D185M1D3M1I110M1I21M1D291M1I123M1D6M2D5M1I32M1I4M1I77M1D35M1I27M2D2M1I50M1D36M1D5M1I33M1D21M1D34M1I31M1D41M2I16M1D105M1D11M1I42M1I62M3I2M1D12M1I91M1I12M2I69M3D302M2S
What we see is a pattern of integers followed by a single letter, e.g. 69S
which is followed by 25M
, and so on. The letter denotes the type of operation that the aligner performed when constructing the alignment, and for how many positions that operation continued. For instance 25M
means that the aligner was able to align 25 bases of the query to the reference in a single stretch. Note however that:
In SAM-speak the idea of "Match" simply means a base in the query sequence is aligned to a base in the reference. In does not mean the two bases are the same.
In our example 25M
is followd by 2D
. What does this mean? When an aligner makes an alignment, there might be errors or variation in the read, meaning that the alignment to the target may be imperfect. For instance if some bases appear to be missing in the read, the aligner may mark that a deletion has occured, thereby adding 2D
into the CIGAR string.
The set of possible operations are listed within SAM specification, as follows:
Operation | Description | Consumes query | Consumes reference |
---|---|---|---|
M | alignment match (can be a sequence match or mismatch) | yes | yes |
I | insertion to the reference | yes | no |
D | deletion from the reference | no | yes |
N | skipped region from the reference | no | yes |
S | soft clipping (clipped sequences present in SEQ) | yes | no |
H | hard clipping (clipped sequences NOT present in SEQ) | no | no |
P | padding (silent deletion from padded reference) | no | no |
= | sequence match | yes | yes |
X | sequence mismatch | yes | yes |
The clipping operations S
and H
warrant further explanation:
*Soft clipping* is more common, and refers to the instances in which alignments do not start at the end of the read, but somewhere within it. Hence, the portions of the read flanking the alignment are marked as 'soft clipped' within the CIGAR string, which lets other software know not to use it, e.g. for variant calling.
*Hard clipping* is employed when reporting a supplementary alignment, i.e. when the read is split to create more than one alignment, the rest of the read which aligns elsewhere is hard clipped, i.e. omitted from the recorded SEQ.
These, however, are not hard and fast rules, but just general rules of thumb to get you acquainted with the concepts. Again, aligners may employ hard or soft clipping strategies differently.
Alignment (or mapping) is one of the key methods in bioinformatics, and is essential for making comparisons between sequence data. This is neither the time nor place to cover mapping in full detail, and there's a lot one could say about it, but in-so-far as it pertains to aligning Oxford Nanopore Technologies' reads and generating a BAM file, the process is actually quite straightforward. So, let's get started!
In this demonstration we use minimap2, the aligner most commonly used with ONT data. It is super fast, accurate and easy to use. Minimap2 lives here on github, and the README there provides detailed usage instructions.
Let's break down the basic invocation of the tool:
minimap2 -a -x map-ont ref.fa query.fq > alignment.sam
minimap2
is the base command.-a
is a flag which turns on SAM output, which is what we want, otherwise minimap2 outputs an alternative PAF format-x map-ont
tells minimap2 that we want to align ONT data, allowing the tool to modify its internal parameters to make the best alignmentsref.fa
is a positional argument, which means it must come first after any flags such as -a
. This term refers to the reference, the sequence(s) against which your reads will be aligned.query.fq
represents the .FASTQ file containing your reads> alignment.sam
is a statement saying 'redirect the output to a file called alignment.sam'. Don't worry, alignment.sam
will be created if it doesn't already exist.Substituting that example invocation for our own, let's map the sample data downloaded above against the E. coli K12 reference. Run the cell as before:
reference = 'ecoli_k12.fasta'
query = '100X_ecoli_zymo_r103.fastq.gz'
!minimap2 -ax map-ont $reference $query > alignment.sam
[M::mm_idx_gen::0.668*0.27] collected minimizers [M::mm_idx_gen::0.704*0.39] sorted minimizers [M::main::0.704*0.39] loaded/built the index for 1 target sequence(s) [M::mm_mapopt_update::0.720*0.40] mid_occ = 12 [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1 [M::mm_idx_stat::0.729*0.40] distinct minimizers: 838542 (98.18% are singletons); average occurrences: 1.034; average spacing: 5.352 [M::worker_pipeline::115.487*1.61] mapped 52061 sequences [M::main] Version: 2.17-r941 [M::main] CMD: minimap2 -ax map-ont ecoli_k12.fasta 100X_ecoli_zymo_r103.fastq.gz [M::main] Real time: 115.517 sec; CPU: 186.249 sec; Peak RSS: 1.208 GB
We should now have a SAM file. We can do quite a lot with it already, but in general we want to compress the file to make best use of disk space, and because many tools (such as IGV) require BAM files as input. As such, its always good practice to convert SAM to BAM.
Let's introduce samtools to take care of that for us. Here's the command:
samtools view -S -b alignment.sam > alignment.bam
samtools
is the base command.view
is the subcommand. Samtools has many different sub-commands. This one, view
, is used for converting SAM/BAM files between formats, as well as accessing certain records within the file, e.g. those aligned to a specific chromosome.-S
is a flag that tells samtools view
that our input will be a SAM file.-b
is a flag that tells samtools view
that it should output BAM records.alignment.sam
is the SAM file we made, which we inputting to this command.alignment.bam
is the file to which we will redirect the BAM output.For every samtools subcommand, there are many flags and arguments that you may provide to customise its behaviour. Whenever you need to do something, but don't immediately know how to, you should consult the documentation.
There a few steps we want to perform during our conversion from SAM to BAM, as follows:
In fact it is possible to compress and sort a BAM file in a single command:
!samtools sort -o sorted.alignment.bam alignment.sam
If given the -o
argument with a filename ending in .bam
, samtools sort
will output a BAM file, and of course take care of the sorting at the same time. Great! Now we just need to index our sorted file.
!samtools index sorted.alignment.bam
We should now see two files sorted.alignment.bam
and sorted.alignment.bam.bai
.
Now that we have our sorted, indexed BAM file we can continue on to interrogating the data stored within by using other samtools commands.
Here is a number of simple examples to whet your appetite.
samtools view
¶To view the binary BAM file as human readable text we can use the view command. Adding the -h
option prints also the header:
# view the first 10 lines of the bam file
!samtools view -h sorted.alignment.bam 2> /dev/null | head -n 6
@HD VN:1.6 SO:coordinate @SQ SN:U00096.3 LN:4641652 @PG ID:minimap2 PN:minimap2 VN:2.17-r941 CL:minimap2 -ax map-ont ecoli_k12.fasta 100X_ecoli_zymo_r103.fastq.gz @PG ID:samtools PN:samtools PP:minimap2 VN:1.10 CL:samtools sort -o sorted.alignment.bam alignment.sam @PG ID:samtools.1 PN:samtools PP:samtools VN:1.10 CL:samtools view -h sorted.alignment.bam adbc3e89-f487-40f7-b99b-adbc3cb66a9f 2048 U00096.3 1 60 17273H22M1I4M1I20M1D49M1D52M3D12M1D10M1D5M1I11M1I4M1D4M1D1M2D5M1D2M1D2M1D6M1D5M4D13M1D2M2I52M2D59M4D35M3D8M1D5M1D3M2I22M2D5M1D7M1D7M1I35M1I3M2D8M1D62M3D6M4D47M1D85M1I16M2I1M1D5M1D131M1I3M1D120M2D51M1D7M1I40M1I9M1D8M1D3M1D40M1I97M1D39M2D3M1I47M2D11M1D101M1D94M1D63M2D108M1D159M1D101M2D30M1I74M1D5M1I25M1D133M2I4M1D71M1I82M1D60M2I26M3D105M2D1M1I146M2D12M1I56M1I112M1D6M1D24M1D8M1I23M1I37M1I1M2D6M1D2M1D130M2I4M1I4M1I3M1D4M2D3M1I63M1I108M1D52M1I5M1I11M3D99M1I66M2D19M3D4M1D11M1I8M1D75M1D57M2D77M3D33M1I6M1D99M1D6M1I71M1I8M2D196M1D2M3D114M1D26M3I4M1D129M1D13M1D49M4D14M1I199M1D77M2D2M1D51M1I8M3D7M2D126M1I31M1D126M1D20M4I2M1I133M1I3M1D5M1I15M1I16M1I47M1I51M2D18M1D42M1I34M1D96M2I8M1I2M1I1M1I13M1I56M1D11M2D32M1I6M1D17M1I8M1I1M3D49M1D31M1D9M1I37M1D95M2D7M1I41M1I6M1I43M1D28M1D19M1D32M1D5M2I4M3I23M2I1M1I41M1I14M2I7M1I34M1I15M1I223M1I86M1D79M1D6M1D2M1D14M1I57M2D44M2I57M1I17M1D50M1I139M1D138M1D10M1H * 0 0 AGCTTTTCATTCTGACTGCAACGGGGCAAATATGTCTCTGTGTGGATTAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATCAGCATGCACAAACCAATGAGATTACAGTTACAGCAACTTCAGTGCAAACAATTGCCACCTTCCGCGCCACGCCACGCGGGCAGCGCTCGGAGGCTGGCTCGGATCGGAAACGGCGAAGAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTGGGCGGTACGTGGTGAATGCAGAACGTTTTCTGCGTGTTGCCGATCTGGAAACAATGCAGGAGCAGGGGCAGGTGGCCGCCGTCCTCTGCCCCGCCAAATCACGCGACCACCTGGTGGCGATGATTGAAAAAACCATTAGGCGCAGATTGCTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGGAACTCGCCGCCGCGGCGGGCCGCTGGCGCGACTGAAAACTTTCGTCGATCGGGAATTTGCCCAAATAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGAATAGCATCAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTGCGATCGCCATTATCGCAACAAGTATAGAAGCGCGCGGTCACAACGTTCCTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCTGAGTCCACCCGCCGTATTGCGGCAAGCCGCATTCCGGCTGGATGCATGGTGCTGATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGCTGCCTGTTTACGCGCCGATTGTTGAGATTTGGACGGACGTTGACGGGGTCTATACCTGCGACCCGCGTCAGGTGCCGATGCAGAGGTTGTTGAAATTGATGTCCTACCAGGAAGCGATGGAGTTTTTCCTACTCAGCGCTGAGCCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGAATCCCTTGCCTGATTAAAAATACCGGAAATCCTCAAGCACCAGGTACGCTCATTGGTGCCAGCCGTGATGAAGACGAATTACCGGTCAAGGGCATTTCAATCTGAATAACATGGCAATGTTCAGCGTTTCTGGTCCGGATTGAAAGGGATGGTCGGCATGGCGGCGCGCGTCTTTGCAGCGATGTCACGCCGGTATTTCATGGTGCTGATTACGCAATCATCTTCCGAATACAGCATCAGTTTCTGCGTTCCACAAAGCGACTGTGTGCGAGCTGAACGGGCAATGCAGGAAGAGTTCTCCTGGAACTGAAAGAAGGCTTACTGGAGCCGCTGGCAGTGACGGAACAGCTGGCCATTATCTCGGTGGTAGGTGATGGTCTGCGCACCTTGCGTGGATCTCGGCGAAATTCTTTGCCGCACTGGCCCGCGCCAATATCAACATTGTCGCCATTGCTCGGATCTTCTGAACGCTCAATCTCTGTCGTGGTCAATAACGATGATGCGACCACTGGCGTGCGCGTTACTCATCAGATGCTGTTCAATACCGATCAGGTTATCAAAGTGTTGTGATTGGCGTCGGTGGCGTTGGCGGTGCGCTGCTGGAGCAACTGAAGCGTCAGCAAAGCTGGCTGAAGAATAAACATATCGACTTACGTGTCTGCGGTGTTGCCAACTCGAAGGCTCTGCTCACCAATGTACATGGCCTTAATCTGGAAAACTGGCGGAAGAACTGGCGCAAGCCAAAGAGCCGTTTAATCTCGGGCGCTTAATTCGCCTCGTGAAAGAATATCATCTGCTGAACCCGGTCATTGTTGACTGCGCTTAGCCAGGCGGTGGCGGAATGATATGCCGACTTTATTGCGCGAAGGTTTCCACGTTGTCACGCCGAACAAAAAGGCCAACACCTCGTCGATGGATTACTACCATCATTGCGATTATGCGGCGGAAAAATCGCGGCGTAATTCCTCTATGACACCAACGTTGGGGCTGGATTACCGGTTATTGAGAACCTGCAAAATCTGCTCAATGCAGGTGATGAATTGATGAACTTCTACGGCATTCTTTCTGGTTCGCTTTCTTATATCTTCGACCACGGTGAACGAAGGCATGAGTTTCTCCGAGGCGACCACGCTGGCGCGGGAAATGGGTTATACCGAACCGGACCCGCGAGGATGATCTTTCTGGTATGGATGTGGCGCGTAAACTATTGATTCTCGCTCGTGAAACGGGACGTGAACTGGAGCTGGCGGATATGAAATTGAACCTGTGCTGCCCGCAGAGTTTAACGCCGAGGGTGATGTTGCCGCTTTTATAGGGAAAATCTGTCGCAACTCAACGATCTTGCCGCGCGCGTGGCGAAGGCCCGTGATGAAGGAAAAGTTTTGCGCTATGTTGGCAATATTGATGAAGATGGCGTCTGCCGCGTGAAGATTGCCAAAGTGGATGACAGCATCCGCTGTTCAAAGTGAAAAATGGCGAAAACGCCCTGGCCTTCTATAGCCGCTATTATCAGCCGCTGCCGTTGGTACTGCGCGGATATGGTGCGGGCAATGACGTTACAGCTGCCGGTGTCTTTGCTGATCTGCTACGTACCCTCATGGAAGTTCAGGAGTCTGACATGGTTAAAGTTTATGCCCCGGCTTCCAGTGCCAATATGAGCGTCAGGGTTTGATGTGCTCGGGGCGGCGGTGACACCTGTTGATGGTGCATTGCTCGGAGATGTAGTCACGGTTGAGGCGGCAGAGACATTCAGTCTCAACAACCTCGGACGCTTTGCAATAACTGCCGTCAGAACCACGGGAAAATGCGTATCTGCAGTGCTGGGAGCGTTTTTGCCAGGGAACTGGGTAAGCAAATTCCAGTGGCGATGACCCTGCGAAGGATTGTGATCGGTTCGGGCTTAGGCTCCAGTGCCTGTTCGGTGGTCGCGGCGCTGATGGCGATGAATGAACACTGCGGCAAGCCGCTTAATGACACTCGTTTGCTGGCTTTGATGGGCGAGCTGGAAGGCCGTATTGTTCCGGGCAGGCATCATCGAGCAACGTGGCACCGTGTTTTCTCGGTGGTATGCAGTTGATGATCGAAGAAAACGACGTCATCGGCCCAGGCAGTGCAAGGGTTTGATGAGTGGCTGTGGGTGCTGGCGTATCCGGGGATTAAAGTCTCAACGGCAGAAGCGAGGGCTATTTTGCCGGCGCAGTATCGCCGCCAGATTGCATTGCGCACGGGCGACATCTGGCAGGCTTCATTCACGCCTGCTATTCCCCGTGCGGCCTGAACTCACGAAGCTGATGAAAGATGTTATCGCTGAACCCTACCGTGAACGGTTACTGCCAGGCTTCCGGCAGGCGCGGCAGGCGGTCGCGGAAATCGGCGCGGTCAGAGAGCGGTATCTCCGGCTCCGGCCCGACCTTGTTCGCTCTGTGTGACAAGCCGGAAACCGCCCAGCGTTGCCGACTGGTCGTGGAGCCCCTGCAAAATCCAGGAAGGTTTATTCATATTTGCCGACTGGATGCGGCGGGCGCACGAGTACTGGAAAACTCAATGAAACTCTACAATCTGAAAATCACGACGAGCAGGTCAGCTTTGCGCAAGCCGTAACCCAGGGGTTGGGCAAAAATCGGGCTGTTTTTTCCGCACGACCTGCCGGAATTCAGCCTGACTGAAATTGATGAGATGCTGAAGCTGGATTTTGTCACCAGTGCGAAAATCCTCTCGGCGTTTATTGGTAATTGAAATCCACAGGAAATCCTGGAAGAGCGCGTGCGCGCGGCGTTTGCCTTCCCGGCTCCGGTCGCCAATGTTGAAAGCAATGTCGGTTGTCTGGAATTGTTCCACGGCCAAGCGCTGGCATTTAAAGATTTCGGCGGTCGCTTTATGGCACAAATGCTGACCCATATTGCGGGTGATAAGCCAGGTGACCATTGACCGCGACCTCCGGTGATACCGGAGCGGCAGTGGCTCATGCTTTCTACGGTTTACCGAATGTGAAAGTGGTTATCCTCTATCCACGAGGCAAAATCAGTCCACTGCAAGAAAAACTGTTCTGTACATTGGGCGGCAATATCGAAACTGTTGCCATCGACGGCGATTTCGATGCCTGTCAGGCGCTGGTGAAGCAGCTGATGATGAAGAACTGAAAGTGGCGCTAGGGTTAAACTCGGCTAACTCGATTAACGTGAGCCGTTTGCTGGCGCAGATTTGCTACTACTTTGAAGCTGTTGCGCAGCTGCCGCAGAGACGCGCAACCAGCTGGTGTTATCAGGGGTGCGAGCCGAAACTTCGGCGATTTGACGGCGGGTCTGCTGGCGAAGTCACTCGGTCTGCCGGTGAAACGTTTTATTGCTGCGACCAACGTGAACGATACCGTGCCACGTTTCCTGCACGACGGTCAGTGGTAACCAAAGCGACTAGGCGACGTTATCCAACGCGATGGACGTGAGTCAGCCGAACAGGTGGCCATGGAAGAGTTGTTCCCGCCGCAAAATCTGGCAACTGAAAGAGCTGGGTTATGCAGCCGTCGATGATGAAACCACGCAACAGACGATGCGTGAGTTAAAAGAACTGGGCTACGCTTCAGAGCCGCACGCTGCCGTAGCTTATCGTGCGCTGCGTGATCAGTTGAATCCAGGCGAATATGGCTTGTTCCTCGGCACCGCGCATCCGGCGAAATTTAAGAGAGCGTGGAAGCGATTCTCGGTGAAATGCTGGATCTGCCAAAAGAGCTGGCAGAACGTGCTGATTTACCCTTGTTACATAATCTGCCCGCCGATTTTGCTGCGTTGCGTAAATTGATGATGAATCATTCGGTCGATATTCATTGTCAATCAGGCCGGGTTTGCTTTTATGCAGCCCGGCTTTTTTATGAAGAAATTATGGAGAAAAATGACAGGGAAAAAGGAGAAATTCTCAATAAATGCGGTAACTTAGAGATTCAGATTGCGGAGAAATAACAACCGCCGTTCTCATCGAGTAATCTCGGATATCGACCCATAACGGGCAATGATAAAAGGAGTAACCTGTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCATGGTTCTGGTCGCTCCCATGGCAGCACAGGCTGCGGAAATTACGTCATCCCGTCAGTGAAATTACAGAAGAATCAGGCGATCGTGATAATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACGACGTTGTGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCATCAAGAAGCTGATTCATGATCGTCACAGGCGGTCATGGTCCAGAGCAAACATCACCGTTAAATGACAAATGCCGGGTGACGATGCGGCATTACAGCGCCTGATGCGACGCTGGCGCGTCTTATCAGGCCTACGTTAATTATCCGATATTGAATCTGCATGCTTTGTAGGCAGGATAAGGCGTTCACGCCGCATCCGGCATTGAGCTGCAAAATTAACGCTATCCGTAGCGTTTAAACACAGTTCGCCATTGCTGGAGGAATCTTCATCAAAGAAGTAACCTTCGCTGTAAAAACCCGTCAGTTGCTCTGGTTTGGTCAGCCGATTTTCAATGATGCGAAACGACTTCGAGTCAGACCGCGTGCTTTTCTTAGCGTAGAAGCTGATGATCTTCAATTTGCCGTTCTTCTCATCGAGGAACACGGCTTAATAATCGGCATTCAATTTCTTCGGCTTCACCGATTTAAAAATATCATCTGACGCCAGATTAAATCACCAGCATCGCCTTGTGCTGCGAGCGCCTCGTTCAGCTTGTTGGTGATGATATCTCCCAGAATTGATACAGATCTTTCCCTCGGGCGTCTTGAGACCGGATCCCCATTTCCAGACGATAAGGCTGCATTAAATGAGCGGGCGGAGTACGCCATACAAGCCGGAAAGCATTCGCAAATGCTGTTGGGCAAAATCGAAATCGTCTTCGCTGAAGGTTTCGGCCTGCAAGCGTGTCGAGCATCACCTTTAAACGCCAGAATCGCCTGGCGGGCATTCGCCAAGCGTGAAAATCTGGCTGCCAGTCATGAAAGCGAGCGGCGTTGATACCCGCAGTTTGTCGCTGATGCGCATCAGCGTGTAATCTGCGGAGGCGTCAGTTCCGCGCCTCATGGATGAGCTGCTGGGAATTTATAACGCAGCTTTTCCGGCAGCGTATAGCGCGTGATAAGAGTCAACGGGCTTTGGTAATCAAGCGTTTTCGCAGGTGAAATAAAGATGCGGCATATTGCCAGTCCTTTCCAGGAAATTTATGCCGACTTTAGCAAAAAATCGAGAATGAGTTGATCGGATCATATTGAGTTCTCCTGCGAAACATCATCCCACGCGTCCGGAGAAAGCTGGCGACCAATATCCGGATAACGCAATGGATAGAACACCGGGCGCACGCCGAGTTTACGCTGGCGTAGATAATCACTGGCAATGGTATGAACCACAGGCGAGAGCAGTAAAATGGCGGTCAAATTGGTAATAGCCATGCAGGCCATTATGATATCTGCCAGTTGCCACATCAGGCGGAAGGCTTAGCAAGGTGCCGCCGATGACCGTTGCGAAGGTGCAGATCCGCAAACACCAGATCGCTTTAGGGTTGTTCAGGCGTAAAAGAAGAGATTGTTTTCGGCATAAATGTAGTTGGCAACGATGGAGCTAAAGGCAAACAGAATGACCACAAGGGTAACAACTCACACCCAGGAACCCGTTGCGCGCCCGCATCGCCTTCTGGATAGGCTGAATACCTTCCAGCGGCATGTAGGTTGTGGTTACCCGCCAGTAATATCAGCATGGCGCTTGCCGTACAGATAACGCCAAGATATCGATAAAAATGCCAATCATCTGGACAATCCCTTGCGCTGCCGGATGCGAAAGGCGCGAACGCCGCTCCGCTGCCGCGTTTGGCGTCGAACCCATTCCCGCCTCATTGAAAGACATACCTGCGCTGAAAACCGTTCGTAATCGCCTGGCTTAAGGTATATCCCGCCGCGCCGCCTGCCGCTTCCTGCCAGCCAAAAGCACTCTCAAAAATAGACCAAATGACGTGGGAAAGTTGCCCAATATTCATTACGCGAATTACAGGCTGGTCAGTACCCAGATTATCGCCATCAACGGGACAAACCCCTGCATGAGCCGGGCGACGCCATGAAAACCGCGAGTGATTGCCAGCCGAGTAAAGACAGCGAGAATAATGCCTGTCACCAGCGGGGAAAGATCAAAGAAAAAC L=7BKR68DG=6@<@@@<E9:8./6;(55:*&'/:25&&&:;@35@A@<:KGEFBDCBBB58I><AAB>5*085@DRCEA?9FB*%.2112.44$52''(*0341&))'.8:=EC>@0=<<GDA:@?A=;:?J8,79<+2A:($"'1#,&*#%#&)2)%2/$-%'''",'(?$#'"'+&2(2$&$(3##%+*($.8$%$(#-&1$$'$.%&,$&%)*$)$*#$(('$+&++%''#&(".%$##)$()&'"$+*,*$(#%'%',39<DH?C@<9FDF>;79888)+9<>>CAA6-)45>C)HK6<;CJCD=@G<??;/3.B7:A>BBA:@IE9DB8,/%&,(&$$$.#%#$#%+'(*600=8F(5D0&5:9>7<7GC9(/)<1&78=G6'6(*-+<&'1.=1%24B9E62<@<>8<(81./99<+((#';1)+*$(($$+*')(*:)$.121//+.11-*)&97;FMQONMBB8B@.#(107'',-($258*-5@:<<44?C999HDMGAD9=B77?FEF633=<9H9EIIO=08-'($*22)2;7,#$%'&/)$%%$%(+,$&,)'$.,0'0:+,$.-#())%%-&$#'($*46&.&,*%''#%%'(+4,&6ABAH1HIID7<85//68+?319:AF0=<=C5')&(&'$(((,,0:1563C@GCFHDDE:B6>@D@.())((()&&')$/=959HAAHAA42*9*.<-.576)$%.544$2%=@@=5.<25<0./+:?@B@$$$(673=8AD8=AD35A<>7B@>.3%*778C8=*23=>::-*(-0+56;0))%?GCFF<C;:,.--<E?7E>8C8;:86('->?A<=?==IFE?>6:A@<;%%'$$546?B@ADCMFAD8=8EB48@DC<G=C@AA:DD@98B*=9AA=A>@DFMK7CAA?=A>@DH?FBDD@DAKM2>@>B:7=;/:7:;@7@>C6==8CA>CC5@B<A41C<CJ@:@DNNAKE&-02227<6C7<?B<@EB?C5<=B68/0///>=?,(22:5?4('(74@>C07.)97+5%,FB7A><;;*6)&&''4456.89'3,$5;:B6/.753/''#%8;)-6;&(?+'36/33&&%'1.046?CC?:@?@AEGLF?7;=@+((0:9G<38<<,5#'(*7875=72:89<FOC77;55=7(<CDPNMN;<=5*87A.GC>BB@DBE>=CAK@=8401;**?3,0-;=1;<;6/1,/A?2;-C>830/..140;7<;.3I4AFLCDIFJ>DKDAABANJA::?FCEN?)**'2(*$%+%(2ACJH=?<77;<AB63788:+).)(/42.1$/1(+A6>77@@57+-8)=,&-2..1%-0:;C>C??>ICC7?7>&*31666><589><C?9*=1++2A<;BFABA<BGN?<GA69KG??BAA?7@:A@AH=CACJ?9EHEE:;C@D**15:%'%&&$'-;7;=NICFECEPINL>GDFE+DAJED;AHMEBFCBF<>JE;9:KFG-?;@::24'89)%&'#*53315,,12(&%%.3,,,87G;>7),05)*)(=7;:1?A9:GJNGE.5B&==:%6*:B8?>=5F@BN@;@EF:825D6;57@/?GHCK=357.311,)*,,17HAB96<:69A4@8;;:3>EB>;%+&*)+&%GCHHEI=BD8?DA@NLDICC@>A<IC50='((,*?ED9?=5C;,%''"&(+77..4>@A8::+FB?BH8@,<DEPHLC?>D3A110-*=:;==:>AB??FDAHDDO@J;D>?CCAHR=?C>EB>BIJM>FBRCF@JEV-%($$$<'.1''.57...;<9E??@DB@@=@ANHDDHHG>;5.<B;;22=>?7:KFBJ>?7F@EBPJPCB86:8;CD;?ONC@>IAC9)8>AG>ECAB<@AAIII6<78HG:87B==?P766;051BEEF59G?=>?<IBE:90<@BIVMDHIDFEIGAIBEDCHWJFE7@DCAGIE@441):@<<$;C@&&()$29+,)-&(2/0(+++688;;:2@4:44'((+=AA65NJ>@JN@CKLC4E:@>8<A5<8&(*0<>SOI?8ANOIK>9<>;GEAENIME669<B?15+**)$))+,+(35:6676A:894>;9=77'1(+*%+'()',,347AABCC;?F:=C;D?B;?B?EKBFVL3;6@CEA4==23,)<EBEFLFFGC8>BECB67:FED@>=BC?E:74<77=@=>''./,,#*//8'(*08B5/324784;<DED<<>7530.$'$&$),(().*&%'03:>286(--%(%&(,0:>8?>>9ELBA7AFG??<87B=EEKHD>?@7<@CB0747678;8;926:=8769<DA=C>((8.3@>FB<HGGJGFMMKNMDA<>?G?A?C@6FFG9?8C88:C>?HNE:CEEJAB?CIFF?LBB9DCM>C(*;;;ABNHIFC:N:>;>6@<'(/,645>:7<75B?15H8BBEA005>BABBEB1BDG?DIB46'&%(($$/2,*-&&*%6*-58*&*0'/3+?A-1:<>EAA<9?DE98GH:B088H@7@<:,F?@<JCB?>FG>BHFA?FD?55/39349,++/=?@SIGNFLLD>=CDKL<9:++*9=7/11')*$81*1*+121$%$$#$'0?<@B>JMIDNL?B@<G9:=;>>:=@A222/%&164EBHAB<91%"($));:554008:;:7<?=6).7:;B?=@?BBCD779=86)%*42==?AB56?@B;*.0+4,*680/0=<?69-,1$%064=<8;<49;0;7,2776%#112;2=.1'AEBK:::@GJFAHMAM?126013++338A?7@?;@=AC?:<;B?25</''%%+&,,&)*8JL9(322.)6;6:;<@75B?B@>CBI87FFAEFNLFL;<=M3=;51>645:44110%)+.&42<-/,<@CAA<5>B??:G*;6,(-3+../3D89I>@2785G1C>H7,1'*:;4=>2B<CBC?CK?B@4016)+68,%$#$'$"#%$%$4>>??980:532566*%')+%'6.155:)+)*-.9B@AE68:8:?:FA;=7446-)$#$$).13-33)(4%,*,EB<<145D344FCEEA<FIE@BE@@IGG=:21138<988C=<A>JGEK=49CDDBBHJOCB;A>BHH<?>?D?AD;DAADBH=G9<<FC@IHCE@Q@BFE@DIFOQF?>B<CCEA@?-:%+%%**%&%(50#%8;&$(*$%%&&%('(213G/A@@F:A;>?@@FA,@B>B>>B>?69111:68<31122-<9A?2LHF55>>55:<8<'3)4)'/,4$&638/055@@P@?78800@DKA@A;;<;<FBBMLB=9AE<;>@;5,>/)+<22)+-%*/3/)-'>83,%&/.;=<51=3*&@;7A@B@<?<66:6;93))%.0122@+.//;7C279;@658044$8+C@2A>596+876;:9=E@<:<:3<?>@@27:+1+=:DA1*%%)$"&%6;+7>DCBBI43CFBA@=JIAJG@IQGB>AA=?8>@?F=6;?FH3<<*=&%7>:@>CA8>CCA>C70-8??A=:A??=0624:?>3.*9520'(("%#"%$..069+)'35=1>==7?<G6@2?BJDLD>NIHB@>CDHDC@FCFDEDH<::MSPC7;CJ@569%;99@@=D?=5,2$&)%&%$#$?==?ABBHIC@-$%%:92+5=.$&6(%%*>C<;72/(/79375.<+:CAE?D=@898867413%'%1(()2#(&((9>A61$"'').;/9,28',./*(&DDDE?@EH@@=AE@HIB:@G;PN<C=FBEBJG?;ED:=CCDCCJ?@G75.())@FD?AFXLOL:FLDN?AA<=CF8A/0B=9:77;G<;69:9*%/479<:9:JAEDHFC>BD<LGN=JFJIE==3)4&$%2-('&?4:B69:<?BGE@CCI8////15%&$&.)797892?%++7-16/469:FE.'(9>:@;9B@C=2>AEDING<=@DB=DC?F98>@:CAB/(.+0>A;;?NIICG&(>D>C=@CKLB@BGCKK<,BJKCA@->A8,4%%55-43;<8<>>.6IAD=NG>GH<JA>FD@JJGJFLF:?GUSK?A<=;0,3CB=IPCO>>?;:;A835&*'$%6<6?8=8?0,C@?B659;=>//+4/..3'&012C<88HC=A;<8+(+AAF?GK>AA?9@@7C6,%*'-+53&''7.41//E<1106:+-/0090/3??F?67HED;37:@<85?5@00&&),:2?=,BFC@=<>D,12/4@76@969;9:*8:ED<BDH:E>223.8?=AABD<DC?HBJCB=.ACAC??JGE=?99>96B/4$&&,9HINBEHRNFPJ@APNNK;>@>@>=;6@A?>CEDE/)<GCB=AA@=;)AB?0/9$%$-4=>>AKIDEBE=A@IJGIJLD@8@CA?B;>A6HOM629=B?.)->:>;=9.-6;6:8*8:9==:82FE;>CBC:B?57,,.%'(76:464=8&'398)'-058C<=B?D**(<?G@CED@CCHA?L=BDOEJ?MB<;BCCB;=F?7AG<AIMFACCB=4*.68?<A?AA<LGFH@CCFA0/F@6/1>A7*,4@>GWQ?A@EBG:B==F?>@A<AA17'(>@2IM:99-:7$??76H9D-C>A@?ID@<?GFBC:CB22%#%3./48+#++($(%500-'$316EF+&)/,&>9/F63>AHKIID67><?<15?JCRBPM6?>DA978@E1&&)%39+%%,1AFABBDLUI=?;A=?30C*73&%/**;55)*-=;+-CA6E:C@2G>B>9<-.0//#./*<6,>?A;ACH4:<<21@<E2/.789A:=@;:&'&5.*,0$',.0>9)*6<89D::7;;;DE<7EEC44:>GC>B;7<:C4333@G;8;LIF><7<''(=AA:DFJKHA::CB@-95$';.13+(*"*('87199HB:KKEBC;DCJP@C.EAFHJ5:9+3/1-./><<<,54%)###,1(/7/22/$1+86=.%'0956;HEA;A?98>9?30-,)$$'(*&%$&%+*+#$('$&%%%))(,,&);/)6;$/365;B:6)52>)C?4H1EGE>ACB@C?5;14BJER?KDH>9@BHAF;(-EEDCDC?*+--%9?CB??EFKCEOL<<7AEAB@J=BA;:83+*:5%)0.7<;&%4))*-3#%)*$;://1>./35.&'&+-*'*6-.%+1-/03&/(6.@C>313<@?4211<?@CCH8:D=@/,,F=33.8=787;A7,)):,*=ACCH??@C8<=G8??<99=HB9?:F)99#*,4@@9<::;<9=<:??645@?*8*)@1/8=<0-136/8./)/).%##%>)'&('*1%''.66*8(##%'(*24*0@=@KG)A?EHAAA77;ANI><888A@@4;FG>A@I=@;B<@<?;;4519<@?:@=@BJEDBCK8621B;4G89@?58=<@<<@6=C02@?@E?69=7@34;D::>AG8D>A:6665AA?=63>3<=+;4,($%#(%8665&'')+&)),,,%%0$3&><67=;2,(()7/(%(%B2()8*1193A,6+.<>B<D@F,,/%/6=9:99-<43-,+,30;368%@3CGDC@EBE@>5<;=3@@GED@D?@:FEE@?6@:<07,3021'(;;$$(6'=9,*')77EEDEI9?B8;F1*/2=0++0=:../5?:9/2@C=@=A@>BA=(9;@342+)(25&%'1&$,2+8:477**/<DC@=BEFN=46;-$6#6:<?><>0=??7=814342693GA79907EE1>?>:;?;@9;6'25)0(*>))991>:;7BFIB349:@B=@=:<=@=@+64;:9=.0()'&((&#%&B?(.36$'$#$##*,-3A@,:4>;>;E?%-)+.5?4577A@5:;.()(*.'23%-')*(-@C;FEGNKB@;HB@=C::8<C*08;=.1$(-,-2&'8<&(*/,7/5()549E?;C=?=C>>DFMA//1274($+@@;@=?G=>F7;436)))'7%;@50A,-,=BDDA9<?=A@@AGF>E699:F@=8<A8B:9?7<9@AFIHEF@2,1+;;&'&*/10:;%'#$,,,8D11=<;/*+7:$4'%(#$$%'3+(/1-0A1,<='36.0;1<;977BG=:/B0?AAA@+&=<<?A?F?BB=<4E:CG=>6A?>7-@@6IIL<J0E>0BEKKM=@???B?'):8'(@GE<FC@FE@??=TQAA??>AI:??D(0*/115%%#-05(0+1$/0/K9960BB:J4-<E;FDJC@B@880+3E??7%++%$6-'-%.(*(*%'&/A669;AIKC>DDNC::>>AADCEA42MG?=>C6>=GFG;:C8(A&65C-+-?*@?:0,,19;%%+6=@;?2779.)--@<<E8C:8881:4A1E?AE<:;9:8:=-''0'?EF@>2;=278:50+'(*#)(-1//+$5<.ABDEFGDFHGAFCC=?:::--'-12?CB@47<AHD>A@?)*)+<A>==BRG+?;E7A=EAIHH31/,$#$$'%/)##%'*/%'(510>4E**/(><=?A=FF6?B><M@BDL?=@>%%C4/4%%0*2#%++&*&(#-($##'%##$"%'&'%5%&+5*;-,().<3()(''&:=7''6562,0189DCG>A?C?=>B?1%*1.,((&&/4135'$&(**(/)&#.&7)3=795C??AG@50.21:EF=>BC@ICJCB?=CKML>E7>:*)4./>B??.//=-5>3>9BF==ECC<G93;34H2CEC??:9;:DD;BD8821?<:JK<:7;0F@G;6F=--6>>343>DA499C?56**5///,.E7;=JK:CAEC0;>9;A::=85'<77867??:)+H/.-5F89?43'*377,=C35B6:D;=;828:>=L:8<>:79<MQ>CFGH@:CEEC=?FBLBIICDBM:'+9BEGHHJKBF>::=;@;@EG/.;67.%&%*$%13::''))$)%($$439&@+-:<53(,:+),:*&:89?5##),.0-4&,,$%&,(&/.*2<;;@:=C@CCG=A;),53?2',--12%(4>=:;:B<<AEC78;@?7=)*683(669<CA<<LOG??GFACJCCDFE;ECA?==<+.,4.$##"$'+'$0$%%%%2>=--GD>BM=G8;A@007?KIH0&'H86?8:?CD=:5A>@AIB=(.&/%&,,$$$$(2$'..35&?:7<33@;>=BI;'*#''/'4)@?;1@45@;&501$,133)50%*6--+'56.+@4>APSTN>BHGC'99@AC>:77ACE;8?6C8???CAB=G@24@EEC;@6@@@HKF@:K<5<EE<BE63682B>DB$=<9>;/AGFIGA><FE:AA:B85$''-+-461,..%<+0654-0>9GE=C4&#&&&&$)%5@CAG@;43//<CK?,8=>@BBH7?D;:.14?6<?EDBCE65;BC8B.2388G=:;??>BA5=;2/0/D%-3=(("%)&%&;$$*5389;@8>856A@?A8C7:9?;;77<AJ@?<:8C6DE??9'(;=198/=7/%::3%,25&% NM:i:464 ms:i:12072 AS:i:12072 nn:i:0 tp:A:P cm:i:800 s1:i:5252 s2:i:0 de:f:0.0525 SA:Z:U00096.3,4624298,+,70S17203M152D7379S,60,1241; rl:i:137
By providing -c
we can tell samtools view
to count the number of records in the BAM file.
❓ Does this tally with the number of reads in our fastq file? If not, why not? Consider what we have learned about primary, secondary and supplementary alignments
# count the alignments in a file
!samtools view -bc sorted.alignment.bam
65646
As you can see from the output of flagstat
below, we have a few thousand secondary and supplementary alignments each, and a mapping rate of 93.98%
. The remainder, i.e. the total number of records minus the mapped reads, 65646 - 61694
, should give us the number of unmapped reads.
!samtools flagstat sorted.alignment.bam
65646 + 0 in total (QC-passed reads + QC-failed reads) 6661 + 0 secondary 6924 + 0 supplementary 0 + 0 duplicates 61694 + 0 mapped (93.98% : N/A) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (N/A : N/A) 0 + 0 with itself and mate mapped 0 + 0 singletons (N/A : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
In this example, we are filtering only the primary alignments by addition of two FLAGs, 0x100
and 0x800
, as an argument to -F
, which excludes any alignments matching the FLAG provided.
!samtools view -bF 0x900 -q 1 sorted.alignment.bam > primaries.sorted.alignment.bam
This time we're using -f
to match the flag we provide, giving us any unmapped reads.
❓ Does our calculation from
flagstat
agree with the number of reads output by this step?
!samtools view -bf 0x4 sorted.alignment.bam > unmapped.sorted.alignment.bam
What if we want to convert our unmapped reads back to fastq, perhaps so that we can try aligning them to something else? samtools fastq
provides a convenient method. By default it will only output primaries and unmapped, i.e. the unique sequences belonging to the underlying reads.
!samtools fastq unmapped.sorted.alignment.bam > unmapped.sorted.alignment.fastq
[M::bam2fq_mainloop] discarded 0 singletons [M::bam2fq_mainloop] processed 3952 reads
And what if, hypothetically, we wanted to stick our unmapped and primary-only BAM files back together? We could use samtools merge
.
This is a fairly unrealistic use-case, but more common might be combining samples from two different sequencing runs.
!samtools merge merged.bam unmapped.sorted.alignment.bam primaries.sorted.alignment.bam
Now, what if we're only interested in a certain region of the genome, e.g. positions 50000 to 150000 in the E. coli reference? We can yet again use samtools view
. We just need to specify our region as: chromosome:start-end
.
!samtools view sorted.alignment.bam "U00096.3:50000-150000" > region_extracted.sam
!tail -n 1 region_extracted.sam
de2f3d0e-b02a-4190-86e7-abb781b10171 0 U00096.3 149775 60 77S33M1I35M1I14M1I7M1I4M1D38M1I16M1D2M1D10M3D12M1I65M5I9M2D6M1D22M2D7M1I35M1I11M1D10M1D9M2I38M1D7M1I23M3I2M1D34M1D6M1I10M1I3M4D8M1D20M1I4M1I6M1I36M1D3M1I12M1I6M2D28M1I7M1I7M1D29M2D10M1D4M1I9M1I10M1D4M1D7M2I6M1I4M1D4M2I12M2D1M1D15M2I1M1I8M1I14M1D29M2D4M1I8M1I4M1D12M1I14M1D33M1D2M1D28M1I42M1D4M1D5M1D17M1D11M4I38M1I21M1I15M1I1M1I4M1I26M1I20M1I7M3D3M1D9M2I6M1D16M1I6M1I40M1D12M1I8M1D21M1D27M1D18M1I16M2D3M1D7M1D8M1I12M2I15M1D19M1D19M1D22M1D49M1D14M1I8M1I33M2I20M3I3M3I3M1D8M1D2M1D13M1D26M1D21M1D17M2D4M1I43M3D10M2D17M1I46M1I5M1D18M2I4M1D6M2D19M1D3M1I14M1I3M1I4M1I11M1I4M2I7M1D8M1D1M1D1M1D4M1D5M2D6M6D3M1D4M1D4M1I1M2D30M1D40M1I9M1I8M1I52M1D11M2D12M2D3M4D17M1D21M2I11M2D3M1D15M1D5M1I7M2D2M1D13M2D11M1I1M1I3M3I42M2I11M3I7M1I6M1D10M1I8M2I2M1D8M2D21M1I21M1I2M1D25M1I61M3D28M1D3M1D18M1D19M1I12M1D4M2D11M1D6M1D11M1I23M3D34M2D46M1I2M1D6M2D4M1D13M1D7M1D46M1I7M1I25M1I3M1I57M2D30M1D50M1D8M1I30M2D3M1I2M1D13M2D8M1D42M1D26M1D2M1I18M1I27M1D8M1D5M1D22M1D41M1D13M1I2M1I27M1I14M1D36M1I7M1I61M1I4M2D13M1I6M1D12M1D1M1D23M1I3M2D24M1D18M2D4M4D4M1D17M1D37M2I42M1I11M1D6M1I5M1D27M1I22M1D22M2I2M2I17M1I75M3D12M1D1M1D3M3D22M1D12M1D2M1D3M1D31M2D17M1I38M1I34M2D20M1I7M1D19M5D4M1D9M1I8M1D3M2D3M1D21M4D10M1I8M2I7M1I32M1I19M1D3M1I3M1D2M1D6M1D40M1D8M1I44M1D7M1D11M2D14M1D47M1I8M1I27M1D5M2D41M2I8M3I9M1D20M1D1M2I5M1I30M1D13M1D8M1D26M1I19M1I17M3D9M1D6M3D3M1I29M1D17M1D36M1D5M1D5M1I3M1I32M2D2M1I46M1I10M1I9M1I3M2I4M1I41M1I25M1I11M1I2M1D19M1D36M1D5M1D11M1I25M1D44M1D2M1I5M1I30M1I25M2D25M1D26M4D7M2I49M1D2M1I23M1D7M1I9M3I7M1I3M1D5M2D5M1D9M1D31M1I12M1D4M4D6M1D46M1D13M1D33M4D2M1D6M1I8M1D9M1I14M1D30M1D6M2D9M2I48M2I11M2I1M1D18M3I5M1I51M1I1M2I25M1D8M1I16M1D7M2D29M1I27M1D24M1D15M3D11M1D46M1D30M1D1M1D6M1I7M2D15M1D19M2D3M1D12M2I2M1I23M1I15M2D2M1I17M1D5M2D54M1D34M2I7M1I19M1D2M1D17M1D5M2D59M1D29M1I51M1I16M3D6M1I5M1I4M1I31M1D18M1D12M1I27M1I21M1I23M1I4M1D32M1I22M1D33M1I2M1I2M2I12M1D3M1I7M1D60M1D1M3I15M1D7M1I4M1D27M6I5M2I16M1D6M1I1M2D6M1D3M1D4M1D6M1D1M2D8M1I15M1D9M1D3M2D20M2I6M1D3M1D2M1D12M2I5M1D5M1D1M1D7M2I7M1I6M1I19M1I3M1D16M1D3M1I8M1D16M3D6M1D4M1I15M1I8M1D1M1D9M1D10M1I11M1D3M1I6M1D22M1D28M2I6M1I9M1D14M1D5M1I12M3D7M1I2M2D11M2I3M3I4M1D11M3I3M1D12M1I6M1D6M1I3M2D21M1I19M1D6M1D4M3I1M1I6M1D14M1I4M1D6M7D3M3D4M1D10M2D7M5I11M3D8M1I11M1D6M1D24M2D2M1D12M2I3M1I4M2D12M1D8M1D27M2D5M1I5M1D7M1I7M1I4M1D3M2D1M1I17M2I2M5I10M3D6M2D4M1I11M2D2M2I2M1I13M1I5M1D6M4I3M1D8M2D7M3D2M1D15M7D2M1D5M1D9M6D10M1D4M1D6M1D13M1D28M1D10M1I18M1D3M1D2M2D5M1D10M3D5M1I6M1D6M1I11M1D4M1I11M1D15M2I10M1D8M1I7M1I10M3I8M3I2M1D25M1D7M1I5M2D13M3D9M2D12M2D7M1D2M1D7M1D22M1I23M2I6M3I4M1D5M1I8M1D38M1I18M1I14M2D2M1I7M1I10M2D124M1I36M1I106M1I9M1I15M1D32M3D53M1D13M2I5M1D18M1D6M1D8M1D4M1I6M1I3M2D19M1I16M1D46M3I1M1D8M1I75M1I22M2I1M1D32M1I6M1D16M1I10M1D65M1D9M1D21M1D10M2I9M1I4M2D10M1D4M2I6M2I14M4I8M2D28M2D10M1D9M1I3M1I3M1I14M1D10M3D23M1D6M2D54M1D3M1I10M1I12M1D23M1I5M1I2M1D23M1D16M2I5M1I3M1I35M1D11M1D10M1I14M1D18M1D8M1I48M2D12M1I25M1I34M1D7M1I9M2I9M2I9M1D23M1D19M2D5M1D25M1D17M1I10M1D26M1D38M1I1M1I25M1I7M1I3M1D29M1I7M1I20M2D9M2D37M1D22M54S * 0 0 TTAAATATTATTTGGTTCAGTTTTACATATTGCTAAGGTTAAAACGACGACTAAACGGAATCGACAGCACCTAGAAGATTTCCATCTTCTTTTAATGTTGCCTGGAAATGAAAGAGGCTGTGATGTGCCGTTACCTTGATTAGGGTAAAACCCCATCGGAGAGTGTGGTGCTGCGTTTCATAATCTATATAAACAGAATTGGTGTGCGTAGGTTTTTAATGTCATTAGCGATTTTGCTGGTTAAACCTTCAATGGGGTACGCCAACGCCTTTGGCGGCAGTGCTTGCAGTAAGCGTGTTACCAAGCAGTTATGTTTTTTCTCACGAGTCCTTATTTCCATGACGAGTTTTGTTTCAATATTTCACACGCAATACAATTCTGAAGCGAGATATCAAAAAGAACAGAGTGAAGCGCCATTTTAATTGTCCAGAGCTATGTATTCTCCCATTCTAACCGTTGAACCATTGACTGATGGCCAGTAACGTATGGAGGTAAAACATGTTGGTCGCCATTTAAATTGACATTATTCAGTGTGGAATTGACATGATGTGATTTGCTGTGATTCTCGTTTGTACAGGACTGTAGGCCTTAAACAGAATGAGAGGTCAATTTGATGAACTGGATAACTGGACTTGCTGATTCATGTCAGGATCAAAATTCATTATCTATAAAATTCAACAGTTATGTTGTGTACTTATACCAGCCAATCGCCAAAACTTATCATCATCATCTGCTTTGTAAACCCTTAGTTGTCAAATCACTGTTCTGTAACAGAAAGAAAATTCTTTTCGTTCAGAAGATCTTTAGGATATGGGTCAGCAATTGAATGTCAAGGATTATTCTATTGGTATCCAGACCAACCCTTGATATTAACATAGTCTAATACAGCAGGAAACAGATGTTATTGATGGTTTATGTGCCGCCATAATCTTTCCTGAATAGACCATGGCATTTTAAATGGTTCTTGACAGGATTCGCTCTGCCAGATGTATTGCAGGTAAAGAATACCACCACCGGTACCTGTATAAGGGTATGACCATACATTTTACCCAGTCAGGCCAAGCTAGTAGGTTTATGCACAAAAAACCAAGTTGCTTGCTGTTGATGTGTAATTAAAGATTCATCTGTGCAGCAGTGAAGAGGGCCTGGATAGTTACTGCAACTTCCTGCATTATTTCCCGCAATAAAACTTTGTATATGCACTGACTGTATTTACAACAATTAGTGCTAAAGAAAAAGATGTACCTGAAGATAGTTGTTCATTTTTTACTCCAATGTTTCTTTATTGTTATGGTTTGCGCTGTTTTGTTTGTTATTTCGTAGTAATATCCACTAACACATGGCTAGTAGTTGACCAGATGCAACACTAAATCAGGATCGATTTTTTGCATTTTGGGTGCTAAATTTAGGTGAACCATCATGTCATTCAGACATTACTCCTGATAAACGGAACGAATCCCGCTGAAGCATTAAGACATTGAATGTATTCGCCGATCCTGAGTAGAAAATATCGCTCTCCTACACGTCTTGCTCCTGACCGAATTTCATTAGCAAATATTTGCCTGAGCCAGCAGCCATCAACACCGCTTAAGGGCTCTAAAATCAATATGTAACTGATTTACGGCGCATCGCCAGTTCCCTGTAGCAATCCAGCAAGTCGGATCATGATCATGAACTCTTTCCACCTGGTTGATAACTTCCAAGGTAACATTACCTGGAGAATATCCACCCCAACGACGCCAATCAGATATTACCCTGGTTACTGATTCCGCTTTTGCAGGTCCTGTTTTCACTGGCAACGGTCAATATCCTGTCCAGGCGATGGCTGGACAGGCGATGCAGAACGGTAATGACAAAAACAGAATTTATTTCTTCATCAGCTTACGTTAAGAATCGATCCATATCGTTCCTGAATCTGTAGTACTGTTCATTATCGGTTTACCTGTAAAATGTGGCTGATTGCCAAGAAGACTACCCGCATAATTGTTACGTTTACTCGCAGCAGCGTAGCTCATCGGATAATAAACGACCGCATTGGCTCAGGTGAAATTGTTACCTCTTGTGATGCTGGTGTTGCACAGGCTTGATTGCTGTTGCACTTCCCTGTTGCAAGAGTTACAGTGCTCCAAGCTTCAACCACAACAGCACTTGCTTATCTGCGGCTGGAACCATTTGAACCACCGTCATTAGCAGTTCCTCGCATGTGGCTCGCTTGGTCTTCATTTTCTTCGCTTTTTATTCGGGATACCGCACCAGTCTTAATTGAGTTTGAATGTTTACTTTGGTACTTTGTTAGCACTGATTTCGGGAAGATTGTCATCACCAAAATCAACAACACGGAAATGTCCTGAATATCAAAGTTCATTTCTAATGTAGCAGTTACAAGTTTACCATTTCTAGTAGTAATAACGGTCAGTCCAAATACTGTCTGTTGCCGCATAGAGGAGTACGCTGAACAATGAGCAAGCCAAAAACGACACTGCTGGATAATCCATAACGTCAATTTTTAAATGTCATCATAATTATCCTTCTTGATAGATGAAATTAAACGTCGCCAATCACAAAATTCCCGGTTACGCCTGCCCTGCATCTGTTTCACATTAGCACAACAGCATTAAAACCCTTCTGGGCTGGTCTGGTCTATATCTCCAGCGTATCTTTTCCGCACTAGGTTTAAGGAAAATCGCATCATCAGTAGTCCGTTTTGAAACCCATACCGATATTACTTGTCGTCGAAGATGGATCACCGAGCTGCGGTAATAGCTTTAATGAGCTGATGATCATTTCCGGTCAGAGTGGTATCAATCCAACTGATGCCACTGCTGCAGCCCACTGCGCAACCAGTTTAAAATCAGCCTGAGAATTACTGTTAGCTTATTCGCGATCTTATCCAGACCCATCTCGGGGATGCTCAATGTGTCGTCTATTGCCATCATTAGTGACGTTATTACCATAAGTGTGATGTTACAGGTTATCGCTTTGACCATCACTGTTGACTCTACATAATATCCATTTCCAAATACTGAGGGTGATAATAATATTCATGCATATCACTATTTTATGGCATTGTTTTATCGTTTGACATCCCTTGTAGTTACTGAATCTGACACCCATTCCATTAGGAATAATAGATTGTGTATTTTTTTTTGCTTCTGGGCTTTTGCTGATAATGCGCAAGACAACTTAGGTTTTACTTGTTTAACCATTTGATGCTGATATTTCCTGCTGCTCAATGCCACGAACGAATGCTTGTCCACCCTGTTGACGTTGCCAATGGAGCATTGCCTTGTTCATCATAAATCTGTGCCAGCAAATGGAATATTTTACCATCACTTCGTGTGATGTTCGTAATGGCTGAATTGCCCTTTGCACGGTTTCAAAATAGGCAAAAACGACTGAACCCTGACACAGTACAGCTACTGCACTGATTACTTTAATTCAACATGCGTTCTAAGATCGTTGATTAAGCGCGATACGGTTTTCATGATCAAGGAAAGAGCGCTGGTGACACCATAACCCATCAATGGATGGTACTCGCATTTCTCGCGCTCCTTGAGCACTGGAGCCTGAACTACGGCCAGTGTATCGGAGTCGCTAAAAAACTATCATTACTGAAAGTCAGTCCACCGCTATGCAATCCGGGAACCACCGTGGTGCTGGAGAGAACTTGACAGCTGTTATCGCTATTTGCGAGAAGCTGAACCTGCAGGCGTTCCCATGGTGACTCATGGCTGGCCGTTCCAGCCCCCAACATAACTCGAATTGTTTGCTAGCTTTATTCATCGTATAGCCAGTATTCACGCTATAACTGACGCGGGCGTTATCGCTATAGCCACTGCTAACGTTGGTGGTTACCCTTAAAGTCACTGCTTATTCAGTATCAATCTTCGAACCTGAAGTACGTTGTTCAGTGCCAAGTAATTTCAATTGGAATGGTGAAAACTAAGATAAACGCTATCATCAGTGTCGCCGTCTTCAATTCCATGAACGCTGGGCACTGACACTGTAGCTGCCAGGATGTACTGTTACTGTCACCCAATGAGTAATTGCTACGATTTTGAAGCCAGTAATCGGGACCAACTCCAAAATAAAATGAACCGTAATCTTTTCAAATTTCAGACGGTTTGCATTGATACTTGACCGTAACCTGATTTTTCATGCATGAGTAATTTACGCATGGATTTGAGTTAAGAATCTGTCAGGAATATCACTTCATCAATTAGAGTTCGTGCATCATTAAGACCAGTTAATTACTGTGTCGAATAGCGATCAGCCGCGATATTTCGTGGAATACCTGTTCTTCGATGACTTGTTCGGGAACACGATAACTTGCCCCTCGTATGTTTTATCATCCGGGATACGAACATTGGAATGAGATCACATCGAAAAGAAAATGCACCAACTGAAGTATTCGACCAACCTAACGAACCAGCGGTATAGTTCTTCTCGGTTATCTGAAAGTACCGGTATCGTTACCCGTCGATAGTTATTCAGGCCGTAGAGCGTAGCTTTGCTTGAAATCAATTAGGCTCATCCTGAAATCATCTTTTAAGCCTGACCGCGCTAATATCCCAACGTCCAACGCCAGGGGCGTAACATTTGAACAAGCGGATAAGAAAGGTTGCAATATCCGCTTGAGCCGGTAGTTTGGGATGGTAACGATAAGCTCACTGCATACCCTGACGGACTCGATCATCAATGACGAAAGCGCCTGGCGGCACCGTCGTTCATAATCTGTATCAGCCACCTTGCGTAATAGTGACTTTGGCGTTGATCTGGCAACGCCATGAATGATAGGCGCAAAGCTGGCTAAAGTCGGAGGCCGACATGCGGCCTGTCACTGCTCTGGAGTCGGGATGCCTCGGATACTGACGGAATCAAAGGTTTCGCCCGTCGGTATGAGCTTCGCCGAGAATGAGTTCGAGAACGCAGCAGAGCGATATCACGCTGAACGTCCGATTCTTGAAATCATTGGTACTGCAAGAATCGGTAACCATTGTAGTTGCCAAGAGGCACGCAGTCGCCATGCACCTAATTCATCCCACCGTTAAATGCAGCATAAATGCTTTCATTTTTTGAGCAAGGAGGTTTCACTATGATATCCGTTGAGGTTGTACTGGCAACATAGCCGCATTAATGCCGTTCCCATAACGATGGATCAACATAGTTTGGTAATTTTTCATTCTCCAGGCTCAACATCTATATATCCAGACGTACATCGTTAACATCATAACGAACAGAAGCCTGAGGGTAGATTTCCGTCGAATTGAGGCAATTACAAGCGTTTTCATCCCTTTTGGCAAGCCAGACGGCTTTCGTCTTTATCTGGGAGAATTGATATGAAACTGCAATAAATTCTTTTAATGTGATAAAGCACATTCTTTTACCTTCAATTGCGACAAATGTAATATTTTCGTTAATGATTGGTAGTATTTTACATGACACTGACGTCGTAAACACCCGGTATAGCGAGCCTCACTGTTGCAGGCTAGAATGATATTTACAATCATTCTGCCCATCAGGAAGGTATGGTCATATTCAACGGTTCAGCTGAAAGCAGGGTATTGCAATACAGCAGCGCGCAAAACGTGGCGATCAGGTTCAGATGAACTGATAAATTTTGCATATATTCTATCGTCACGCTATTATGCTTTCCTGCGGGATTATATCCTTGCCTGATTACAGCCTGGCATTCCTTTCAATCTCCGCGCCACCAAAGTCATTAATGGCGTAAAATGCAGCTTTTGCAGAGTTCGCTTGCCATCGGCCATTGACTTTCATGACCTCATCACTCCAATGGTGCAATCATTTTCACATCAATGGATGGCGTTTACCGCTAGCTTCTAATAGCCACTGCTAAAGACGTAGTAAGGGTTGGATTGGTCACTCGTAATGACGCCTTACCTTCTGAACCTGACAGAACCACTTCAGGGCTAACGGGGCTTCAAGGATTTCCCCTTGACCATCCGGGCAATAGAAAGTTTTATACATATCCGATGCAGTTGCAGCAGCCGCTTTTGATTCGCGACCTTTTCTGCATTGTGGTTTTGGTGGATTACCAGTACGTTAAACCAGACACGCTCTCTGTCTTTAGGCAGTGAGGTACTGGCTGTGTACATTAATTTGATTGTTTGCCACGTTTGGCATCGATACACGATACTGGCAGCGATTGGCGGTGAAAGGGGACTATGATTCTGCAGCTCAGCGTTGTCATCGCAGTATAACCAACTCTGGACAAGCGACGGGTTACTCCCCTTATTTTCCAGACGTACGTTGACACTTTTTGATCGCTTTTATATATTACGCGAGTTGCCGGAGATGACAATGTCCGCAATAGACGATGAACTAAAAGCCATACAGGTTTAAGAAGCATAAAGCTGTATGCTTTGGTTGTTCAAAAGACATGGACCATCCCTGATAGAGTTCATAAAACAGGGAATAATATGAATAAAAATTACTTCTTCCCTGTATCTTCATTCAATCAATTTTAACTGGTGAATACTTACTAAATAAGTAATGGTGTATGCAGTGTTTAGTTTTACATAACCAGCAGTTGCTGTTTGGTCTGCCGACGGCACGAACGTAAGACGCTTACAATCATAAACAGCCGGGTTTTTTGTGGCAGTCTCATTAGGCTTTCGTATCGCGGTCGCCAGGTTGTCAATTTGAACCTGTTTGATCATGGAACCGTCGATATTGTGTCAAGCAATATTAAGTTACCGTTAGATCGGTTGTAATAAAGCATGCGTTGTTCAAAGTCCTTTTGCTGTTACAAAAAAAAAAGAAAAAACAGAACCGAAAGCATTTTCACATGCTCCTGACTAAATTTCTGCTGCAGATCGACAATCATGCTGAAGGTTTCGTTTCGCCAACAGTGTCATCGATTCCACAGCACATGATTCGCCAACAGTTGTGCGGTTGCACGGATTGGGCACCATCCTTTGTTACCCGCGTTCGATCCGGTTTTCCACGGTGTTATCAACAACCGACGCACTGATATTGACTGGCCACCGTCATCGGCGAAGACGGTACCTGCAACCATTCGCCATCTCGGCCAGACGGGCAAGACCTTCAGTTTTCAGCATCAAAAATATTTGATGAGATTCCGTTCATCGACTACATGTTTGAAAAATGACTATTCATCTGACACGGAATAAAAGAACAAATGTCCGGTGTAATGGCGCAATGGTGTTTACCCAGAACCCAACCATTGCTTTCTTGATCAGTTGCCTATTGTATTGCCTTTAAAAGCTTCTTGAACAATTAAGATTCTTGGTTCAAAAATACGTAAAGAAGTGGTTCGCAGGTGCAATCGTTAGGTACACTGGGAGATGAATCGATATTTGTCTTATGAAAGTCTTCACAAGATGTTCAAGAAACTCGATGGGTAATTCAGGAAGATGGGAGTATGGTTCTCAAAGATATTATATTCGCAATATGAATGAATTCATTGTTTCGTTTGTCTCAGAATGCTTGTATGTAAGCGTCAAGCACAACGTCCCATGGCGGAACACCAACTGGCACACATTTGGACAGCAGGCCACAGGCGTAATAGCGATTTTTCATATGGTATCATCAAAAGCATAACGTTTCGATCTTTTGTAGGCTTTCATCCAGAACACACCATGATGTCAGGGATCAGCGTCATCGCGGGCATCCAGTTCAACGCGACGTTATCGGGCTGATCGCTGATCTGATCGTTTTCAGGTGAAGAAGGTTTCAGCGCCACGGCTCGTTTCGATGATCGGGTTGATATTGCGGGCCAGCAGTGCAGGTGGTGAAGATCGAAGATCGTGAATGCGCTTTCAGGATCGCTGTCATGTTTCAGCGGGGGCGTTGACTGCTTCCAGCGGAGAAACAGGTTGCTGCCTGCCTGGCGATGTCGCCACTGTTCATACGATTTTCCTCGCGCGGACATGGTGTAGCCGTTTTCGTGGACAACAAGTACAACAAACGGTCAACAGCTTCATCCAGCTCGAGCATCCCTTTGGTCTGGTGGCGGAAACTGAACTCACCCACCATTTCACCAGACGCTGCCAGTTCAGCGTTACGCTCAACTTCGCGGCTCGCGGCGAGCCAAGCAGGTCATAGCCGCACGGAACTTCAGATGCTCCAGCAGTTTCCATGTCGCGTTTACCCTGACAACAGGGACATACGCAACCAGACTGCCAGGATATCGCGGTAATGTTGGTAAACGTTTCGGGATTGCCGGTGAACGGCAGGCTTCGTCCAGCACGTCGTTAATCGCCAGCGCGAAAGCGTCGTGATAGGTCAGGCCGCTTTCCTGGGCAATCTTCTGTGCCGTCTTCCAGCAGTGGGTACCAGAACATGGCGGCAAACAGGAAACACCGGGTTCACGCGCATATCGTTATGGATACGCGTATCGGTATTCTTTGGCACCTGTTCAATGATCCGCTCCATCGGGCTGTCGCCATTTTCCGTGAAGTAGCAGGGTAATGGTTCGGGAACGGCGGCTGAACAGATGATATTGGCGCGACAGCTTATGGGCGTAACCGTCACCCGCTTGTAGCGGTTTAAGCGATTCTTCAAACAGGCATACCGTGGGATCTGGTTAAGAGCAGGTAGCAAGGCAAGGGATGGTTATCAGCTTCCGGCTCGATGCGCCGTCCGATTCGGCGGCAAAACGTTACCGCGCGCGGCGTAGTTACGAATCTTCGCGGTAGCGCGTTTCCAGGTTTCCGATCGGACGATCGTAACGCGGTTCCTTCAAATCCTTCATGCCGCCAACGTAATCACGGACGGTAAAATCCGCTACGCTGTCATCCAGGCTGTTGATACGTGAAATCGCGGCGCTGGGCGTTGCTCTTCGATGGAGCCGAAAATGTTGTCGCGCAGCCGACATCCGTTTTGCCCGCGTTCGGGAGGTCGTCGGTCGCTGACGTTACCTTCGTGGTGTCCACGGAAGGTGGCAACTTCGATAATCTCCGGGCCAAAACTACGTGACCAGACGGAAACGGCGACCCACAGGAGCAAGGATTACGGAACAAGTTCGCACCTGCTACGCATATCGCGTTTTAGTCGTTACGTCGAGGAAAATCTTGGCTTTTTGCCAAGTAACAGGTCGCGCACCGCCACCAACAGCAGGGCTTTCGGTATTCCCCCTTTTATGACCTGTACATTTGCTGGGCATTTTCGCTGATATCTTGCGGAATAGCATGCTGCTCACGCGGGATAGCCGTCACCTGTGGACGGGCGACTGCCTGTCAGGCCTCGCTTTACCTCGCGGCTTACACCCTGCGGCGAAAACTCGCGAGCTCGGAGTAAAATCGTACACCTCGGTCATGTAAACGTAGTTCAGGACGAAAAAAGAATCAGCGGCTAATCATAGCTCAGCATGACGCATTTGAGATGTTGAATTTCAATTGCCGAGCTCGGGCACGGCGGACGGCGCAGGCTTTTGACGCTGACTGAAAGGGTTTGCTCGACGCTGAAATCCTGCAGGTGTGCTTCTGCCTGCTGCCAGGAATTGAAGGTGCCGCGGTCATTACCGGGCGTGGAATCGCCTTTCGGCAACGCAGGCGCATGATTATGCTGGAAAGCTTTAGCGCCTTTTATCGATTTAAAGCGCCAGGGCAAGTGAATGTAATCTGGCACTTCCAGCCAAAAAGCTGGTAGCGATTTTGCCTTACTGTTCGTTCAATCAATCAGCCCCACGCACTATTTTCTGTAACACCTGGAAATAATCATCGACCACAGCGCCAAGTTGTAGGCGAACAACCCATAGCGGCGATGAATAGAATAAAATCTTCCCGTGCCAGTTTTTCCGTCAGCGGTGATAATGCCGCGCAGCTGGTCAGTAAATTGCCGTAACCAGGATGCTGCTGGCGAATTCGCTGCGGCATTCTGGTAGGTGATGCGACACCCGGCAATGACCGTCGTAATACCGCCAATGCTTTGAATAGGTGCTGTCGATTCCCGTTTGTAGTCATTCTGTTTAACCTTGGCGATCGTAACC <*#))$#$$#;=&*/9:''/22.%&'**.;;:79./88>=&)<D)&.+%2/5,,1&&&@70((8B.90--),%#,0%&$%%*5**(:7-3<.'&/6646..,./31;752)&&$'.718+.&-/.--2'$85+8:&)5>?6%$%*7%,;/+%7:30.,+/%$(&$,$$+%%+#%/5:7+<**+2'410--''+*%((*10&&%#$#'$(%&/9<C/%),887((<9,/&,<@E:+&$112?A%)*%*$'**-54/*..+-.*-2)/,483=((,0(37.4;7:(+)'%(+%&%*%�..+809*9)8%(%&&,-/)%####$#$#$$(-%($+,)'(%76I('/38@?--55/&&'/--+&$$'$'$&(2)#%('+-$&-EG0'#$&012.,%+4..0/39&$"(.(+6(8$89;DJN4'04))824(+)(05+'&()+-*0:=8$<:&*2(%(+B;4A797:B>9DB@'&($(%'%(+#%#&'(((%1/-3#%6,6(87103./+)(((.7&)-0(4%9<'(*),.($/32362:=2/-5=$)'-545+))1#$&''-(*)%#,$"-/-,736)$&9??&&435?<35&&'(+%-&$$)#0/*.$%&'-%.333430>03D@22>?01=>258*C.9?;$$>;9(,-251'+12/+,()+&$&&/-+**-*+.(2*>13037>=553546,-(*45&%'##.,+,&$$2..+266%(++(%%(,,+''(:9CDD%,4#%0&.6%($%A3')$&33%-&57*#&'+(0'*3)%1/.)1356>@)?C&"$-'*#$*/8,11,)+#$#$/6+&&'-1%$%+*)5351.00)&#%))'31$%&#$%$&244/,+,(7(+)&)13=89---,-.-,'-&#%%&+$&50<.62**-$(8A'-:;:*-/01))))*4;&1<5-%06,.46736:-46((67103.-//<85;:6B:;=5+$%'$&)34#/7457*'(,$()/48-.,/3$%'657((#$&&%//01;/$):5=5-730%*(//*><?62(%###%1/-+)673,-,#$12/(/@D6666(#%34&&%&$(#%%$3667$$%$%%%%%2++5;/.5=@?4?DA5;=;SH77650798*<<-2C1$$'(**+*)*(*$*2%'-)$/''--+.39$$*-'((*&#%$))+1-/+@@2:7B:000.2>4)*&'971J86-/')*67/((.)9?-::;&&%%(#+5()B&&.%$*61,0()5%&(++,'.--.46;<8EI;G<8934&*,+1485/,&,-684.4?@;=*(7=(,12089,()*?I<*<9=C>=K>$.4D'0/0,.4%$&%#$&#$20.(D>.>*AA6C<DD=:<%$$27GD1589)*+,&$$)<,6;$$2&$&/750+$.:HIB:17;2362##&+32/8)))%*)%%'4$;*+2%%*-../,+.(&$*36((*-1/%&.,)'(*-(##;1%&$&6494$162BH6;&&)--*9?..-/F39:9/13+.#&%:<18:+%*)(%))),*&%$#%0&1247,-.5%/&-.&.67:<-'#&#,'(%)##)*#%2()6'--0*0>6;(+6<:75='->1.(&%*B0273<&*':;<:C?/8--:2909G?60072>?B8?@7,$#&,(':1./064()>:7;:;,$#$%')&$*#&/&'100)+30%%,*00'+,+C'(,#%&$&2$==@*-*,37+%$"..*)''&(.%,4,63)''/,'5$02/.,++)7=@3/4.*/D0:+)**+$)/(&+.%040///.0$/96>B?7<5AFK3,%=%&+*#-.-592%%#'*&08+($'/1//2@>=;>?<?<50=;-,)&&(*6:9@?.%2&))&&&#%(%&&&(%)248@%%5*.8((**'2'163$.?=;E@A8?CBA7;.)1%'''*+)*2$$-&&%&(5138BD9>9>0&+7&()%'''&(##./###/*%&$"(*+)/%&,%$#))*7<)01+$$#$(%''$%#-(.C@4<-$$$$#%++,25<>,#%&',8*885@?22$$691.('&(=873-&'.'%79;4>&,/&+88#%%(30).-.*)?=64::5986>8B@=<>2421'($/*6-$89%(%+++'#--$&#.)0%'/::;2+-(7#$%&-44)/#%,)(-+*$&(%**(*,/'&6$&?%1.$687*)53$$"'&)#&8</0-*-.(+-##*'-4'+0./.673-.=06<:0)=*')$665.5<>.;:--&&6%"$$/1??%(*-**DIJ6.%')'").+-*$1+/>.%($%'&-/$'/))&,128'%)=>AG-&06<**-3*434%+'(%*-*,.&)'(&&&(,.6$(()%&'$#$$:/*2A72'&(>&(9$'&'/4844528:*%'.%"%$&'#)($.0>0$4'.978,+;(6*++*&&*&&##3.1<?4*7%+-$*-$&($$%&03%+*+..'33*1>B:;.32963101<2+(&83/*((:A2?A.>25344/$83'(/-0)2B.4..3;18@5,;%'..)*1%%))).4--*8-AHJ=&-1;&..000,0/-.340365>8-46'--.0433<?E4578?8%%2-%&%&$$'+%&3'*-7;6-703A>5>2@29?/B.9AA>@>CJ8)&8%%(+0.%?(*=$))&)&&(/%.19*,%+$*:<=+*2./%%.&'%%$#'4,1&'3&&$$$87%#0)%&$+0&)512%'&-+(-23.').1167685970*((/9?DG+.?A&%-20204--8*%%(*:=<4;*#&('24)$&&)49.&2*').6;-01(.3-%%+536&&%%%&%'6''$$&%&'4@,&+)-6+%%)'*)+6..-&(.+)&*-1-=A>-1'8?=?@21$$$21+:;&'97,-<8**21($%/,,,*07:9+/.$),,&6<985%&../%&)1-(&><-A>+.,)./2@>:8<)1*4./%+(&),$&&**&$%%$9',&'&(<-*+.011(*3/64*9/%%((=)$$#%&522.==<9:54(1*1.$$)%<8H,286%##$"9;&$&%9.67:7=+(#(---,,./-%/4+32+&+3+=88:C:K<<;+)),&('()'''(('(/*9;5=@@&/.'%*&/*$451&.0%.1F--=@''+93;634(,+-$+,%126,40;<1'>)24673366()*-1(&'/)&+543-274<>;3/,(&$'/84$'&*/3390$%#-&642/))+'92302*23?>A,/50?+$$589&5:'$'3,)1<+,;9:@C337(?C8+=<5:=*0;6,.,+5@)1.41%&6'67:D;272.,1B164-.(0;&;'%$''-//-..<,>93@ELE+),1,*''(&'-)#$$((&)%0)'&%%56,1@@ALE8+?@<;B6@=?+;9?A>GF=><122C>:85;=4(+.'%+0-07%30,)'.1522%/?'9%74A,/9*+--81=6E501<DAL9%5(,,)4-/<>3=='2/8;***(&-;A*$,:</-0?-((/>;=57'&&*$#->7-52$$$16$*'>9/+)38JH+81//--'"&&'$+)-$+8%65''64@?.:952134<?:3-&'$$%0..+-('+##%&)*$%++0=?@@>8B9/''.#14336+;@C3/1:=+,#&&''<:57))&&(4::)39:"#%(9,<>;>93463BA4%>5+7(,017@)23(36771A./=*-)),798=(60,2./778>'&3)*.)++,,/@B'<<<8.1,./397(,7,&&#)&0-'%3,&&$-/+<)(=@6871A=(&&'0+/+7*'$%.6:651,%&*$$*'4;@:6(-C$$$+,7-#&(4''3C%788$=99K93=90:10994*(0.,'&,)0/))7/5>6;78;>(6;+'"#140A71/-/:;5---@I.89'&+$'5+$&&&-&$&(&-4:77)1G:8@=EB<>GB.0/$&+04%$$)1&$'$$8),28.'*((-(/>32(*#*.0+<1%'(,'(244+<.J??9I>:?32-/4))48--6/&&&@'2>A564$$,+21((0387:=:;>0+-6941/CB=:A%8=58DG-8>6>A09B?>348?3:@=:89999A@C>C>'&&&()$&<1/'##&+1&'76//;,/0AF>6$*>;(96)*+340%)003565+48-.-941../B.@A'*6;9955?3<A9+(%22&D;"%#'/<9@G1+<9>=$9</0<48>><99D@C?B3AHC6#"$$.-1&'&)(')+8E$5;881--.?@E8$"#**$'*%*3$)'7($(('($$%--@EF,30,),34&%8202%&*);66,22-(.1251%$%&'(/0%)))++0.88:D@:31:2$#31.)0-@A+/1-+6=8/$&&&#&*',.%&)).*&&.0;99')*$)&00;0+*<112>ED557905)9;?B>9B=*%%')**/453;:)&)3->&,-+*0-((&-++-14&%$$%,-0<%3:8;9@19=69AA6D@44''*')3<%%)=6>#$*+++930=7834;61368-><'($;5>;+'%%)(5.%*#&%*(,''%$#$#%,*')&%12''81(..0/58:,%-)%,,,--',89F'18)*)&893%-(..&&;DA9;;<AB5$843.@@/@G1$'(&9?@08A@+/*68+,6'''@548A66@DBA;9&''(*==?,.3@A57F0/A.=34&3124&#$$&*1),%'"%&(-%$*-$141*31>1,$$$%0:%#$(+()*,'))'76243;<3$<;('%%%'5<=@B4&(//1)(1:+.-4,0;;=)'3496;'953-1-*/4**)++'.),.+++*/%##025$%(?,&6<&)&++++-:%%>>AACG44371E8,&&$+)78+-A2</#&&%'-.4.$+:47B,+717=4;=8./884-%&5*9$(%26143'34&23',1'9<8+/''0+-(-$((%1.-037-)+21-8;9=2*+424*+2>3=CR5.&.A445$&*1+&&%&&%'',=.%1.)+&&+./1$(((/3?7:,$*&'/8&$?8772*;>A?77)3&'%%(678024$)54303(*22:4:?/<)'6,.(-342,)42'015>67497D=,1045;E><<99:82,924==><<=>31@@0..3=8+676::@AE<:95?631,.1*?=AK29<H<;2($'*)7)+.,%'58A8((<=;?<@@/.289C72>=5A;-3&$&#&1/0?)()6;=0?,6=#)11-,-22-<9///81/1..+6+$#)1)-*18:614:8;:DJ>71**-'(&&#%,/8).27$$*7,,/10((*;8:B0:-)(()0650-9&&-//,0('$)?4:<;5:?C3756&&'$+*6532./+$,(6215768#$$&&0%%((&176%(%$#'%(((&,5@:4:#'$&,2$&3C;'$#4%)<'**#&%'),(561-$?0/7/25)&*,4'()$%&%%&?C/%)6<>8<//-.3/1%3*30+@A/'%5735#-*:652375&&.'75()4&6A:'*3A@3B::=>3+*/')5663:+99'(&.8+7=)5>))')*((/)3'=356:9513$$#&&*5.%4;2')$;&$6/5$&'&%%*.,-0%,)/++,//14;30=?<92))*>A:9;,,,,64>0''))(,(&5,//.()?+,=@8>'$%22--/,+##?28+?-,=D71%@-.A56*,),)$/3*$$+()#&0$''$%%//</&')&AN;#'&&((&%(,,*)(8:99959:$&'**,,<9(/78+,,-$&'*&$##,9;:4?6@3;;;(&$&($%#(2(86+578%)&(5$$$&01%8=/9<7@@22,,8978>?:1'1&*0/+)&'&387,&&<'&&)42*-&')+(/)(*%0/2696;==3>,)),,2;/>/)(M$$*'**,0.*+,,016B?*0)B/,)*,../+((//';;$&/0+99'':;;)-),.+,@C3<=;,9()5*&6%&'&'02885??$%.52*)***76/88:911D96<:AE3:;<C&'%'&;:D98:%$+&>*(;),,#$%?;;300,-.==8:+*-60.-,',=74'**004453<)*7;&'(%&)&%%4GB'*,'/##&'')&46/<(9-/):$6'%7$$74.>5.88:8$*(&8698@/%*0*5&%$&/)'&(0'()0,'-14##$*0,)-%,-'),&2$&*&(%19:(/,/,.,>>+)09;-23#01(6%$5-4=?+::?;?--00049.09=9,.4)-32,25AB719780.*9=,0$&)72.8>;:?'8,*...+1//33&&H59>919??&##&%&(+$6$5(:,04@-9/--9;52+@&61/$#$2;04=69:BAB@32@%%:-()2:;%%-%>?6;81JMA&,-0-1><;=(635;;=7?@BSF2)754(ADM6,/?01114>;0$%#2%($+.#$6#-))/;%'<51+//0$%<(+9+(744/),4//@=0;04:4<,*968$''(*/.,''&,2/.0+-&43$03(#%$&(&$%#%0-#')$&$$(/&&&&388'*)$$&'&&$%%"#)'9$$,,%(($%%%"#0+$#$5.513%'45*%&.<;;82)%#&)*;976439*+-6;8)';@@(&%%@):,-876=</020C>:;>=:::@>=/)&%##-,++%&(197,+3))(./,317'-6422602A-=DC:<8(80D=:9A05777777,,*/=5;:8*(0+-,111:B96;)479;A<L?:-%'$&(#%(##%$$$%&(%##&##'#"%#%$')(,$%#+(86%+#&'.8++//3=C:>K,/--%%-,,/22*''&';:388:3:&%*014-')&'(#$#%%#%&*++%&.('10,-&-#'%##&#.)8,),85.24)(/+&'0/1C3%-%-.#$#'""##$$&%.-1#*($.-:&.+-,-$&53$*%77*$$%0$$##('#'#(#+##%#)&$'%+0%2#$+*$$&)%$#"&((#&$)+'%+,%$$'(*"#$'&)0&'.&*$####%$$&'$%$$&"*$+)'''($'&%%2%(()'#%*(,),$%"0"$+'##$%$&#(%%$%11$.$%%'1&%&&&'#$*($)#%--($'.%%&)0/(&%%#%'8%'4)$%'%0%'*&)'&..''$,%$/0''(&*($$$%*$$),'#'%(+&.+-$'#&).(*)%'&%%%"%$&.%)"&+,$&47''#%&%#&&$$-09)&$'-*)"#%+(*)#')&#$%%##%(*''%#&#&&$$&$,&7%43=:,($$,)'%$&$'"$$%&%"'$*,%2"#,'%$$&'$'*7#$)$&+$%$%(&,$&##%*#&(('&&'&,0(&$%)"$&%$0'&()-)*%$$22+&$,$0.%'*')$013$))###%$#$%-3$##1+('')$('$+,:'(0*&'%$()%+-#$'($#%/,1,'()-')'$*)#)((&)%$+#$*+'%''(#&%&#$%(#'#)0&'&)%%##'&*#'.+)01)''%..(&&%-((*"##,$-)()%,+&($$%#$%&"'#$&$$%#&#%+'$$'&,/)(%%%#*',(&&&&($'%$'-$,)'03('&)+*%'(-)'$$'&&'%&))%#($$58'#+01))$$-'#&%()&$''''%&&/"$$&$)"$'#*$'&)1#(&(0&$$%+((**6'.+$$$+#$"$/$$$,/$$%&-$$,'%-*+(+)#+/%%$&'$#%&1.1*(&'&&$$$%+%##($$"(3*$&##"$($,.#"%$$$%$)'''#%((*,$$$#%("$%&*$(##&%(&%''*#"%%','%&*2$%#&$##/#$*&&$%,"%24'$(-")&*+$%&$#%#$*(-%(%%&'*'$$"#$%$##'&*&(#&*+&%+&',(**1(*$).$$'$'$/**%$%'##%$$$$%#$%)%'#'&#%)$-"#%*2#$,##+*2#,,.&&&$&)$$$"$$))%$"#$#&&%$#*0+,#&$&'&&#()-$$#$&%37%*&.*#&')*%%'(#&###&&(%('+.#$#$$&%5$&%%%#$&$001$##*%%#')(('&%).$43.%''%%#$%$%&'#,-1)#%&&&'(.$'&(-))%'291;:2#%%&%&5?&'9'''(&.$'/())%&/2%%'-#%#-'%$$#%$$$&#.#3/&&+7>BA+''),+)'.(,/..4303%%,+3400-AC7?<7:C?6+7D9>=C;?A@$%)<12*(.8+$&$-1&+'),./1#&($*(3)$$$/%#).$'#%#:'.43($#+3>>9E..016:?99$$-'++)2+3-00+*5C-3.,54%$4C57B+/(?471/+&)7$&:38'88;?.C(,B6$$-5<7>7$$'8>>:12&;)+#(3.$##%&$%&$&'''%&().-,%=3$*12(-0+5:-.61+.)--:E6:8%=003.34$%(,)>@?=C51,66<9<03*194=86>;9)''/.0*1+-'275-9?>*+7456..-&&2,++6$'16/6/,;822971=B:<9<;999;@@C>--D-/21&,8>'+8,.<:9>?A@BBHDC>32CID1**9--&3-;&+$$&&/2&&&$*-##&)&5)+'02(+,,.@?=7<=:56;3-(*889..8995-06/'0/175&2''$$;%%&78/)##$%%++2((,(#%%(-+0.)*:+/&%%')$$'*,0()&1;-:==>&$$&).$$)**)$$'%20)4,9/..$)8+/,4&)*-%%('*)66$&//*.,56*+&8&#%$&1//&001@>$$#23<#.)3)08%%7A:-.211*''*'*$%'$%#8(+),6(-$.*77<3*,))(*$..%&&+/&,3*+?->)'*(43,976)%,87&+#'$'&&'&(4$++)$&%+,+.:39,=?/3@%252:%&-$++$'(1%&'$"$"#$*'.#&#%)'*,*#''&*++-'&$$181.1$)C-.14><5425E31@912C@62?7=@=B>FD?=%/*'.%(<<;*&)%55$&,>9322,*-%$)$#<<,.79*')#$5+**++6337;3@,$112.1-;:36*85-8'*$'#26.$$''13.#'(:./((&(&$$$2,%&(%&$)327&990<B&,/4=A:<4:;58886..,534?I90/%)523ACC,$<.*+0-1.6&(-20#);%'/%0%2,##%/*.6/0(@-'&*&&9:=7./)'0%#1&'%%##$%&%&-.3$$0-/),0(224$+7$1&0.-%%&'&%-4*#%%%'73618$$$"$$$&1///,68'.=;:-&%0$&&($%*,<-&21$((+&$&'%(*&56+.=*-3$'&,224&&&$&%-.%42+50;3%%/.+;@G@@3-',+#$)-#$&0#,--%%)<?7<;**-?@,81-<=1**$%<;<@7,:.0:6322/;0$&30-.:+48?:9/.:5#&'0./<?9==<,3)')-34;-1=>>$#&&''0**4598,/))-%0*+-1-%&3/+)+#$*&'%+.++,#*4*/5%(%/06@2@2:)021(&..5$%5*,,%&34:--/52'(-/065&)/5%&+//>><<=9%%%-=5@729?530,A<C)78$%$$&67A<:97,+%($)<++'&%%#'5+67<A3+(.%#%*($,/*+$$'$38/''5,6496??=:$79-/'-21&45)%'=.<?7'>=6+-4(<#$/-(7%503*,)-:?8'8&;7#&&0256'5?+*4/+.--+,4.*,'%5:)**:,;67&.1++7ACB+'2>?<7=2@%(7<:?):<557'43-9,*38>+1*,%$#928<<D-(''&$,0-'+-70*.14(')->&&%)):3102*,;-2-2D9BJ:+7:=E;.2H:A=FC?75$$()*++,$*=9<;+571$%$1),&'64)%(2'+386CLEA?C5<?BBGJ%/>;><=@%:'*+,(=5&%&*$,'%%1>F6=%&)%/1+,166/7%%3((3*2%'%%2))#%*,0.,445/0%#$#%%#$;%$$%(7?97?<E1$$8IG:'*/''*%,-$&0,2-,-44<<709:&9=?+,6*+42223(,)&++,(*/-#'./+*-55@@:8&&+'43&*-;-)0:5$()%'&(5++36,&%$%*(2%)=3;B=13:9EA/..,433.4/>A14104A?;=I(7+0=..-:/56340-81%*+,&3;2.6()+-%%'+--:F%,16<35$&&(:6*22--50 NM:i:1605 ms:i:10336 AS:i:10336 nn:i:0 tp:A:P cm:i:297 s1:i:2806 s2:i:0 de:f:0.1327 rl:i:0
You may want to only carry forward a subset of data into the next stage of your experiment. For instance, if you are planning to perform variant calling, you will find that some tools operate better within certain depth ranges. As such, if you have more coverage than needed, you may need to 'down-' or 'sub-' sample your alignments.
The simplest way to do this is using samtools view
and using the -s
, which will randomly subsample all input alignments.
For instance an -s
of 0.25
should output only 25% of the input reads:
!samtools view -s 0.25 -b sorted.alignment.bam > subsampled.sorted.alignment.bam
!samtools index subsampled.sorted.alignment.bam
Sampling reads to create uniform coverage across a genome is a form of the box packing problem: how do we best fit a selection of reads into a box with sides the length of the genome and the required depth of coverage.
The subsample_bam
program from pomoxis implements a heuristic to approximately solve this problem:
!subsample_bam -o pomoxis_sampled sorted.alignment.bam 25
[09:14:17 - U00096.3] Building interval tree. [09:14:24 - U00096.3] Starting pileup. [09:14:29 - U00096.3] Hit target depth 25.
This program also has convenient filters for read quality, accuracy, and length.
Sometimes, one needs more powerful programmatic access than that which is available via command line tools alone. The good news is that there are several ways to access SAM and BAM files using Python and other languages.
For example, pysam
is a python library which makes the manipulation of various biological formats as straightforward as possible. Specifically, Pysam provides the pysam.AlignmentFile
parser.
We can use the parser to iterate over the alignments within our BAM file, and quite easily visualise the alignment results, as below:
import pysam
import numpy as np
read_lengths = []
with pysam.AlignmentFile("sorted.alignment.bam", "rb") as bam:
for read in bam.fetch('U00096.3'):
read_lengths.append(read.query_length)
print("Mean read length: {:.0f} bases.".format(np.mean(read_lengths)))
Mean read length: 7250 bases.
The documentation for pysam is generally quite helpful for summarising the information contained in BAM records.
To create a simple tabular summary of per-read statistics the stats_from_bam
program in pomoxis is useful:
# run the alignment summarizer program
!stats_from_bam sorted.alignment.bam > sorted.alignment.bam.stats
# use python to plot the results
import pandas as pd
import aplanat
from aplanat.hist import histogram
from bokeh.layouts import gridplot
df = pd.read_csv("sorted.alignment.bam.stats", sep="\t")
p1 = histogram(
[df['read_length']], title="Read lengths",
x_axis_label="read length / bases", y_axis_label="count")
p1.xaxis.formatter.use_scientific = False
p2 = histogram(
[df['acc']], title="Read accuracy",
x_axis_label="% accuracy", y_axis_label="count")
aplanat.show(gridplot((p1, p2), ncols=2))
Mapped/Unmapped/Short/Masked: 48109/3952/0/0
A very common requirement is to investigate the depth of coverage across the genome, i.e. the number of reads on average overlapping any given locus. This may be in order to determine how well a sequencing run performed in terms of on-target throughput, to check if we have enough depth of coverage to support variant calling, or to find out if there are any areas of the genome that have been under-represented in our sequencing run.
Whilst samtools
does provide tools for performing these checks, mosdepth
is simpler to use whilst also being faster, making it a natural choice for running coverage analysis.
Fundamentally mosdepth
accepts a BAM
file and in turn creates several output files containing coverage information. Before we look at those, let's go over the command we'll run:
mosdepth
is the base command,-n
means don't output per base coverage information, as it is unneeded, instead mosdepth will output coverage by intervals,--by
sets the interval size,--fast-mode
makes some performance optimizations,ecoli_cov
is the name we'll give to mosdepth
to prefix our output files with.sorted.alignment.bam
is our trusty BAM which will be analysed.# calculate coverage for both forms of sampling used above.
!mosdepth -n --fast-mode --by 500 ecoli_cov subsampled.sorted.alignment.bam
!ls ./ecoli_cov*
./ecoli_cov.mosdepth.global.dist.txt ./ecoli_cov.regions.bed.gz ./ecoli_cov.mosdepth.region.dist.txt ./ecoli_cov.regions.bed.gz.csi ./ecoli_cov.mosdepth.summary.txt
We should now be seeing a set of new files all prefixed with ecoli_cov
, but what do they contain?
ecoli_cov.mosdepth.summary.txt
contains some basic summary information
ecoli_cov.mosdepth.global.dist.txt
and ecoli_cov.mosdepth.region.dist.txt
contain cumulative distributions of the proportion of total bases at increasing coverage values.
ecoli_cov.regions.bed.gz
and ecoli_cov.regions.bed.gz.csi
are BED
files containg the depth of coverage at increasing intervals of whatever we set as --by
, in our case 500bp
.
First, let's examine the summary info:
!cat ecoli_cov.mosdepth.summary.txt
chrom length bases mean min max U00096.3 4641652 100323185 21.61 0 105 U00096.3_region 4641652 100323185 21.61 0 105 total 4641652 100323185 21.61 0 105 total_region 4641652 100323185 21.61 0 105
Once again, we only have one chromosome U00096.3
, and we can see that we have mean coverage of 21.61
across it, with highs of 105
and lows of 0
.
❓ What could cause such highs and lows? Is this unexpected?
Usually, we want a bit more insight into the our coverage data, whilst maintaining a high level view. So let's make some plots, and for that we'll drop back into python.
# Plotting the results of mosdepth (click play)
import pysam
import aplanat
import numpy as np
import pandas as pd
from aplanat.lines import steps
cumulative_depth = pd.read_csv(
'ecoli_cov.mosdepth.region.dist.txt', sep='\t',
names=['ref', 'depth', 'proportion'])
binned_depth = pd.read_csv(
'ecoli_cov.regions.bed.gz', sep='\t',
names=['ref', 'start', 'end', 'depth'])
def make_coverage_plot(cumulative_depth, binned_depth):
# Plot the proportion of the genome at coverage levels
p1 = steps(
[cumulative_depth['depth']],
[cumulative_depth['proportion']],
colors=['darkolivegreen'],
x_axis_label='Depth of coverage',
y_axis_label='Proportion of genome at coverage')
# Plot the binned coverage levels across the genome
p2 = steps(
[binned_depth['start']],
[binned_depth['depth']],
colors=['darkolivegreen'],
x_axis_label='Position along reference',
y_axis_label='sequencing depth / bases')
p2.xaxis.formatter.use_scientific = False
return gridplot((p1, p2), ncols=2)
aplanat.show(make_coverage_plot(cumulative_depth, binned_depth), background="#f4f4f4")
The visualisations above should provide even greater insight into the depth of your alignment data across the genome.
The cumulative coverage plot (top) reassures us that a high proportion of the genome is well covered, and that the mean we saw in the summary file is reasonably representative.
However, with the coverage across genome plot (bottom) it is easy to see that whilst coverage remains quite even across the genome, there are dips and troughs in certain regions, as well as regular outlying areas of extremely high or low depth. In your experiment these regions could lend themselves to further investigation.
To show the effect of the subsample_bam
program from pomoxis
lets also plot the coverage trace from the bam produced:
# Demonstration of pomoxis uniform sampling
!mosdepth -n --fast-mode --by 500 ecoli_cov_pomoxis pomoxis_sampled_25X_U00096.3.sorted.alignment.bam
cumulative_depth = pd.read_csv(
'ecoli_cov_pomoxis.mosdepth.region.dist.txt', sep='\t',
names=['ref', 'depth', 'proportion'])
binned_depth = pd.read_csv(
'ecoli_cov_pomoxis.regions.bed.gz', sep='\t',
names=['ref', 'start', 'end', 'depth'])
aplanat.show(make_coverage_plot(cumulative_depth, binned_depth), background="#f4f4f4")
In summary, we have explored what SAM and BAM files are, how they are structured and how one might be created through the process of alignment. In addition, we have taken a brief look at manipulating BAM files in order to extract only the information in which we are interested, both using commmand line utilities and with Python.
The code tools presented here can be run on any dataset from an Oxford Nanopore Technologies' device. The code will run within th EPI2ME Labs notebook server environment.