BCH709 Introduction to Bioinformatics: 13_Sequence_Alignment

Pairwise alignment

The process or result of matching up the nucleotide or amino acid residues of two or more biological sequences to achieve maximal levels of identity and, in the case of amino acid sequences, conservation, for the purpose of assessing the degree of similarity and the possibility of homology.

One of the most fundamental problems in bioinformatics is determining how “similar” a pair of biological sequences are. There are many applications for this, including inferring the function or source organism of an unknown gene sequence, developing hypotheses about the relatedness of organisms, or grouping sequences from closely related organisms. On the surface this seems like a pretty straight-forward problem, not one that would have been at the center of decades of research and the subject of one of the most cited papers in modern biology. In this chapter we’ll explore why determining sequence similarity is harder than it might initially seem, and learn about pairwise sequence alignment, the standard approach for determining sequence similarity.

Protein vs. DNA sequence alignment

Protein amino acid sequences are preferred over DNA sequences for a list of reasons.

However, there are some obvious instances when DNA alignments are needed.

insertion mismatch deletion

Hamming Distance

Metric to compare sequences (DNA, AA, ASCII, binary, etc…)

hamming

Here end-to-end (global) for query, partial (local) for reference

Local vs Global alignment

ALIGNMENT

local_example

The Smith–Waterman algorithm finds the segments in two sequences that have similarities while the Needleman–Wunsch algorithm aligns two complete sequences. Therefore, they serve different purposes. Both algorithms use the concepts of a substitution matrix, a gap penalty function, a scoring matrix, and a traceback process.

alignment_compare gap transitions

Take the alignment of DNA sequences TGTTACGG and GGTTGACTA as an example. Use the following scheme:

penalty

Initialize and fill the scoring matrix, shown as below. This figure shows the scoring process of the first three elements. The yellow color indicates the bases that are being considered. The red color indicates the highest possible score for the cell being scored.

Smith-Waterman-Algorithm-Example-Step1 matrix_finish backtrace

location

 mkdir /data/gpfs/assoc/bch709/<YOURID>/alignment
 cd $!

Install software

create alignment env, activate and install blast muscle clustalo from bioconda

Actual Alignment

Match score: 3
Mismatch score: -3
Gap penalty: -2
TTAGTAT
TAGCTAT

  T T A G T A T
T              
A              
G              
C              
T              
A              
T              

.

Example Protein sequences

>sp|P10242|MYB_HUMAN Transcriptional activator Myb OS=Homo sapiens OX=9606 GN=MYB PE=1 SV=2
MARRPRHSIYSSDEDDEDFEMCDHDYDGLLPKSGKRHLGKTRWTREEDEKLKKLVEQNGT
DDWKVIANYLPNRTDVQCQHRWQKVLNPELIKGPWTKEEDQRVIELVQKYGPKRWSVIAK
HLKGRIGKQCRERWHNHLNPEVKKTSWTEEEDRIIYQAHKRLGNRWAEIAKLLPGRTDNA
IKNHWNSTMRRKVEQEGYLQESSKASQPAVATSFQKNSHLMGFAQAPPTAQLPATGQPTV
NNDYSYYHISEAQNVSSHVPYPVALHVNIVNVPQPAAAAIQRHYNDEDPEKEKRIKELEL
LLMSTENELKGQQVLPTQNHTCSYPGWHSTTIADHTRPHGDSAPVSCLGEHHSTPSLPAD
PGSLPEESASPARCMIVHQGTILDNVKNLLEFAETLQFIDSFLNTSSNHENSDLEMPSLT
STPLIGHKLTVTTPFHRDQTVKTQKENTVFRTPAIKRSILESSPRTPTPFKHALAAQEIK
YGPLKMLPQTPSHLVEDLQDVIKQESDESGIVAEFQENGPPLLKKIKQEVESPTDKSGNF
FCSHHWEGDSLNTQLFTQTSPVADAPNILTSSVLMAPASEDEDNVLKAFTVPKNRSLASP
LQPCSSTWEPASCGKMEEQMTSSSQARKYVNAFSARTLVM

>sp|P46200|MYB_BOVIN Transcriptional activator Myb OS=Bos taurus OX=9913 GN=MYB PE=2 SV=1
MARRPRHSIYSSDEDDEDIEMCDHDYDGLLPKSGKRHLGKTRWTREEDEKLKKLVEQNGT
DDWKVIANYLPNRTDVQCQHRWQKVLNPELIKGPWTKEEDQRVIELVQKYGPKRWSVIAK
HLKGRIGKQCRERWHNHLNPEVKKTSWTEEEDRIIYQAHKRLGNRWAEIAKLLPGRTDNA
IKNHWNSTMRRKVEQEGYLQESSKASQPAVTTSFQKNSHLMGFTHAPPSAQLPPAGQPSV
NSDYPYYHISEAQNVSSHVPYPVALHVNIVNVPQPAAAAIQRHYNDEDPEKEKRIKELEL
LLMSTENELKGQQALPTQNHTCSYPGWHSTTIADHTRPHGDSAPVSCLEEHHSTPSLPAD
PGSLPEESASPARCMIFHQSTILDNVKNLLEFAETLQFIDSFLNTSNNHENLDLEMPSLT
STPLNGHKLTVTTPFHRDQTVKIQKENTIFRTPAIKRSILEGSPRTPTPFKHALTAQEIK
YGPLKMLPQTPSHLVEDLQDEIKQESDESGIVAEFQENGQPLLKKIKQEVESPTDKAGNF
FCSNHWEGDSLNTQLFTQASPVADMPNILTSSVLMTPVSEDEDNVLKAFTVPKSRSLASP
LQPCNGAWESASCGKTDDQMTASGQSRKYVNAFSTRTLVM

>tr|A0A2I2ZIL5|A0A2I2ZIL5_GORGO MYB proto-oncogene, transcription factor OS=Gorilla gorilla gorilla OX=9595 PE=4 SV=1
MARRPRHSIYSSDEDDEDFEMCDHDYDGLLPKSGKRHLGKTRWTREEDEKLKKLVEQNGT
DDWKVIANYLPNRTDVQCQHRWQKVLNPELIKGPWTKEEDQRVIELVQKYGPKRWSVIAK
HLKGRIGKQCRERWHNHLNPEVKKTSWTEEEDRIIYQAHKRLGNRWAEIAKLLPGRTDNA
IKNHWNSTMRRKVEQEGYLQESSKASQPAVATSFQKNSHLMGFAQAPPTAQLPATGQPTV
NNDYSYYHISEAQNVSSHVPYPVALHVNIVNVPQPAAAAIQRHYNDEDPEKEKRIKELEL
LLMSTENELKGQQVLPTQNHTCSYPGWHSTTIADHTRPHGDSAPVSCLGEHHSTPSLPAD
PGSLPEESASPARCMIVHQGTILDNVKNLLEFAETLQFIDSFLNTSSNHENSDLEMPSLT
STPLIGHKLTVTTPFHRDQTVKTQKENTVFRTPAIKRSILESSPRTPTPFKHALAAQEIK
YGPLKMLPQTPSHLVEDLQDVIKQESDESGIVAEFQENGPPLLKKIKQEVESPTDKSGNF
FCSHHWEGDGLNTQLFTQTSPVADAPNILTSSVLMAPASEDEDNVLKAFTVPKNRSLASP
LQPCSSTWEPASCGKLEEQMTASSQARKYVNAFSARTLVM

>XP_012418097.1 PREDICTED: transcriptional activator Myb isoform X5 [Odobenus rosmarus divergens]
MARRPRHSIYSSDEDDEDIEMCDHDYDGLLPKSGKRHLGKTRWTREEDEKLKKLVEQNGTDDWKVIANYL
PNRTDVQCQHRWQKVLNPELIKGPWTKEEDQRVIELVQKYGPKRWSVIAKHLKGRIGKQCRERWHNHLNP
EVKKTSWTEEEDRIIYQAHKRLGNRWAEIAKLLPGRTDNAIKNHWNSTMRRKVEQEGYLQESSKASPPTV
ATSFQKNSHLMGFAHAPPSAHLPPAGQPSVNNDYSYYHISEAQNVSSHVPYPVALHVNIVNVPQPAAAAI
QRHYNDEDPEKEKRIKELELLLMSTENELKGQQTQNHTCSYPGWHSTTIADHTRPHGDSAPVSCLEEHHS
TPSLPVDPGSLPEESASPARCMIVHQGTILDNVKNLLEFAETLQFIDSFLNTSGNHENLDLEMPSLTSTP
LNGHKLTVTTPFHRDQTVKTQKENTIFRTPAIKRSILESSPRTPTPFKHGLAAQEIKYGPLKMLPQTPSH
LVEDLQDVIKQESDEPGIVAEFQENGPPLLKKIKQEVESPTDKAGNFFCSSHWEGESLNTQLFPQALPVT
DVPNILTSSVLMTPVSEDEDNVLKAFTVPKNRSLASPLQPCGGAWEAASCGKTEDQMTASGQARKYVNAF
STRTLVM

Example Nucleotide Sequence

>NM_001130173.1
ATGGCCCGAAGACCCCGGCACAGCATATATAGCAGTGACGAGGATGATGAGGACTTTGAGATGTGTGACC
ATGACTATGATGGGCTGCTTCCCAAGTCTGGAAAGCGTCACTTGGGGAAAACAAGGTGGACCCGGGAAGA
GGATGAAAAACTGAAGAAGCTGGTGGAACAGAATGGAACAGATGACTGGAAAGTTATTGCCAATTATCTC
CCGAATCGAACAGATGTGCAGTGCCAGCACCGATGGCAGAAAGTACTAAACCCTGAGCTCATCAAGGGTC
CTTGGACCAAAGAAGAAGATCAGAGAGTGATAGAGCTTGTACAGAAATACGGTCCGAAACGTTGGTCTGT
TATTGCCAAGCACTTAAAGGGGAGAATTGGAAAACAATGTAGGGAGAGGTGGCATAACCACTTGAATCCA
GAAGTTAAGAAAACCTCCTGGACAGAAGAGGAAGACAGAATTATTTACCAGGCACACAAGAGACTGGGGA
ACAGATGGGCAGAAATCGCAAAGCTACTGCCTGGACGAACTGATAATGCTATCAAGAACCACTGGAATTC
TACAATGCGTCGGAAGGTCGAACAGGAAGGTTATCTGCAGGAGTCTTCAAAAGCCAGCCAGCCAGCAGTG
GCCACAAGCTTCCAGAAGAACAGTCATTTGATGGGTTTTGCTCAGGCTCCGCCTACAGCTCAACTCCCTG
CCACTGGCCAGCCCACTGTTAACAACGACTATTCCTATTACCACATTTCTGAAGCACAAAATGTCTCCAG
TCATGTTCCATACCCTGTAGCGTTACATGTAAATATAGTCAATGTCCCTCAGCCAGCTGCCGCAGCCATT
CAGAGACACTATAATGATGAAGACCCTGAGAAGGAAAAGCGAATAAAGGAATTAGAATTGCTCCTAATGT
CAACCGAGAATGAGCTAAAAGGACAGCAGACACAGAACCACACATGCAGCTACCCCGGGTGGCACAGCAC
CACCATTGCCGACCACACCAGACCTCATGGAGACAGTGCACCTGTTTCCTGTTTGGGAGAACACCACTCC
ACTCCATCTCTGCCAGCGGATCCTGGCTCCCTACCTGAAGAAAGCGCCTCGCCAGCAAGGTGCATGATCG
TCCACCAGGGCACCATTCTGGATAATGTTAAGAACCTCTTAGAATTTGCAGAAACACTCCAATTTATAGA
TTCTTTCTTAAACACTTCCAGTAACCATGAAAACTCAGACTTGGAAATGCCTTCTTTAACTTCCACCCCC
CTCATTGGTCACAAATTGACTGTTACAACACCATTTCATAGAGACCAGACTGTGAAAACTCAAAAGGAAA
ATACTGTTTTTAGAACCCCAGCTATCAAAAGGTCAATCTTAGAAAGCTCTCCAAGAACTCCTACACCATT
CAAACATGCACTTGCAGCTCAAGAAATTAAATACGGTCCCCTGAAGATGCTACCTCAGACACCCTCTCAT
CTAGTAGAAGATCTGCAGGATGTGATCAAACAGGAATCTGATGAATCTGGAATTGTTGCTGAGTTTCAAG
AAAATGGACCACCCTTACTGAAGAAAATCAAACAAGAGGTGGAATCTCCAACTGATAAATCAGGAAACTT
CTTCTGCTCACACCACTGGGAAGGGGACAGTCTGAATACCCAACTGTTCACGCAGACCTCGCCTGTGGCA
GATGCACCGAATATTCTTACAAGCTCCGTTTTAATGGCACCAGCATCAGAAGATGAAGACAATGTTCTCA
AAGCATTTACAGTACCTAAAAACAGGTCCCTGGCGAGCCCCTTGCAGCCTTGTAGCAGTACCTGGGAACC
TGCATCCTGTGGAAAGATGGAGGAGCAGATGACATCTTCCAGTCAAGCTCGTAAATACGTGAATGCATTC
TCAGCCCGGACGCTGGTCATGTGA

>NM_005375.3
ATGGCCCGAAGACCCCGGCACAGCATATATAGCAGTGACGAGGATGATGAGGACTTTGAGATGTGTGACC
ATGACTATGATGGGCTGCTTCCCAAGTCTGGAAAGCGTCACTTGGGGAAAACAAGGTGGACCCGGGAAGA
GGATGAAAAACTGAAGAAGCTGGTGGAACAGAATGGAACAGATGACTGGAAAGTTATTGCCAATTATCTC
CCGAATCGAACAGATGTGCAGTGCCAGCACCGATGGCAGAAAGTACTAAACCCTGAGCTCATCAAGGGTC
CTTGGACCAAAGAAGAAGATCAGAGAGTGATAGAGCTTGTACAGAAATACGGTCCGAAACGTTGGTCTGT
TATTGCCAAGCACTTAAAGGGGAGAATTGGAAAACAATGTAGGGAGAGGTGGCATAACCACTTGAATCCA
GAAGTTAAGAAAACCTCCTGGACAGAAGAGGAAGACAGAATTATTTACCAGGCACACAAGAGACTGGGGA
ACAGATGGGCAGAAATCGCAAAGCTACTGCCTGGACGAACTGATAATGCTATCAAGAACCACTGGAATTC
TACAATGCGTCGGAAGGTCGAACAGGAAGGTTATCTGCAGGAGTCTTCAAAAGCCAGCCAGCCAGCAGTG
GCCACAAGCTTCCAGAAGAACAGTCATTTGATGGGTTTTGCTCAGGCTCCGCCTACAGCTCAACTCCCTG
CCACTGGCCAGCCCACTGTTAACAACGACTATTCCTATTACCACATTTCTGAAGCACAAAATGTCTCCAG
TCATGTTCCATACCCTGTAGCGTTACATGTAAATATAGTCAATGTCCCTCAGCCAGCTGCCGCAGCCATT
CAGAGACACTATAATGATGAAGACCCTGAGAAGGAAAAGCGAATAAAGGAATTAGAATTGCTCCTAATGT
CAACCGAGAATGAGCTAAAAGGACAGCAGGTGCTACCAACACAGAACCACACATGCAGCTACCCCGGGTG
GCACAGCACCACCATTGCCGACCACACCAGACCTCATGGAGACAGTGCACCTGTTTCCTGTTTGGGAGAA
CACCACTCCACTCCATCTCTGCCAGCGGATCCTGGCTCCCTACCTGAAGAAAGCGCCTCGCCAGCAAGGT
GCATGATCGTCCACCAGGGCACCATTCTGGATAATGTTAAGAACCTCTTAGAATTTGCAGAAACACTCCA
ATTTATAGATTCTGATTCTTCATCATGGTGTGATCTCAGCAGTTTTGAATTCTTTGAAGAAGCAGATTTT
TCACCTAGCCAACATCACACAGGCAAAGCCCTACAGCTTCAGCAAAGAGAGGGCAATGGGACTAAACCTG
CAGGAGAACCTAGCCCAAGGGTGAACAAACGTATGTTGAGTGAGAGTTCACTTGACCCACCCAAGGTCTT
ACCTCCTGCAAGGCACAGCACAATTCCACTGGTCATCCTTCGAAAAAAACGGGGCCAGGCCAGCCCCTTA
GCCACTGGAGACTGTAGCTCCTTCATATTTGCTGACGTCAGCAGTTCAACTCCCAAGCGTTCCCCTGTCA
AAAGCCTACCCTTCTCTCCCTCGCAGTTCTTAAACACTTCCAGTAACCATGAAAACTCAGACTTGGAAAT
GCCTTCTTTAACTTCCACCCCCCTCATTGGTCACAAATTGACTGTTACAACACCATTTCATAGAGACCAG
ACTGTGAAAACTCAAAAGGAAAATACTGTTTTTAGAACCCCAGCTATCAAAAGGTCAATCTTAGAAAGCT
CTCCAAGAACTCCTACACCATTCAAACATGCACTTGCAGCTCAAGAAATTAAATACGGTCCCCTGAAGAT
GCTACCTCAGACACCCTCTCATCTAGTAGAAGATCTGCAGGATGTGATCAAACAGGAATCTGATGAATCT
GGAATTGTTGCTGAGTTTCAAGAAAATGGACCACCCTTACTGAAGAAAATCAAACAAGAGGTGGAATCTC
CAACTGATAAATCAGGAAACTTCTTCTGCTCACACCACTGGGAAGGGGACAGTCTGAATACCCAACTGTT
CACGCAGACCTCGCCTGTGGCAGATGCACCGAATATTCTTACAAGCTCCGTTTTAATGGCACCAGCATCA
GAAGATGAAGACAATGTTCTCAAAGCATTTACAGTACCTAAAAACAGGTCCCTGGCGAGCCCCTTGCAGC
CTTGTAGCAGTACCTGGGAACCTGCATCCTGTGGAAAGATGGAGGAGCAGATGACATCTTCCAGTCAAGC
TCGTAAATACGTGAATGCATTCTCAGCCCGGACGCTGGTCATGTGA

>NM_001130172.1
ATGGCCCGAAGACCCCGGCACAGCATATATAGCAGTGACGAGGATGATGAGGACTTTGAGATGTGTGACC
ATGACTATGATGGGCTGCTTCCCAAGTCTGGAAAGCGTCACTTGGGGAAAACAAGGTGGACCCGGGAAGA
GGATGAAAAACTGAAGAAGCTGGTGGAACAGAATGGAACAGATGACTGGAAAGTTATTGCCAATTATCTC
CCGAATCGAACAGATGTGCAGTGCCAGCACCGATGGCAGAAAGTACTAAACCCTGAGCTCATCAAGGGTC
CTTGGACCAAAGAAGAAGATCAGAGAGTGATAGAGCTTGTACAGAAATACGGTCCGAAACGTTGGTCTGT
TATTGCCAAGCACTTAAAGGGGAGAATTGGAAAACAATGTAGGGAGAGGTGGCATAACCACTTGAATCCA
GAAGTTAAGAAAACCTCCTGGACAGAAGAGGAAGACAGAATTATTTACCAGGCACACAAGAGACTGGGGA
ACAGATGGGCAGAAATCGCAAAGCTACTGCCTGGACGAACTGATAATGCTATCAAGAACCACTGGAATTC
TACAATGCGTCGGAAGGTCGAACAGGAAGGTTATCTGCAGGAGTCTTCAAAAGCCAGCCAGCCAGCAGTG
GCCACAAGCTTCCAGAAGAACAGTCATTTGATGGGTTTTGCTCAGGCTCCGCCTACAGCTCAACTCCCTG
CCACTGGCCAGCCCACTGTTAACAACGACTATTCCTATTACCACATTTCTGAAGCACAAAATGTCTCCAG
TCATGTTCCATACCCTGTAGCGTTACATGTAAATATAGTCAATGTCCCTCAGCCAGCTGCCGCAGCCATT
CAGAGACACTATAATGATGAAGACCCTGAGAAGGAAAAGCGAATAAAGGAATTAGAATTGCTCCTAATGT
CAACCGAGAATGAGCTAAAAGGACAGCAGGTGCTACCAACACAGAACCACACATGCAGCTACCCCGGGTG
GCACAGCACCACCATTGCCGACCACACCAGACCTCATGGAGACAGTGCACCTGTTTCCTGTTTGGGAGAA
CACCACTCCACTCCATCTCTGCCAGCGGATCCTGGCTCCCTACCTGAAGAAAGCGCCTCGCCAGCAAGGT
GCATGATCGTCCACCAGGGCACCATTCTGGATAATGTTAAGAACCTCTTAGAATTTGCAGAAACACTCCA
ATTTATAGATTCTTTCTTAAACACTTCCAGTAACCATGAAAACTCAGACTTGGAAATGCCTTCTTTAACT
TCCACCCCCCTCATTGGTCACAAATTGACTGTTACAACACCATTTCATAGAGACCAGACTGTGAAAACTC
AAAAGGAAAATACTGTTTTTAGAACCCCAGCTATCAAAAGGTCAATCTTAGAAAGCTCTCCAAGAACTCC
TACACCATTCAAACATGCACTTGCAGCTCAAGAAATTAAATACGGTCCCCTGAAGATGCTACCTCAGACA
CCCTCTCATCTAGTAGAAGATCTGCAGGATGTGATCAAACAGGAATCTGATGAATCTGGAATTGTTGCTG
AGTTTCAAGAAAATGGACCACCCTTACTGAAGAAAATCAAACAAGAGGTGGAATCTCCAACTGATAAATC
AGGAAACTTCTTCTGCTCACACCACTGGGAAGGGGACAGTCTGAATACCCAACTGTTCACGCAGACCTCG
CCTGTGGCAGATGCACCGAATATTCTTACAAGCTCCGTTTTAATGGCACCAGCATCAGAAGATGAAGACA
ATGTTCTCAAAGCATTTACAGTACCTAAAAACAGGTCCCTGGCGAGCCCCTTGCAGCCTTGTAGCAGTAC
CTGGGAACCTGCATCCTGTGGAAAGATGGAGGAGCAGATGACATCTTCCAGTCAAGCTCGTAAATACGTG
AATGCATTCTCAGCCCGGACGCTGGTCATGTGA

Muscle

muscle Iterative Progressive Alignment

muscle --help
muscle -in example.aa -out example.aa.muscle
muscle -in example.aa -out example.aa.muscle.clw -clw
muscle -in example.fasta -out example.fasta.muscle
muscle -in example.fasta -out example.fasta.muscle.clw -clw

ClustalO

HMM Hidden Markov Model

Global_vs_Local

local

https://www.ebi.ac.uk/Tools/msa/clustalo/

clustalo --help
clustalo -i example.aa -o example.aa.clustao
clustalo -i example.aa -o example.aa.clustao.clw --outfmt=clu
clustalo -i example.fasta -o example.fasta.clustalo
clustalo -i example.fasta -o example.fasta.clustalo.clw --outfmt=clu

Alignmnet Visualization

https://www.ebi.ac.uk/Tools/msa/mview/