Local Optimal Alignment Sequence Alignment: The Smith & Waterman Algorithm


Contents of this section:

Local Optimal Alignment

The local optimal alignment is a variation of the dynamic programming approach described by Needleman & Wunsch, to generate optimal local alignments. This is achieved almost entirely through a different choice of the scoring matrix and thus is a fine example how important the scoring matrix is in the process of sequence alignments.

 The Smith & Waterman algorithm constructs a path matrix just like the Needleman & Wunsch algorithm. The most important trick for local alignments is to provide an  expectation value for a random alignment as  negative. That causes random alignments and other stretches of mismatches to decrease the path score. After the whole path matrix is filled, the optimal local alignment is simply given by a path starting at the highest score overall in the path matrix(top left), containing all the contributing cells (considering gap creation and extension penalties) and then retracing the path backthru the matrix until the path score has dropped to zero.(bottom right)

 This algorithm is implemented in the UWGCG program BESTFIT. The following description is adapted from the help library of the GCG package. Look there for further deatails, especially on the alignment of very long sequences.
 
 

GCG program BESTFIT

BestFit uses the local homology algorithm of Smith and Waterman (Advances in Applied Mathematics 2:482-489 (1981)) to find the best segment of similarity between two sequences. BestFit inserts gaps to obtain the optimal alignment of the best region of similarity between two sequences, and then displays the alignment. The sequences can be of very different lengths and have only a small segment of similarity between them. You could take a short RNA sequence, for example, and run it against a whole mitochondrial genome. BestFit is the most powerful method in the UWGCG program package for identifying the best region of similarity between two sequences whose relationship is unknown.

 BestFit reads a scoring matrix that contains values for every possible GCG symbol match. The program uses these values to construct a path matrix that represents the entire surface of comparison with a score at every position for the best possible alignment to that point. The quality score for the best alignment to any point is equal to:
 
 

The gap creation and gap extension penalties are set by you. If the best path to any point has a negative value, a zero is put in that position.

 After the path matrix is complete, the highest value on the surface of comparison represents the end of the best region of similarity between the sequences. The best path from this highest value backwards to the point where the values revert to zero is the alignment shown by BestFit. This alignment is the best segment of similarity between the two sequences.

 For nucleic acids, the default scoring matrix has a match value of 1.0 for each identical symbol comparison and -0.90 for each non-identical comparison (not considering nucleotide ambiguity symbols for this example). The quality score for a nucleic acid alignment can, therefore, be determined using the following equation:
 
 

Quality = 
   1.00 x TotalMatches 
+ -0.90 x TotalMismatches 
- (GapCreationPenalty x GapNumber) 
- (GapExtensionPenalty x TotalLengthOfGaps)
The quality score for a protein alignment is calculated in a similar manner. However, while the default nucleic acid scoring matrix has a single value for all non-identical comparisons, the default protein scoring matrix has different values for the various non-identical amino acid comparisons. The quality score for a protein alignment can therefore be determined using the following equation (where Total(AA) is the total number of A-A (Ala-Ala) matches in the alignment, CmpVal(AA) is the value for an A-A comparison in the scoring matrix, Total(AB) is the total number of A-B (Ala-Asx) matches in the alignment, CmpVal(AB) is the value for an A-B comparison in the scoring matrix, ...) :
 
 
Quality = 
  CmpVal(AA) x Total(AA) 
+ CmpVal(AB) x Total(AB) 
+ CmpVal(AC) x Total(AC) 
. . . 
+ CmpVal(ZZ) x Total(ZZ) 
- (GapCreationPenalty x GapNumber) 
- (GapExtensionPenalty x TotalLengthOfGaps)
Gap and BestFit were originally written by Paul Haeberli from a careful reading of the Needleman and Wunsch (J. Mol. Biol. 48:443-453 (1970)) and the Smith and Waterman (Adv. Appl. Math. 2:482-489 (1981)) papers. Limited alignments were designed by Paul Haeberli and added to the Package for Version 3.0. They were united into a single program by Philip Delaquess for Version 4.0. Default gap penalties for protein alignments were modified according to the suggestions of Rechid, Vingron and Argos (CABIOS 5:107-113 (1989)).
 
 

BestFit Considerations

BESTFIT ALWAYS FINDS SOMETHING

 BestFit always finds an alignment for any two sequences you compare -- even if there is no significant similarity between them! You must evaluate the results critically to decide if the segment shown is not just a random region of relative
similarity.

 THE SEGMENTS SHOWN OBSCURE ALTERNATIVE SEGMENTS

 BestFit only shows one segment of similarity, so if there are several, all but one is obscured. You can approach this problem with graphic matrix analysis (see the Compare and DotPlot programs). Alternatively, you can run BestFit on ranges outside the ranges of similarity found in earlier runs to bring other segments out of the shadow of the best segment.

 THE BEST FIT IS ONLY ONE MEMBER OF A FAMILY

 Like all fast gapping algorithms, the alignment displayed is a member of the family of best alignments. This family may have other members of equal quality, but will not have any member with a higher quality. The family is usually significantly different for different choices of gap creation and gap extension penalties. See the CONSIDERATIONS topic in the entry for the Gap program in the PROGRAM MANUAL to learn more about how to assign gap creation and gap extension penalties.

 THE SURFACE OF COMPARISON

 The magnitude of the computer's job is proportional to the area of the surface of comparison. That area is determined by the product of the lengths of the two sequences compared. BestFit can evaluate a surface of up to 3.5 million elements. This surface would be large enough to compare two sequences approximately 1,870-symbols long, or one sequence 200-symbols long with another sequence 17,500-symbols long. When you have much longer sequences that are known to align well, you can use the command-line option /LIMit to use the surface more efficiently.

 THE PUBLIC SCORING MATRIX FOR NUCLEIC ACID COMPARISONS IS VERY STRINGENT
to do a 'real' Smith-Waterman alignment, choose a scoring matrix that is less stringent.

 The scoring matrix (called Genrundata:SWGapDNA.Cmp) penalizes mismatches -0.9 so the segments found may be very brief. This penalty means that the alignment cannot be extended by three bases to pick one extra match. The scoring matrix used by Smith and Waterman, when local alignments were first described, used -0.333 for the mismatch penalty. You can use Fetch to copy RandomDNA.Cmp and rename it SWGapDNA.Cmp to use these values, or use NWSGapDNA.Cmp, which has no mismatch penalty at all.
 
 

BestFit Example

The sequence Gamma.Seq contains an Alu family sequence somewhere in the first 500 bases. Alu.Seq contains a generic human Alu family repeat. The two sequences are aligned and the best segment of similarity is found with BestFit.

$ BestFit
BESTFIT of what sequence 1 ? Gamma.Seq
Begin (* 1 *) ?
End (* 11375 *) ? 500
Reverse (* No *) ?
to what sequence 2 (* Gamma.Seq *) ? Alu.Seq
Begin (* 1 *) ?
End (* 207 *) ?
Reverse (* No *) ?
What is the gap creation penalty (* 5.00 *) ?
What is the gap extension penalty (* 0.30 *) ?
What should I call the paired output display file (* Gamma.pair *)

 Aligning ..........-..

Gaps: 3
Quality: 129.3
Quality Ratio: 0.625 %
Similarity: 84.466
Length: 209
 
 

BestFit Output

Here is the output file. Notice how BestFit finds and displays only the best segments of similarity:

BESTFIT of: Gamma.Seq check: 6474 from: 1 to: 500

Human fetal beta globins G and A gamma from Shen, Slightom and Smithies, Cell 26. 191-203. Analyzed by Smithies et al. Cell 26. 345-353.
to: Alu.Seq check: 4238 from: 1 to: 207

HSREP2 from the EMBL data library Human Alu repetitive sequence located near the insulin gene Dhruva D.R., Shenk T., Subramanian K.N.. "Integration in vivo into Simian virus 40 DNA of a sequence that resembles a certain family of genomic interspersed repeated sequences". Proc. Natl. Acad. Sci. USA 77:4514-4518(1980)

 . . . .

Symbol comparison table: Gencoredisk:[Gcgcore.Data.Rundata]Swgapdna.Cmp CompCheck: 5234
 

Gap Weight: 5.000
Average Match: 1.000
Length Weight: 0.300
Average Mismatch: -0.900
Quality: 129.3
Length: 209
Ratio: 0.625
Gaps: 3
Percent Similarity: 84.466
Percent Identity: 84.466

Gamma.Seq x Alu.Seq June 20, 1994 15:15 .. . . . . . 
137 AGACCAACCTGGCCAACATGGTGAAATCCCATCTCTAC.AAAAATACAAA 185
    |||||| |||||||||||||||||||  |||||||||| |||||||||| 
  1 AGACCAGCCTGGCCAACATGGTGAAACTCCATCTCTACTGAAAATACAAA 50 

186 AATTAGACAGGCATGATGGCAAGTGCCTGTAATCCCAGCTACTTGGGAGG 235
    |||||| |||||||| ||    ||||||| |||||||||||||| ||||| 
 51 AATTAGCCAGGCATGGTGATGCGTGCCTGGAATCCCAGCTACTTAGGAGG 100

236 CTGAGGAAGGAGAATTGCTTGAACCTGGAAGGCAGGAGTTGCAGTGAGCC 285
    |||||  || |||||  ||| ||||  | |||  |  ||||||||||||| 
101 CTGAGACAGAAGAATCCCTTAAACCAAG.AGGTGGAGGTTGCAGTGAGCC 149

286 GAGATCATACCACTGCACTCCAGCCTGGGTGACAGAACAAGACTCTGTCT 335
    ||||||  ||  |||||||||||||| ||||||||| | ||||||  ||| 
150 GAGATCGCACGGCTGCACTCCAGCCT.GGTGACAGAGCGAGACTCCATCT 198

336 CAAAAAAAA 344 
    ||||||||| 
199 CAAAAAAAA 207

Quality, Ratio, Identity, and Similarity

BestFit and Gap display four figures of merit for alignments: Quality, Ratio, Identity, and Similarity.

 The Quality (described above) is the metric maximized in order to align the sequences.

 Ratio is the quality divided by the number of bases in the shorter segment.

 Percent Identity is the percent of the symbols that actually match.

 Percent Similarity is the percent of the symbols that are similar. Symbols that are across from gaps are ignored. A similarity is scored when the scoring matrix value for a pair of symbols is greater than or equal to 0.50, the similarity threshold. This threshold is also used by the display procedure to decide when to put a ':' (colon) between two aligned symbols. You can reset it from the command line with the second optional parameter of /PAIr. For instance, the expression /PAIr=1.0,0.5 would set the similarity threshold to 0.5.

 The similarity and identity metrics are not optimized by alignment programs so they should not be used to compare alignments.
 
 

Evaluating Alignment Significance

This program can help you evaluate the significance of the alignment, using a simple statistical method, with the /RANdomizations command line option. The second sequence is repeatedly shuffled, maintaining its length and composition, and then realigned to the first sequence. The average alignment score, plus or minus the standard deviation, of all randomized alignments is reported in the output file. You can compare this average quality score to the quality score of the actual alignment to help evaluate the significance of the alignment. The number of randomizations can be specified along with the /RANdomizations command line qualifier. the default is 10.

 The score of each randomized alignment is reported to the screen. You can use <CTRL>C to interrupt the randomizations and output the results from those randomized alignments that have been completed.

 By ignoring the statistical properties of biological sequences, this simple Monte Carlo statistical method may give misleading results. Please see Lipman, D.J, Wilbur, W.J., Smith, T.F., and Waterman, M.S. (Nucl. Acids Res. 12:215-226 (1984)) for a discussion of the statistical significance of nucleic acid similarities.
 
 

Local Data Files

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either
1) have a data file with exactly the same name in your current working directory. or
2) name a file on the command line with an expression like /DATa1=MYFILE.DAT.

For more information see Chapter 4, Using Data Files in the USER'S GUIDE. If the first sequence you name is a nucleic acid, BestFit uses the scoring matrix in the public file SWGapDNA.Cmp. (SW stands for Smith and Waterman.) If the first sequence you name is a peptide sequence, BestFit reads SWGapPep.Cmp instead. The presence of these files in your current working directory causes BestFit to read your version instead. (See the DATA FILES manual for more information about scoring matrices.)
 
 

Command Line Options

All parameters for this program may be put on the command line. Use the option /CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the qualifier names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose qualifiers or parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Basic Concepts: Using Programs in the USER'S GUIDE.

 Minimal Syntax:

$ BESTfit [/INfile1=]Gamma.Seq [/INfile2=]Alu.Seq /Default

 Prompted Parameters:

/BEGin1=1 /BEGin2=1 beginning of each sequence
/END1=500 /END2=207 end of each sequence
/NOREV1 /NOREV2 strand of each sequence
/GAPweight=5.0 gap creation penalty (3.0 is protein default)
/LENgthweight=0.3 gap extension penalty (0.1 is protein default)
[/OUTfile1=]Gamma.Pair output file for alignment

Local Data Files:

/DATa=SWGapDna.Cmp scoring matrix for nucleic acids
/DATa=SWGapPep.Cmp scoring matrix for peptides Optional Parameters:
/OUTfile2=Gamma.Gap new sequence file for sequence 1 with gaps added
/OUTfile3=Alu.Gap new sequence file for sequence 2 with gaps added
/LIMit1=499 /LIMit2=206 limit the surface of comparison
/RANdomizations[=10] determine average score from 10 randomized alignments
/PAIr=1.0,0.5,0.1 thresholds for displaying '|', ':', and '.'
/WIDth=50 the number of sequence symbols per line
/PAGe=60 adds a line with a form feed every 60 lines
/NOBIGGaps suppresses abbreviation of large gaps with '.'s
/HIGhroad makes the top alignment for your parameters
/LOWroad makes the bottom alignment for your parameters
/NOSUMmary suppresses the screen summary

 /LIMit1=20 and /LIMit2=20

let you set gap shift limits for each sequence. When you already know of a long similarity between two sequences you can "zip" them together using this mode. The beginning coordinates for each sequence must be near the beginning of the alignment you want to see. The alignment continues so that gaps inserted do not require the sequences to get out of step by more than the gap shift limits. You can align very long sequences rapidly. The surface of comparison is still limited to 3.5 million. The size of a comparison can be predicted by multiplying the average length of the two sequences by the sum of the two shift limits. If you add /LIMit to the command line without any qualifier value, the program prompts you to enter gap shift limits for each sequence.

 /RANdomizations=10

 reports the average alignment score and standard deviation from 10 randomized alignments in which the second sequence is repeatedly shuffled, maintaining the length and composition of the original sequence, and then aligned to the first sequence. You can use the optional parameter to set the number of randomized alignment to some number other than 10.

 /OUTfile2=SEQNAME1.GAP /OUTfile3=SEQNAME2.GAP

 This program can write three different output files. The first displays the alignment of sequence one with sequence two. The second is a new sequence file for sequence one, possibly expanded by gaps to make it align with sequence two. The third, like the second, is a new sequence file for sequence two, possibly expanded by gaps to make it align with sequence one. The program writes only the first file unless there are output file options on the command line. If there are any output files named on the command line, only those output files are written. If you add /OUT to the command line without any qualifying filename, then the program will write the second and third output files after prompting you for their names. Aligned sequences (in sequence files) can be displayed with GapShow. Their similarity can be displayed with PlotSimilarity.

 /PAIr=1.0,0.5,0.1

 The paired output file from this program displays sequence similarity by printing one of three characters between similar sequence symbols: a pipe character(|), a colon (:), or a period (.). Normally a pipe character is put between symbols that are the same, a colon is put between symbols whose comparison value is greater than or equal to 0.50, and a period is put between symbols whose comparison value is greater than or equal to 0.10. You can change these match display thresholds from the command line. The three parameters for /PAIr are the display thresholds for the pipe character, colon, and period. The match display criterion for a pipe character changes from symbolic identity (the default) to the quantitative threshold you have set in the first parameter. A pipe character will no longer be inserted between identical symbols unless their comparison values are greater than or equal to this threshold. If you still want a pipe character to connect identical symbols, use x instead of a number as the first parameter. (See the DATA FILES manual for more information about scoring matrices.)

 /PAGe=64

 When you print the output from this program, it may cross from one page to another in a frustrating way -- especially when you print on individual sheets. This option adds form feeds to the output file in order to try to keep clusters of related information together. You can set the number of lines per page by supplying a number after the /PAGe qualifier.

 /WIDth=50

 puts 50 sequence symbols on each line of the output file. You can set the width to anything from 10 to 150 symbols.

 /NOBIGGaps

 suppresses large gap abbreviations, showing all the sequence characters across from large gaps. Usually, gaps that extend one sequence by more than one complete line of output are abbreviated with three dots arranged in a vertical line.

 /LOWroad and /HIGhroad

 The insertion of gaps is, in many cases, arbitrary, and equally optimal alignments can be generated by inserting gaps differently. When equally optimal alignments are possible, this program can insert the gaps differently if you select either the /LOWroad or the /HIGhroad options.

Here are examples for the alignment of
GACCAT with
GACAT
with different parameters.

For: Match = 1.0 MisMatch = -0.9 Gap weight = 1.0 Length Weight = 0.0 

LowRoad: 
1 GACCAT 6 
  || |||      Quality = 4.0 
1 GA.CAT 5 

HighRoad: 
1 GACCAT 6 
  ||| ||      Quality = 4.0 
1 GAC.AT 5 


For: Match = 1.0 MisMatch = 0.0 Gap weight = 3.0 Length Weight = 0.0

HighRoad: 
1 GACCAT 6 
  |||         Quality = 3.0 
1 GACAT. 5 

LowRoad: 
1 GACCAT 6 
     |||      Quality = 3.0 
1 .GACAT 5
Essentially the low road shifts all of the arbitrary gaps in sequence two to the left and all of the arbitrary gaps in sequence one to the right. The high road does exactly the opposite. When neither high road nor low road is selected, the program tries not to insert a gap whenever that is possible and uses the high road alternative for all collisions.

 /SUMmary

 writes a summary of the program's work to the screen when you've used the /Default qualifier to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with /NOSUMmary. Use this qualifier also to include a summary of the program's work in the log file for a program run in batch.
 
 

Related Programs

When you want an alignment that covers the whole length of both sequences, use Gap. When you are trying to find only the best segment of similarity between two sequences, use BestFit. PileUp creates a multiple sequence alignment of a group of related sequences, aligning the whole length of all sequences. DotPlot displays the entire surface of comparison for a comparison of two sequences. GapShow displays the pattern of differences between two aligned sequences. PlotSimilarity plots the average similarity of two or more aligned sequences at each position in the alignment. Pretty displays alignments of several sequences. LineUp is an editor for editing multiple sequence alignments. CompTable helps generate scoring matrices for peptide comparison.