The Smith & Waterman algorithm constructs a path matrix just like the Needleman & Wunsch algorithm. The most important trick for local alignments is to provide an expectation value for a random alignment as negative. That causes random alignments and other stretches of mismatches to decrease the path score. After the whole path matrix is filled, the optimal local alignment is simply given by a path starting at the highest score overall in the path matrix(top left), containing all the contributing cells (considering gap creation and extension penalties) and then retracing the path backthru the matrix until the path score has dropped to zero.(bottom right)
This algorithm is implemented in the UWGCG program BESTFIT.
The following description is adapted from the help library of the GCG package.
Look there for further deatails, especially on the alignment of very long
sequences.
BestFit reads a scoring matrix that contains values for every
possible GCG symbol match. The program uses these values to construct a
path matrix that represents the entire surface of comparison
with a score at every position for the best possible alignment to that
point. The quality score for the best alignment to any point is equal to:
After the path matrix is complete, the highest value on the surface of comparison represents the end of the best region of similarity between the sequences. The best path from this highest value backwards to the point where the values revert to zero is the alignment shown by BestFit. This alignment is the best segment of similarity between the two sequences.
For nucleic acids, the default scoring matrix has a match value
of 1.0 for each identical symbol comparison and -0.90 for each non-identical
comparison (not considering nucleotide ambiguity symbols for this example).
The quality score for a nucleic acid alignment can, therefore, be determined
using the following equation:
Quality = 1.00 x TotalMatches + -0.90 x TotalMismatches - (GapCreationPenalty x GapNumber) - (GapExtensionPenalty x TotalLengthOfGaps)The quality score for a protein alignment is calculated in a similar manner. However, while the default nucleic acid scoring matrix has a single value for all non-identical comparisons, the default protein scoring matrix has different values for the various non-identical amino acid comparisons. The quality score for a protein alignment can therefore be determined using the following equation (where Total(AA) is the total number of A-A (Ala-Ala) matches in the alignment, CmpVal(AA) is the value for an A-A comparison in the scoring matrix, Total(AB) is the total number of A-B (Ala-Asx) matches in the alignment, CmpVal(AB) is the value for an A-B comparison in the scoring matrix, ...) :
Quality = CmpVal(AA) x Total(AA) + CmpVal(AB) x Total(AB) + CmpVal(AC) x Total(AC) . . . + CmpVal(ZZ) x Total(ZZ) - (GapCreationPenalty x GapNumber) - (GapExtensionPenalty x TotalLengthOfGaps)Gap and BestFit were originally written by Paul Haeberli from a careful reading of the Needleman and Wunsch (J. Mol. Biol. 48:443-453 (1970)) and the Smith and Waterman (Adv. Appl. Math. 2:482-489 (1981)) papers. Limited alignments were designed by Paul Haeberli and added to the Package for Version 3.0. They were united into a single program by Philip Delaquess for Version 4.0. Default gap penalties for protein alignments were modified according to the suggestions of Rechid, Vingron and Argos (CABIOS 5:107-113 (1989)).
BestFit always finds an alignment for any two sequences you compare
-- even if there is no significant similarity between them! You must evaluate
the results critically to decide if the segment shown is not just a random
region of relative
similarity.
THE SEGMENTS SHOWN OBSCURE ALTERNATIVE SEGMENTS
BestFit only shows one segment of similarity, so if there are several, all but one is obscured. You can approach this problem with graphic matrix analysis (see the Compare and DotPlot programs). Alternatively, you can run BestFit on ranges outside the ranges of similarity found in earlier runs to bring other segments out of the shadow of the best segment.
THE BEST FIT IS ONLY ONE MEMBER OF A FAMILY
Like all fast gapping algorithms, the alignment displayed is a member of the family of best alignments. This family may have other members of equal quality, but will not have any member with a higher quality. The family is usually significantly different for different choices of gap creation and gap extension penalties. See the CONSIDERATIONS topic in the entry for the Gap program in the PROGRAM MANUAL to learn more about how to assign gap creation and gap extension penalties.
THE SURFACE OF COMPARISON
The magnitude of the computer's job is proportional to the area of the surface of comparison. That area is determined by the product of the lengths of the two sequences compared. BestFit can evaluate a surface of up to 3.5 million elements. This surface would be large enough to compare two sequences approximately 1,870-symbols long, or one sequence 200-symbols long with another sequence 17,500-symbols long. When you have much longer sequences that are known to align well, you can use the command-line option /LIMit to use the surface more efficiently.
THE PUBLIC SCORING MATRIX FOR NUCLEIC ACID COMPARISONS IS VERY
STRINGENT
to do a 'real' Smith-Waterman alignment, choose a scoring matrix that
is less stringent.
The scoring matrix (called Genrundata:SWGapDNA.Cmp)
penalizes mismatches -0.9 so the segments found may be very brief. This
penalty means that the alignment cannot be extended by three bases to pick
one extra match. The scoring matrix used by Smith and Waterman, when local
alignments were first described, used -0.333 for the mismatch penalty.
You can use Fetch to copy RandomDNA.Cmp and rename it SWGapDNA.Cmp
to use these values, or use NWSGapDNA.Cmp, which has no
mismatch penalty at all.
$ BestFit
BESTFIT of what sequence 1 ? Gamma.Seq
Begin (* 1 *) ?
End (* 11375 *) ? 500
Reverse (* No *) ?
to what sequence 2 (* Gamma.Seq *) ? Alu.Seq
Begin (* 1 *) ?
End (* 207 *) ?
Reverse (* No *) ?
What is the gap creation penalty (* 5.00 *) ?
What is the gap extension penalty (* 0.30 *) ?
What should I call the paired output display file (* Gamma.pair
*)
Aligning ..........-..
Gaps: 3
Quality: 129.3
Quality Ratio: 0.625 %
Similarity: 84.466
Length: 209
BESTFIT of: Gamma.Seq check: 6474 from: 1 to: 500
Human fetal beta globins G and A gamma from Shen, Slightom and Smithies,
Cell 26. 191-203. Analyzed by Smithies et al. Cell 26. 345-353.
to: Alu.Seq check: 4238 from: 1 to: 207
HSREP2 from the EMBL data library Human Alu repetitive sequence located near the insulin gene Dhruva D.R., Shenk T., Subramanian K.N.. "Integration in vivo into Simian virus 40 DNA of a sequence that resembles a certain family of genomic interspersed repeated sequences". Proc. Natl. Acad. Sci. USA 77:4514-4518(1980)
. . . .
Symbol comparison table: Gencoredisk:[Gcgcore.Data.Rundata]Swgapdna.Cmp
CompCheck: 5234
Gap Weight: 5.000
Average Match: 1.000
Length Weight: 0.300
Average Mismatch: -0.900
Quality: 129.3
Length: 209
Ratio: 0.625
Gaps: 3
Percent Similarity: 84.466
Percent Identity: 84.466
Gamma.Seq x Alu.Seq June 20, 1994 15:15 .. . . . . . 137 AGACCAACCTGGCCAACATGGTGAAATCCCATCTCTAC.AAAAATACAAA 185 |||||| ||||||||||||||||||| |||||||||| |||||||||| 1 AGACCAGCCTGGCCAACATGGTGAAACTCCATCTCTACTGAAAATACAAA 50 186 AATTAGACAGGCATGATGGCAAGTGCCTGTAATCCCAGCTACTTGGGAGG 235 |||||| |||||||| || ||||||| |||||||||||||| ||||| 51 AATTAGCCAGGCATGGTGATGCGTGCCTGGAATCCCAGCTACTTAGGAGG 100 236 CTGAGGAAGGAGAATTGCTTGAACCTGGAAGGCAGGAGTTGCAGTGAGCC 285 ||||| || ||||| ||| |||| | ||| | ||||||||||||| 101 CTGAGACAGAAGAATCCCTTAAACCAAG.AGGTGGAGGTTGCAGTGAGCC 149 286 GAGATCATACCACTGCACTCCAGCCTGGGTGACAGAACAAGACTCTGTCT 335 |||||| || |||||||||||||| ||||||||| | |||||| ||| 150 GAGATCGCACGGCTGCACTCCAGCCT.GGTGACAGAGCGAGACTCCATCT 198 336 CAAAAAAAA 344 ||||||||| 199 CAAAAAAAA 207
The Quality (described above) is the metric maximized in order to align the sequences.
Ratio is the quality divided by the number of bases in the shorter segment.
Percent Identity is the percent of the symbols that actually match.
Percent Similarity is the percent of the symbols that are similar. Symbols that are across from gaps are ignored. A similarity is scored when the scoring matrix value for a pair of symbols is greater than or equal to 0.50, the similarity threshold. This threshold is also used by the display procedure to decide when to put a ':' (colon) between two aligned symbols. You can reset it from the command line with the second optional parameter of /PAIr. For instance, the expression /PAIr=1.0,0.5 would set the similarity threshold to 0.5.
The similarity and identity metrics are not optimized by alignment
programs so they should not be used to compare alignments.
The score of each randomized alignment is reported to the screen. You can use <CTRL>C to interrupt the randomizations and output the results from those randomized alignments that have been completed.
By ignoring the statistical properties of biological sequences,
this simple Monte Carlo statistical method may give misleading results.
Please see Lipman, D.J, Wilbur, W.J., Smith, T.F., and Waterman, M.S. (Nucl.
Acids Res. 12:215-226 (1984)) for a discussion of the statistical
significance of nucleic acid similarities.
For more information see Chapter 4, Using Data Files in the USER'S GUIDE.
If the first sequence you name is a nucleic acid, BestFit uses the scoring
matrix in the public file SWGapDNA.Cmp. (SW stands for
Smith and Waterman.) If the first sequence you name is a peptide sequence,
BestFit reads SWGapPep.Cmp instead. The presence of these
files in your current working directory causes BestFit to read your version
instead. (See the DATA FILES manual for more information about scoring
matrices.)
Minimal Syntax:
$ BESTfit [/INfile1=]Gamma.Seq [/INfile2=]Alu.Seq /Default
Prompted Parameters:
/BEGin1=1 /BEGin2=1 beginning of each sequence
/END1=500 /END2=207 end of each sequence
/NOREV1 /NOREV2 strand of each sequence
/GAPweight=5.0 gap creation penalty (3.0 is protein
default)
/LENgthweight=0.3 gap extension penalty (0.1 is protein
default)
[/OUTfile1=]Gamma.Pair output file for alignment
Local Data Files:
/DATa=SWGapDna.Cmp scoring matrix for nucleic acids
/DATa=SWGapPep.Cmp scoring matrix for peptides Optional
Parameters:
/OUTfile2=Gamma.Gap new sequence file for sequence
1 with gaps added
/OUTfile3=Alu.Gap new sequence file for sequence 2
with gaps added
/LIMit1=499 /LIMit2=206 limit the surface of comparison
/RANdomizations[=10] determine average score from 10
randomized alignments
/PAIr=1.0,0.5,0.1 thresholds for displaying '|', ':',
and '.'
/WIDth=50 the number of sequence symbols per line
/PAGe=60 adds a line with a form feed every 60 lines
/NOBIGGaps suppresses abbreviation of large gaps with
'.'s
/HIGhroad makes the top alignment for your parameters
/LOWroad makes the bottom alignment for your parameters
/NOSUMmary suppresses the screen summary
/LIMit1=20 and /LIMit2=20
let you set gap shift limits for each sequence. When you already know of a long similarity between two sequences you can "zip" them together using this mode. The beginning coordinates for each sequence must be near the beginning of the alignment you want to see. The alignment continues so that gaps inserted do not require the sequences to get out of step by more than the gap shift limits. You can align very long sequences rapidly. The surface of comparison is still limited to 3.5 million. The size of a comparison can be predicted by multiplying the average length of the two sequences by the sum of the two shift limits. If you add /LIMit to the command line without any qualifier value, the program prompts you to enter gap shift limits for each sequence.
/RANdomizations=10
reports the average alignment score and standard deviation from 10 randomized alignments in which the second sequence is repeatedly shuffled, maintaining the length and composition of the original sequence, and then aligned to the first sequence. You can use the optional parameter to set the number of randomized alignment to some number other than 10.
/OUTfile2=SEQNAME1.GAP /OUTfile3=SEQNAME2.GAP
This program can write three different output files. The first displays the alignment of sequence one with sequence two. The second is a new sequence file for sequence one, possibly expanded by gaps to make it align with sequence two. The third, like the second, is a new sequence file for sequence two, possibly expanded by gaps to make it align with sequence one. The program writes only the first file unless there are output file options on the command line. If there are any output files named on the command line, only those output files are written. If you add /OUT to the command line without any qualifying filename, then the program will write the second and third output files after prompting you for their names. Aligned sequences (in sequence files) can be displayed with GapShow. Their similarity can be displayed with PlotSimilarity.
/PAIr=1.0,0.5,0.1
The paired output file from this program displays sequence similarity by printing one of three characters between similar sequence symbols: a pipe character(|), a colon (:), or a period (.). Normally a pipe character is put between symbols that are the same, a colon is put between symbols whose comparison value is greater than or equal to 0.50, and a period is put between symbols whose comparison value is greater than or equal to 0.10. You can change these match display thresholds from the command line. The three parameters for /PAIr are the display thresholds for the pipe character, colon, and period. The match display criterion for a pipe character changes from symbolic identity (the default) to the quantitative threshold you have set in the first parameter. A pipe character will no longer be inserted between identical symbols unless their comparison values are greater than or equal to this threshold. If you still want a pipe character to connect identical symbols, use x instead of a number as the first parameter. (See the DATA FILES manual for more information about scoring matrices.)
/PAGe=64
When you print the output from this program, it may cross from one page to another in a frustrating way -- especially when you print on individual sheets. This option adds form feeds to the output file in order to try to keep clusters of related information together. You can set the number of lines per page by supplying a number after the /PAGe qualifier.
/WIDth=50
puts 50 sequence symbols on each line of the output file. You can set the width to anything from 10 to 150 symbols.
/NOBIGGaps
suppresses large gap abbreviations, showing all the sequence characters across from large gaps. Usually, gaps that extend one sequence by more than one complete line of output are abbreviated with three dots arranged in a vertical line.
/LOWroad and /HIGhroad
The insertion of gaps is, in many cases, arbitrary, and equally optimal alignments can be generated by inserting gaps differently. When equally optimal alignments are possible, this program can insert the gaps differently if you select either the /LOWroad or the /HIGhroad options.
Here are examples for the alignment of
GACCAT with
GACAT
with different parameters.
For: Match = 1.0 MisMatch = -0.9 Gap weight = 1.0 Length Weight = 0.0 LowRoad: 1 GACCAT 6 || ||| Quality = 4.0 1 GA.CAT 5 HighRoad: 1 GACCAT 6 ||| || Quality = 4.0 1 GAC.AT 5 For: Match = 1.0 MisMatch = 0.0 Gap weight = 3.0 Length Weight = 0.0 HighRoad: 1 GACCAT 6 ||| Quality = 3.0 1 GACAT. 5 LowRoad: 1 GACCAT 6 ||| Quality = 3.0 1 .GACAT 5Essentially the low road shifts all of the arbitrary gaps in sequence two to the left and all of the arbitrary gaps in sequence one to the right. The high road does exactly the opposite. When neither high road nor low road is selected, the program tries not to insert a gap whenever that is possible and uses the high road alternative for all collisions.
/SUMmary
writes a summary of the program's work to the screen when you've
used the /Default qualifier to suppress all program interaction.
A summary typically displays at the end of a program run interactively.
You can suppress the summary for a program run interactively with /NOSUMmary.
Use this qualifier also to include a summary of the program's work in the
log file for a program run in batch.