How to Implement Smith-Waterman Algorithm in Python

Aditya Raj Feb 15, 2024
  1. Understanding the Smith-Waterman Algorithm
  2. Implement the Smith-Waterman Algorithm in Python
  3. Conclusion
How to Implement Smith-Waterman Algorithm in Python

The Smith-Waterman algorithm is a dynamic programming algorithm used for local sequence alignment. It is particularly useful in bioinformatics for comparing biological sequences, such as DNA, RNA, or protein sequences.

This algorithm is capable of identifying the optimal local alignment between two sequences, taking into account gaps, mismatches, and matches.

In this article, we will explore the step-by-step implementation of the Smith-Waterman algorithm in Python. We will cover the algorithm’s key concepts, provide a Python implementation, and discuss potential applications.

Understanding the Smith-Waterman Algorithm

The Smith-Waterman algorithm, developed by Temple F. Smith and Michael S. Waterman in 1981, is a dynamic programming algorithm designed for local sequence alignment. Local alignment aims to identify the most significant similarities between subsequences of two given sequences, making it a valuable tool in various fields, particularly bioinformatics.

While global alignment algorithms like Needleman-Wunsch find the best alignment for entire sequences, Smith-Waterman focuses on identifying the most significant local similarities. This makes it particularly useful when comparing sequences of varying lengths or sequences with regions of similarity.

The algorithm employs a dynamic programming approach to find the optimal local alignment. This involves breaking down the problem into smaller subproblems and solving them iteratively, using the solutions of overlapping subproblems to derive the overall optimal solution.

Smith-Waterman uses a scoring system to assign values to matches, mismatches, and gap penalties. This scoring system allows for the quantification of the quality of an alignment.

The user can customize the scoring parameters based on the specific characteristics of the sequences being compared.

Let’s delve into the key steps of the algorithm with the help of illustrations to enhance our understanding.

Initialize a Matrix to Store Alignment Scores

The algorithm begins by creating a matrix to store alignment scores.

The dimensions of the matrix correspond to the lengths of the two sequences being aligned. The first row and column of the matrix are initialized with zeros.

For illustration, let’s consider two DNA sequences: AGTACGCA and TATGC. We initialize a matrix with dimensions (length(seq1) + 1) x (length(seq2) + 1).

   |   | T | A | T | G | C |
---|---|---|---|---|---|---|
   | 0 | 0 | 0 | 0 | 0 | 0 |
 A | 0 |   |   |   |   |   |
 G | 0 |   |   |   |   |   |
 T | 0 |   |   |   |   |   |
 A | 0 |   |   |   |   |   |
 C | 0 |   |   |   |   |   |
 G | 0 |   |   |   |   |   |
 C | 0 |   |   |   |   |   |
 A | 0 |   |   |   |   |   |

The first row and column are initialized with zeros since they represent the scores for aligning with an empty sequence.

Calculate Scores Based on the Scoring Scheme

Next, we iterate through each cell of the matrix, calculating scores based on matches, mismatches, and gap penalties. The scoring is done by considering three possibilities: extending the alignment from the diagonal (match/mismatch), opening a gap in the first sequence, or opening a gap in the second sequence.

Here, the scores are determined by comparing the corresponding characters of the two sequences.

   |   | T | A | T | G | C |
---|---|---|---|---|---|---|
   | 0 | 0 | 0 | 0 | 0 | 0 |
 A | 0 | 0 | 0 | 0 | 0 | 0 |
 G | 0 | 0 | 0 | 0 | 1 | 0 |
 T | 0 | 0 | 0 | 1 | 0 | 0 |
 A | 0 | 0 | 2 | 1 | 0 | 0 |
 C | 0 | 0 | 1 | 1 | 0 | 1 |
 G | 0 | 0 | 0 | 0 | 2 | 1 |
 C | 0 | 0 | 0 | 0 | 1 | 3 |
 A | 0 | 0 | 2 | 1 | 0 | 2 |

Here, we use match = 2, mismatch = -1, and gap_penalty = -1 as scoring parameters. The cell values represent the scores of aligning the corresponding substrings.

Traceback to Find the Aligned Subsequences

After scoring, we identify the highest-scoring cell in the matrix.

A traceback process is then initiated, starting from this high-scoring cell and following the path of the highest scores. This traceback reconstructs the optimal local alignment.

In our example, the maximum score is 3 at (8, 5). We then trace back the alignment by following the path of the highest scores.

   |   | T | A | T | G | C |
---|---|---|---|---|---|---|
   | 0 | 0 | 0 | 0 | 0 | 0 |
 A | 0 | 0 | 0 | 0 | 0 | 0 |
 G | 0 | 0 | 0 | 0 | 1 | 0 |
 T | 0 | 0 | 0 | 1 | 0 | 0 |
 A | 0 | 0 | 2 | 1 | 0 | 0 |
 C | 0 | 0 | 1 | 1 | 0 | 1 |
 G | 0 | 0 | 0 | 0 | 2 | 1 |
 C | 0 | 0 | 0 | 0 | 1 | 3 |
 A | 0 | 0 | 2 | 1 | 0 | 2 |

By tracing back from the highest-scoring cell (8, 5), we reconstruct the aligned sequences:

Aligned Sequence 1: ACGCA
Aligned Sequence 2: A-GC-

This represents the optimal local alignment between the two sequences with a score of 3.

Implement the Smith-Waterman Algorithm in Python

The swalign module provides a convenient way to implement the Smith-Waterman algorithm in Python. Follow the steps below to integrate the algorithm into your Python program using the swalign module.

To get started, we need to install the swalign module. Open your command line or terminal and execute the following command for Python 3:

pip3 install swalign

For Python 2, use the following command:

pip install swalign

Now that you have the swalign module installed follow these steps to see a sample implementation of the Smith-Waterman algorithm in Python:

  • Start by importing the swalign module into your Python script:
import swalign
  • To perform the alignment, a scoring matrix for nucleotides must be defined, specifying scores for matches and mismatches. In this example, we’ll use a match score of 2 and a mismatch score of -1.

    We will create the nucleotide scoring matrix using the NucleotideScoringMatrix() method:

match_score = 2
mismatch_score = -1
matrix = swalign.NucleotideScoringMatrix(match_score, mismatch_score)
  • Next, create a LocalAlignment object using the scoring matrix:
lalignment_object = swalign.LocalAlignment(matrix)
  • Now, apply the Smith-Waterman algorithm using the align() method on the LocalAlignment object. Provide two strings representing the DNA strands as input:
dna_string = "ATCCACAGC"
reference_string = "ATGCAGCGC"
alignment_object = lalignment_object.align(dna_string, reference_string)
  • The result of the alignment is stored in an Alignment object. You can access various properties such as the alignment score, matched and mismatched positions, and a detailed CIGAR string:
alignment_object.dump()

Here’s the complete Python script demonstrating the implementation of the Smith-Waterman algorithm:

import swalign

match_score = 2
mismatch_score = -1
matrix = swalign.NucleotideScoringMatrix(match_score, mismatch_score)

lalignment_object = swalign.LocalAlignment(matrix)

dna_string = "ATCCACAGC"
reference_string = "ATGCAGCGC"
alignment_object = lalignment_object.align(dna_string, reference_string)

alignment_object.dump()

Code Output:

Query:  1 ATGCAGC-GC 9
          ||.|| | ||
Ref  :  1 ATCCA-CAGC 9

Score: 11
Matches: 7 (70.0%)
Mismatches: 3
CIGAR: 5M1I1M1D2M

This output provides information about the alignment score, matches, mismatches, and the CIGAR string, offering insights into the local alignment between the two DNA sequences. The swalign module simplifies the implementation of the Smith-Waterman algorithm, making it accessible for a wide range of applications.

Conclusion

This comprehensive guide has walked you through the process of implementing the Smith-Waterman algorithm in Python using the swalign module. By following these steps, you can perform local sequence alignment for DNA strands or protein sequences, gaining insights into matching and mismatching regions.

The Smith-Waterman algorithm is a powerful tool in bioinformatics, and Python’s swalign module makes it accessible for researchers and developers alike.

Author: Aditya Raj
Aditya Raj avatar Aditya Raj avatar

Aditya Raj is a highly skilled technical professional with a background in IT and business, holding an Integrated B.Tech (IT) and MBA (IT) from the Indian Institute of Information Technology Allahabad. With a solid foundation in data analytics, programming languages (C, Java, Python), and software environments, Aditya has excelled in various roles. He has significant experience as a Technical Content Writer for Python on multiple platforms and has interned in data analytics at Apollo Clinics. His projects demonstrate a keen interest in cutting-edge technology and problem-solving, showcasing his proficiency in areas like data mining and software development. Aditya's achievements include securing a top position in a project demonstration competition and gaining certifications in Python, SQL, and digital marketing fundamentals.

GitHub

Related Article - Python Algorithm