How to Find the Longest Common Substring in C++

Muhammad Husnain Feb 02, 2024
  1. Longest Common Substring
  2. Problem Formulation of the Longest Common Substring
How to Find the Longest Common Substring in C++

In this article, we will discuss the longest common substring problem and its solution using dynamic programming in C++.

Longest Common Substring

The longest common substring (LCS) is a well-known problem in computer science. A maximum length substring in two or more strings is the longest common substring.

For example, consider the following three strings:

  1. XYXYZ
  2. YXYZX
  3. XYZYX

The longest common substring of these three strings is XYZ. Other common substrings exist in these three strings like X, XY, Y, YX, YZ, and Z; however, the longest common substring is XYZ only.

Figure 1 shows the longest common substring of the above three strings:

Longest Common Substring

Problem Formulation of the Longest Common Substring

Consider two strings, S1 and S2, with lengths m and n, respectively, to find the exact substring from both S1 and S2, with the maximum length.

We will present two solutions to this problem:

  1. A Simple Solution
  2. Dynamic Programming Solution

A Simple Solution

Following are the steps to find the longest common substring from two strings using the simple solution:

  • Find all substrings of the first string S1.
  • For every substring of S1, check whether it is a substring of S2 or not.
  • Note that the maximum length substring matches the S2 substring.
  • As the length of the string S1 is m, there are $\mathcal{O}(m^2)$ substrings.
  • The length of string S2 is n, and $\mathcal{O}(n)$ is required to find whether a string is a substring or not.
  • The total time complexity required to perform this solution is $\mathcal{O}(nm^2)$.

This solution can find the longest common substring; however, it will take time, so let’s look at the dynamic programming solution to achieve the same goal in less time.

Find the Longest Common Substring Through Dynamic Programming

We can use dynamic programming (DP) to find the longest common substring (LCS) from two strings. Following is the pseudocode for finding the longest common substring using the dynamic programming approach:

function Longest_Common_Substring(S1[1..m], S2[1..n])
    DP_Matrix = array(1..m, 1..n)
    len = 0
    result_substring = {}
    for i = 1..m
        for j = 1..n
            if S1[i] == S2[j]
                if i = 1 or j = 1
                    DP_Matrix[i, j] = 1
                else
                    DP_Matrix[i, j] = DP_Matrix[i − 1, j − 1] + 1
                if DP_Matrix[i, j] > len
                    len = DP_Matrix[i, j]
                    result_substring = {S1[i − len + 1..i]}
                else if DP_Matrix[i, j] == len
                    result_substring = result_substring ∪ {S1[i − len + 1..i]}
            else
                DP_Matrix[i, j] = 0
    return result_substring

The above solution takes quadratic (i.e., $\mathcal{O}(mn)$) time complexity to calculate the longest common substring. The solution is optimal and always guarantees the longest sub-match.

Note: The arrays in pseudocodes start from index 1. Moreover, this algorithm works fine only with non-empty strings.

The DP_Matrix is a 2D array or matrix used as a dynamic programming table to store the substring length for the prefixes S1[1..i] and S2[1..j] with ending positions S1[i] and S2[j].

The variable len is used to store the current largest result_substring length, which has been found so far. The substring is used to store the string of length len.

All the cells in the first row and column of the DP_matrix are initialized to 1. An objective function is used to calculate the remaining cells.

The objective function for this formulation can be presented as:

  1. For a given cell DP_Matrix[i,j], if the corresponding characters of given string sequences (i.e., characters at S1[i] and S2[j]) are matching, then we add 1 to the value of cell DP_Matrix[i-1,j-1].
    • If the updated value of cell DP_Matrix[i,j] is greater than the previous length of substring, then update len with this new value.
    • Simultaneously, save this new maximum string to an auxiliary string holder (i.e., result_substring).
  2. If the corresponding characters are not matching (i.e., a mismatch case), then set the cell DP_Matrix[i, j] with a zero value, indicating that only a new sub-string can start from here.

After all the M*N iterations (i.e., end of the for loops), the result_substring will have an optimally largest common sub-string.

Longest Common Substring Dynamic Programming Solution in C++

The following code is used to find the longest common substring from the strings using dynamic programming in C++.

#include <iostream>
#include <string>
using namespace std;

string Longest_Common_Substring(string Str1, string Str2) {
  string result_substring = "";
  int m = Str1.length();
  int n = Str2.length();
  int DP_Matrix[m + 1][n + 1];
  int len = 0;
  int end = 0;

  for (int i = 0; i <= m; i++) {
    for (int j = 0; j <= n; j++) {
      if (Str1[i] == Str2[j]) {
        if ((i == 0) || (j == 0))
          DP_Matrix[i][j] = 1;
        else
          DP_Matrix[i][j] = 1 + DP_Matrix[i - 1][j - 1];

        if (DP_Matrix[i][j] > len) {
          len = DP_Matrix[i][j];
          int start = i - DP_Matrix[i][j] + 1;
          if (end == start) {
            result_substring.push_back(Str1[i]);
          } else {
            end = start;
            result_substring.clear();
            result_substring.append(Str1.substr(end, (i + 1) - end));
          }
        }
      } else {
        DP_Matrix[i][j] = 0;
      }
    }
  }
  return result_substring;
}

int main() {
  string String1, String2;
  cout << "Enter the first string: ";
  cin >> String1;
  cout << "Enter the second string: ";
  cin >> String2;
  string substring = Longest_Common_Substring(String1, String2);
  cout << "LCS of " << String1 << " and " << String2 << " is " << substring
       << " of Length " << substring.length() << endl;
  return 0;
}

The above code contains the function Longest_Common_Substring, which accepts two strings (i.e., String1 and String2) as a parameter and returns their optimal largest common substring. The Longest_Common_Substring function uses the dynamic programming table method to find this LCS.

In the above code, the DP_Matrix is a 2D integer array that stores the current best substring length for the strings sub-pair S1[1..i] and S2[1..j]. The result_substring, as discussed in the pseudocode explanation, is used to store the longest substring found.

Let’s look at the outputs of the above code for different string pairs:

Output 1:

Enter the first string: ATGCACATA
Enter the second string: GCAATGCACT
LCS of ATGCACATA and GCAATGCACT is ATGCAC of Length 6

Output 2:

Enter the first string: ABCDDFA
Enter the second string: DCDDACDDF
LCS of ABCDDFA and DCDDACDDF is CDDF of Length 4

Besides the theoretical or mathematical proof, the above outputs suggest that the discussed dynamic programming solution is optimal.

Muhammad Husnain avatar Muhammad Husnain avatar

Husnain is a professional Software Engineer and a researcher who loves to learn, build, write, and teach. Having worked various jobs in the IT industry, he especially enjoys finding ways to express complex ideas in simple ways through his content. In his free time, Husnain unwinds by thinking about tech fiction to solve problems around him.

LinkedIn

Related Article - C++ String