How does one detect when two strings overlap? In the case below, the four-letter suffix of string 1 matches the four-letter prefix of string 2.
1: "Fire at Will" 2: "William Riker is number one"
Sometimes there are several matches; finding the longest is not always straight forward.
1: "Have some CoCo and CoCo" 2: "CoCo and CoCo is here." 2: "CoCo and CoCo is here." 2: "CoCo and CoCo is here."
The naïve solution is to take ever smaller substrings of each string, compare them, then bail out when the first match is found. This is quick to program, very reliable, and very inefficient for long strings.
- def commonOverlapNaive(text1, text2):
- x = min(len(text1), len(text2))
- while x > 0:
- if text1[-x:] == text2[:x]:
- break
- x -= 1
- return x
A more efficient solution is to use a modified version of the Knuth-Morris-Pratt algorithm to scan the strings while keeping track of partial matches. This is very tricky to program, with many opportunities for off-by-one errors.
- def commonOverlapKmp(text1, text2):
- # Cache the text lengths to prevent multiple calls.
- text1_length = len(text1)
- text2_length = len(text2)
- # Eliminate the null case.
- if text1_length == 0 or text2_length == 0:
- return 0
- # Truncate the longer string.
- if text1_length > text2_length:
- text1 = text1[-text2_length:]
- elif text1_length < text2_length:
- text2 = text2[:text1_length]
- text_length = min(text1_length, text2_length)
- # Quick check for the worst case.
- if text1 == text2:
- return text_length
- # Build partial match table from text2.
- table = [0] * text_length
- table[0] = -1
- #table[1] = 0
- pos = 2
- cnd = 0
- while pos < text_length:
- if text2[pos - 1] == text2[cnd]:
- cnd += 1
- table[pos] = cnd
- pos += 1
- elif cnd > 0:
- cnd = table[cnd]
- else:
- table[pos] = 0
- pos += 1
- # Search text1.
- m = 0
- i = 0
- while m + i < text_length:
- if text2[i] == text1[m + i]:
- i += 1
- if m + i == text_length:
- return i
- else:
- m += i - table[i]
- if table[i] > -1:
- i = table[i]
- else:
- i = 0
- return 0
A somewhat simpler approach is to leverage the highly efficient indexOf function that is built into most languages (it is called find in Python). Start by assuming a single letter overlap and search for that letter in the second string. If found, then check the two substrings for equality. Then use indexOf to locate any instance of the substring plus one character. Keep repeating until no matches are found, then return the last confirmed substring match.
- def commonOverlapIndexOf(text1, text2):
- # Cache the text lengths to prevent multiple calls.
- text1_length = len(text1)
- text2_length = len(text2)
- # Eliminate the null case.
- if text1_length == 0 or text2_length == 0:
- return 0
- # Truncate the longer string.
- if text1_length > text2_length:
- text1 = text1[-text2_length:]
- elif text1_length < text2_length:
- text2 = text2[:text1_length]
- # Quick check for the worst case.
- if text1 == text2:
- return min(text1_length, text2_length)
- # Start by looking for a single character match
- # and increase length until no match is found.
- best = 0
- length = 1
- while True:
- pattern = text1[-length:]
- found = text2.find(pattern)
- if found == -1:
- return best
- length += found
- if text1[-length:] == text2[:length]:
- best = length
- length += 1
Now that we have three functions that all do the same thing, which
one is faster? Well, that depends. Let's feed random strings to each
function, like this:
cAx&"[|J{aL[xJu081e:(grxnV`kOOe#&`y#AxfA/;o2~WVE1qMUVqk~ ^]>...
The resulting logarithmic timing plots are fairly clear. The naïve
algorithm scales worse than O(n), though it beats KMP until string
lengths reach 10,000. KMP is solidly O(n) as advertised. The IndexOf
algorithm is also O(n) -- but one hundred times faster than KMP.
However, the IndexOf algorithm relies on there not being too many
coincidental matches. Let's see what happens when we feed it a
pathological input string, like this:
baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
The resulting logarithmic timing plots show a sudden change in
behaviour. The naïve and KMP algorithms are unchanged, while IndexOf
becomes the worst of the bunch.
Now we have a problem. The best algorithm can become the worst
algorithm, depending on the inputs. KMP offers a consistently scalable
solution regardless of input, but potentially at a cost of 100x
performance. Let's take a closer look at the inputs. Can we recreate
the pathological behaviour with real-world data? Let's start by
creating fractal strings with lots of repetitions, like this:
N>NeN>N`N>NeN>NVN>NeN>N`N>NeN>NRN>NeN>N`N>NeN>NVN>NeN>N`N>Ne...
This is similar to patterns found in source code. The resulting
logarithmic timing plots shows that performance returns to near-optimum
levels.
A major user of string manipulation utilities these days is the
genetics world. DNA has an alphabet of four letters, A, C, G and T.
This will certainly result in large numbers of coincidental matches.
Let's download the human genome, like this:
TCGGTCTCCTTTGGGTAATTTTCCATTATGTCATAACAGTAAATATTAATATGTGCTCCT...
The resulting logarithmic timing plots once again show this as
near-optimum. I fed it a fifth of a human and got an answer back in
fifteen seconds.
Computer Science (always suspect any discipline which feels the need to add the word 'science' to its name) is often an exercise in compromises. Does one choose the algorithm that works best most of the time (IndexOf)? Does one choose the algorithm that will never fail badly (KMP)? Does one choose the algorithm that's easiest to program and least likely to contain bugs (Naïve)? Does one spawn off two threads each running different algorithms to execute on separate processors with the winner terminating the loser?
Special thanks to Tancred Lindholm for continually asking awkward questions every time I thought I was finished.Source:http://neil.fraser.name/news/2010/11/04/?