Wednesday, December 12. 2007

Fun With String Searching

Jake Voytko posted an interesting article on string searching algorithms. He presented the three algorithms brute force, Boyer-Moore, and Rabin-Karp. However, his implementation has two small bugs and the Rabin-Karp can be improved considerably.

The brute force is the easiest to implement: it compares the search pattern and the text character by character. If a character does not match, the search pattern is shifted by one character forward and the comparison starts again. As you might expect, this is not the fastest algorithm and Jake provides a pathological case where brute force takes very long.

The Boyer-Moore algorithm also does a character-by-character comparsion, however it does it from right to left. If a character does not match, the search pattern can be shifted not only by one position but by as many as needed to match the closest character in the search pattern with the mismatching character in the text. With this, you usually shift the search pattern by several characters up to the length of the search pattern if there is a mismatch. As you might imagine, this increases search speed considerably. However, there still exist pathological cases one of which Jake provides in his article.

The Rabin-Karp algorithm chooses a different approach: instead of comparing character by character, it compares hash values. First, it computes the hash value of the pattern. Then it computes the hash value of a substring window of the text of the same length as the pattern. If the hash values are equal, the algorithm does a brute force match on the pattern and the text substring to verify the match. If there is a mismatch or the hash values differ, the substring window of the text moves forward one character. Here's comes the specialty of the algorithm: it can calculate the hash value of the new substring window by just subtracting one value and adding one value. Jake's article shows you the details. As the Rabin-Karp algorithm moves character-by-character through the text, it is not the fastest algorithm. However, as the hash values of the pattern and the substring window rarely match if the pattern and the substring are not equal, the comparison is really fast. Furthermore, it's hard to come up with a pathological case.

Jake compares the three algorithms by loading large files and the searching for a pattern, carefully choosing pathological patterns to illustrate the weakness of the brute force and the Boyer-Moore algorithms. However, in his implementation he has a small glitch which blurs his illustration. Furthermore, the text file for the pathological case of brute force is not so pathological. And last, the Rabin-Karp can be improved considerably.

Jake reads in the text files like this:

string load_text(const string& filename)
{
    // In order to avoid expensive concatenations, we will load
    // the contents of the file into a vector, and then create a string
    // of the proper size
    vector<string> contents;

    ifstream infile(filename.c_str());
    if(!infile)
        throw runtime_error("Unable to open file");

    string temp;
    int size = 0;

    while(getline(infile, temp))
    {
        size+=temp.size();
        contents.push_back(temp);
    }
    infile.close();

    string to_ret(size, ' ');
    unsigned int ind = 0;
    for(unsigned int i = 0; i < contents.size(); ++i)
    {
        for(unsigned int j = 0; j < contents[i].size(); ++j)
        {
            to_ret[ind++] = contents[i][j];
        }
    }

    return to_ret;
}

First, he reads the text file line by line into a vector. Then he creates a string of an appropriate size, and finally he copies the file contents character by character into the string.

Besides this being awfully slow (at least he could have concatenated the lines stored in the vector), he also introduces a small bug: getline() drops the newline. Thus, the reported match position of the pattern does not match the position in the text file, as all newlines from the start to the match position are not counted.

Furthermore, as the missing newline concatenates all lines in the file to one large line, his text file for the pathological case of Boyer-Moore gets not-so-pathological as the concatenation of the first two lines in the file create a match position, which Boyer-Moore finds rather quickly. The pathological text file for the brute force algorithm already contains a match position at the beginning of the file. In both cases, having the match position at the end of the file strengthens the illustration of the algorithms' weaknesses, as brute force and Boyer-Moore take several seconds, while Rabin-Karp still takes the same time of less than 0.5 seconds. By the way, a faster (and correct) way to read in a file is

string load_text(const string& filename) {
    ifstream infile(filename.c_str());
    if (!infile) 
       throw runtime_error("Unable to open file");

    string result;
    static const int buffer_size = 1 << 13;
    char buffer[buffer_size];
    while (infile) {
       infile.read(buffer, buffer_size);
       result.append(buffer,infile.gcount());
    }

    return result;
}

The Rabin-Karp algorithm's speed can also be improved. Instead of calculating some more or less complex hash function which involves taking the modulo, you can use a simple hash function: just add the character values. On my machine, this resulted in a speedup of about 2. Here are the results with the same example files as Jake used. First, the normal case: search a string in a novel.

Loading Moby Dick
Text length is 1210051 characters
Loading text took 0.003307 seconds
Searching for pattern "devious-cruising Rachel"...459 hash equalities 
.
Timings:
Brute force:     0.048142 seconds found pattern at position 1209941
Boyer-Moore:     0.010583 seconds found pattern at position 1209941
Rabin-Karp:      0.045427 seconds found pattern at position 1209941
Rabin-Karp_AGB:  0.020273 seconds found pattern at position 1209941

Rabin-Karp_AGB is the Rabin-Karp variant with the simple hash function. Boyer-Moore is the fastest function, closely followed by Rabin-Karb_AGB, although there are 459-1 hash equalities which did not result in a string match. The regular Rabin-Karp takes about the same time as brute force.

Here's the pathological case for the brute force algorithm. Note that the match position is rather at the end of the file and not at the beginning as Jake has posted:

Loading Brute Force torture test
Text length is 11015600 characters
Loading text took 0.031743 seconds
Searching for pattern "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"...1 hash equalities 
.
Timings:
Brute force:     6.877955 seconds found pattern at position 11015557
Boyer-Moore:     1.075159 seconds found pattern at position 11015557
Rabin-Karp:      0.421158 seconds found pattern at position 11015557
Rabin-Karp_AGB:  0.182705 seconds found pattern at position 11015557

Brute force takes over 6 seconds to find the pattern. Rabin-Karp takes 10 times longer than in the novel example above, but the text is also 10 times larger, so Rabin-Karp performs consistently good. So does Rabin-Karp_AGB, which is the fastest algorithm in this case.

Last, the pathological case for the Boyer-Moore algorithm. Note again the match position at the end of the file and not at the beginning of the file as in Jake's article:

Loading Boyer-Moore torture test
Text length is 11015656 characters
Loading text took 0.030001 seconds
Searching for pattern "baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"...141 hash equalities 
.
Timings:
Brute force:     0.372420 seconds found pattern at position 11015613
Boyer-Moore:     8.293481 seconds found pattern at position 11015613
Rabin-Karp:      0.421905 seconds found pattern at position 11015613
Rabin-Karp_AGB:  0.183333 seconds found pattern at position 11015613

Boyer-Moore takes over 8 seconds to find the pattern, while Rabin-Karp performs as good as in the pathological case for brute force, which performs better here. Still, Rabin-Karb_AGB is again the fastest algorithm.

To be fair, the 'improved' Rabin-Karp_AGB easily allows for pathological cases: searching for aaaaf in a text file which repeatedly contains baaaebaaaebaaae... results in hash equalities at every position, yet there is no match. Rabin-Karp_AGB performs as bad as brute force in that case. The reason why it is difficult to find pathological cases for Rabin-Karp is that the value a character adds to the hash value depends on its position in the pattern. This is not the case for Rabin-Karp_AGB.

To conclude, we can note that Jake's small glitches could be easily fixed, and that the good and consistent performance of Rabin-Karp heavily depends on its position-dependent hash function.