Oncaphillis::niftyLib

Using std::set as a Suffix Array - The power of comparators

This might not be the most efficient implementation of a Suffix Array, but I thing it has its charme.

Suffix Array - The basic concept

The basic concept of a Suffix Array is a simple ordered set of suffix strings. If we have an ordered list of all possible suffixes within the string we can implement substring searches with a simple binary search to determine if a given substring can be found within the string.

So in a way one might build up a suffix array simply as.

std::string _str="banana";
{
// a simple collection of std::string
std::set<std::string> _set;
// Iterate over all suffix strings including the magic
// "end of string" terminus and insert
for(int i=0;i<=_str.length();i++) {
_set.insert(_str.substr(i));
}
for(auto s : _set) {
std::cerr << "'" << s << "'" << std::endl;;
}
return 0;
}

Using a custom comparator

Of course that's not achieving much since (1) we loose the index (the most precious part for later substring searches) and (2) we waste lots of space since each substring gets stored individually. So for a string of length 10 we would have to store sum(1...10)==55 characters for all suffixes. It doesn't even have academical values but may be some other variant of std::set<...> can be used instead:.

So what do we need ?

  • We only want to store the string once.
  • We only want to store the index of the first character of a particular suffix in our array.
  • The index has to be ordered by the lexicographic order of the suffix string it refers to.

All in all this sounds like a std::set<int> with a custom comparator class.

The declaration of std::set<...> is:

template < class T, // set::key_type/value_type
class Compare = less<T>, // set::key_compare/value_compare
class Alloc = allocator<T> // set::allocator_type
> class set;

With the Compare class being used to establish element order and uniqueness. It defaults to less<T> which tries to apply the < operator on it's two arguments.

Using it in a std::set

If however we provide our own custom comparator with the string we would like to construct a suffix array from we might use it to order the index value by the substring comparison like this:

// Our comparator for two index values referring to the
// same string referred to by _str.
public:
// The constructor takes the whole string to be indexed
// as an argument and holds a const reference.
SuffixComparator(const std::string & str) : _str(str) {
}
// Compare to index values. This is a simple
// substring comparison including the special
// "end of string" marker.
bool operator()(int a,int b) {
// We replace an index pointing behind the strings end
// with our special purpose "end of string" marker.
a = a == _str.length() ? -1 : a;
b = b == _str.length() ? -1 : b;
// if one of the two index values point after the string
// we just have to compare the two
if( a == -1 || b == -1 )
return a < b;
// Otherwise we do a simple string comparison
for(;a < _str.length() && b < _str.length();a++,b++) {
if(_str[a] != _str[b]) {
return _str[a] < _str[b];
}
}
// If we end up here the two substrings
// are completely identical up to the end of one
// of them. The shorter wins
return _str.length() - a < _str.length() - b;
}
private:
const std::string & _str;
};

When building the Suffix Array we might simply do:

int main() {
std::string str="banana";
std::set<int,SuffixComparator> se({},SuffixComparator(str));
for(int i=0;i<=str.length();i++) {
se.insert(i);
}
for(auto i : se) {
std::cerr << i << ":'" << str.sub-str(i) << "'" << std::endl;;
}
}

... which gives us:

6:''
5:'a'
3:'ana'
1:'anana'
0:'banana'
4:'na'
2:'nana'

Voila – suffix index values in the proper order.


Generated by  doxygen
© 2008-2011; Dr. Sebastian Kloska ( Oncaphillis )
Powered by: [?]