Suffix Arrays and their Construction
– A suffix array is a sorted array of all suffixes of a string.
– Suffix arrays were introduced as an alternative to suffix trees.
– The first in-place suffix array construction algorithm was developed in 2016.
– Enhanced suffix arrays reproduce the functionality of suffix trees.
– Suffix arrays require less space than suffix trees.
– A suffix array requires 4 bytes per integer, while a suffix tree requires 20 bytes per node.
– Compressed suffix arrays and BWT-based indices require less space than suffix arrays.
– The SA-IS algorithm is a fast and space-efficient suffix array construction algorithm.
– Suffix array construction algorithms differ based on the supported alphabet.
– Prefix doubling algorithms find prefixes that honor the lexicographic ordering of suffixes.
– Recursive algorithms recursively sort a subset of suffixes and merge the suffix arrays.
– Induced copying algorithms sort the remaining suffixes using an already sorted subset.
– The DC3/skew algorithm is a recursive algorithm for integer alphabets.
– Dynamic suffix arrays update the suffix array of an edited text instead of rebuilding it.
– Yuta Mori’s DivSufSort is the fastest known suffix sorting algorithm in main memory.
– Ilya Grebnov presented a faster implementation of the algorithm in 2021.
– The algorithm showed a 65% performance improvement over DivSufSort on Silesia Corpus.
– Dynamic suffix arrays are more efficient than rebuilding for inserting letters in the original text.
– Open source routines such as qsufsort and DivSufSort are commonly used for suffix array construction.
Generalized Suffix Arrays and their Applications
– A generalized suffix array contains all suffixes for a set of strings.
– It is lexicographically sorted with all suffixes of each string.
– Suffix sorting algorithms can be used to compute the Burrows-Wheeler transform (BWT).
– The BWT can be computed in linear time using a suffix array.
– The BWT is useful for data compression and string searching.
Enhanced Suffix Arrays and their Properties
– Enhanced suffix arrays consist of suffix arrays and a child table, improving space efficiency and time complexity.
– The child table contains information about the parent-child relationship in the suffix tree.
– Enhanced suffix arrays can be applied to any algorithm that uses a suffix tree by using lcp-interval trees.
– Searching a pattern in an enhanced suffix array has a time complexity of O(m|Σ|).
– An lcp-interval is associated with a node in the suffix tree and represents a range of suffixes with a common longest prefix.
– The lcp-interval has specific properties, such as the lcp-value and the relationship with other intervals.
– The lcp-array stores the lengths of the longest common prefixes between consecutive suffixes.
– The lcp-interval reflects the parent-child relationship in the suffix tree.
– The lcp-interval can be used to compute the child table in linear time.
– The child table, composed of arrays like cldtab, down, and nextlIndex, stores information about the edges of the suffix tree.
– The down and nextlIndex arrays maintain the parent-child relationship.
– The child table can be constructed by traversing the lcp-interval tree in a bottom-up manner.
– Separate algorithms can compute the up/down values and the nextlIndex values.
– Constructing the child table is essential for efficient operations on enhanced suffix arrays.
– The suffix links in an enhanced suffix array can be computed using suffix link intervals.
– Suffix link intervals [i,..j] are generated for each lcp-interval [i,..j] during preprocessing.
– The left and right elements of the interval are maintained in the first index of [i,..j].
– The suffix link table is constructed through a breadth-first traversal of the lcp-interval tree.
– The suffix link interval [l,..r] is represented by the interval [l,..r] in the l-list.
Recent Developments in Suffix Arrays
– Yuta Mori’s DivSufSort is the fastest known suffix sorting algorithm in main memory.
– Ilya Grebnov presented a faster implementation of the algorithm in 2021.
– The algorithm showed a 65% performance improvement over DivSufSort on Silesia Corpus.
– Dynamic suffix arrays are more efficient than rebuilding for inserting letters in the original text.
– Open source routines such as qsufsort and DivSufSort are commonly used for suffix array construction.
References
– Abouelhoda, Kurtz & Ohlebusch 2004
– I, Kärkkäinen & Kempa 2014
– Gawrychowski & Kociumaka 2017
– Abouelhoda, Kurtz & Ohlebusch 2002
– Kurtz 1999
– Puglisi, Smyth & Turpin 2007
– Fischer 2011
– Mori, Yuta. sais. Archived from the original on 9 Mar 2023. Retrieved 31 Aug 2023
– Burkhardt & Kärkkäinen 2003
– Kulla & Sanders 2007
– L. Ard, “Dynamic Extended Suffix Arrays,” Journal of Discrete Algorithms, vol. 8, no. 2, pp. 241, 2010.
– S. Burkhardt and J. Kärkkäinen, “Fast Lightweight Suffix Array Construction and Checking,” in Combinatorial Pattern Matching, Lecture Notes in Computer Science, vol. 2676, pp. 55-69, 2003.
– R. M. Karp, R. E. Miller, and A. L. Rosenberg, “Rapid Identification of Repeated Patterns,” in Proceedings of the fourth annual ACM symposium on Theory of computing – STOC 72, pp. 125-136, 1972.
– M. Farach, “Optimal Suffix Tree Construction
In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.
Suffix array | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Type | Array | |||||||||
Invented by | Manber & Myers (1990) | |||||||||
Time complexity in big O notation | ||||||||||
|
Suffix arrays were introduced by Manber & Myers (1990) as a simple, space efficient alternative to suffix trees. They had independently been discovered by Gaston Gonnet in 1987 under the name PAT array (Gonnet, Baeza-Yates & Snider 1992).
Li, Li & Huo (2016) gave the first in-place time suffix array construction algorithm that is optimal both in time and space, where in-place means that the algorithm only needs additional space beyond the input string and the output suffix array.
Enhanced suffix arrays (ESAs) are suffix arrays with additional tables that reproduce the full functionality of suffix trees preserving the same time and memory complexity. The suffix array for a subset of all suffixes of a string is called sparse suffix array. Multiple probabilistic algorithms have been developed to minimize the additional memory usage including an optimal time and memory algorithm.