Glossary Term
Suffix array
Suffix Arrays and their Construction
- A suffix array is a sorted array of all suffixes of a string.
- Suffix arrays were introduced as an alternative to suffix trees.
- The first in-place suffix array construction algorithm was developed in 2016.
- Enhanced suffix arrays reproduce the functionality of suffix trees.
- Suffix arrays require less space than suffix trees.
- A suffix array requires 4 bytes per integer, while a suffix tree requires 20 bytes per node.
- Compressed suffix arrays and BWT-based indices require less space than suffix arrays.
- The SA-IS algorithm is a fast and space-efficient suffix array construction algorithm.
- Suffix array construction algorithms differ based on the supported alphabet.
- Prefix doubling algorithms find prefixes that honor the lexicographic ordering of suffixes.
- Recursive algorithms recursively sort a subset of suffixes and merge the suffix arrays.
- Induced copying algorithms sort the remaining suffixes using an already sorted subset.
- The DC3/skew algorithm is a recursive algorithm for integer alphabets.
- Dynamic suffix arrays update the suffix array of an edited text instead of rebuilding it.
- Yuta Mori's DivSufSort is the fastest known suffix sorting algorithm in main memory.
- Ilya Grebnov presented a faster implementation of the algorithm in 2021.
- The algorithm showed a 65% performance improvement over DivSufSort on Silesia Corpus.
- Dynamic suffix arrays are more efficient than rebuilding for inserting letters in the original text.
- Open source routines such as qsufsort and DivSufSort are commonly used for suffix array construction.
Generalized Suffix Arrays and their Applications
- A generalized suffix array contains all suffixes for a set of strings.
- It is lexicographically sorted with all suffixes of each string.
- Suffix sorting algorithms can be used to compute the Burrows-Wheeler transform (BWT).
- The BWT can be computed in linear time using a suffix array.
- The BWT is useful for data compression and string searching.
Enhanced Suffix Arrays and their Properties
- Enhanced suffix arrays consist of suffix arrays and a child table, improving space efficiency and time complexity.
- The child table contains information about the parent-child relationship in the suffix tree.
- Enhanced suffix arrays can be applied to any algorithm that uses a suffix tree by using lcp-interval trees.
- Searching a pattern in an enhanced suffix array has a time complexity of O(m|Σ|).
- An lcp-interval is associated with a node in the suffix tree and represents a range of suffixes with a common longest prefix.
- The lcp-interval has specific properties, such as the lcp-value and the relationship with other intervals.
- The lcp-array stores the lengths of the longest common prefixes between consecutive suffixes.
- The lcp-interval reflects the parent-child relationship in the suffix tree.
- The lcp-interval can be used to compute the child table in linear time.
- The child table, composed of arrays like cldtab, down, and nextlIndex, stores information about the edges of the suffix tree.
- The down and nextlIndex arrays maintain the parent-child relationship.
- The child table can be constructed by traversing the lcp-interval tree in a bottom-up manner.
- Separate algorithms can compute the up/down values and the nextlIndex values.
- Constructing the child table is essential for efficient operations on enhanced suffix arrays.
- The suffix links in an enhanced suffix array can be computed using suffix link intervals.
- Suffix link intervals are generated for each lcp-interval during preprocessing.
- The left and right elements of the interval are maintained in the first index of .
- The suffix link table is constructed through a breadth-first traversal of the lcp-interval tree.
- The suffix link interval is represented by the interval in the l-list.
Recent Developments in Suffix Arrays
- Yuta Mori's DivSufSort is the fastest known suffix sorting algorithm in main memory.
- Ilya Grebnov presented a faster implementation of the algorithm in 2021.
- The algorithm showed a 65% performance improvement over DivSufSort on Silesia Corpus.
- Dynamic suffix arrays are more efficient than rebuilding for inserting letters in the original text.
- Open source routines such as qsufsort and DivSufSort are commonly used for suffix array construction.
References
- Abouelhoda, Kurtz & Ohlebusch 2004
- I, Kärkkäinen & Kempa 2014
- Gawrychowski & Kociumaka 2017
- Abouelhoda, Kurtz & Ohlebusch 2002
- Kurtz 1999
- Puglisi, Smyth & Turpin 2007
- Fischer 2011
- Mori, Yuta. sais. Archived from the original on 9 Mar 2023. Retrieved 31 Aug 2023
- Burkhardt & Kärkkäinen 2003
- Kulla & Sanders 2007
- L. Ard, "Dynamic Extended Suffix Arrays," Journal of Discrete Algorithms, vol. 8, no. 2, pp. 241, 2010.
- S. Burkhardt and J. Kärkkäinen, "Fast Lightweight Suffix Array Construction and Checking," in Combinatorial Pattern Matching, Lecture Notes in Computer Science, vol. 2676, pp. 55-69, 2003.
- R. M. Karp, R. E. Miller, and A. L. Rosenberg, "Rapid Identification of Repeated Patterns," in Proceedings of the fourth annual ACM symposium on Theory of computing - STOC 72, pp. 125-136, 1972.
- M. Farach, "Optimal Suffix Tree Construction