Abstract
Given a string x = x[1..n] on an alphabet of size α, and a threshold p min ≥ 1, we describe four variants of an algorithm PSY1 that, using a suffix array, computes all the complete nonextendible repeats in x of length p ≥ p min . The basic algorithm PSY1–1 and its simple extension PSY1–2 are fast on strings that occur in biological, natural language and other applications (not highly periodic strings), while PSY1–3 guarantees Θ(n) worst-case execution time. The final variant, PSY1–4, also achieves Θ(n) processing time and, over the complete range of strings tested, is the fastest of the four. The space requirement of all four algorithms is about 5n bytes, but all make use of the “longest common prefix” (LCP) array, whose construction requires about 6n bytes. The four algorithms are faster in applications and use less space than a recently-proposed algorithm (Narisawa in Proceedings of 18th Annual Symposium on Combinatorial Pattern Matching, pp. 340–351, 2007) that produces equivalent output. The suffix array is not explicitly used by algorithms PSY1, but may be required for postprocessing; in this case, storage requirements rise to 9n bytes. We also describe two variants of a fast Θ(n)-time algorithm PSY2 for computing all complete supernonextendible repeats in x.
Similar content being viewed by others
References
Abouelhoda M.I, Kurtz S., Ohlebusch E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2, 53–86 (2004)
Apostolico A., Lonardi S.: Off-line compression by greedy textual substitution. Proc. IEEE 88(11), 1733–1744 (2000)
Bernstein Y., Zobel J.: Accurate discovery of co-derivative documents via duplicate text detection. Inf. Syst. 31, 595–609 (2006)
Brodal G.S., Lyngso R.B., Pederesen C.N.S., Stoye J.: Finding maximal pairs with bounded gap. J. Discrete Algebras 1, 77–103 (2000)
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report 124. Digital Equipment Corporation (1994)
Franek, F., Simpson, R.J., Smyth, W.F.: The maximum number of runs in a string. In: Miller, M., Park, K. (eds.) Proceedings of 14th Australasian Workshop on Combinatorial Algebras, pp. 36–45 (2003)
Franek F., Smyth W.F., Tang Y.: Computing all repeats using suffix arrays. J. Autom. Lang. Comb. 8(4), 579–591 (2003)
Gusfield, D.: Algorithms on Strings, Trees & Sequences, 534 pp. Cambridge University Press, Cambridge (1997)
Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Proceedings of 30th International Colloquium on Automata, Languages & Programming, pp. 943–955 (2003)
Karlin S., Ghandour G., Ost F., Tavare S., Korn L.J.: New approaches for computer analysis of nucleic acid sequences. Proc. Natl. Acad. Sci. USA 80, 5660–5664 (1983)
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, LNCS, vol. 2089, pp. 181–192. Springer-Verlag, Berlin (2001)
Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching, LNCS, vol. 2676, pp. 200–210. Springer-Verlag, Berlin (2003)
Larsson J., Moffat A.: Off-line dictionary-based compression. Proc. IEEE 88(11), 1722–1732 (2000)
Maniscalco, M., Puglisi, S.J.: Faster lightweight suffix array construction. In: Ryan, J., Dafik (eds.) Proceedings of the 17th Australasian Workshop on Combinatorial Algebras, pp. 16–29 (2006)
Manzini, G.: Two space-saving tricks for linear time LCP computation. In: Hagerup, T., Katajainen, J. (eds.) Proceedings of the 9th Scandinavian Workshop on Algebra Theory, LNCS, vol. 3111, pp. 372–383. Springer-Verlag, New York (2004)
Manzini G., Ferragina P.: Engineering a lightweight suffix array construction algorithm. Algorithmica 40, 33–50 (2004)
Narisawa, K., Inenaga, S., Bannai, H., Takeda, M.: Efficient computation of substring equivalence classes with suffix arrays. In: Proceedings of the 18th Annual Symposium on Combinatorial Pattern Matching, pp. 340–351 (2007)
Puglisi S.J., Smyth W.F., Turpin A.: A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39(2), 1–31 (2007)
Puglisi, S.J., Smyth, W.F., Yusufu, M.: Fast optimal algorithms for computing all the repeats in a string. In: Proceedings of the Prague Stringology Conference, pp. 161–169 (2008)
Puglisi, S.J., Turpin, A.: Space-time tradeoffs for longest-common-prefix array computation. In: Hong, S-H., Nagamochi, H., Fukunaga, T. (eds.) Proceedings of the 19th International Symposium on Algorithms & Computation, LNCS, vol. 5369, pp. 124–135. Springer-Verlag, New York (2008)
Shin S.W., Kim S.M.: A new algorithm for detecting low-complexity regions in protein sequences. Bioinformatics 21(2), 160–170 (2005)
Smyth, B.: Computing Patterns in Strings, 423 pp. Pearson Addison-Wesley (2003)
Turpin, A., Smyth, W.F.: An approach to phrase selection for offline data compression. In: Oudshoorn, M. (ed.) Proceedings of the 25th Australasian Computer Science Conference, pp. 267–273 (2002)
Author information
Authors and Affiliations
Corresponding authors
Additional information
The work of the first author is supported by the Australian Research Council. The work of the second and third authors was supported in part by grants from the Natural Sciences & Engineering Research Council of Canada. A preliminary version of this paper appeared in [19]. The authors thank an anonymous referee whose comments have materially improved this paper.
Rights and permissions
About this article
Cite this article
Puglisi, S.J., Smyth, W.F. & Yusufu, M. Fast, Practical Algorithms for Computing All the Repeats in a String. Math.Comput.Sci. 3, 373–389 (2010). https://doi.org/10.1007/s11786-010-0033-6
Received:
Revised:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1007/s11786-010-0033-6