Fast, Practical Algorithms for Computing All the Repeats in a String

Puglisi, Simon J.; Smyth, W. F.; Yusufu, Munina

doi:10.1007/s11786-010-0033-6

Fast, Practical Algorithms for Computing All the Repeats in a String

Published: 15 April 2010

Volume 3, pages 373–389, (2010)
Cite this article

Mathematics in Computer Science Aims and scope Submit manuscript

144 Accesses
10 Citations
Explore all metrics

Abstract

Given a string x = x[1..n] on an alphabet of size α, and a threshold p _min ≥ 1, we describe four variants of an algorithm PSY1 that, using a suffix array, computes all the complete nonextendible repeats in x of length p ≥ p _min. The basic algorithm PSY1–1 and its simple extension PSY1–2 are fast on strings that occur in biological, natural language and other applications (not highly periodic strings), while PSY1–3 guarantees Θ(n) worst-case execution time. The final variant, PSY1–4, also achieves Θ(n) processing time and, over the complete range of strings tested, is the fastest of the four. The space requirement of all four algorithms is about 5n bytes, but all make use of the “longest common prefix” (LCP) array, whose construction requires about 6n bytes. The four algorithms are faster in applications and use less space than a recently-proposed algorithm (Narisawa in Proceedings of 18th Annual Symposium on Combinatorial Pattern Matching, pp. 340–351, 2007) that produces equivalent output. The suffix array is not explicitly used by algorithms PSY1, but may be required for postprocessing; in this case, storage requirements rise to 9n bytes. We also describe two variants of a fast Θ(n)-time algorithm PSY2 for computing all complete supernonextendible repeats in x.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast

Property Suffix Array with Applications

Optimal In-Place Suffix Sorting

References

Abouelhoda M.I, Kurtz S., Ohlebusch E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2, 53–86 (2004)
Article MATH MathSciNet Google Scholar
Apostolico A., Lonardi S.: Off-line compression by greedy textual substitution. Proc. IEEE 88(11), 1733–1744 (2000)
Article Google Scholar
Bernstein Y., Zobel J.: Accurate discovery of co-derivative documents via duplicate text detection. Inf. Syst. 31, 595–609 (2006)
Article Google Scholar
Brodal G.S., Lyngso R.B., Pederesen C.N.S., Stoye J.: Finding maximal pairs with bounded gap. J. Discrete Algebras 1, 77–103 (2000)
Google Scholar
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report 124. Digital Equipment Corporation (1994)
Franek, F., Simpson, R.J., Smyth, W.F.: The maximum number of runs in a string. In: Miller, M., Park, K. (eds.) Proceedings of 14th Australasian Workshop on Combinatorial Algebras, pp. 36–45 (2003)
Franek F., Smyth W.F., Tang Y.: Computing all repeats using suffix arrays. J. Autom. Lang. Comb. 8(4), 579–591 (2003)
MATH MathSciNet Google Scholar
Gusfield, D.: Algorithms on Strings, Trees & Sequences, 534 pp. Cambridge University Press, Cambridge (1997)
Book MATH Google Scholar
Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Proceedings of 30th International Colloquium on Automata, Languages & Programming, pp. 943–955 (2003)
Karlin S., Ghandour G., Ost F., Tavare S., Korn L.J.: New approaches for computer analysis of nucleic acid sequences. Proc. Natl. Acad. Sci. USA 80, 5660–5664 (1983)
Article MATH Google Scholar
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, LNCS, vol. 2089, pp. 181–192. Springer-Verlag, Berlin (2001)
Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching, LNCS, vol. 2676, pp. 200–210. Springer-Verlag, Berlin (2003)
Larsson J., Moffat A.: Off-line dictionary-based compression. Proc. IEEE 88(11), 1722–1732 (2000)
Article Google Scholar
Maniscalco, M., Puglisi, S.J.: Faster lightweight suffix array construction. In: Ryan, J., Dafik (eds.) Proceedings of the 17th Australasian Workshop on Combinatorial Algebras, pp. 16–29 (2006)
Manzini, G.: Two space-saving tricks for linear time LCP computation. In: Hagerup, T., Katajainen, J. (eds.) Proceedings of the 9th Scandinavian Workshop on Algebra Theory, LNCS, vol. 3111, pp. 372–383. Springer-Verlag, New York (2004)
Manzini G., Ferragina P.: Engineering a lightweight suffix array construction algorithm. Algorithmica 40, 33–50 (2004)
Article MATH MathSciNet Google Scholar
Narisawa, K., Inenaga, S., Bannai, H., Takeda, M.: Efficient computation of substring equivalence classes with suffix arrays. In: Proceedings of the 18th Annual Symposium on Combinatorial Pattern Matching, pp. 340–351 (2007)
Puglisi S.J., Smyth W.F., Turpin A.: A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39(2), 1–31 (2007)
Article Google Scholar
Puglisi, S.J., Smyth, W.F., Yusufu, M.: Fast optimal algorithms for computing all the repeats in a string. In: Proceedings of the Prague Stringology Conference, pp. 161–169 (2008)
Puglisi, S.J., Turpin, A.: Space-time tradeoffs for longest-common-prefix array computation. In: Hong, S-H., Nagamochi, H., Fukunaga, T. (eds.) Proceedings of the 19th International Symposium on Algorithms & Computation, LNCS, vol. 5369, pp. 124–135. Springer-Verlag, New York (2008)
Shin S.W., Kim S.M.: A new algorithm for detecting low-complexity regions in protein sequences. Bioinformatics 21(2), 160–170 (2005)
Article MathSciNet Google Scholar
Smyth, B.: Computing Patterns in Strings, 423 pp. Pearson Addison-Wesley (2003)
Turpin, A., Smyth, W.F.: An approach to phrase selection for offline data compression. In: Oudshoorn, M. (ed.) Proceedings of the 25th Australasian Computer Science Conference, pp. 267–273 (2002)

Download references

Author information

Authors and Affiliations

School of Computer Science and Information Technology, RMIT University, GPO Box 2476V, Melbourne, VIC, 3001, Australia
Simon J. Puglisi
Digital Ecosystems and Business Intelligence Institute, Curtin University of Technology, GPO Box U1987, Perth, WA, 6845, Australia
W. F. Smyth & Munina Yusufu
Algorithms Research Group, Department of Computing and Software, McMaster University, Hamilton, ON, L8S 4K1, Canada
W. F. Smyth & Munina Yusufu

Authors

Simon J. Puglisi
View author publications
Search author on:PubMed Google Scholar
W. F. Smyth
View author publications
Search author on:PubMed Google Scholar
Munina Yusufu
View author publications
Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Simon J. Puglisi or W. F. Smyth.

Additional information

The work of the first author is supported by the Australian Research Council. The work of the second and third authors was supported in part by grants from the Natural Sciences & Engineering Research Council of Canada. A preliminary version of this paper appeared in [19]. The authors thank an anonymous referee whose comments have materially improved this paper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Puglisi, S.J., Smyth, W.F. & Yusufu, M. Fast, Practical Algorithms for Computing All the Repeats in a String. Math.Comput.Sci. 3, 373–389 (2010). https://doi.org/10.1007/s11786-010-0033-6

Download citation

Received: 23 March 2009
Revised: 13 January 2010
Accepted: 15 February 2010
Published: 15 April 2010
Issue date: June 2010
DOI: https://doi.org/10.1007/s11786-010-0033-6

Keywords

Mathematics Subject Classification (2000)

Nonnumerical algorithms 68W05

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast, Practical Algorithms for Computing All the Repeats in a String

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast

Property Suffix Array with Applications

Optimal In-Place Suffix Sorting

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding authors

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2000)

Subscribe and save

Buy Now