这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

Fast, Practical Algorithms for Computing All the Repeats in a String

  • Published:
Mathematics in Computer Science Aims and scope Submit manuscript

Abstract

Given a string xx[1..n] on an alphabet of size α, and a threshold p min ≥ 1, we describe four variants of an algorithm PSY1 that, using a suffix array, computes all the complete nonextendible repeats in x of length pp min . The basic algorithm PSY1–1 and its simple extension PSY1–2 are fast on strings that occur in biological, natural language and other applications (not highly periodic strings), while PSY1–3 guarantees Θ(n) worst-case execution time. The final variant, PSY1–4, also achieves Θ(n) processing time and, over the complete range of strings tested, is the fastest of the four. The space requirement of all four algorithms is about 5n bytes, but all make use of the “longest common prefix” (LCP) array, whose construction requires about 6n bytes. The four algorithms are faster in applications and use less space than a recently-proposed algorithm (Narisawa in Proceedings of 18th Annual Symposium on Combinatorial Pattern Matching, pp. 340–351, 2007) that produces equivalent output. The suffix array is not explicitly used by algorithms PSY1, but may be required for postprocessing; in this case, storage requirements rise to 9n bytes. We also describe two variants of a fast Θ(n)-time algorithm PSY2 for computing all complete supernonextendible repeats in x.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abouelhoda M.I, Kurtz S., Ohlebusch E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2, 53–86 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  2. Apostolico A., Lonardi S.: Off-line compression by greedy textual substitution. Proc. IEEE 88(11), 1733–1744 (2000)

    Article  Google Scholar 

  3. Bernstein Y., Zobel J.: Accurate discovery of co-derivative documents via duplicate text detection. Inf. Syst. 31, 595–609 (2006)

    Article  Google Scholar 

  4. Brodal G.S., Lyngso R.B., Pederesen C.N.S., Stoye J.: Finding maximal pairs with bounded gap. J. Discrete Algebras 1, 77–103 (2000)

    Google Scholar 

  5. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report 124. Digital Equipment Corporation (1994)

  6. Franek, F., Simpson, R.J., Smyth, W.F.: The maximum number of runs in a string. In: Miller, M., Park, K. (eds.) Proceedings of 14th Australasian Workshop on Combinatorial Algebras, pp. 36–45 (2003)

  7. Franek F., Smyth W.F., Tang Y.: Computing all repeats using suffix arrays. J. Autom. Lang. Comb. 8(4), 579–591 (2003)

    MATH  MathSciNet  Google Scholar 

  8. Gusfield, D.: Algorithms on Strings, Trees & Sequences, 534 pp. Cambridge University Press, Cambridge (1997)

    Book  MATH  Google Scholar 

  9. Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Proceedings of 30th International Colloquium on Automata, Languages & Programming, pp. 943–955 (2003)

  10. Karlin S., Ghandour G., Ost F., Tavare S., Korn L.J.: New approaches for computer analysis of nucleic acid sequences. Proc. Natl. Acad. Sci. USA 80, 5660–5664 (1983)

    Article  MATH  Google Scholar 

  11. Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, LNCS, vol. 2089, pp. 181–192. Springer-Verlag, Berlin (2001)

  12. Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching, LNCS, vol. 2676, pp. 200–210. Springer-Verlag, Berlin (2003)

  13. Larsson J., Moffat A.: Off-line dictionary-based compression. Proc. IEEE 88(11), 1722–1732 (2000)

    Article  Google Scholar 

  14. Maniscalco, M., Puglisi, S.J.: Faster lightweight suffix array construction. In: Ryan, J., Dafik (eds.) Proceedings of the 17th Australasian Workshop on Combinatorial Algebras, pp. 16–29 (2006)

  15. Manzini, G.: Two space-saving tricks for linear time LCP computation. In: Hagerup, T., Katajainen, J. (eds.) Proceedings of the 9th Scandinavian Workshop on Algebra Theory, LNCS, vol. 3111, pp. 372–383. Springer-Verlag, New York (2004)

  16. Manzini G., Ferragina P.: Engineering a lightweight suffix array construction algorithm. Algorithmica 40, 33–50 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  17. Narisawa, K., Inenaga, S., Bannai, H., Takeda, M.: Efficient computation of substring equivalence classes with suffix arrays. In: Proceedings of the 18th Annual Symposium on Combinatorial Pattern Matching, pp. 340–351 (2007)

  18. Puglisi S.J., Smyth W.F., Turpin A.: A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39(2), 1–31 (2007)

    Article  Google Scholar 

  19. Puglisi, S.J., Smyth, W.F., Yusufu, M.: Fast optimal algorithms for computing all the repeats in a string. In: Proceedings of the Prague Stringology Conference, pp. 161–169 (2008)

  20. Puglisi, S.J., Turpin, A.: Space-time tradeoffs for longest-common-prefix array computation. In: Hong, S-H., Nagamochi, H., Fukunaga, T. (eds.) Proceedings of the 19th International Symposium on Algorithms & Computation, LNCS, vol. 5369, pp. 124–135. Springer-Verlag, New York (2008)

  21. Shin S.W., Kim S.M.: A new algorithm for detecting low-complexity regions in protein sequences. Bioinformatics 21(2), 160–170 (2005)

    Article  MathSciNet  Google Scholar 

  22. Smyth, B.: Computing Patterns in Strings, 423 pp. Pearson Addison-Wesley (2003)

  23. Turpin, A., Smyth, W.F.: An approach to phrase selection for offline data compression. In: Oudshoorn, M. (ed.) Proceedings of the 25th Australasian Computer Science Conference, pp. 267–273 (2002)

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Simon J. Puglisi or W. F. Smyth.

Additional information

The work of the first author is supported by the Australian Research Council. The work of the second and third authors was supported in part by grants from the Natural Sciences & Engineering Research Council of Canada. A preliminary version of this paper appeared in [19]. The authors thank an anonymous referee whose comments have materially improved this paper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Puglisi, S.J., Smyth, W.F. & Yusufu, M. Fast, Practical Algorithms for Computing All the Repeats in a String. Math.Comput.Sci. 3, 373–389 (2010). https://doi.org/10.1007/s11786-010-0033-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11786-010-0033-6

Keywords

Mathematics Subject Classification (2000)