Succinct BWT-Based Sequence Prediction

Ktistakis, Rafael; Fournier-Viger, Philippe; Puglisi, Simon J.; Raman, Rajeev

doi:10.1007/978-3-030-27618-8_7

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11707))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

894 Accesses
5 Citations
5 Altmetric

Abstract

Sequences of symbols can be used to represent data in many domains such as text documents, activity logs, customer transactions and website click-streams. Sequence prediction is a popular task, which consists of predicting the next symbol of a sequence, given a set of training sequences. Although numerous prediction models have been proposed, many have a low accuracy because they are lossy models (they discard information from training sequences to build the model), while lossless models are often more accurate but typically consume a large amount of memory. This paper addresses these issues by proposing a novel sequence prediction model named SuBSeq that is lossless and utilizes the succinct Wavelet Tree data structure and the Burrows-Wheeler Transform to compactly store and efficiently access training sequences for prediction. An experimental evaluation shows that SuBSeq has a very low memory consumption and excellent accuracy when compared to eight state-of-the-art predictors on seven real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Experiment on Transfer Learning for Suffix Prediction on Event Logs

A Novel Deep Learning Attention Based Sequence to Sequence Model for Automatic Abstractive Text Summarization

Article 26 May 2024

Enhancing the Accuracy of Predictors of Activity Sequences of Business Processes

Notes

1.
Logs are to base 2 unless stated otherwise.
2.
Available at http://www.philippe-fournier-viger.com/spmf.
3.
Details about QUEST exported data, are available at github.com/rafkt/SUBSEQ.
4.
Same evaluation approach was followed for CPT+[6].

References

Balle, B., Eyraud, R., Luque, F.M., Quattoni, A., Verwer, S.: Results of the sequence PredIction ChallengE (SPiCe): a competition on learning the next symbol in a sequence. In: Proceedings 13th International Conference in Grammatical Inference, vol. 57. JMLR W&CP, Delft (2016)
Google Scholar
Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Technical report, 124, Digital Equiptment Corporation (1994)
Google Scholar
Cleary, J., Witten, I.: Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32(4), 396–402 (1984)
Article Google Scholar
Gog, S.: simongog/sdsl-lite (2015). https://github.com/simongog/sdsl-lite
Gog, S., Petri, M.: Optimized succinct data structures for massive data. Softw. Pract. Experience 44(11), 1287–1314 (2014)
Article Google Scholar
Gueniche, T., Fournier-Viger, P., Raman, R., Tseng, V.S.: CPT+: decreasing the time/space complexity of the compact prediction tree. In: Cao, T., Lim, E.-P., Zhou, Z.-H., Ho, T.-B., Cheung, D., Motoda, H. (eds.) PAKDD 2015. LNCS (LNAI), vol. 9078, pp. 625–636. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18032-8_49
Chapter Google Scholar
Gueniche, T., Fournier-Viger, P., Tseng, V.S.: Compact prediction tree: a lossless model for accurate sequence prediction. In: Motoda, H., Wu, Z., Cao, L., Zaiane, O., Yao, M., Wang, W. (eds.) ADMA 2013. LNCS (LNAI), vol. 8347, pp. 177–188. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-53917-6_16
Chapter Google Scholar
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Hybrid compression of bitvectors for the FM-index. In: Proceedings DCC, pp. 302–311. IEEE (2014)
Google Scholar
Laird, P., Saul, R.: Discrete sequence prediction and its applications. Mach. Learn. 15(1), 43–68 (1994)
MATH Google Scholar
Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Article MathSciNet MATH Google Scholar
Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001)
Article MathSciNet MATH Google Scholar
Navarro, G.: Wavelet trees for all. J. Discrete Algorithms 25, 2–20 (2014)
Article MathSciNet MATH Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1) (2007). Article 2
Article MATH Google Scholar
Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proceedings ALENEX, pp. 60–70. SIAM (2007)
Google Scholar
Padmanabhan, V.N., Mogul, J.C.: Using predictive prefetching to improve world wide web latency. SIGCOMM Comput. Commun. Rev. 26(3), 22–36 (1996)
Article Google Scholar
Pitkow, J., Pirolli, P.: Mining longest repeating subsequences to predict world wide web surfing. In: Proceedings of the 2nd Conference on USENIX Symposium on Internet Technologies and Systems - Volume 2, USITS 1999, pp. 139–150 (1999)
Google Scholar
Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4), 43 (2007)
Article MathSciNet MATH Google Scholar
Rjeily, C.B., Badr, G., Al Hassani, A.H., Andres, E.: Predicting heart failure class using a sequence prediction algorithm. In: 2017 Fourth International Conference on Advances in Biomedical Engineering (ICABME), pp. 1–4. IEEE (2017)
Google Scholar
Tax, N.: Human activity prediction in smart home environments with LSTM neural networks. In: 2018 14th International Conference on Intelligent Environments (IE), pp. 40–47. IEEE (2018)
Google Scholar
Zheng, Z., Kohavi, R., Mason, L.: Real world performance of association rule algorithms. In: Proceedings ACM SIGKDD, pp. 401–406. ACM (2001)
Google Scholar
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theor. 24(5), 530–536 (1978)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, University of Leicester, Leicester, UK
Rafael Ktistakis & Rajeev Raman
Harbin Institute of Technology (Shenzhen), Shenzhen, China
Philippe Fournier-Viger
Department of Computer Science, University of Helsinki, Helsinki, Finland
Simon J. Puglisi

Authors

Rafael Ktistakis
View author publications
Search author on:PubMed Google Scholar
Philippe Fournier-Viger
View author publications
Search author on:PubMed Google Scholar
Simon J. Puglisi
View author publications
Search author on:PubMed Google Scholar
Rajeev Raman
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Rafael Ktistakis .

Editor information

Editors and Affiliations

Clausthal University of Technology, Clausthal-Zellerfeld, Germany
Sven Hartmann
Johannes Kepler University of Linz, Linz, Austria
Josef Küng
The University of Texas at Arlington, Arlington, TX, USA
Sharma Chakravarthy
Johannes Kepler University of Linz, Linz, Austria
Gabriele Anderst-Kotsis
Software Competence Center Hagenberg, Hagenberg im Mühlkreis, Austria
A Min Tjoa
Johannes Kepler University of Linz, Linz, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ktistakis, R., Fournier-Viger, P., Puglisi, S.J., Raman, R. (2019). Succinct BWT-Based Sequence Prediction. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., Tjoa, A., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2019. Lecture Notes in Computer Science(), vol 11707. Springer, Cham. https://doi.org/10.1007/978-3-030-27618-8_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-27618-8_7
Published: 06 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27617-1
Online ISBN: 978-3-030-27618-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics