Evaluation of phone posterior probabilities for pathology detection in speech data using deep learning models

Farazi, Sahar; Shekofteh, Yasser

doi:10.1007/s10772-024-10166-w

Evaluation of phone posterior probabilities for pathology detection in speech data using deep learning models

Published: 04 January 2025

Volume 28, pages 99–116, (2025)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

144 Accesses
Explore all metrics

Abstract

Voice pathology detection (VPD) aims to accurately identify voice impairments by analyzing speech signals. This study proposes models based on deep learning (DL) for binary classification to distinguish between healthy and pathological voices, with a unique focus on the integration of phone posterior probabilities (PPPs as phonetic-based features) alongside Mel-frequency cepstral coefficients (MFCCs as acoustic-based features) for input models. By incorporating PPPs as supplementary information, we investigate the model’s performance across spontaneous, sustained vowel, and read speech datasets, addressing the gap in comparing these speech types for VPD. Our results highlight that PPPs significantly enhance classification accuracy, particularly for read speech and spontaneous speech data types. Using the AVFAD database, we show that the proposed CNN-based model achieves its highest performance on spontaneous speech, with an accuracy of approximately 87% on test data and 93% on validation data. This study emphasizes the impact of PPPs as phonetic-based features in VPD tasks and clarifies which types of speech benefit most from their inclusion, paving the way for more refined models in this field.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Voice pathology detection on spontaneous speech data using deep learning models

Article 10 August 2024

Deep Learning-Based Speaker Identification for Individuals with Voice Disorders

Identification of Pronunciation Defects in Spoken Arabic Language

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Data availability

The AVFAD dataset was distributed through the ACSA https://acsa.web.ua.pt/AVFAD.htm platform.

References

Abdulmajeed, N. Q., Al-Khateeb, B., & Mohammed, M. A. (2022). A review on voice pathology: Taxonomy, diagnosis, medical procedures and detection techniques, open challenges, limitations, and recommendations for future directions. Journal of Intelligent Systems, 31(1), 855–875.
Article Google Scholar
Abdulmajeed, N. Q., Al‐Khateeb, B., & Mohammed, M. A. (2023). Voice pathology identification system using a deep learning approach based on unique feature selection sets. Expert Systems, e13327
Alhussein, M., & Muhammad, G. (2018). Voice pathology detection using deep learning on mobile healthcare framework. IEEE Access, 6, 41034–41041.
Article Google Scholar
Ali, Z., Alsulaiman, M., Muhammad, G., Elamvazuthi, I., & Mesallam, T. A. (2013). Vocal fold disorder detection based on continuous speech by using MFCC and GMM. In 2013 7th IEEE GCC conference and exhibition (GCC). IEEE
Ali, Z., Elamvazuthi, I., Alsulaiman, M., & Muhammad, G. (2016). Automatic voice pathology detection with running speech by using estimation of auditory spectrum and cepstral coefficients based on the all-pole model. Journal of Voice, 30(6), 757. e7–757. e19.
Atmaja, B. T., Sasou, A., & Akagi, M. (2022). Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion. Speech Communication, 140, 11–28.
Article Google Scholar
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449–12460.
Google Scholar
Balagurunathan, Y., Mitchell, R., & El Naqa, I. (2021). Requirements and reliability of AI in the medical context. Physica Medica, 83, 72–78.
Article Google Scholar
Bolfan-Stosic, N., & Hedjever, M. (1997). Acoustical characteristics of speech and voice in speech pathology. In Fifth European conference on speech communication and technology.
Cordeiro, H. T., Fonseca, J. M., & Ribeiro, C. M. (2013). LPC spectrum first peak analysis for voice pathology detection. Procedia Technology, 9, 1104–1111.
Article Google Scholar
Deepa, P., & Khilar, R. (2022). Speech technology in healthcare. Measurement: Sensors, 24, 100565.
Google Scholar
Devlin, J., & Toutanova, L. K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Diez, M., Varona, A., Penagarikano, M., Rodriguez-Fuentes, L. J., & Bordel, G. (2014). On the complementarity of phone posterior probabilities for improved speaker recognition. IEEE Signal Processing Letters, 21(6), 649–652.
Article Google Scholar
Dvijotham, K., Winkens, J., Barsbey, M., Ghaisas, S., Stanforth, R., Pawlowski, N., Strachan, P., Ahmed, Z., Azizi, S., Bachrach, Y., & Culp, L. (2023). Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians. Nature Medicine, 29(7), 1814–1820.
Article Google Scholar
Fang, S. H., Tsao, Y., Hsiao, M. J., Chen, J. Y., Lai, Y. H., Lin, F. C., & Wang, C. T. (2019). Detection of pathological voice using cepstrum vectors: A deep learning approach. Journal of Voice, 33(5), 634–641.
Article Google Scholar
Farazi, S., & Shekofteh, Y. (2024a). Voice pathology detection on spontaneous speech data using deep learning models. International Journal of Speech Technology, 27(3), 739–751.
Article Google Scholar
Farazi, S., & Shekofteh, Y. (2024). Voice pathology detection on spontaneous speech data using deep learning models. International Journal of Speech Technology, 1–13
Godino-Llorente, J. I., Fraile, R., Sáenz-Lechón, N., Osma-Ruiz, V., & Gómez-Vilda, P. (2009). Automatic detection of voice impairments from text-dependent running speech. Biomedical Signal Processing and Control, 4(3), 176–182.
Article Google Scholar
Guedes, V., Teixeira, F., Oliveira, A., Fernandes, J., Silva, L., Junior, A., & Teixeira, J. P. (2019). Transfer learning with audioset to voice pathologies identification in continuous speech. Procedia Computer Science, 164, 662–669.
Article Google Scholar
Harar, P., Alonso-Hernandezy, J. B., Mekyska, J., Galaz, Z., Burget, R., & Smekal, Z. (2017). Voice pathology detection using deep learning: A preliminary study. In 2017 international conference and workshop on bioinspired intelligence (IWOBI). IEEE
Harar, P., Galaz, Z., Alonso-Hernandez, J. B., Mekyska, J., Burget, R., & Smekal, Z. (2020). Towards robust voice pathology detection: Investigation of supervised deep learning, gradient boosting, and anomaly detection approaches across four databases. Neural Computing and Applications, 32, 15747–15757.
Article Google Scholar
Hossain, M. S., Muhammad, G., & Alamri, A. (2019). Smart healthcare monitoring: A voice pathology detection paradigm for smart cities. Multimedia Systems, 25(5), 565–575.
Article Google Scholar
Huang, X., Acero, A., Hon, H. W., & Reddy, R. (2001). Spoken language processing: A guide to theory, algorithm, and system development. Prentice Hall
Huckvale, M., Liu, Z., & Buciuleac, C. (2023). Automated voice pathology discrimination from audio recordings benefits from phonetic analysis of continuous speech. Biomedical Signal Processing and Control, 86, 105201.
Article Google Scholar
Islam, R., Tarique, M., & Abdel-Raheem, E. (2020). A survey on signal processing based pathological voice detection techniques. IEEE Access, 8, 66749–66776.
Article Google Scholar
Jesus, L. M., Barney, A., Santos, R., Caetano, J., Jorge, J., & Couto, P. S. (2009). Universidade de Aveiro's voice evaluation protocol. In Tenth annual conference of the international speech communication association
Jesus, L. M., Belo, I., Machado, J., & Hall, A. (2017). The advanced voice function assessment databases (AVFAD): Tools for voice clinicians and speech research. In Advances in speech-language pathology. IntechOpen
Jesus, L. M., Valente, A. R. S., & Hall, A. (2015). Is the Portuguese version of the passage ‘The North Wind and the Sun’phonetically balanced? Journal of the International Phonetic Association, 45(1), 1–11.
Article Google Scholar
Junior, S. B., Guido, R. C., Aguiar, G. J., Santana, E. J., Junior, M. L. P., & Patil, H. A. (2023). Multiple voice disorders in the same individual: Investigating handcrafted features, multi-label classification algorithms, and base-learners. Speech Communication, 152, 102952.
Article Google Scholar
Kamath, U., Liu, J., & Whitaker, J. (2019). Deep learning for NLP and speech recognition (Vol. 84). Springer
Kempster, G. B., Gerratt, B. R., Abbott, K. V., Barkmeier-Kraemer, J., & Hillman, R. E. (2009). Consensus auditory-perceptual evaluation of voice: development of a standardized clinical protocol
Latif, S., Qadir, J., Qayyum, A., Usama, M., & Younis, S. (2020). Speech technology for healthcare: Opportunities, challenges, and state of the art. IEEE Reviews in Biomedical Engineering, 14, 342–356.
Article Google Scholar
Lee, J.-N., & Lee, J.-Y. (2023). An efficient SMOTE-based deep learning model for voice pathology detection. Applied Sciences, 13(6), 3571.
Article Google Scholar
Li, Z., Liu, F., Yang, W., Peng, S., & Zhou, J. (2021). A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Transactions on Neural Networks and Learning Systems, 33(12), 6999–7019.
Article MathSciNet Google Scholar
Liu, Y., Lee, T., Law, T., & Lee, K. Y. S. (2019). Acoustical assessment of voice disorder with continuous speech using ASR posterior features. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(6), 1047–1059.
Article Google Scholar
Luyet, G., Dighe, P., Asaei, A., & Bourlard, H. (2016). Low-rank representation of nearest neighbor phone posterior probabilities to enhance DNN acoustic modeling.
Mesallam, T. A., Farahat, M., Malki, K. H., Alsulaiman, M., Ali, Z., Al-Nasheri, A., & Muhammad, G. (2017). Development of the Arabic voice pathology database and its evaluation by using speech features and machine learning algorithms. Journal of Healthcare Engineering, 2017(1), 8783751.
Google Scholar
Miyoshi, H., Saito, Y., Takamichi, S., & Saruwatari, H. (2017). Voice conversion using sequence-to-sequence learning of context posterior probabilities. arXiv:1704.02360
Mohammed, H. M., Omeroglu, A. N., & Oral, E. A. (2023). MMHFNet: Multi-modal and multi-layer hybrid fusion network for voice pathology detection. Expert Systems with Applications, 223, 119790.
Article Google Scholar
Mohammed, M. A., Abdulkareem, K. H., Mostafa, S. A., Ghani, M. K. A., Maashi, M. S., Garcia-Zapirain, B., Oleagordia, I., Alhakami, H., & Al-Dhief, F.T. (2020). Voice pathology detection and classification using convolutional neural network model. Applied Sciences, 10(11), 3723.
Article Google Scholar
Moradi, A., & Shekofteh, Y. (2023). Spoken language identification using a genetic-based fusion approach to combine acoustic and universal phonetic results. Computers and Electrical Engineering, 105, 108549.
Article Google Scholar
Muhammad, G. (2013). Voice pathology detection using vocal tract area. In 2013 European modelling symposium. IEEE
Muhammad, G., & Alhussein, M. (2021). Convergence of artificial intelligence and internet of things in smart healthcare: A case study of voice pathology detection. IEEE Access, 9, 89198–89209.
Article Google Scholar
Niu, Y., Cao, J., Shen, F., & Ren, P. (2020). The study of voice pathology detection based on MFCC and SVM. In Proceedings of the 2020 7th international conference on biomedical and bioinformatics engineering
Özseven, T., & Düğenci, M. (2018). SPeech ACoustic (SPAC): A novel tool for speech feature extraction and classification. Applied Acoustics, 136, 1–8.
Article Google Scholar
Peng, X., Xu, H., Liu, J., Wang, J., & He, C. (2023). Voice disorder classification using convolutional neural network based on deep transfer learning. Scientific Reports, 13(1), 7264.
Article Google Scholar
Prabhavalkar, R., Hori, T., Sainath, T. N., Schlüter, R., & Watanabe, S. (2023). End-to-end speech recognition: A survey. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32, 325–351.
Article Google Scholar
Pützer, M., & Barry, W. (2009). Saarbrücken voice database. Saarland University.
Google Scholar
Ribas, D., Pastor, M. A., Miguel, A., Martínez, D., Ortega, A., & Lleida, E. (2023). Automatic voice disorder detection using self-supervised representations. IEEE Access, 11, 14915–14927.
Article Google Scholar
Rostami, D., & Shekofteh, Y. (2023). A Persian wake word detection system based on the fine tuning of a universal phone decoder and Levenshtein distance. In 2023 9th international conference on web research (ICWR). IEEE
Shekofteh, Y. (2023). What can phone attractors in RPS tell us? A study of dynamic information in speech signals for phone classification purposes. Applied Acoustics, 211, 109534.
Article Google Scholar
Shekofteh, Y., & Almasganj, F. (2013). Remote diagnosis of unilateral vocal fold paralysis using matching pursuit based features extracted from telephony speech signal. Scientia Iranica, 20(6), 2051–2060.
Google Scholar
Shekofteh, Y., Almasganj, F., & Daliri, A. (2015). MLP-based isolated phoneme classification using likelihood features extracted from reconstructed phase space. Engineering Applications of Artificial Intelligence, 44, 1–9.
Article Google Scholar
Sidhu, M. S., Latib, N. A. A., & Sidhu, K. K. (2024). MFCC in audio signal processing for voice disorder: A review. Multimedia Tools and Applications, 1–21
Sindhu, I., & Sainin, M. S. (2024). Automatic speech and voice disorder detection using deep learning—A systematic literature review. IEEE Access
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.
MathSciNet Google Scholar
Sun, L., Li, K., Wang, H., Kang, S., & Meng, H. (2016). Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In 2016 IEEE international conference on multimedia and expo (ICME). IEEE
Syed, S. A., Rashid, M., Hussain, S., & Zahid, H. (2021). Comparative analysis of CNN and RNN for voice pathology detection. BioMed Research International, 2021(1), 6635964.
Article Google Scholar
Tirronen, S., Kadiri, S. R., & Alku, P. (2022). The effect of the MFCC frame length in automatic voice pathology detection. Journal of Voice. https://doi.org/10.1016/j.jvoice.2022.03.021
Article Google Scholar
Xie, X., Cai, H., Li, C., Wu, Y., & Ding, F. (2023). A voice disease detection method based on MFCCs and shallow CNN. Journal of Voice. https://doi.org/10.1016/j.jvoice.2023.09.024
Article Google Scholar
Zheng, W. Z., Han, J. Y., Lee, C. K., Lin, Y. Y., Chang, S. H., & Lai, Y. H. (2022). Phonetic posteriorgram-based voice conversion system to improve speech intelligibility of dysarthric patients. Computer Methods and Programs in Biomedicine, 215, 106602.
Article Google Scholar

Download references

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

Intelligent Sound Processing Laboratory (ISP-Lab), Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
Sahar Farazi & Yasser Shekofteh

Authors

Sahar Farazi
View author publications
Search author on:PubMed Google Scholar
Yasser Shekofteh
View author publications
Search author on:PubMed Google Scholar

Contributions

S. Farazi prepared the data, experiments, figures, and tables. Y. Shekofteh planned and directed the project. S. Farazi and Y. Shekofteh wrote the manuscript text.

Corresponding author

Correspondence to Yasser Shekofteh.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Farazi, S., Shekofteh, Y. Evaluation of phone posterior probabilities for pathology detection in speech data using deep learning models. Int J Speech Technol 28, 99–116 (2025). https://doi.org/10.1007/s10772-024-10166-w

Download citation

Received: 30 August 2024
Accepted: 23 November 2024
Published: 04 January 2025
Issue Date: March 2025
DOI: https://doi.org/10.1007/s10772-024-10166-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluation of phone posterior probabilities for pathology detection in speech data using deep learning models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Voice pathology detection on spontaneous speech data using deep learning models

Deep Learning-Based Speaker Identification for Individuals with Voice Disorders

Identification of Pronunciation Defects in Spoken Arabic Language

Explore related subjects

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now