这是indexloc提供的服务,不要输入任何密码
Skip to main content

Advertisement

Log in

Evaluation of phone posterior probabilities for pathology detection in speech data using deep learning models

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Voice pathology detection (VPD) aims to accurately identify voice impairments by analyzing speech signals. This study proposes models based on deep learning (DL) for binary classification to distinguish between healthy and pathological voices, with a unique focus on the integration of phone posterior probabilities (PPPs as phonetic-based features) alongside Mel-frequency cepstral coefficients (MFCCs as acoustic-based features) for input models. By incorporating PPPs as supplementary information, we investigate the model’s performance across spontaneous, sustained vowel, and read speech datasets, addressing the gap in comparing these speech types for VPD. Our results highlight that PPPs significantly enhance classification accuracy, particularly for read speech and spontaneous speech data types. Using the AVFAD database, we show that the proposed CNN-based model achieves its highest performance on spontaneous speech, with an accuracy of approximately 87% on test data and 93% on validation data. This study emphasizes the impact of PPPs as phonetic-based features in VPD tasks and clarifies which types of speech benefit most from their inclusion, paving the way for more refined models in this field.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Data availability

The AVFAD dataset was distributed through the ACSA https://acsa.web.ua.pt/AVFAD.htm platform.

References

  • Abdulmajeed, N. Q., Al-Khateeb, B., & Mohammed, M. A. (2022). A review on voice pathology: Taxonomy, diagnosis, medical procedures and detection techniques, open challenges, limitations, and recommendations for future directions. Journal of Intelligent Systems, 31(1), 855–875.

    Article  Google Scholar 

  • Abdulmajeed, N. Q., Al‐Khateeb, B., & Mohammed, M. A. (2023). Voice pathology identification system using a deep learning approach based on unique feature selection sets. Expert Systems, e13327

  • Alhussein, M., & Muhammad, G. (2018). Voice pathology detection using deep learning on mobile healthcare framework. IEEE Access, 6, 41034–41041.

    Article  Google Scholar 

  • Ali, Z., Alsulaiman, M., Muhammad, G., Elamvazuthi, I., & Mesallam, T. A. (2013). Vocal fold disorder detection based on continuous speech by using MFCC and GMM. In 2013 7th IEEE GCC conference and exhibition (GCC). IEEE

  • Ali, Z., Elamvazuthi, I., Alsulaiman, M., & Muhammad, G. (2016). Automatic voice pathology detection with running speech by using estimation of auditory spectrum and cepstral coefficients based on the all-pole model. Journal of Voice, 30(6), 757. e7–757. e19.

  • Atmaja, B. T., Sasou, A., & Akagi, M. (2022). Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion. Speech Communication, 140, 11–28.

    Article  Google Scholar 

  • Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449–12460.

    Google Scholar 

  • Balagurunathan, Y., Mitchell, R., & El Naqa, I. (2021). Requirements and reliability of AI in the medical context. Physica Medica, 83, 72–78.

    Article  Google Scholar 

  • Bolfan-Stosic, N., & Hedjever, M. (1997). Acoustical characteristics of speech and voice in speech pathology. In Fifth European conference on speech communication and technology.

  • Cordeiro, H. T., Fonseca, J. M., & Ribeiro, C. M. (2013). LPC spectrum first peak analysis for voice pathology detection. Procedia Technology, 9, 1104–1111.

    Article  Google Scholar 

  • Deepa, P., & Khilar, R. (2022). Speech technology in healthcare. Measurement: Sensors, 24, 100565.

    Google Scholar 

  • Devlin, J., & Toutanova, L. K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  • Diez, M., Varona, A., Penagarikano, M., Rodriguez-Fuentes, L. J., & Bordel, G. (2014). On the complementarity of phone posterior probabilities for improved speaker recognition. IEEE Signal Processing Letters, 21(6), 649–652.

    Article  Google Scholar 

  • Dvijotham, K., Winkens, J., Barsbey, M., Ghaisas, S., Stanforth, R., Pawlowski, N., Strachan, P., Ahmed, Z., Azizi, S., Bachrach, Y., & Culp, L. (2023). Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians. Nature Medicine, 29(7), 1814–1820.

    Article  Google Scholar 

  • Fang, S. H., Tsao, Y., Hsiao, M. J., Chen, J. Y., Lai, Y. H., Lin, F. C., & Wang, C. T. (2019). Detection of pathological voice using cepstrum vectors: A deep learning approach. Journal of Voice, 33(5), 634–641.

    Article  Google Scholar 

  • Farazi, S., & Shekofteh, Y. (2024a). Voice pathology detection on spontaneous speech data using deep learning models. International Journal of Speech Technology, 27(3), 739–751.

    Article  Google Scholar 

  • Farazi, S., & Shekofteh, Y. (2024). Voice pathology detection on spontaneous speech data using deep learning models. International Journal of Speech Technology, 1–13

  • Godino-Llorente, J. I., Fraile, R., Sáenz-Lechón, N., Osma-Ruiz, V., & Gómez-Vilda, P. (2009). Automatic detection of voice impairments from text-dependent running speech. Biomedical Signal Processing and Control, 4(3), 176–182.

    Article  Google Scholar 

  • Guedes, V., Teixeira, F., Oliveira, A., Fernandes, J., Silva, L., Junior, A., & Teixeira, J. P. (2019). Transfer learning with audioset to voice pathologies identification in continuous speech. Procedia Computer Science, 164, 662–669.

    Article  Google Scholar 

  • Harar, P., Alonso-Hernandezy, J. B., Mekyska, J., Galaz, Z., Burget, R., & Smekal, Z. (2017). Voice pathology detection using deep learning: A preliminary study. In 2017 international conference and workshop on bioinspired intelligence (IWOBI). IEEE

  • Harar, P., Galaz, Z., Alonso-Hernandez, J. B., Mekyska, J., Burget, R., & Smekal, Z. (2020). Towards robust voice pathology detection: Investigation of supervised deep learning, gradient boosting, and anomaly detection approaches across four databases. Neural Computing and Applications, 32, 15747–15757.

    Article  Google Scholar 

  • Hossain, M. S., Muhammad, G., & Alamri, A. (2019). Smart healthcare monitoring: A voice pathology detection paradigm for smart cities. Multimedia Systems, 25(5), 565–575.

    Article  Google Scholar 

  • Huang, X., Acero, A., Hon, H. W., & Reddy, R. (2001). Spoken language processing: A guide to theory, algorithm, and system development. Prentice Hall

  • Huckvale, M., Liu, Z., & Buciuleac, C. (2023). Automated voice pathology discrimination from audio recordings benefits from phonetic analysis of continuous speech. Biomedical Signal Processing and Control, 86, 105201.

    Article  Google Scholar 

  • Islam, R., Tarique, M., & Abdel-Raheem, E. (2020). A survey on signal processing based pathological voice detection techniques. IEEE Access, 8, 66749–66776.

    Article  Google Scholar 

  • Jesus, L. M., Barney, A., Santos, R., Caetano, J., Jorge, J., & Couto, P. S. (2009). Universidade de Aveiro's voice evaluation protocol. In Tenth annual conference of the international speech communication association

  • Jesus, L. M., Belo, I., Machado, J., & Hall, A. (2017). The advanced voice function assessment databases (AVFAD): Tools for voice clinicians and speech research. In Advances in speech-language pathology. IntechOpen

  • Jesus, L. M., Valente, A. R. S., & Hall, A. (2015). Is the Portuguese version of the passage ‘The North Wind and the Sun’phonetically balanced? Journal of the International Phonetic Association, 45(1), 1–11.

    Article  Google Scholar 

  • Junior, S. B., Guido, R. C., Aguiar, G. J., Santana, E. J., Junior, M. L. P., & Patil, H. A. (2023). Multiple voice disorders in the same individual: Investigating handcrafted features, multi-label classification algorithms, and base-learners. Speech Communication, 152, 102952.

    Article  Google Scholar 

  • Kamath, U., Liu, J., & Whitaker, J. (2019). Deep learning for NLP and speech recognition (Vol. 84). Springer

  • Kempster, G. B., Gerratt, B. R., Abbott, K. V., Barkmeier-Kraemer, J., & Hillman, R. E. (2009). Consensus auditory-perceptual evaluation of voice: development of a standardized clinical protocol

  • Latif, S., Qadir, J., Qayyum, A., Usama, M., & Younis, S. (2020). Speech technology for healthcare: Opportunities, challenges, and state of the art. IEEE Reviews in Biomedical Engineering, 14, 342–356.

    Article  Google Scholar 

  • Lee, J.-N., & Lee, J.-Y. (2023). An efficient SMOTE-based deep learning model for voice pathology detection. Applied Sciences, 13(6), 3571.

    Article  Google Scholar 

  • Li, Z., Liu, F., Yang, W., Peng, S., & Zhou, J. (2021). A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Transactions on Neural Networks and Learning Systems, 33(12), 6999–7019.

    Article  MathSciNet  Google Scholar 

  • Liu, Y., Lee, T., Law, T., & Lee, K. Y. S. (2019). Acoustical assessment of voice disorder with continuous speech using ASR posterior features. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(6), 1047–1059.

    Article  Google Scholar 

  • Luyet, G., Dighe, P., Asaei, A., & Bourlard, H. (2016). Low-rank representation of nearest neighbor phone posterior probabilities to enhance DNN acoustic modeling.

  • Mesallam, T. A., Farahat, M., Malki, K. H., Alsulaiman, M., Ali, Z., Al-Nasheri, A., & Muhammad, G. (2017). Development of the Arabic voice pathology database and its evaluation by using speech features and machine learning algorithms. Journal of Healthcare Engineering, 2017(1), 8783751.

    Google Scholar 

  • Miyoshi, H., Saito, Y., Takamichi, S., & Saruwatari, H. (2017). Voice conversion using sequence-to-sequence learning of context posterior probabilities. arXiv:1704.02360

  • Mohammed, H. M., Omeroglu, A. N., & Oral, E. A. (2023). MMHFNet: Multi-modal and multi-layer hybrid fusion network for voice pathology detection. Expert Systems with Applications, 223, 119790.

    Article  Google Scholar 

  • Mohammed, M. A., Abdulkareem, K. H., Mostafa, S. A., Ghani, M. K. A., Maashi, M. S., Garcia-Zapirain, B., Oleagordia, I., Alhakami, H., & Al-Dhief, F.T. (2020). Voice pathology detection and classification using convolutional neural network model. Applied Sciences, 10(11), 3723.

    Article  Google Scholar 

  • Moradi, A., & Shekofteh, Y. (2023). Spoken language identification using a genetic-based fusion approach to combine acoustic and universal phonetic results. Computers and Electrical Engineering, 105, 108549.

    Article  Google Scholar 

  • Muhammad, G. (2013). Voice pathology detection using vocal tract area. In 2013 European modelling symposium. IEEE

  • Muhammad, G., & Alhussein, M. (2021). Convergence of artificial intelligence and internet of things in smart healthcare: A case study of voice pathology detection. IEEE Access, 9, 89198–89209.

    Article  Google Scholar 

  • Niu, Y., Cao, J., Shen, F., & Ren, P. (2020). The study of voice pathology detection based on MFCC and SVM. In Proceedings of the 2020 7th international conference on biomedical and bioinformatics engineering

  • Özseven, T., & Düğenci, M. (2018). SPeech ACoustic (SPAC): A novel tool for speech feature extraction and classification. Applied Acoustics, 136, 1–8.

    Article  Google Scholar 

  • Peng, X., Xu, H., Liu, J., Wang, J., & He, C. (2023). Voice disorder classification using convolutional neural network based on deep transfer learning. Scientific Reports, 13(1), 7264.

    Article  Google Scholar 

  • Prabhavalkar, R., Hori, T., Sainath, T. N., Schlüter, R., & Watanabe, S. (2023). End-to-end speech recognition: A survey. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32, 325–351.

    Article  Google Scholar 

  • Pützer, M., & Barry, W. (2009). Saarbrücken voice database. Saarland University.

    Google Scholar 

  • Ribas, D., Pastor, M. A., Miguel, A., Martínez, D., Ortega, A., & Lleida, E. (2023). Automatic voice disorder detection using self-supervised representations. IEEE Access, 11, 14915–14927.

    Article  Google Scholar 

  • Rostami, D., & Shekofteh, Y. (2023). A Persian wake word detection system based on the fine tuning of a universal phone decoder and Levenshtein distance. In 2023 9th international conference on web research (ICWR). IEEE

  • Shekofteh, Y. (2023). What can phone attractors in RPS tell us? A study of dynamic information in speech signals for phone classification purposes. Applied Acoustics, 211, 109534.

    Article  Google Scholar 

  • Shekofteh, Y., & Almasganj, F. (2013). Remote diagnosis of unilateral vocal fold paralysis using matching pursuit based features extracted from telephony speech signal. Scientia Iranica, 20(6), 2051–2060.

    Google Scholar 

  • Shekofteh, Y., Almasganj, F., & Daliri, A. (2015). MLP-based isolated phoneme classification using likelihood features extracted from reconstructed phase space. Engineering Applications of Artificial Intelligence, 44, 1–9.

    Article  Google Scholar 

  • Sidhu, M. S., Latib, N. A. A., & Sidhu, K. K. (2024). MFCC in audio signal processing for voice disorder: A review. Multimedia Tools and Applications, 1–21

  • Sindhu, I., & Sainin, M. S. (2024). Automatic speech and voice disorder detection using deep learning—A systematic literature review. IEEE Access

  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.

    MathSciNet  Google Scholar 

  • Sun, L., Li, K., Wang, H., Kang, S., & Meng, H. (2016). Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In 2016 IEEE international conference on multimedia and expo (ICME). IEEE

  • Syed, S. A., Rashid, M., Hussain, S., & Zahid, H. (2021). Comparative analysis of CNN and RNN for voice pathology detection. BioMed Research International, 2021(1), 6635964.

    Article  Google Scholar 

  • Tirronen, S., Kadiri, S. R., & Alku, P. (2022). The effect of the MFCC frame length in automatic voice pathology detection. Journal of Voice. https://doi.org/10.1016/j.jvoice.2022.03.021

    Article  Google Scholar 

  • Xie, X., Cai, H., Li, C., Wu, Y., & Ding, F. (2023). A voice disease detection method based on MFCCs and shallow CNN. Journal of Voice. https://doi.org/10.1016/j.jvoice.2023.09.024

    Article  Google Scholar 

  • Zheng, W. Z., Han, J. Y., Lee, C. K., Lin, Y. Y., Chang, S. H., & Lai, Y. H. (2022). Phonetic posteriorgram-based voice conversion system to improve speech intelligibility of dysarthric patients. Computer Methods and Programs in Biomedicine, 215, 106602.

    Article  Google Scholar 

Download references

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

Authors

Contributions

S. Farazi prepared the data, experiments, figures, and tables. Y. Shekofteh planned and directed the project. S. Farazi and Y. Shekofteh wrote the manuscript text.

Corresponding author

Correspondence to Yasser Shekofteh.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Farazi, S., Shekofteh, Y. Evaluation of phone posterior probabilities for pathology detection in speech data using deep learning models. Int J Speech Technol 28, 99–116 (2025). https://doi.org/10.1007/s10772-024-10166-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-024-10166-w

Keywords