Abstract
Voice pathology detection (VPD) aims to accurately identify voice impairments by analyzing speech signals. This study proposes models based on deep learning (DL) for binary classification to distinguish between healthy and pathological voices, with a unique focus on the integration of phone posterior probabilities (PPPs as phonetic-based features) alongside Mel-frequency cepstral coefficients (MFCCs as acoustic-based features) for input models. By incorporating PPPs as supplementary information, we investigate the model’s performance across spontaneous, sustained vowel, and read speech datasets, addressing the gap in comparing these speech types for VPD. Our results highlight that PPPs significantly enhance classification accuracy, particularly for read speech and spontaneous speech data types. Using the AVFAD database, we show that the proposed CNN-based model achieves its highest performance on spontaneous speech, with an accuracy of approximately 87% on test data and 93% on validation data. This study emphasizes the impact of PPPs as phonetic-based features in VPD tasks and clarifies which types of speech benefit most from their inclusion, paving the way for more refined models in this field.
Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Data availability
The AVFAD dataset was distributed through the ACSA https://acsa.web.ua.pt/AVFAD.htm platform.
References
Abdulmajeed, N. Q., Al-Khateeb, B., & Mohammed, M. A. (2022). A review on voice pathology: Taxonomy, diagnosis, medical procedures and detection techniques, open challenges, limitations, and recommendations for future directions. Journal of Intelligent Systems, 31(1), 855–875.
Abdulmajeed, N. Q., Al‐Khateeb, B., & Mohammed, M. A. (2023). Voice pathology identification system using a deep learning approach based on unique feature selection sets. Expert Systems, e13327
Alhussein, M., & Muhammad, G. (2018). Voice pathology detection using deep learning on mobile healthcare framework. IEEE Access, 6, 41034–41041.
Ali, Z., Alsulaiman, M., Muhammad, G., Elamvazuthi, I., & Mesallam, T. A. (2013). Vocal fold disorder detection based on continuous speech by using MFCC and GMM. In 2013 7th IEEE GCC conference and exhibition (GCC). IEEE
Ali, Z., Elamvazuthi, I., Alsulaiman, M., & Muhammad, G. (2016). Automatic voice pathology detection with running speech by using estimation of auditory spectrum and cepstral coefficients based on the all-pole model. Journal of Voice, 30(6), 757. e7–757. e19.
Atmaja, B. T., Sasou, A., & Akagi, M. (2022). Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion. Speech Communication, 140, 11–28.
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449–12460.
Balagurunathan, Y., Mitchell, R., & El Naqa, I. (2021). Requirements and reliability of AI in the medical context. Physica Medica, 83, 72–78.
Bolfan-Stosic, N., & Hedjever, M. (1997). Acoustical characteristics of speech and voice in speech pathology. In Fifth European conference on speech communication and technology.
Cordeiro, H. T., Fonseca, J. M., & Ribeiro, C. M. (2013). LPC spectrum first peak analysis for voice pathology detection. Procedia Technology, 9, 1104–1111.
Deepa, P., & Khilar, R. (2022). Speech technology in healthcare. Measurement: Sensors, 24, 100565.
Devlin, J., & Toutanova, L. K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Diez, M., Varona, A., Penagarikano, M., Rodriguez-Fuentes, L. J., & Bordel, G. (2014). On the complementarity of phone posterior probabilities for improved speaker recognition. IEEE Signal Processing Letters, 21(6), 649–652.
Dvijotham, K., Winkens, J., Barsbey, M., Ghaisas, S., Stanforth, R., Pawlowski, N., Strachan, P., Ahmed, Z., Azizi, S., Bachrach, Y., & Culp, L. (2023). Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians. Nature Medicine, 29(7), 1814–1820.
Fang, S. H., Tsao, Y., Hsiao, M. J., Chen, J. Y., Lai, Y. H., Lin, F. C., & Wang, C. T. (2019). Detection of pathological voice using cepstrum vectors: A deep learning approach. Journal of Voice, 33(5), 634–641.
Farazi, S., & Shekofteh, Y. (2024a). Voice pathology detection on spontaneous speech data using deep learning models. International Journal of Speech Technology, 27(3), 739–751.
Farazi, S., & Shekofteh, Y. (2024). Voice pathology detection on spontaneous speech data using deep learning models. International Journal of Speech Technology, 1–13
Godino-Llorente, J. I., Fraile, R., Sáenz-Lechón, N., Osma-Ruiz, V., & Gómez-Vilda, P. (2009). Automatic detection of voice impairments from text-dependent running speech. Biomedical Signal Processing and Control, 4(3), 176–182.
Guedes, V., Teixeira, F., Oliveira, A., Fernandes, J., Silva, L., Junior, A., & Teixeira, J. P. (2019). Transfer learning with audioset to voice pathologies identification in continuous speech. Procedia Computer Science, 164, 662–669.
Harar, P., Alonso-Hernandezy, J. B., Mekyska, J., Galaz, Z., Burget, R., & Smekal, Z. (2017). Voice pathology detection using deep learning: A preliminary study. In 2017 international conference and workshop on bioinspired intelligence (IWOBI). IEEE
Harar, P., Galaz, Z., Alonso-Hernandez, J. B., Mekyska, J., Burget, R., & Smekal, Z. (2020). Towards robust voice pathology detection: Investigation of supervised deep learning, gradient boosting, and anomaly detection approaches across four databases. Neural Computing and Applications, 32, 15747–15757.
Hossain, M. S., Muhammad, G., & Alamri, A. (2019). Smart healthcare monitoring: A voice pathology detection paradigm for smart cities. Multimedia Systems, 25(5), 565–575.
Huang, X., Acero, A., Hon, H. W., & Reddy, R. (2001). Spoken language processing: A guide to theory, algorithm, and system development. Prentice Hall
Huckvale, M., Liu, Z., & Buciuleac, C. (2023). Automated voice pathology discrimination from audio recordings benefits from phonetic analysis of continuous speech. Biomedical Signal Processing and Control, 86, 105201.
Islam, R., Tarique, M., & Abdel-Raheem, E. (2020). A survey on signal processing based pathological voice detection techniques. IEEE Access, 8, 66749–66776.
Jesus, L. M., Barney, A., Santos, R., Caetano, J., Jorge, J., & Couto, P. S. (2009). Universidade de Aveiro's voice evaluation protocol. In Tenth annual conference of the international speech communication association
Jesus, L. M., Belo, I., Machado, J., & Hall, A. (2017). The advanced voice function assessment databases (AVFAD): Tools for voice clinicians and speech research. In Advances in speech-language pathology. IntechOpen
Jesus, L. M., Valente, A. R. S., & Hall, A. (2015). Is the Portuguese version of the passage ‘The North Wind and the Sun’phonetically balanced? Journal of the International Phonetic Association, 45(1), 1–11.
Junior, S. B., Guido, R. C., Aguiar, G. J., Santana, E. J., Junior, M. L. P., & Patil, H. A. (2023). Multiple voice disorders in the same individual: Investigating handcrafted features, multi-label classification algorithms, and base-learners. Speech Communication, 152, 102952.
Kamath, U., Liu, J., & Whitaker, J. (2019). Deep learning for NLP and speech recognition (Vol. 84). Springer
Kempster, G. B., Gerratt, B. R., Abbott, K. V., Barkmeier-Kraemer, J., & Hillman, R. E. (2009). Consensus auditory-perceptual evaluation of voice: development of a standardized clinical protocol
Latif, S., Qadir, J., Qayyum, A., Usama, M., & Younis, S. (2020). Speech technology for healthcare: Opportunities, challenges, and state of the art. IEEE Reviews in Biomedical Engineering, 14, 342–356.
Lee, J.-N., & Lee, J.-Y. (2023). An efficient SMOTE-based deep learning model for voice pathology detection. Applied Sciences, 13(6), 3571.
Li, Z., Liu, F., Yang, W., Peng, S., & Zhou, J. (2021). A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Transactions on Neural Networks and Learning Systems, 33(12), 6999–7019.
Liu, Y., Lee, T., Law, T., & Lee, K. Y. S. (2019). Acoustical assessment of voice disorder with continuous speech using ASR posterior features. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(6), 1047–1059.
Luyet, G., Dighe, P., Asaei, A., & Bourlard, H. (2016). Low-rank representation of nearest neighbor phone posterior probabilities to enhance DNN acoustic modeling.
Mesallam, T. A., Farahat, M., Malki, K. H., Alsulaiman, M., Ali, Z., Al-Nasheri, A., & Muhammad, G. (2017). Development of the Arabic voice pathology database and its evaluation by using speech features and machine learning algorithms. Journal of Healthcare Engineering, 2017(1), 8783751.
Miyoshi, H., Saito, Y., Takamichi, S., & Saruwatari, H. (2017). Voice conversion using sequence-to-sequence learning of context posterior probabilities. arXiv:1704.02360
Mohammed, H. M., Omeroglu, A. N., & Oral, E. A. (2023). MMHFNet: Multi-modal and multi-layer hybrid fusion network for voice pathology detection. Expert Systems with Applications, 223, 119790.
Mohammed, M. A., Abdulkareem, K. H., Mostafa, S. A., Ghani, M. K. A., Maashi, M. S., Garcia-Zapirain, B., Oleagordia, I., Alhakami, H., & Al-Dhief, F.T. (2020). Voice pathology detection and classification using convolutional neural network model. Applied Sciences, 10(11), 3723.
Moradi, A., & Shekofteh, Y. (2023). Spoken language identification using a genetic-based fusion approach to combine acoustic and universal phonetic results. Computers and Electrical Engineering, 105, 108549.
Muhammad, G. (2013). Voice pathology detection using vocal tract area. In 2013 European modelling symposium. IEEE
Muhammad, G., & Alhussein, M. (2021). Convergence of artificial intelligence and internet of things in smart healthcare: A case study of voice pathology detection. IEEE Access, 9, 89198–89209.
Niu, Y., Cao, J., Shen, F., & Ren, P. (2020). The study of voice pathology detection based on MFCC and SVM. In Proceedings of the 2020 7th international conference on biomedical and bioinformatics engineering
Özseven, T., & Düğenci, M. (2018). SPeech ACoustic (SPAC): A novel tool for speech feature extraction and classification. Applied Acoustics, 136, 1–8.
Peng, X., Xu, H., Liu, J., Wang, J., & He, C. (2023). Voice disorder classification using convolutional neural network based on deep transfer learning. Scientific Reports, 13(1), 7264.
Prabhavalkar, R., Hori, T., Sainath, T. N., Schlüter, R., & Watanabe, S. (2023). End-to-end speech recognition: A survey. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32, 325–351.
Pützer, M., & Barry, W. (2009). Saarbrücken voice database. Saarland University.
Ribas, D., Pastor, M. A., Miguel, A., Martínez, D., Ortega, A., & Lleida, E. (2023). Automatic voice disorder detection using self-supervised representations. IEEE Access, 11, 14915–14927.
Rostami, D., & Shekofteh, Y. (2023). A Persian wake word detection system based on the fine tuning of a universal phone decoder and Levenshtein distance. In 2023 9th international conference on web research (ICWR). IEEE
Shekofteh, Y. (2023). What can phone attractors in RPS tell us? A study of dynamic information in speech signals for phone classification purposes. Applied Acoustics, 211, 109534.
Shekofteh, Y., & Almasganj, F. (2013). Remote diagnosis of unilateral vocal fold paralysis using matching pursuit based features extracted from telephony speech signal. Scientia Iranica, 20(6), 2051–2060.
Shekofteh, Y., Almasganj, F., & Daliri, A. (2015). MLP-based isolated phoneme classification using likelihood features extracted from reconstructed phase space. Engineering Applications of Artificial Intelligence, 44, 1–9.
Sidhu, M. S., Latib, N. A. A., & Sidhu, K. K. (2024). MFCC in audio signal processing for voice disorder: A review. Multimedia Tools and Applications, 1–21
Sindhu, I., & Sainin, M. S. (2024). Automatic speech and voice disorder detection using deep learning—A systematic literature review. IEEE Access
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.
Sun, L., Li, K., Wang, H., Kang, S., & Meng, H. (2016). Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In 2016 IEEE international conference on multimedia and expo (ICME). IEEE
Syed, S. A., Rashid, M., Hussain, S., & Zahid, H. (2021). Comparative analysis of CNN and RNN for voice pathology detection. BioMed Research International, 2021(1), 6635964.
Tirronen, S., Kadiri, S. R., & Alku, P. (2022). The effect of the MFCC frame length in automatic voice pathology detection. Journal of Voice. https://doi.org/10.1016/j.jvoice.2022.03.021
Xie, X., Cai, H., Li, C., Wu, Y., & Ding, F. (2023). A voice disease detection method based on MFCCs and shallow CNN. Journal of Voice. https://doi.org/10.1016/j.jvoice.2023.09.024
Zheng, W. Z., Han, J. Y., Lee, C. K., Lin, Y. Y., Chang, S. H., & Lai, Y. H. (2022). Phonetic posteriorgram-based voice conversion system to improve speech intelligibility of dysarthric patients. Computer Methods and Programs in Biomedicine, 215, 106602.
Funding
No funding was received to assist with the preparation of this manuscript.
Author information
Authors and Affiliations
Contributions
S. Farazi prepared the data, experiments, figures, and tables. Y. Shekofteh planned and directed the project. S. Farazi and Y. Shekofteh wrote the manuscript text.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Farazi, S., Shekofteh, Y. Evaluation of phone posterior probabilities for pathology detection in speech data using deep learning models. Int J Speech Technol 28, 99–116 (2025). https://doi.org/10.1007/s10772-024-10166-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-024-10166-w