Fine-Grained Multimodal DeepFake Classification via Heterogeneous Graphs

Yin, Qilin; Lu, Wei; Cao, Xiaochun; Luo, Xiangyang; Zhou, Yicong; Huang, Jiwu

doi:10.1007/s11263-024-02128-1

Fine-Grained Multimodal DeepFake Classification via Heterogeneous Graphs

Published: 06 June 2024

Volume 132, pages 5255–5269, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Qilin Yin^1,2,3,
Wei Lu ORCID: orcid.org/0000-0002-4068-1766^1,2,3,
Xiaochun Cao⁴,
Xiangyang Luo⁵,
Yicong Zhou⁶ &
…
Jiwu Huang⁷

1808 Accesses
8 Citations
Explore all metrics

Abstract

Nowadays, the abuse of deepfakes is a well-known issue since deepfakes can lead to severe security and privacy problems. And this situation is getting worse, as attackers are no longer limited to unimodal deepfakes, but use multimodal deepfakes, i.e., both audio forgery and video forgery, to better achieve malicious purposes. The existing unimodal or ensemble deepfake detectors are demanded with fine-grained classification capabilities for the growing technique on multimodal deepfakes. To address this gap, we propose a graph attention network based on heterogeneous graph for fine-grained multimodal deepfake classification, i.e., not only distinguishing the authenticity of samples, but also identifying the forged types, e.g., video or audio or both. To this end, we propose a positional coding-based heterogeneous graph construction method that converts an audio-visual sample into a multimodal heterogeneous graph according to relevant hyperparameters. Moreover, a cross-modal graph interaction module is devised to utilize audio-visual synchronization patterns for capturing inter-modal complementary information. The de-homogenization graph pooling operation is elaborately designed to keep differences in graph node features for enhancing the representation of graph-level features. Through the heterogeneous graph attention network, we can efficiently model intra- and inter-modal relationships of multimodal data both at spatial and temporal scales. Extensive experimental results on two audio-visual datasets FakeAVCeleb and LAV-DF demonstrate that our proposed model obtains significant performance gains as compared to other state-of-the-art competitors. The code is available at https://github.com/yinql1995/Fine-grained-Multimodal-DeepFake-Classification/.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Constructing Audio-Visual Heterogeneous Graphs for Deepfake Detection

A defensive attention mechanism to detect deepfake content across multiple modalities

Article 03 February 2024

Enhancing multimodal deepfake detection with local–global feature integration and diffusion models

Article Open access 11 March 2025

References

Arik, S., Chen, J., Peng, K., Ping, W., & Zhou, Y. (2018). Neural voice cloning with a few samples. In Advances in Neural Information Processing Systems, (pp. 10040–10050).
Brissman, E., Johnander, J., Danelljan, M., & Felsberg, M. (2023). Recurrent graph neural networks for video instance segmentation. International Journal of Computer Vision, 131(2), 471–495.
Article Google Scholar
Cai, Z., Stefanov, K., Dhall, A., & Hayat, M. (2022). Do you really mean that? content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization. In International conference on digital image computing: techniques and applications (DICTA) (pp. 1–10).
Cao, B., Bi, Z., Hu, Q., Zhang, H., Wang, N., Gao, X., & Shen, D. (2023). Autoencoder-driven multimodal collaborative learning for medical image synthesis. International Journal of Computer Vision, 131(8), 1995–2014.
Cao, J., Ma, C., Yao, T., Chen, S., Ding, S., & Yang, X. (2022). End-to-end reconstruction-classification learning for face forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4113–4122).
Chen, S., Yao, T., Chen, Y., Ding, S., Li, J., & Ji, R. (2021). Local relation learning for face forgery detection. In Proceedings of the AAAI conference on artificial intelligence (pp. 1081–1088).
Cheng, H., Guo, Y., Wang, T., Li, Q., Chang, X., & Nie, L. (2023). Voice-face homogeneity tells deepfake. ACM Transactions on Multimedia Computing, Communications and Applications, 20(3), 1–22.
Article Google Scholar
Chugh, K., Gupta, P., Dhall, A., & Subramanian, R. (2020). Not made for each other-audio-visual dissonance-based deepfake detection and localization. In Proceedings of the 28th ACM international conference on multimedia (pp. 439–447).
Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., & Ferrer, C. C. (2020). The deepfake detection challenge (DFDC) dataset. arXiv preprint arXiv:2006.07397.
Fu, X., Qi, Q., Zha, Z. J., Ding, X., Wu, F., & Paisley, J. (2021). Successive graph convolutional network for image de-raining. International Journal of Computer Vision, 129, 1691–1711.
Article Google Scholar
Gu, Z., Chen, Y., Yao, T., Ding, S., Li, J., Huang, F., & Ma, L. (2021). Spatiotemporal inconsistency learning for deepfake video detection. In Proceedings of the 29th ACM international conference on multimedia (pp. 3473–3481).
Haliassos, A., Vougioukas, K., Petridis, S., & Pantic, M. (2021). Lips don’t lie: A generalisable and robust approach to face forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5039–5049).
Haliassos, A., Mira, R., Petridis, S., & Pantic, M. (2022). Leveraging real talking faces via self-supervision for robust forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14950–14962).
Hua, G., Teoh, A. B. J., & Zhang, H. (2021). Towards end-to-end synthetic speech detection. IEEE Signal Processing Letters, 28, 1265–1269.
Article Google Scholar
Jamaludin, A., Chung, J. S., & Zisserman, A. (2019). You said that?: Synthesising talking faces from audio. International Journal of Computer Vision, 127, 1767–1779.
Article Google Scholar
Jia, Y., Zhang, Y., Weiss, R., Wang, Q., Shen, J., Ren, F., Nguyen, P., Pang, R., Lopez Moreno, I., & Wu, Y. (2018). Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in Neural Information Processing Systems (pp. 4485–4495).
Jiang, L., Li, R., Wu, W., Qian, C., & Loy, C. C. (2020). Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2889–2898).
Juefei-Xu, F., Wang, R., Huang, Y., Guo, Q., Ma, L., & Liu, Y. (2022). Countering malicious deepfakes: Survey, battleground, and horizon. International Journal of Computer Vision, 130(7), 1678–1734.
Article Google Scholar
Khalid, H., Tariq, S., Kim, M., & Woo, S. S. (2021). Fakeavceleb: A novel audio-video multimodal deepfake dataset. arXiv preprint arXiv:2108.05080.
Korshunova, I., Shi, W., Dambre, J., & Theis, L. (2017). Fast face-swap using convolutional neural networks. In Proceedings of the IEEE international conference on computer vision (pp. 3677–3685).
Le, T. M., Le, V., Venkatesh, S., & Tran, T. (2021). Hierarchical conditional relation networks for multimodal video question answering. International Journal of Computer Vision, 129(11), 3027–3050.
Article Google Scholar
Li, L., Bao, J., Zhang, T., Yang, H., Chen, D., Wen, F., & and Guo, B. (2020a). Face x-ray for more general face forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5001–5010).
Li, X., Lang, Y., Chen, Y., Mao, X., He, Y., Wang, S., Xue, H., & Lu, Q. (2020b). Sharp multiple instance learning for deepfake video detection. In Proceedings of the 28th ACM international conference on multimedia (pp. 1864–1872).
Li, Y., Yang, X., Sun, P., Qi, H., & Lyu, S. (2020c). Celeb-df: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3207–3216).
Liu, H., Li, X., Zhou, W., Chen, Y., He, Y., Xue, H., Zhang, W., & Yu, N. (2021a). Spatial-phase shallow learning: rethinking face forgery detection in frequency domain. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 772–781).
Liu, J., Rojas, J., Li, Y., Liang, Z., Guan, Y., Xi, N., & Zhu, H. (2021b). A graph attention spatio-temporal convolutional network for 3d human pose estimation in video. In 2021 IEEE International Conference on Robotics and Automation (ICRA) (pp. 3374–3380). IEEE.
Lu, W., Liu, L., Zhang, B., Luo, J., Zhao, X., Zhou, Y., & Huang, J. (2023). Detection of deepfake videos using long-distance attention. IEEE Transactions on Neural Networks and Learning Systems, 1–14. https://doi.org/10.1109/TNNLS.2022.3233063
McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264(5588), 746–748.
Article Google Scholar
Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., & Manocha, D. (2020). Emotions don’t lie: An audio-visual deepfake detection method using affective cues. In Proceedings of the 28th ACM international conference on multimedia (pp. 2823–2832).
Muda, L., Begam, M., & Elamvazuthi, I. (2010). Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. arXiv preprint arXiv:1003.4083.
Peng, N., Poon, H., Quirk, C., et al. (2017). Cross-sentence n-ary relation extraction with graph lstms. Transactions of the Association for Computational Linguistics, 5, 101–115.
Article Google Scholar
Peng, N., Poon, H., Quirk, C., et al. (2017). Cross-sentence n-ary relation extraction with graph lstms. Transactions of the Association for Computational Linguistics, 5, 101–115.
Ping, W., Peng, K., & Chen, J. (2018). Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281.
Prajwal, K., Mukhopadhyay, R., Namboodiri, V. P., & Jawahar, C. V. (2020). A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia (pp. 484–492).
Qi, S., Wang, W., Jia, B., Shen, J., & Zhu, S. C. (2018). Learning human-object interactions by graph parsing neural networks. In Proceedings of the European conference on computer vision (ECCV) (pp. 401–417).
Qian, Y., Yin, G., Sheng, L., Chen, Z., & Shao, J. (2020). Thinking in frequency: Face forgery detection by mining frequency-aware clues. In European conference on computer vision (pp. 86–103). Springer.
Saqur, R., & Narasimhan, K. (2020). Multimodal graph networks for compositional generalization in visual question answering. Advances in Neural Information Processing Systems, 33, 3070–3081.
Google Scholar
Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., & Monfardini, G. (2008). The graph neural network model. IEEE Transactions on Neural Networks, 20(1), 61–80.
Article Google Scholar
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., & Saurous, R. A. (2018). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4779–4783). IEEE.
Tak, H., Jung, J. W., Patino, J., Kamble, M., Todisco, M., & Evans, N. (2021). End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection. In Proc. 2021 edition of the automatic speaker verification and spoofing countermeasures challenge (pp. 1–8).
Tak, H., Todisco, M., Wang, X., Jung, J. W., Yamagishi, J., & Evans, N. (2022). Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation. In The speaker and language recognition workshop.
Todisco, M., Delgado, H., & Evans, N. W. (2016). A new feature for automatic speaker verification anti-spoofing: Constant q cepstral coefficients. In Odyssey (pp. 283–290).
Tulyakov, S., Liu, M. Y., Yang, X., et al. (2018). Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1526–1535).
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9(11), 2579–2605.
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2017). Graph attention networks. arXiv preprint arXiv:1710.10903.
Veyseh, A. P. B., Nguyen, T. H., & Dou, D. (2019). Graph based neural networks for event factuality prediction using syntactic and semantic structures. arXiv preprint arXiv:1907.03227.
Wang, Q., Wei, Y., Yin, J., Wu, J., Song, X., & Nie, L. (2021). Dualgnn: Dual graph neural network for multimedia recommendation. IEEE Transactions on Multimedia, 25, 1074–1084.
Wei, Y., Wang, X., Nie, L., He, X., Hong, R., & Chua, T. S. (2019). Mmgcn: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM international conference on multimedia (pp. 1437–1445).
Wu, X., & Li, T. (2023). Sentimental visual captioning using multimodal transformer. International Journal of Computer Vision, 131(4), 1073–1090.
Article Google Scholar
Yang, X., Feng, S., Zhang, Y., & Wang, D. (2021). Multimodal sentiment detection based on multi-channel graph neural networks. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (pp. 328–339).
Yin, Q., Lu, W., Li, B., & Huang, J. (2023). Dynamic difference learning with spatio-temporal correlation for deepfake video detection. IEEE Transactions on Information Forensics and Security, 18, 4046–4058.
Article Google Scholar
Zhang, S., Qin, Y., Sun, K., & Lin, Y. (2019). Few-shot audio classification with attentional graph neural networks. In Interspeech (pp. 3649–3653).
Zhao, T., Xu, X., & Xu, M. (2020). Learning to recognize patch-wise consistency for deepfake detection. arXiv preprint arXiv:2012.09311.
Zheng, Y., Bao, J., Chen, D., Zeng, M., & Wen, F. (2021). Exploring temporal coherence for more general video face forgery detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 15044–15054).
Zhou, Y., & Lim, S. N. (2021). Joint audio-visual deepfake detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 14800–14809).
Zhuang, W., Chu, Q., Tan, Z., Liu, Q., Yuan, H., Miao, C., Luo, Z., & Yu, N. (2022). Uia-vit: Unsupervised inconsistency-aware method based on vision transformer for face forgery detection. In European conference on computer vision (pp. 391–407). Springer.
Zi, B., Chang, M., Chen, J., Ma, X., & Jiang, Y. G. (2020). Wilddeepfake: A challenging real-world dataset for deepfake detection. In Proceedings of the 28th ACM international conference on multimedia (pp. 2382–2390).

Download references

Acknowledgements

This work was jointly supported by the National Natural Science Foundation of China (Grant No. U2001202, No. 62072480, No. 62172435).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, China
Qilin Yin & Wei Lu
Ministry of Education Key Laboratory of Information Technology, Sun Yat-sen University, Guangzhou, 510006, China
Qilin Yin & Wei Lu
Guangdong Province Key Laboratory of Information Security Technology, Sun Yat-sen University, Guangzhou, 510006, China
Qilin Yin & Wei Lu
School of Cyber Science and Technology, Sun Yat-sen University, Shenzhen, 518107, China
Xiaochun Cao
State Key Laboratory of Mathematical Engineering and Advanced Computing, Henan, 450001, China
Xiangyang Luo
Department of Computer and Information Science, University of Macau, Macau, China
Yicong Zhou
Guangdong Laboratory of Machine Perception and Intelligent Computing, Faculty of Engineering, Shenzhen MSU-BIT University, Shenzhen, 518116, China
Jiwu Huang

Authors

Qilin Yin
View author publications
Search author on:PubMed Google Scholar
Wei Lu
View author publications
Search author on:PubMed Google Scholar
Xiaochun Cao
View author publications
Search author on:PubMed Google Scholar
Xiangyang Luo
View author publications
Search author on:PubMed Google Scholar
Yicong Zhou
View author publications
Search author on:PubMed Google Scholar
Jiwu Huang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Wei Lu.

Ethics declarations

Conflict of interest

The authors declare that they have no Conflict of interest.

Additional information

Communicated by Segio Escalera.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yin, Q., Lu, W., Cao, X. et al. Fine-Grained Multimodal DeepFake Classification via Heterogeneous Graphs. Int J Comput Vis 132, 5255–5269 (2024). https://doi.org/10.1007/s11263-024-02128-1

Download citation

Received: 27 July 2023
Accepted: 17 May 2024
Published: 06 June 2024
Version of record: 06 June 2024
Issue date: November 2024
DOI: https://doi.org/10.1007/s11263-024-02128-1

Keywords

Part of a collection:

Special Issue on Biometrics Security and Privacy

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fine-Grained Multimodal DeepFake Classification via Heterogeneous Graphs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Constructing Audio-Visual Heterogeneous Graphs for Deepfake Detection

A defensive attention mechanism to detect deepfake content across multiple modalities

Enhancing multimodal deepfake detection with local–global feature integration and diffusion models

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now