Abstract
This paper focuses to detect the fake news on the short video platforms. While significant research efforts have been devoted to this task with notable progress in recent years, current detection accuracy remains suboptimal due to the rapid evolution of content manipulation and generation technologies. Existing approaches typically employ a cross-modal fusion strategy that directly combines raw video data with metadata inputs before applying a classification layer. However, our empirical observations reveal a critical oversight: manipulated content frequently exhibits inter-modal inconsistencies that could serve as valuable discriminative features, yet remain underutilized in contemporary detection frameworks. Motivated by this insight, we propose a novel detection paradigm that explicitly identifies and leverages cross-modal contradictions as discriminative cues. Our approach consists of two core modules: Cross-modal Consistency Learning (CMCL) and Multi-modal Collaborative Diagnosis (MMCD). CMCL includes Pseudo-label Generation (PLG) and Cross-modal Consistency Diagnosis (CMCD). In PLG, a Multimodal Large Language Model is used to generate pseudo-labels for evaluating cross-modal semantic consistency. Then, CMCD extracts [CLS] tokens and computes cosine loss to quantify cross-modal inconsistencies. MMCD further integrates multimodal features through Multimodal Feature Fusion (MFF) and Probability Scores Fusion (PSF). MFF employs a co-attention mechanism to enhance semantic interactions across different modalities, while a Transformer is utilized for comprehensive feature fusion. Meanwhile, PSF further integrates the fake news probability scores obtained in the previous step. Extensive experiments on established benchmarks (FakeSV and FakeTT) demonstrate our model exhibits outstanding performance in Fake videos detection.Our code is available at https://github.com/Sakura-not-sleep/CA_FVD.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Igwebuike, E.E., Chimuanya, L.: Legitimating falsehood in social media: a discourse analysis of political fake news. Discourse Commun. 15(1), 42–58 (2021)
Fong, B.: Analysing the behavioural finance impact of ‘fake news’ phenomena on financial markets: a representative agent model and empirical validation. Financ. Innov. 7(1), 53 (2021)
Bezbaruah, S., Dhir, A., Talwar, S., et al.: Believing and acting on fake news related to natural food: the influential role of brand trust and system trust. Br. Food J. 124(9), 2937–2962 (2022)
Niu, S., Shrestha, D., Ghimire, A., et al.: A survey on watching social issue videos among YouTube and TikTok users (2023). arXiv preprint arXiv:2310.19193
Sundar, S.S., Molina, M.D., Cho, E.: Seeing is believing: Is video modality more powerful in spreading fake news via online messaging apps? J. Comput. – Mediat. Commun. 26(6), 301–319 (2021)
Bu, Y., Sheng, Q., Cao, J., et al.: FakingRecipe: Detecting fake news on short video platforms from the perspective of creative process. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 1351–1360 (2024)
Choi, H., Ko, Y.: Using topic modeling and adversarial neural networks for fake news video detection. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 2950–2954 (2021)
Liu, X., Li, P., Huang, H., et al.: FKA – OWL: Advancing multimodal fake news detection through knowledge – augmented LVLMs. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 1015–10163 (2024)
Ghanem, B., Ponzetto, S.P., Rosso, P., et al.: FakeFlow: Fake news detection by modeling the flow of affective information (2021). arXiv preprint arXiv:2101.09810
Cao, J., Qi, P., Sheng, Q., et al.: Exploring the role of visual content in fake news detection. In: Shu, K., Wang, S., Lee, D., Liu, H. (eds.) Disinformation, Misinformation, and Fake News in Social Media. LNSN, pp. 141–161. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-42699-6_8
Hong, R., Lang, J., Xu, J., et al.: Following clues, approaching the truth: Explainable micro – video rumor detection via chain – of – thought reasoning. In: The Web Conference 2025 (2025)
Shang, L., Kou, Z., Zhang, Y., et al.: A multimodal misinformation detector for COVID – 19 short videos on TikTok. In: 2021 IEEE International Conference on Big Data (Big Data), pp. 899–908. IEEE (2021)
Li, X., Xiao, X., Li, J., et al.: A CNN – based misleading video detection model. Sci. Rep. 12(1) (2022)
Medina Serrano, J.C., Papakyriakopoulos, O., Hegelich, S.: NLP – based feature extraction for the detection of COVID – 19 misinformation videos on YouTube. In: ACL Workshop on Natural Language Processing for COVID – 19 (NLP – COVID) (2020)
Qi, P., Zhao, Y., Shen, Y., et al.: Two heads are better than one: Improving fake news video detection by correlating with neighbors (2023). arXiv preprint arXiv:2306.05241
Su, Y., Lan, T., Li, H., et al.: PandaGPT: One model to instruction – follow them all (2023). arXiv preprint arXiv:2305.16355
Zhan, J., Dai, J., Ye, J., et al.: AnyGPT: Unified multimodal LLM with discrete sequence modeling (2024). arXiv preprint arXiv:2402.12226
Qi, P., Yan, Z., Hsu, W., et al.: Sniffer: multimodal large language model for explainable out – of – context misinformation detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13052–13062 (2024)
Qi, P., Bu, Y., Cao, J., et al.: FakeSV: a multimodal benchmark with rich social context for fake news detection on short video platforms. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 12, p. 14444–14452 (2023)
Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, PMLR, pp. 8748–8763 (2021)
Conneau, A., Khandelwal, K., Goyal, N., et al.: Unsupervised cross – lingual representation learning at scale (2019). arXiv preprint arXiv:1911.02116
Hsu, W.N., Bolte, B., Tsai, Y.H.H., et al.: HuBERT: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3451–3460 (2021)
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: Pretraining task – agnostic visiolinguistic representations for vision – and – language tasks. Adv. Neural Inf. Process. Syst. 32 (2019)
A GPT-4o level MLLM for vision, speech and multimodal live streaming on your phone. https://github.com/OpenBMB/MiniCPM-o
Devlin, J., Chang, M.W., Lee, K., et al.: BERT: Pre–training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171–4186 (2019)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16 × 16 words: Transformers for image recognition at scale (2020). arXiv preprint arXiv:2010.11929
Chen, Z., Wang, W., Cao, Y., et al.: Expanding performance boundaries of open – source multimodal models with model, data, and test – time scaling (2024). arXiv preprint arXiv:2412.05271
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, J., Liu, J., Zhang, N., Wang, Y. (2025). Consistency-Aware Fake Videos Detection on Short Video Platforms. In: Huang, DS., Zhang, Q., Zhang, C., Chen, W. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2025. Lecture Notes in Computer Science, vol 15859. Springer, Singapore. https://doi.org/10.1007/978-981-96-9812-7_17
Download citation
DOI: https://doi.org/10.1007/978-981-96-9812-7_17
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-9811-0
Online ISBN: 978-981-96-9812-7
eBook Packages: Computer ScienceComputer Science (R0)