Abstract
The part-whole relational property endowed by Capsule Networks (CapsNets) has been known successful for camouflaged object detection due to its segmentation integrity. However, the previous Expectation Maximization (EM) capsule routing algorithm with heavy computation and large parameters obstructs this trend. The primary attribution behind lies in the pixel-level capsule routing. Alternatively, in this paper, we propose a novel mamba capsule routing at the type level. Specifically, we first extract the implicit latent state in mamba as capsule vectors, which abstract type-level capsules from pixel-level versions. These type-level mamba capsules are fed into the EM routing algorithm to get the high-layer mamba capsules, which greatly reduce the computation and parameters caused by the pixel-level capsule routing for part-whole relationships exploration. On top of that, to retrieve the pixel-level capsule features for further camouflaged prediction, we achieve this on the basis of the low-layer pixel-level capsules with the guidance of the correlations from adjacent-layer type-level mamba capsules. Extensive experiments on three widely used COD benchmark datasets demonstrate that our method significantly outperforms state-of-the-arts. Code has been available on https://github.com/Liangbo-Cheng/mamba_capsule.
Similar content being viewed by others
Data Availability
The datasets used and analyzed during the current study are available in the following public domain resources: - CAMO: https://github.com/ondyari/FaceForensics - COD10K: https://github.com/DengPingFan/SINet - NC4K: https://github.com/JingZhang617/COD-Rank-Localize-and-Segment The models and source data generated and analyzed during the current study are available from the corresponding author upon reasonable request.
Notes
High-layer and whole-level can be applied in an interleaved manner.
Low-layer and part-level can be applied in an interleaved manner.
References
Achanta, R., Hemami, S., Estrada, F., & Susstrunk, S. (2009). Frequency-tuned salient region detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1597–1604.
Bideau, P., Learned-Miller, E., Schmid, C., & Alahari, K. (2024). The right spin: Learning object motion from rotation-compensated flow fields. International Journal of Computer Vision, 132(1), 40–55.
Cai, L., McGuire, N. E., Hanlon, R., Mooney, T. A., & Girdhar, Y. (2023). Semi-supervised visual tracking of marine animals using autonomous underwater vehicles. International Journal of Computer Vision, 131(6), 1406–1427.
Chen, R., Shen, H., Zhao, Z. Q., Yang, Y., & Zhang, Z. (2024). Global routing between capsules. Pattern Recognition,148, Article 110142.
Chen, X., & Schmitt, F. (1993). Vision-based construction of cad models from range images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 129–136.
Cong, R., Sun, M., Zhang, S., Zhou, X., Zhang, W., & Zhao, Y. (2023). Frequency perception network for camouflaged object detection. In: Proceedings of the ACM International Conference on Multimedia, pp 1179–1189.
Fan, D. P., Cheng, M. M., Liu, Y., Li, T., & Borji, A. (2017). Structure-measure: A new way to evaluate foreground maps. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4558–4567
Fan, D. P., Gong, C., Cao, Y., Ren, B., Cheng, M. M., & Borji, A. (2018). Enhanced-alignment measure for binary foreground map evaluation. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp 698–704.
Fan, D. P., Ji, G. P., Sun, G., Cheng, M. M., Shen, J., & Shao, L. (2020). Camouflaged object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2774–2784.
Fan, D. P., Ji, G. P., Cheng, M. M., & Shao, L. (2022). Concealed object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10), 6024–6042.
Fazekas, S., Amiaz, T., Chetverikov, D., & Kiryati, N. (2009). Dynamic texture detection based on motion analysis. International Journal of Computer Vision, 82(1), 48–63.
Gan, D., Chang, M., & Chen, J. (2024). 3d-effivitcaps: 3d efficient vision transformer with capsule for medical image segmentation. In: International Conference on Pattern Recognition, pp 141–156.
Geng, X., Wang, J., Gong, J., Xue, Y., Xu, J., Chen, F., & Huang, X. (2024). Orthcaps: An orthogonal capsnet with sparse attention routing and pruning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6037–6046.
Girshick, R., Iandola, F., Darrell, T., & Malik, J. (2015). Deformable part models are convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 437–446.
Gu, A., & Dao, T. (2024). Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752
Gu, A., Goel, K., & Re, C. (2022). Efficiently modeling long sequences with structured state spaces. In: Proceedings of the International Conference on Learning Representations.
Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., & Ré, C. (2024). Combining recurrent, convolutional, and continuous-time models with linear state-space layers. In: Advances in Neural Information Processing Systems.
Guo, G., Zhang, D., Han, L., Liu, N., Cheng, M. M., & Han, J. (2024). Pixel distillation: Cost-flexible distillation across image sizes and heterogeneous networks. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Hahn, T., Pyeon, M., & Kim, G. (2019). Self-routing capsule networks. Advances in Neural Information Processing Systems 32.
Han, Y., Meng, W., & Tang, W. (2023). Capsule-inferenced object detection for remote sensing images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 16, 5260–5270.
Hao, C., Yu, Z., Liu, X., Xu, J., Yue, H., & Yang, J. (2025). A simple yet effective network based on vision transformer for camouflaged object and salient object detection. Processing: IEEE Transactions on Image.
He, C., Li, K., Zhang, Y., Tang. L., Zhang, Y., Guo, Z., & Li, X. (2023). Camouflaged object detection with feature decomposition and edge reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 22046–22055.
He, C., Li, K., Zhang, Y., Xu, G., Tang, L., Zhang, Y., Guo, Z., & Li, X. (2023). Weakly-supervised concealed object segmentation with sam-based pseudo labeling and multi-scale feature grouping. Advances in Neural Information Processing Systems, 36, 30726–30737.
He, C., Li, K., Zhang, Y., Zhang, Y., You, C., Guo, Z., Li, X., Danelljan, M., & Yu, F. (2024). Strategic preys make acute predators: Enhancing camouflaged object detectors by generating camouflaged objects. In: Proceedings of the International Conference on Learning Representations.
Hinton, G. (2019). Talk on capsule networks. https://www.youtube.com/watch?v=x5Vxk9twXlE, york University.
Hinton, G. E., Krizhevsky, A., & Wang, S. D. (2011). Transforming auto-encoders. In: Proceedings of the International Conference on Artificial Neural Networks, pp 44–51.
Hinton, G. E., Sabour, S., & Frosst, N. (2018). Matrix capsules with em routing. In: Proceedings of the International Conference on Learning Representations, pp 3856–3866.
Hu, J., Lin, J., Gong, S., & Cai, W. (2024). Relax image-specific prompt requirement in sam: A single generic prompt for segmenting camouflaged objects. Proceedings of the AAAI Conference on Artificial Intelligence, 38, 12511–12518.
Hu, X., Wang, S., Qin, X., Dai, H., Ren, W., Luo, D., Tai, Y., & Shao, L. (2023). High-resolution iterative feedback network for camouflaged object detection. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 881–889.
Huang, P., Zhang, D., Cheng, D., Han, L., Zhu, P., & Han, J. (2024). M-rrfs: A memory-based robust region feature synthesizer for zero-shot object detection. International Journal of Computer Vision pp 1–22.
Huerta, I., Rowe, D., Mozerov, M., & Gonzàlez, J. (2007). Improving background subtraction based on a casuistry of colour-motion segmentation problems. In: Pattern Recognition and Image Analysis, pp 475–482.
Jia, Q., Yao, S., Liu, Y., Fan, X., Liu, R., & Luo, Z. (2022). Segment, magnify and reiterate: Detecting camouflaged objects the hard way. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4703–4712.
Kalman, R. E. (1960). A New Approach to Linear Filtering and Prediction Problems. Journal of Basic Engineering, 82(1), 35–45.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations.
Krivic, J., & Solina, F. (2004). Part-level object recognition using superquadrics. Computer Vision and Image Understanding, 95(1), 105–126.
LaLonde, R., Khosravan, N., & Bagci, U. (2024). Deformable capsules for object detection. Advanced Intelligent Systems, 6(9), 2400044.
Le, T. N., Nguyen, T. V., Nie, Z., Tran, M. T., & Sugimoto, A. (2019). Anabranch network for camouflaged object segmentation. Computer Vision and Image Understanding, 184, 45–56.
Liang, D., Zhou, X., Xu, W., Zhu, X., Zou, Z., Ye, X., Tan, X., & Bai, X. (2024). Pointmamba: A simple state space model for point cloud analysis. In: Advances in Neural Information Processing Systems.
Liu, J., Lin, R., Wu, G., Liu, R., Luo, Z., & Fan, X. (2024). Coconet: Coupled contrastive learning network with multi-level feature ensemble for multi-modality image fusion. International Journal of Computer Vision, 132(5), 1748–1775.
Liu, N., Zhang, N., Wan, K., Shao, L., & Han, J. (2021). Visual saliency transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4702–4712.
Liu, W., Shen, X., Pun, C. M., & Cun, X. (2023). Explicit visual prompting for low-level structure segmentations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19434–19445.
Liu, Y., Zhang, Q., Zhang, D., & Han, J. (2019). Employing deep part-object relationships for salient object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1232–1241.
Liu, Y., Zhang, D., Zhang, Q., & Han, J. (2021). Integrating part-object relationship and contrast for camouflaged object detection. IEEE Transactions on Information Forensics and Security, 16, 5154–5166.
Liu, Y., Zhang, D., Liu, N., Xu, S., & Han, J. (2022). Disentangled capsule routing for fast part-object relational saliency. IEEE Transactions on Image Processing, 31, 6719–6732.
Liu, Y., Zhang, D., Zhang, Q., & Han, J. (2022). Part-object relational visual saliency. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7), 3688–3704.
Liu, Y., Cheng, D., Zhang, D., Xu, S., & Han, J. (2024). Capsule networks with residual pose routing. IEEE Transactions on Neural Networks and Learning Systems pp 1–14.
Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Jiao, J., & Liu, Y. (2024). Vmamba: Visual state space model. Advances in Neural Information Processing Systems, 37, 103031–103063.
Liu, Y., Li, C., Dong, X., Li, L., Zhang, D., Xu, S., & Han, J. (2025). Seamless detection: Unifying salient object detection and camouflaged object detection. Expert Systems with Applications,274, Article 126912.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9992–10002.
Liu, Z., Zhang, Z., Tan, Y., & Wu, W. (2022). Boosting camouflaged object detection with dual-task interactive transformer. In: Proceedings of the International Conference on Pattern Recognition, pp 140–146.
Liu, Z., Deng, X., Jiang, P., Lv, C., Min, G., & Wang, X. (2024). Edge perception camouflaged object detection under frequency domain reconstruction. IEEE Transactions on Circuits and Systems for Video Technology, 34(10), 10194–10207.
Luo, N., Pan, Y., Sun, R., Zhang, T., Xiong, Z., & Wu, F. (2023). Camouflaged instance segmentation via explicit de-camouflaging. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 17918–17927.
Luo, Z., Liu, N., Zhao, W., Yang, X., Zhang, D., Fan, D. P., Khan, F., & Han, J. (2024). Vscode: General visual salient and camouflaged object detection with 2d prompt learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 17169–17180.
Lv, Y., Zhang, J., Dai, Y., Li, A., Liu, B., Barnes, N., & Fan, D. P. (2021). Simultaneously localize, segment and rank the camouflaged objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 11586–11596.
Ma, J., Li, F., & Wang, B. (2024). U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv:2401.04722
Margolin, R., Zelnik-Manor, L., & Tal, A. (2014). How to evaluate foreground maps. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 248–255.
Mei, H., Ji, G. P., Wei, Z., Yang, X., Wei, X., & Fan, D. P. (2021). Camouflaged object segmentation with distraction mining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8772–8781.
Mei, H., Xu, K., Zhou, Y., Wang, Y., Piao, H., Wei, X., & Yang, X. (2023). Camouflaged object segmentation with omni perception. International Journal of Computer Vision, 131(11), 3019–3034.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems.
Pan, Y., Chen, Y., Fu, Q., Zhang, P., & Xu, X. (2011). Study on the camouflaged target detection method based on 3d convexity. Mathematical Models and Methods in Applied Sciences, 5, 152.
Pang, Y., Zhao, X., Xiang, T. Z., Zhang, L., & Lu, H. (2022). Zoom in and out: A mixed-scale triplet network for camouflaged object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2160–2170.
Rahmon, G., Palaniappan, K., Toubal, I. E., Bunyak, F., Rao, R., & Seetharaman, G. (2024). Deepftsg: Multi-stream asymmetric use-net trellis encoders with shared decoder feature fusion architecture for video motion segmentation. International Journal of Computer Vision, 132(3), 776–804.
Sabour, S., Frosst, N., & Hinton, G. E. (2017). Dynamic routing between capsules. In: Advances in Neural Information Processing Systems, pp 3859–3869.
Sengottuvelan, P., Wahi, A., & Shanmugam, A. (2008). Performance of decamouflaging through exploratory image analysis. In: International Conference on Emerging Trends in Engineering and Technology, pp 6–10.
Singh, S. K., Dhawale, C. A., & Misra, S. (2013). Survey of object detection methods in camouflaged image. Ieri Procedia, 4, 351–357.
Sun, Y., Chen, G., Zhou, T., Zhang, Y., & Liu, N. (2021). Context-aware cross-level fusion network for camouflaged object detection. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp 1025–1031.
Sun, Y., Wang, S., Chen, C., & Xiang, T. Z. (2022). Boundary-guided camouflaged object detection. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp 1335–1341.
Sun, Y., Xu, C., Yang, J., Xuan, H., & Luo, L. (2024). Frequency-spatial entanglement learning for camouflaged object detection. In: Proceedings of the European Conference on Computer Vision, pp 343–360.
Tsai, Y. H. H., Srivastava, N., Goh, H., & Salakhutdinov, R. (2020). Capsules with inverted dot-product attention routing. In: Proceedings of the International Conference on Learning Representations.
Wang, J., Liu, X., Yin, Z., Wang, Y., Guo, J., Qin, H., Wu, Q., & Liu, A. (2024). Generate transferable adversarial physical camouflages via triplet attention suppression. International Journal of Computer Vision.
Xiao, F., Hu, S., Shen, Y., Fang, C., Huang, J., Tang, L., Yang, Z., Li, X., & He, C. (2024). A survey of camouflaged object detection and beyond. CAAI Artificial Intelligence Research, 3, 9150044.
Yang, F., Zhai, Q., Li, X., Huang, R., Luo, A., Cheng, H., & Fan, D. P. (2021). Uncertainty-guided transformer reasoning for camouflaged object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4126–4135.
Yang, X., Burghardt, T., & Mirmehdi, M. (2023). Dynamic curriculum learning for great ape detection in the wild. International Journal of Computer Vision, 131(5), 1163–1181.
Yin, B., Zhang, X., Fan, D. P., Jiao, S., Cheng, M. M., Van Gool, L., & Hou, Q. (2024). Camoformer: Masked separable attention for camouflaged object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Yu, J., Jiang, Y., Wang, Z., Cao, Z., & Huang, T. (2016). Unitbox: An advanced object detection network. In: Proceedings of the ACM International Conference on Multimedia, pp 516–520.
Zhai, Q., Li, X., Yang, F., Chen, C., Cheng, H., & Fan, D. P. (2021). Mutual graph learning for camouflaged object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12997–13007.
Zhai, Q., Li, X., Yang, F., Chen, C., Cheng, H., & Fan, D. P. (2021). Mutual graph learning for camouflaged object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12997–13007.
Zhang, Q., Ge, Y., Zhang, C., & Bi, H. (2023). Tprnet: camouflaged object detection via transformer-induced progressive refinement network. The Visual Computer, 39(10), 4593–4607.
Zhang, X., Han, L., Xu, C., Zheng, Z., Ding, J., Fu, X., & Zhang, D. (2024). Mhkd: Multi-step hybrid knowledge distillation for low-resolution whole slide images glomerulus detection. IEEE Journal of Biomedical and Health Informatics.
Zhao, J., Li, X., Yang, F., Zhai, Q., Luo, A., Jiao, Z., & Cheng, H. (2024). Focusdiffuser: Perceiving local disparities for camouflaged object detection. In: Proceedings of the European Conference on Computer Vision, pp 181–198.
Zhao, X., Zhang, L., & Lu, H. (2021). Automatic polyp segmentation via multi-scale subtraction network. Medical Image Computing and Computer Assisted Intervention, 12901, 120–130.
Zhao, X., Pang, Y., Ji, W., Sheng, B., Zuo, J., Zhang, L., & Lu, H. (2024). Spider: A unified framework for context-dependent concept segmentation. In: Proceedings of the International Conference on Machine Learning.
Zhao, X., Pang, Y., Zhang, L., Lu, H., & Zhang, L. (2024). Towards diverse binary segmentation via a simple yet general gated network. International Journal of Computer Vision, 132, 1–20.
Zhou, T., Zhou, Y., Gong, C., Yang, J., & Zhang, Y. (2022). Feature aggregation and propagation network for camouflaged object detection. IEEE Transactions on Image Processing, 31, 7036–7047.
Zhou, X., Wu, Z., & Cong, R. (2024). Decoupling and integration network for camouflaged object detection. IEEE Transactions on Multimedia, 26, 7114–7129.
Zhu, H., Li, P., Xie, H., Yan, X., Liang, D., Chen, D., Wei, M., & Qin, J. (2022). I can find you! boundary-guided separated attention network for camouflaged object detection. Proceedings of the AAAI Conference on Artificial Intelligence, 36, 3608–3616.
Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., & Wang, X. (2024). Vision mamba: Efficient visual representation learning with bidirectional state space model. In: Proceedings of the International Conference on Machine Learning.
Zhuge, M., Fan, D. P., Liu, N., Zhang, D., Xu, D., & Shao, L. (2023). Salient object detection via integrity learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3738–3752.
Acknowledgements
This work was supported in part by the National Science and Technology Major Project (2022ZD0119004), in part by the National Natural Science Foundation of China under Graint 62322605, in part by Jiangsu Province Qing Lan Project, in part by National Natural Science Foundation of Jiangsu Province under Grant No. BK20221379, and in part by the Jiangsu Province Youth Science and Technology Talent Support Project.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Communicated by Boxin Shi.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, D., Cheng, L., Liu, Y. et al. Mamba Capsule Routing Towards Part-Whole Relational Camouflaged Object Detection. Int J Comput Vis 133, 7201–7221 (2025). https://doi.org/10.1007/s11263-025-02530-3
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02530-3