Ultra-High Resolution Image Segmentation via Locality-Aware Context Fusion and Alternating Local Enhancement

Liu, Wenxi; Li, Qi; Lin, Xindai; Yang, Weixiang; He, Shengfeng; Yu, Yuanlong

doi:10.1007/s11263-024-02045-3

Ultra-High Resolution Image Segmentation via Locality-Aware Context Fusion and Alternating Local Enhancement

Published: 31 May 2024

Volume 132, pages 5030–5047, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Wenxi Liu ORCID: orcid.org/0000-0002-3630-6322¹,
Qi Li¹,
Xindai Lin¹,
Weixiang Yang¹,
Shengfeng He² &
…
Yuanlong Yu¹

722 Accesses
13 Citations
Explore all metrics

Abstract

Ultra-high resolution image segmentation has raised increasing interests in recent years due to its realistic applications. In this paper, we innovate the widely used high-resolution image segmentation pipeline, in which an ultra-high resolution image is partitioned into regular patches for local segmentation and then the local results are merged into a high-resolution semantic mask. In particular, we introduce a novel locality-aware context fusion based segmentation model to process local patches, where the relevance between local patch and its various contexts are jointly and complementarily utilized to handle the semantic regions with large variations. Additionally, we present the alternating local enhancement module that restricts the negative impact of redundant information introduced from the contexts, and thus is endowed with the ability of fixing the locality-aware features to produce refined results. Furthermore, in comprehensive experiments, we demonstrate that our model outperforms other state-of-the-art methods in public benchmarks and verify the effectiveness of the proposed modules. Our released codes will be available at: https://github.com/liqiokkk/FCtL.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1

Semantic-Aware Global and Local Fusion Model for Image Enhancement

Preserving details in semantics-aware context for scene parsing

Article 15 January 2020

Single-image low-light enhancement via generating and fusing multiple sources

Article 24 November 2018

Data Availability

The employed datasets DeepGlobe (https://competitions.codalab.org/competitions/18468), Inria Aerial (https://project.inria.fr/aerialimagelabeling), and ISIC (https://challenge.isic-archive.com/data) in our work are publicly available for research.

References

Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. TPAMI, 39(12), 2481–2495.
Article Google Scholar
Carion, N., Massa, F., & Synnaeve, G., et al. (2020). End-to-end object detection with transformers. In ECCV (pp. 213–229). Springer.
Chen, H., Wang, Y., & Guo, T., et al. (2021). Pre-trained image processing transformer. In CVPR (pp 12299–12310).
Chen, L. C., Papandreou, G., Kokkinos, I., et al. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 40(4), 834–848.
Article Google Scholar
Chen, L. C., Zhu, Y., & Papandreou, G., et al. (2018b). Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV (pp 801–818).
Chen, W., Jiang, Z., & Wang, Z., et al. (2019). Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images. In CVPR (pp. 8924–8933).
Cheng, H. K., Chung, J., & Tai, Y. W., et al. (2020). Cascadepsp: Toward class-agnostic and very high-resolution segmentation via global and local refinement. In CVPR (pp. 8890–8899).
Choi, J., Gao, C., & Messou, J. C., et al. (2019). Why can’t i dance in the mall? Learning to mitigate scene bias in action recognition. NeurIPS 32.
Codella, N. C., Gutman, D., & Celebi, M. E., et al. (2018). Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). In 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018). IEEE (pp. 168–172).
Dai, J., Qi, H., Xiong, Y., et al. (2017). Deformable convolutional networks. In ICCV (pp. 764–773).
d’Ascoli, S., Touvron, H., & Leavitt, M.L., et al. (2021). Convit: Improving vision transformers with soft convolutional inductive biases. In ICML, PMLR (pp. 2286–2296).
Demir, I., Koperski, K., & Lindenbaum, D., et al. (2018). Deepglobe 2018: A challenge to parse the earth through satellite images. In CVPRW (pp. 172–181).
Dosovitskiy, A., Beyer, L., & Kolesnikov, A., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.
Fu, J., Liu, J., & Tian, H., et al. (2019). Dual attention network for scene segmentation. In CVPR (pp. 3146–3154).
Gao, Z., Xie, J., & Wang, Q., et al. (2019). Global second-order pooling convolutional networks. In CVPR (pp. 3024–3033).
Gehring, J., Auli, M., & Grangier, D., et al. (2017). Convolutional sequence to sequence learning. In ICML, PMLR (pp. 1243–1252).
Gregor, K., Danihelka, I., & Graves, A., et al. (2015). Draw: A recurrent neural network for image generation. In ICML, PMLR (pp. 1462–1471).
Gu, J., Liu, Q., & Cho, K. (2019). Insertion-based decoding with automatically inferred generation order. Transactions of the Association for Computational Linguistics, 7, 661–676.
Article Google Scholar
Guo, M. H., Xu, T. X., & Liu, J. J., et al. (2022a). Attention mechanisms in computer vision: A survey. Computational Visual Media (pp. 1–38).
Guo, S., Liu, L., & Gan, Z., et al. (2022b). Isdnet: Integrating shallow and deep networks for efficient ultra-high resolution segmentation. In CVPR, (pp. 4361–4370).
Han, K., Xiao, A., & Wu, E., et al. (2021). Transformer in transformer. NeurIPS 34.
He, J., Deng, Z., & Zhou, L., et al. (2019). Adaptive pyramid context network for semantic segmentation. In CVPR (pp. 7519–7528).
He, Y., Shirakabe, S., & Satoh, Y., et al. (2016). Human action recognition without human. In ECCVW (pp. 11–17). Springer.
Hu, H., Chen, Y., & Xu, J., et al. (2022). Learning implicit feature alignment function for semantic segmentation. In ECCV (pp. 487–505). Springer.
Hu, J., Shen, L., & Albanie, S., et al. (2018). Gather-excite: Exploiting feature context in convolutional neural networks. NeurIPS 31.
Hu, J., Shen, L., Albanie, S., et al. (2019). Squeeze-and-excitation networks. TPAMI, 42(8), 2011–2023.
Article Google Scholar
Huang, Z., Wang, X., & Huang, L., et al. (2019). Ccnet: Criss-cross attention for semantic segmentation. In ICCV (pp. 603–612).
Huang, Z., Wei, Y., Wang, X., et al. (2021). Alignseg: Feature-aligned segmentation networks. TPAMI, 44(1), 550–557.
Google Scholar
Huynh, C., Tran, A. T., & Luu, K., et al. (2021). Progressive semantic segmentation. In CVPR (pp. 16755–16764).
Jaderberg, M., Simonyan, K., & Zisserman, A., et al. (2015). Spatial transformer networks. NeurIPS 28.
Jain, M., Van Gemert, J. C., & Snoek, C. G. (2015). What do 15,000 object categories tell us about classifying and localizing actions? In CVPR (pp. 46–55).
Jiang, Y., Chang, S., & Wang, Z. (2021). Transgan: Two pure transformers can make one strong gan, and that can scale up. NeurIPS 34.
Jin, Z., Gong, T., & Yu, D., et al. (2021a). Mining contextual information beyond image for semantic segmentation. In ICCV (pp. 7231–7241).
Jin, Z., Liu, B., & Chu, Q., et al. (2021b). Isnet: Integrate image-level and semantic-level context for semantic segmentation. In ICCV (pp. 7189–7198).
Ke, G., He, D., & Liu, T. Y. (2021). Rethinking positional encoding in language pre-training. In ICLR.
Kenton, J. D. M. W. C., & Toutanova, L. K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL (pp. 4171–4186).
Kirillov, A., Wu, Y., & He, K., et al. (2020). Pointrend: Image segmentation as rendering. In CVPR (pp. 9799–9808).
Li, Q., Yang, W., & Liu, W., et al. (2021). From contexts to locality: Ultra-high resolution image segmentation via locality-aware contextual correlation. In ICCV (pp. 7252–7261).
Li, X., You, A., & Zhu, Z., et al. (2020). Semantic flow for fast and accurate scene parsing. In ECCV (pp. 775–793). Springer.
Lin, K., Wang, L., & Liu, Z. (2021). End-to-end human pose and mesh reconstruction with transformers. In CVPR (pp. 1954–1963).
Lin, T. Y., Goyal, P., & Girshick, R., et al. (2017). Focal loss for dense object detection. In ICCV.
Liu, J. J., Hou, Q., & Cheng, M. M., et al. (2020). Improving convolutional networks with self-calibrated convolutions. In CVPR (pp. 10096–10105).
Liu, W., Rabinovich, A., & Berg, A. C. (2015). Parsenet: Looking wider to see better. arXiv:1506.04579.
Liu, Z., Lin, Y., & Cao, Y., et al. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV (pp. 10012–10022).
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR (pp. 3431–3440).
Luo, G., Zhou, Y., & Sun, X., et al. (2020). Multi-task collaborative network for joint referring expression comprehension and segmentation. In CVPR (pp. 10034–10043).
Maggiori, E., Tarabalka, Y., & Charpiat, G., et al. (2017). Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. In IGARSS, IEEE (pp. 3226–3229).
Mazzini, D. (2018). Guided upsampling network for real-time semantic segmentation. In BMVC.
Mnih, V., Heess, N., & Graves, A., et al. (2014). Recurrent models of visual attention. NeurIPS 27.
Paszke, A., Chaurasia, A., & Kim, S., et al. (2016), Enet: A deep neural network architecture for real-time semantic segmentation. arXiv:1606.02147.
Poudel, R. P., Bonde, U., & Liwicki, S., et al. (2018). Contextnet: Exploring context and detail for semantic segmentation in real-time. In BMVC.
Qin, Z., Zhang, P., & Wu, F., et al. (2021). Fcanet: Frequency channel attention networks. In ICCV (pp. 783–792).
Ramachandran, P., Parmar, N., & Vaswani, A., et al. (2019). Stand-alone self-attention in vision models. NeurIPS 32.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234–241). Springer.
Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-attention with relative position representations. In NAACL (pp. 464–468).
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.
Su, J., Ahmed, M., Lu, Y., et al. (2024). Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568, 127063.
Article Google Scholar
Takikawa, T., Acuna, D., & Jampani, V., et al. (2019). Gated-scnn: Gated shape cnns for semantic segmentation. In: CVPR (pp. 5229–5238).
Tong, X. Y., Xia, G. S., Lu, Q., et al. (2020). Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sensing of Environment, 237, 111322.
Article Google Scholar
Tschandl, P., Rosendahl, C., & Kittler, H. (2018). The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data, 5(1), 1–9.
Article Google Scholar
Vaswani, A., Shazeer, N., & Parmar, N., et al. (2017). Attention is all you need. NeurIPS 30.
Visin, F., Ciccone, M., & Romero, A., et al. (2016). Reseg: A recurrent neural network-based model for semantic segmentation. In CVPRW.
Volpi, M., & Tuia, D. (2016). Dense semantic labeling of subdecimeter resolution images with convolutional neural networks. IEEE Transactions on Geoscience and Remote Sensing, 55(2), 881–893.
Article Google Scholar
Wang, F., Jiang, M., & Qian, C., et al. (2017). Residual attention network for image classification. In CVPR (pp. 3156–3164).
Wang, X., Girshick, R., & Gupta, A., et al. (2018). Non-local neural networks. In CVPR (pp. 7794–7803).
Woo, S., Park, J., & Lee, J. Y., et al. (2018). Cbam: Convolutional block attention module. In ECCV (pp. 3–19).
Wu, T., Lei, Z., & Lin, B., et al. (2020a). Patch proposal network for fast semantic segmentation of high-resolution images. In AAAI (pp. 12402–12409).
Wu, T., Tang, S., Zhang, R., et al. (2020). Cgnet: A light-weight context guided network for semantic segmentation. TIP, 30, 1169–1179.
Xiao, T., Liu, Y., & Zhou, B., et al. (2018). Unified perceptual parsing for scene understanding. In ECCV (pp. 418–434).
Xie, E., Wang, W., & Yu, Z., et al (2021). Segformer: Simple and efficient design for semantic segmentation with transformers. NeurIPS 34.
Xu, K., Ba, J., & Kiros, R., et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. In ICML, PMLR (pp. 2048–2057).
Yang, Z., Zhu, L., & Wu, Y., et al. (2020). Gated channel transformation for visual recognition. In: CVPR (pp. 11794–11803).
Yin, M., Yao, Z., & Cao, Y., et al. (2020). Disentangled non-local neural networks. In ECCV (pp 191–207). Springer.
Yu, C., Wang, J., & Peng, C., et al. (2018). Bisenet: Bilateral segmentation network for real-time semantic segmentation. In ECCV (pp. 325–341).
Yu, C., Wang, J., & Gao, C., et al. (2020). Context prior for scene segmentation. In CVPR (pp. 12416–12425).]
Zhang, H., Dana, K., & Shi, J., et al. (2018). Context encoding for semantic segmentation. In CVPR (pp. 7151–7160).
Zhao, H., Shi, J., & Qi, X., et al. (2017). Pyramid scene parsing network. In CVPR (pp. 2881–2890).
Zhao, H., Qi, X., & Shen, X., et al. (2018a). Icnet for real-time semantic segmentation on high-resolution images. In ECCV (pp. 405–420).
Zhao, H., Zhang, Y., & Liu, S., et al. (2018b). Psanet: Point-wise spatial attention network for scene parsing. In ECCV (pp. 267–283).
Zheng, S., Lu, J., & Zhao, H., et al. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR (pp. 6881–6890).
Zhou, P., Price, B., & Cohen, S., et al. (2020). Deepstrip: High-resolution boundary refinement. In CVPR (pp. 10558–10567).
Zhu, X., Su, W., & Lu, L., et al. (2021). Deformable detr: Deformable transformers for end-to-end object detection. In ICLR.
Zhu, Z., Xie, L., & Yuille, A. (2017). Object recognition with and without objects. In IJCAI (pp. 3609–3615).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 62072110, U21A20471, U21A20472).

Author information

Authors and Affiliations

College of Computer and Data Science, Fuzhou University, Fuzhou, Fujian, China
Wenxi Liu, Qi Li, Xindai Lin, Weixiang Yang & Yuanlong Yu
School of Computing and Information Systems, Singapore Management University, Singapore, Singapore
Shengfeng He

Authors

Wenxi Liu
View author publications
Search author on:PubMed Google Scholar
Qi Li
View author publications
Search author on:PubMed Google Scholar
Xindai Lin
View author publications
Search author on:PubMed Google Scholar
Weixiang Yang
View author publications
Search author on:PubMed Google Scholar
Shengfeng He
View author publications
Search author on:PubMed Google Scholar
Yuanlong Yu
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Yuanlong Yu.

Additional information

Communicated by Yasuyuki Matsushita.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, W., Li, Q., Lin, X. et al. Ultra-High Resolution Image Segmentation via Locality-Aware Context Fusion and Alternating Local Enhancement. Int J Comput Vis 132, 5030–5047 (2024). https://doi.org/10.1007/s11263-024-02045-3

Download citation

Received: 02 December 2022
Accepted: 29 February 2024
Published: 31 May 2024
Version of record: 31 May 2024
Issue date: November 2024
DOI: https://doi.org/10.1007/s11263-024-02045-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ultra-High Resolution Image Segmentation via Locality-Aware Context Fusion and Alternating Local Enhancement

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Semantic-Aware Global and Local Fusion Model for Image Enhancement

Preserving details in semantics-aware context for scene parsing

Single-image low-light enhancement via generating and fusing multiple sources

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now