这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

Ultra-High Resolution Image Segmentation via Locality-Aware Context Fusion and Alternating Local Enhancement

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Ultra-high resolution image segmentation has raised increasing interests in recent years due to its realistic applications. In this paper, we innovate the widely used high-resolution image segmentation pipeline, in which an ultra-high resolution image is partitioned into regular patches for local segmentation and then the local results are merged into a high-resolution semantic mask. In particular, we introduce a novel locality-aware context fusion based segmentation model to process local patches, where the relevance between local patch and its various contexts are jointly and complementarily utilized to handle the semantic regions with large variations. Additionally, we present the alternating local enhancement module that restricts the negative impact of redundant information introduced from the contexts, and thus is endowed with the ability of fixing the locality-aware features to produce refined results. Furthermore, in comprehensive experiments, we demonstrate that our model outperforms other state-of-the-art methods in public benchmarks and verify the effectiveness of the proposed modules. Our released codes will be available at: https://github.com/liqiokkk/FCtL.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Algorithm 1
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data Availability

The employed datasets DeepGlobe (https://competitions.codalab.org/competitions/18468), Inria Aerial (https://project.inria.fr/aerialimagelabeling), and ISIC (https://challenge.isic-archive.com/data) in our work are publicly available for research.

References

  • Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. TPAMI, 39(12), 2481–2495.

    Article  Google Scholar 

  • Carion, N., Massa, F., & Synnaeve, G., et al. (2020). End-to-end object detection with transformers. In ECCV (pp. 213–229). Springer.

  • Chen, H., Wang, Y., & Guo, T., et al. (2021). Pre-trained image processing transformer. In CVPR (pp 12299–12310).

  • Chen, L. C., Papandreou, G., Kokkinos, I., et al. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 40(4), 834–848.

    Article  Google Scholar 

  • Chen, L. C., Zhu, Y., & Papandreou, G., et al. (2018b). Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV (pp 801–818).

  • Chen, W., Jiang, Z., & Wang, Z., et al. (2019). Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images. In CVPR (pp. 8924–8933).

  • Cheng, H. K., Chung, J., & Tai, Y. W., et al. (2020). Cascadepsp: Toward class-agnostic and very high-resolution segmentation via global and local refinement. In CVPR (pp. 8890–8899).

  • Choi, J., Gao, C., & Messou, J. C., et al. (2019). Why can’t i dance in the mall? Learning to mitigate scene bias in action recognition. NeurIPS 32.

  • Codella, N. C., Gutman, D., & Celebi, M. E., et al. (2018). Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). In 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018). IEEE (pp. 168–172).

  • Dai, J., Qi, H., Xiong, Y., et al. (2017). Deformable convolutional networks. In ICCV (pp. 764–773).

  • d’Ascoli, S., Touvron, H., & Leavitt, M.L., et al. (2021). Convit: Improving vision transformers with soft convolutional inductive biases. In ICML, PMLR (pp. 2286–2296).

  • Demir, I., Koperski, K., & Lindenbaum, D., et al. (2018). Deepglobe 2018: A challenge to parse the earth through satellite images. In CVPRW (pp. 172–181).

  • Dosovitskiy, A., Beyer, L., & Kolesnikov, A., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.

  • Fu, J., Liu, J., & Tian, H., et al. (2019). Dual attention network for scene segmentation. In CVPR (pp. 3146–3154).

  • Gao, Z., Xie, J., & Wang, Q., et al. (2019). Global second-order pooling convolutional networks. In CVPR (pp. 3024–3033).

  • Gehring, J., Auli, M., & Grangier, D., et al. (2017). Convolutional sequence to sequence learning. In ICML, PMLR (pp. 1243–1252).

  • Gregor, K., Danihelka, I., & Graves, A., et al. (2015). Draw: A recurrent neural network for image generation. In ICML, PMLR (pp. 1462–1471).

  • Gu, J., Liu, Q., & Cho, K. (2019). Insertion-based decoding with automatically inferred generation order. Transactions of the Association for Computational Linguistics, 7, 661–676.

    Article  Google Scholar 

  • Guo, M. H., Xu, T. X., & Liu, J. J., et al. (2022a). Attention mechanisms in computer vision: A survey. Computational Visual Media (pp. 1–38).

  • Guo, S., Liu, L., & Gan, Z., et al. (2022b). Isdnet: Integrating shallow and deep networks for efficient ultra-high resolution segmentation. In CVPR, (pp. 4361–4370).

  • Han, K., Xiao, A., & Wu, E., et al. (2021). Transformer in transformer. NeurIPS 34.

  • He, J., Deng, Z., & Zhou, L., et al. (2019). Adaptive pyramid context network for semantic segmentation. In CVPR (pp. 7519–7528).

  • He, Y., Shirakabe, S., & Satoh, Y., et al. (2016). Human action recognition without human. In ECCVW (pp. 11–17). Springer.

  • Hu, H., Chen, Y., & Xu, J., et al. (2022). Learning implicit feature alignment function for semantic segmentation. In ECCV (pp. 487–505). Springer.

  • Hu, J., Shen, L., & Albanie, S., et al. (2018). Gather-excite: Exploiting feature context in convolutional neural networks. NeurIPS 31.

  • Hu, J., Shen, L., Albanie, S., et al. (2019). Squeeze-and-excitation networks. TPAMI, 42(8), 2011–2023.

    Article  Google Scholar 

  • Huang, Z., Wang, X., & Huang, L., et al. (2019). Ccnet: Criss-cross attention for semantic segmentation. In ICCV (pp. 603–612).

  • Huang, Z., Wei, Y., Wang, X., et al. (2021). Alignseg: Feature-aligned segmentation networks. TPAMI, 44(1), 550–557.

    Google Scholar 

  • Huynh, C., Tran, A. T., & Luu, K., et al. (2021). Progressive semantic segmentation. In CVPR (pp. 16755–16764).

  • Jaderberg, M., Simonyan, K., & Zisserman, A., et al. (2015). Spatial transformer networks. NeurIPS 28.

  • Jain, M., Van Gemert, J. C., & Snoek, C. G. (2015). What do 15,000 object categories tell us about classifying and localizing actions? In CVPR (pp. 46–55).

  • Jiang, Y., Chang, S., & Wang, Z. (2021). Transgan: Two pure transformers can make one strong gan, and that can scale up. NeurIPS 34.

  • Jin, Z., Gong, T., & Yu, D., et al. (2021a). Mining contextual information beyond image for semantic segmentation. In ICCV (pp. 7231–7241).

  • Jin, Z., Liu, B., & Chu, Q., et al. (2021b). Isnet: Integrate image-level and semantic-level context for semantic segmentation. In ICCV (pp. 7189–7198).

  • Ke, G., He, D., & Liu, T. Y. (2021). Rethinking positional encoding in language pre-training. In ICLR.

  • Kenton, J. D. M. W. C., & Toutanova, L. K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL (pp. 4171–4186).

  • Kirillov, A., Wu, Y., & He, K., et al. (2020). Pointrend: Image segmentation as rendering. In CVPR (pp. 9799–9808).

  • Li, Q., Yang, W., & Liu, W., et al. (2021). From contexts to locality: Ultra-high resolution image segmentation via locality-aware contextual correlation. In ICCV (pp. 7252–7261).

  • Li, X., You, A., & Zhu, Z., et al. (2020). Semantic flow for fast and accurate scene parsing. In ECCV (pp. 775–793). Springer.

  • Lin, K., Wang, L., & Liu, Z. (2021). End-to-end human pose and mesh reconstruction with transformers. In CVPR (pp. 1954–1963).

  • Lin, T. Y., Goyal, P., & Girshick, R., et al. (2017). Focal loss for dense object detection. In ICCV.

  • Liu, J. J., Hou, Q., & Cheng, M. M., et al. (2020). Improving convolutional networks with self-calibrated convolutions. In CVPR (pp. 10096–10105).

  • Liu, W., Rabinovich, A., & Berg, A. C. (2015). Parsenet: Looking wider to see better. arXiv:1506.04579.

  • Liu, Z., Lin, Y., & Cao, Y., et al. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV (pp. 10012–10022).

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR (pp. 3431–3440).

  • Luo, G., Zhou, Y., & Sun, X., et al. (2020). Multi-task collaborative network for joint referring expression comprehension and segmentation. In CVPR (pp. 10034–10043).

  • Maggiori, E., Tarabalka, Y., & Charpiat, G., et al. (2017). Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. In IGARSS, IEEE (pp. 3226–3229).

  • Mazzini, D. (2018). Guided upsampling network for real-time semantic segmentation. In BMVC.

  • Mnih, V., Heess, N., & Graves, A., et al. (2014). Recurrent models of visual attention. NeurIPS 27.

  • Paszke, A., Chaurasia, A., & Kim, S., et al. (2016), Enet: A deep neural network architecture for real-time semantic segmentation. arXiv:1606.02147.

  • Poudel, R. P., Bonde, U., & Liwicki, S., et al. (2018). Contextnet: Exploring context and detail for semantic segmentation in real-time. In BMVC.

  • Qin, Z., Zhang, P., & Wu, F., et al. (2021). Fcanet: Frequency channel attention networks. In ICCV (pp. 783–792).

  • Ramachandran, P., Parmar, N., & Vaswani, A., et al. (2019). Stand-alone self-attention in vision models. NeurIPS 32.

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234–241). Springer.

  • Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-attention with relative position representations. In NAACL (pp. 464–468).

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.

  • Su, J., Ahmed, M., Lu, Y., et al. (2024). Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568, 127063.

    Article  Google Scholar 

  • Takikawa, T., Acuna, D., & Jampani, V., et al. (2019). Gated-scnn: Gated shape cnns for semantic segmentation. In: CVPR (pp. 5229–5238).

  • Tong, X. Y., Xia, G. S., Lu, Q., et al. (2020). Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sensing of Environment, 237, 111322.

    Article  Google Scholar 

  • Tschandl, P., Rosendahl, C., & Kittler, H. (2018). The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data, 5(1), 1–9.

    Article  Google Scholar 

  • Vaswani, A., Shazeer, N., & Parmar, N., et al. (2017). Attention is all you need. NeurIPS 30.

  • Visin, F., Ciccone, M., & Romero, A., et al. (2016). Reseg: A recurrent neural network-based model for semantic segmentation. In CVPRW.

  • Volpi, M., & Tuia, D. (2016). Dense semantic labeling of subdecimeter resolution images with convolutional neural networks. IEEE Transactions on Geoscience and Remote Sensing, 55(2), 881–893.

    Article  Google Scholar 

  • Wang, F., Jiang, M., & Qian, C., et al. (2017). Residual attention network for image classification. In CVPR (pp. 3156–3164).

  • Wang, X., Girshick, R., & Gupta, A., et al. (2018). Non-local neural networks. In CVPR (pp. 7794–7803).

  • Woo, S., Park, J., & Lee, J. Y., et al. (2018). Cbam: Convolutional block attention module. In ECCV (pp. 3–19).

  • Wu, T., Lei, Z., & Lin, B., et al. (2020a). Patch proposal network for fast semantic segmentation of high-resolution images. In AAAI (pp. 12402–12409).

  • Wu, T., Tang, S., Zhang, R., et al. (2020). Cgnet: A light-weight context guided network for semantic segmentation. TIP, 30, 1169–1179.

  • Xiao, T., Liu, Y., & Zhou, B., et al. (2018). Unified perceptual parsing for scene understanding. In ECCV (pp. 418–434).

  • Xie, E., Wang, W., & Yu, Z., et al (2021). Segformer: Simple and efficient design for semantic segmentation with transformers. NeurIPS 34.

  • Xu, K., Ba, J., & Kiros, R., et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. In ICML, PMLR (pp. 2048–2057).

  • Yang, Z., Zhu, L., & Wu, Y., et al. (2020). Gated channel transformation for visual recognition. In: CVPR (pp. 11794–11803).

  • Yin, M., Yao, Z., & Cao, Y., et al. (2020). Disentangled non-local neural networks. In ECCV (pp 191–207). Springer.

  • Yu, C., Wang, J., & Peng, C., et al. (2018). Bisenet: Bilateral segmentation network for real-time semantic segmentation. In ECCV (pp. 325–341).

  • Yu, C., Wang, J., & Gao, C., et al. (2020). Context prior for scene segmentation. In CVPR (pp. 12416–12425).]

  • Zhang, H., Dana, K., & Shi, J., et al. (2018). Context encoding for semantic segmentation. In CVPR (pp. 7151–7160).

  • Zhao, H., Shi, J., & Qi, X., et al. (2017). Pyramid scene parsing network. In CVPR (pp. 2881–2890).

  • Zhao, H., Qi, X., & Shen, X., et al. (2018a). Icnet for real-time semantic segmentation on high-resolution images. In ECCV (pp. 405–420).

  • Zhao, H., Zhang, Y., & Liu, S., et al. (2018b). Psanet: Point-wise spatial attention network for scene parsing. In ECCV (pp. 267–283).

  • Zheng, S., Lu, J., & Zhao, H., et al. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR (pp. 6881–6890).

  • Zhou, P., Price, B., & Cohen, S., et al. (2020). Deepstrip: High-resolution boundary refinement. In CVPR (pp. 10558–10567).

  • Zhu, X., Su, W., & Lu, L., et al. (2021). Deformable detr: Deformable transformers for end-to-end object detection. In ICLR.

  • Zhu, Z., Xie, L., & Yuille, A. (2017). Object recognition with and without objects. In IJCAI (pp. 3609–3615).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 62072110, U21A20471, U21A20472).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuanlong Yu.

Additional information

Communicated by Yasuyuki Matsushita.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, W., Li, Q., Lin, X. et al. Ultra-High Resolution Image Segmentation via Locality-Aware Context Fusion and Alternating Local Enhancement. Int J Comput Vis 132, 5030–5047 (2024). https://doi.org/10.1007/s11263-024-02045-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-024-02045-3

Keywords