MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation

Xie, Jiahao; Li, Wei; Li, Xiangtai; Liu, Ziwei; Ong, Yew Soon; Loy, Chen Change

doi:10.1007/s11263-024-02223-3

MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation

Published: 08 October 2024

Volume 133, pages 1456–1475, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Jiahao Xie¹,
Wei Li¹,
Xiangtai Li¹,
Ziwei Liu¹,
Yew Soon Ong² &
…
Chen Change Loy ORCID: orcid.org/0000-0001-5345-1591¹

1262 Accesses
19 Citations
Explore all metrics

Abstract

We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code: https://github.com/Jiahao000/MosaicFusion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask

Article 12 December 2024

Few-Shot 3D Volumetric Segmentation with Multi-surrogate Fusion

MaxFusion: Plug&Play Multi-modal Generation in Text-to-Image Diffusion Models

Data Availability Statements

All data supporting the findings of this study are available online. The LVIS dataset can be downloaded from https://www.lvisdataset.org/dataset. The V3Det dataset can be downloaded from https://v3det.openxlab.org.cn/. The RefCOCO, RefCOCO+, and RefCOCOg datasets can be downloaded from https://github.com/lichengunc/refer. The Stable Diffusion models used to generate data are available at https://github.com/CompVis/stable-diffusion and https://github.com/Stability-AI/generative-models.

Notes

We use the word “single” to encourage the diffusion model to generate a single object at each specific location.
We empirically find that appending a category definition after the category name reduces the semantic ambiguity of generated images due to the polysemy of some category names. For LVIS (Gupta et al., 2019) the category definitions are readily available in annotations, where the meanings are mostly derived from WordNet (Miller, 1995). See Table 2d for ablations on the effect of different prompt templates.
We show more details in the experiment section (see Table 2g, Table 2h, and Fig. 3) that averaging cross-attention maps across all layers and time steps is necessary to achieve the best performance.
The image resolution used during training in Stable Diffusion is $512\times 512$. We notice that the generated results will get worse if one deviates from this training resolution too much. Thus, we simply choose the aspect ratio of the average LVIS image and keep the longer dimension to 512.
We are aware that different works may use different notations for a $1\times $ training schedule. In this work, we always refer $1\times $ schedule to a total of 16 $\times $ 90k images.
Note that the final number of synthetic images per category is not fixed as we need to filter some images during the mask filtering process introduced in Sect. 3.3. We observe that $\sim $50% generated images and masks will be discarded accordingly.
We choose X-Paste for comparison due to its open-sourced implementation. Note that Li et al. (2023) uses a different setting by training and testing models on its own synthetic datasets. Thus, an apple-to-apple quantitative comparison with Li et al. (2023) is infeasible.
We find that the baseline and X-Paste cannot be fully reproduced using the official code (https://github.com/yoctta/XPaste/issues/2). Therefore, our reproduced results are relatively lower than the original reported performance. Nevertheless, all experiments are done under the same settings for a fair comparison.

References

Baranchuk, D., Rubachev, I., Voynov, A., Khrulkov, V., Babenko, A. (2022) Label-efficient semantic segmentation with diffusion models. In ICLR
Barron, J.T., Poole, B. (2016) The fast bilateral solver. In ECCV
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M. (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934
Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., & Cohen-Or, D. (2023). Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4), 1–10.
Article Google Scholar
Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., Lin, D. (2019) MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155
Chen, P., Sheng, K., Zhang, M., Shen, Y., Li, K., Shen, C. (2022) Open vocabulary object detection with proposal mining and prediction equalization. arXiv preprint arXiv:2206.11134
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B. (2016) The cityscapes dataset for semantic urban scene understanding. In CVPR
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L. (2009) Imagenet: A large-scale hierarchical image database. In CVPR
Di Stefano, L., Bulgarelli, A. (1999) A simple and efficient connected components labeling algorithm. In ICIAP
Du, Y., Wei, F., Zhang, Z., Shi, M., Gao, Y., Li, G. (2022) Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR
Dvornik, N., Mairal, J., Schmid, C. (2018) Modeling visual context is key to augmenting object detection datasets. In ECCV
Dwibedi, D., Misra, I., Hebert, M. (2017) Cut, paste and learn: Surprisingly easy synthesis for instance detection. In ICCV
Fang, H. S., Sun, J., Wang, R., Gou, M., Li, Y. L., Lu, C. (2019) Instaboost: Boosting instance segmentation via probability map guided copy-pasting. In ICCV
Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D. (2023) Designing an encoder for fast personalization of text-to-image models. arXiv preprint arXiv:2302.12228
Gao, M., Xing, C., Niebles, J. C., Li, J., Xu, R., Liu, W., Xiong, C. (2022) Open vocabulary object detection with pseudo bounding-box labels. In ECCV
Ge, Y., Xu, J., Zhao, B. N., Itti, L., Vineet, V. (2022) Dall-e for detection: Language-driven context image synthesis for object detection. arXiv preprint arXiv:2206.09592
Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T.Y., Cubuk, E.D., Le, Q.V., Zoph, B. (2021) Simple copy-paste is a strong data augmentation method for instance segmentation. In CVPR
Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y. (2022) Scaling open-vocabulary image segmentation with image-level labels. In ECCV
Gu, X., Lin, T.Y., Kuo, W., Cui, Y. (2022) Open-vocabulary object detection via vision and language knowledge distillation. In ICLR
Gupta, A., Dollar, P., Girshick, R. (2019) Lvis: A dataset for large vocabulary instance segmentation. In CVPR
Han, J., Niu, M., Du, Z., Wei, L., Xie, L., Zhang, X., Tian, Q. (2020) Joint coco and lvis workshop at eccv 2020: Lvis challenge track technical report: Asynchronous semi-supervised learning for large vocabulary instance segmentation. In ECCVW
He, K., Zhang, X., Ren, S., Sun, J. (2016) Deep residual learning for image recognition. In CVPR
He, K., Gkioxari, G., Dollár, P., Girshick, R. (2017) Mask r-cnn. In ICCV
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D. (2023) Prompt-to-prompt image editing with cross attention control. In ICLR
Hinterstoisser, S., Lepetit, V., Wohlhart, P., Konolige, K .(2018) On pre-trained image features and synthetic images for deep learning. In ECCV Workshops
Hu, X., Jiang, Y., Tang, K., Chen, J., Miao, C., Zhang, H. (2020) Learning to segment the tail. In CVPR
Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T. (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In ICML
Karras, T., Aittala, M., Aila, T., Laine, S. (2022) Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T. (2014) Referitgame: Referring to objects in photographs of natural scenes. In EMNLP
Kingma, D.P., Welling, M. (2014) Auto-encoding variational bayes. In ICLR
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al (2023) Segment anything. In ICCV
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y. (2022) Multi-concept customization of text-to-image diffusion. arXiv preprint arXiv:2212.04488
Kuo, W., Cui, Y., Gu, X., Piergiovanni, A., Angelova, A. (2023) F-vlm: Open-vocabulary object detection upon frozen vision and language models. In ICLR
Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., et al. (2020). The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision, 128(7), 1956–1981.
Article Google Scholar
Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J. (2024) Lisa: Reasoning segmentation via large language model. In CVPR
Li, B,. Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R. (2022a) Language-driven semantic segmentation. In ICLR
Li, D., Ling, H., Kim, SW., Kreis, K., Fidler, S., Torralba, A. (2022b) Bigdatasetgan: Synthesizing imagenet with pixel-wise annotations. In CVPR
Li, Y., Wang, T., Kang, B., Tang, S., Wang, C., Li, J., Feng, J. (2020) Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In CVPR
Li, Z., Zhou, Q., Zhang, X., Zhang, Y., Wang, Y., Xie, W. (2023) Guiding text-to-image diffusion model towards grounded generation. arXiv preprint arXiv:2301.05221
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L. (2014) Microsoft coco: Common objects in context. In ECCV
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S. (2017) Feature pyramid networks for object detection. In CVPR
Liu, J., Sun, Y., Han, C., Dou, Z., Li, W. (2020) Deep representation learning on long-tailed data: A learnable embedding augmentation perspective. In CVPR
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B. (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV
Loshchilov, I., Hutter, F. (2017) Sgdr: Stochastic gradient descent with warm restarts. In ICLR
Loshchilov, I., Hutter, F. (2019) Decoupled weight decay regularization. In ICLR
Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Sun, Y., et al (2024) Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525
Mao, J., Huang, J., Toshev, A, Camburu, O., Yuille, A.L., Murphy, K. (2016) Generation and comprehension of unambiguous object descriptions. In CVPR
Miller, G. A. (1995). Wordnet: A lexical database for english. Communications of the ACM, 38(11), 39–41.
Article MATH Google Scholar
Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., et al (2022) Simple open-vocabulary object detection with vision transformers. In ECCV
OpenAI (2023) Gpt-4 technical report. arXiv preprint arXiv:2303.08774
Otsu, N. (1979). A threshold selection method from gray-level histograms. IEEE Trans SMC, 9, 62.
MATH Google Scholar
Parmar, G., Singh, K.K., Zhang, R., Li, Y., Lu, J., Zhu, J.Y. (2023) Zero-shot image-to-image translation. arXiv preprint arXiv:2302.03027
Phung, Q., Ge, S., Huang, J. B. (2023) Grounded text-to-image synthesis with attention refocusing. arXiv preprint arXiv:2306.05427
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R. (2023) Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al (2021a) Learning transferable visual models from natural language supervision. In ICML
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. (2021b) Learning transferable visual models from natural language supervision. In ICML
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M. (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125
Rasheed, H., Maaz, M., Khattak, M.U., Khan, S., Khan, F.S. (2022) Bridging the gap between object and image-level representations for open-vocabulary detection. In NeurIPS
Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S., Cholakkal, H., Anwer, R.M., Xing, E., Yang, M.H., Khan, F.S. (2024) Glamm: Pixel grounding large multimodal model. In CVPR
Ren, J., Yu, C., Ma, X., Zhao, H., Yi, S., et al (2020) Balanced meta-softmax for long-tailed visual recognition. In NeurIPS
Richter, S. R., Vineet, V., Roth, S., Koltun, V. (2016) Playing for data: Ground truth from computer games. In ECCV
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B. (2022) High-resolution image synthesis with latent diffusion models. In CVPR
Ronneberger, O., Fischer, P., Brox, T. (2015) U-net: Convolutional networks for biomedical image segmentation. In MICCAI
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S.K.S., Ayan, B.K., Mahdavi, S.S., Lopes, R.G., et al (2022) Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS
Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., Sun, J. (2019) Objects365: A large-scale, high-quality dataset for object detection. In ICCV
Sharma, P., Ding, N., Goodman, S., Soricut, R. (2018) Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL
Su, H., Qi, C.R., Li, Y., Guibas, L.J. (2015) Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In ICCV
Tan, J., Wang, C., Li, B., Li, Q., Ouyang, W., Yin, C., Yan, J. (2020a) Equalization loss for long-tailed object recognition. In CVPR
Tan, J., Zhang, G., Deng, H., Wang, C., Lu, L., Li, Q., Dai, J. (2020b) 1st place solution of lvis challenge 2020: A good box is not a guarantee of a good mask. arXiv preprint arXiv:2009.01559
Tan, J., Lu, X., Zhang, G., Yin, C., Li, Q. (2021) Equalization loss v2: A new gradient balance approach for long-tailed object detection. In CVPR
Tan, M., Pang, R., Le, Q.V. (2020c) Efficientdet: Scalable and efficient object detection. In CVPR
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F. (2023a) Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al (2023b) Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288
Wang, J., Zhang, W., Zang, Y., Cao, Y., Pang, J., Gong, T., Chen, K., Liu, Z., Loy, C.C., Lin, D. (2021a) Seesaw loss for long-tailed instance segmentation. In CVPR
Wang, J., Zhang, P., Chu, T., Cao, Y., Zhou, Y., Wu, T., Wang, B., He, C., Lin, D. (2023) V3det: Vast vocabulary visual detection dataset. In ICCV
Wang, T., Li, Y., Kang, B., Li, J., Liew, J., Tang, S., Hoi, S., Feng, J. (2020) The devil is in classification: A simple framework for long-tail instance segmentation. In ECCV
Wang, T., Zhu, Y., Zhao, C., Zeng, W., Wang, J., Tang, M. (2021b) Adaptive class suppression loss for long-tail object detection. In CVPR
Waqas, Z. S., Arora, A., Gupta, A., Khan, S., Sun, G., Khan, S. F., Zhu, F., Shao, L., Xia, G. S., Bai, X. (2019) isaid: A large-scale dataset for instance segmentation in aerial images. In CVPRW
Wei, F., Gao, Y., Wu, Z., Hu, H., Lin, S. (2021) Aligning pretraining for detection via object-level contrastive learning. In NeurIPS
Wu, J., Song, L., Wang, T., Zhang, Q., Yuan, J. (2020) Forest r-cnn: Large-vocabulary long-tailed object detection and instance segmentation. In ACM-MM
Wu, J., Li, X., Xu, S., Yuan, H., Ding, H., Yang, Y., Li, X., Zhang, J., Tong, Y., Jiang, X., Ghanem, B., Tao, D. (2023) Towards open vocabulary learning: A survey. arXiv preprint arXiv:2306.15880
Wu, S., Jin, S., Zhang, W., Xu, L., Liu, W., Li, W., Loy, C. C. (2024) F-lmm: Grounding frozen large multimodal models. arXiv preprint arXiv:2406.05821
Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R. (2019) Detectron2
Xie, J., Li, Y., Huang, Y., Liu, H., Zhang, W., Zheng, Y., Shou, M.Z. (2023) Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In ICCV
Zang, Y., Huang, C., Loy, C.C. (2021) Fasa: Feature augmentation and sampling adaptation for long-tailed instance segmentation. In ICCV
Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C. (2022) Open-vocabulary detr with conditional matching. In ECCV
Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F. (2021) Open-vocabulary object detection using captions. In CVPR
Zhang, C., Pan, T. Y., Li, Y., Hu, H., Xuan, D., Changpinyo, S., Gong, B., Chao, W. L. (2021a) Mosaicos: A simple and effective use of object-centric images for long-tailed object detection. In ICCV
Zhang, J., Huang, J., Jin, S., Lu, S. (2023) Vision-language models for vision tasks: A survey. arXiv preprint arXiv:2304.00685
Zhang, S., Li, Z., Yan, S., He, X., Sun, J. (2021b) Distribution alignment: A unified framework for long-tail visual recognition. In CVPR
Zhang, Y., Ling, H., Gao, J., Yin, K., Lafleche, J. F., Barriuso, A., Torralba, A., Fidler, S. (2021c) Datasetgan: Efficient labeled data factory with minimal human effort. In CVPR
Zhao, H., Sheng, D., Bao, J., Chen, D., Chen, D., Wen, F., Yuan, L., Liu, C., Zhou, W., Chu, Q., Zhang, W., Yu, N. (2023) X-paste: Revisiting scalable copy-paste for instance segmentation using clip and stablediffusion. In ICML
Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., et al (2022) Regionclip: Region-based language-image pretraining. In CVPR
Zhou, K., Yang, J., Loy, C.C., Liu, Z. (2022a) Conditional prompt learning for vision-language models. In CVPR
Zhou, X., Koltun, V., Krähenbühl, P. (2021) Probabilistic two-stage detection. arXiv preprint arXiv:2103.07461
Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I. (2022b) Detecting twenty-thousand classes using image-level supervision. In ECCV
Zong, Z., Song, G., Liu, Y. (2023) Detrs with collaborative hybrid assignments training. In ICCV

Download references

Acknowledgements

This study is supported under the RIE2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). The project is also supported by NTU NAP and Singapore MOE AcRF Tier 2 (MOE-T2EP20120-0001, MOE-T2EP20221-0012).

Author information

Authors and Affiliations

S-Lab, Nanyang Technological University, Singapore, Singapore
Jiahao Xie, Wei Li, Xiangtai Li, Ziwei Liu & Chen Change Loy
Nanyang Technological University, Singapore, Singapore
Yew Soon Ong

Authors

Jiahao Xie
View author publications
Search author on:PubMed Google Scholar
Wei Li
View author publications
Search author on:PubMed Google Scholar
Xiangtai Li
View author publications
Search author on:PubMed Google Scholar
Ziwei Liu
View author publications
Search author on:PubMed Google Scholar
Yew Soon Ong
View author publications
Search author on:PubMed Google Scholar
Chen Change Loy
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Chen Change Loy.

Additional information

Communicated by ZHUN ZHONG.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xie, J., Li, W., Li, X. et al. MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation. Int J Comput Vis 133, 1456–1475 (2025). https://doi.org/10.1007/s11263-024-02223-3

Download citation

Received: 25 September 2023
Accepted: 14 August 2024
Published: 08 October 2024
Version of record: 08 October 2024
Issue date: April 2025
DOI: https://doi.org/10.1007/s11263-024-02223-3

Keywords

Profiles

Chen Change Loy View author profile

Part of a collection:

Special Issue on Open-World Visual Recognition

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask

Few-Shot 3D Volumetric Segmentation with Multi-surrogate Fusion

MaxFusion: Plug&Play Multi-modal Generation in Text-to-Image Diffusion Models

Explore related subjects

Data Availability Statements

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now