这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code: https://github.com/Jiahao000/MosaicFusion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availability Statements

All data supporting the findings of this study are available online. The LVIS dataset can be downloaded from https://www.lvisdataset.org/dataset. The V3Det dataset can be downloaded from https://v3det.openxlab.org.cn/. The RefCOCO, RefCOCO+, and RefCOCOg datasets can be downloaded from https://github.com/lichengunc/refer. The Stable Diffusion models used to generate data are available at https://github.com/CompVis/stable-diffusion and https://github.com/Stability-AI/generative-models.

Notes

  1. We use the word “single” to encourage the diffusion model to generate a single object at each specific location.

  2. We empirically find that appending a category definition after the category name reduces the semantic ambiguity of generated images due to the polysemy of some category names. For LVIS (Gupta et al., 2019) the category definitions are readily available in annotations, where the meanings are mostly derived from WordNet (Miller, 1995). See Table 2d for ablations on the effect of different prompt templates.

  3. We show more details in the experiment section (see Table 2g, Table 2h, and Fig. 3) that averaging cross-attention maps across all layers and time steps is necessary to achieve the best performance.

  4. The image resolution used during training in Stable Diffusion is \(512\times 512\). We notice that the generated results will get worse if one deviates from this training resolution too much. Thus, we simply choose the aspect ratio of the average LVIS image and keep the longer dimension to 512.

  5. We are aware that different works may use different notations for a \(1\times \) training schedule. In this work, we always refer \(1\times \) schedule to a total of 16 \(\times \) 90k images.

  6. Note that the final number of synthetic images per category is not fixed as we need to filter some images during the mask filtering process introduced in Sect. 3.3. We observe that \(\sim \)50% generated images and masks will be discarded accordingly.

  7. We choose X-Paste for comparison due to its open-sourced implementation. Note that Li et al. (2023) uses a different setting by training and testing models on its own synthetic datasets. Thus, an apple-to-apple quantitative comparison with Li et al. (2023) is infeasible.

  8. We find that the baseline and X-Paste cannot be fully reproduced using the official code (https://github.com/yoctta/XPaste/issues/2). Therefore, our reproduced results are relatively lower than the original reported performance. Nevertheless, all experiments are done under the same settings for a fair comparison.

References

  • Baranchuk, D., Rubachev, I., Voynov, A., Khrulkov, V., Babenko, A. (2022) Label-efficient semantic segmentation with diffusion models. In ICLR

  • Barron, J.T., Poole, B. (2016) The fast bilateral solver. In ECCV

  • Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M. (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934

  • Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., & Cohen-Or, D. (2023). Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4), 1–10.

    Article  Google Scholar 

  • Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., Lin, D. (2019) MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155

  • Chen, P., Sheng, K., Zhang, M., Shen, Y., Li, K., Shen, C. (2022) Open vocabulary object detection with proposal mining and prediction equalization. arXiv preprint arXiv:2206.11134

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B. (2016) The cityscapes dataset for semantic urban scene understanding. In CVPR

  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L. (2009) Imagenet: A large-scale hierarchical image database. In CVPR

  • Di Stefano, L., Bulgarelli, A. (1999) A simple and efficient connected components labeling algorithm. In ICIAP

  • Du, Y., Wei, F., Zhang, Z., Shi, M., Gao, Y., Li, G. (2022) Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR

  • Dvornik, N., Mairal, J., Schmid, C. (2018) Modeling visual context is key to augmenting object detection datasets. In ECCV

  • Dwibedi, D., Misra, I., Hebert, M. (2017) Cut, paste and learn: Surprisingly easy synthesis for instance detection. In ICCV

  • Fang, H. S., Sun, J., Wang, R., Gou, M., Li, Y. L., Lu, C. (2019) Instaboost: Boosting instance segmentation via probability map guided copy-pasting. In ICCV

  • Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D. (2023) Designing an encoder for fast personalization of text-to-image models. arXiv preprint arXiv:2302.12228

  • Gao, M., Xing, C., Niebles, J. C., Li, J., Xu, R., Liu, W., Xiong, C. (2022) Open vocabulary object detection with pseudo bounding-box labels. In ECCV

  • Ge, Y., Xu, J., Zhao, B. N., Itti, L., Vineet, V. (2022) Dall-e for detection: Language-driven context image synthesis for object detection. arXiv preprint arXiv:2206.09592

  • Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T.Y., Cubuk, E.D., Le, Q.V., Zoph, B. (2021) Simple copy-paste is a strong data augmentation method for instance segmentation. In CVPR

  • Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y. (2022) Scaling open-vocabulary image segmentation with image-level labels. In ECCV

  • Gu, X., Lin, T.Y., Kuo, W., Cui, Y. (2022) Open-vocabulary object detection via vision and language knowledge distillation. In ICLR

  • Gupta, A., Dollar, P., Girshick, R. (2019) Lvis: A dataset for large vocabulary instance segmentation. In CVPR

  • Han, J., Niu, M., Du, Z., Wei, L., Xie, L., Zhang, X., Tian, Q. (2020) Joint coco and lvis workshop at eccv 2020: Lvis challenge track technical report: Asynchronous semi-supervised learning for large vocabulary instance segmentation. In ECCVW

  • He, K., Zhang, X., Ren, S., Sun, J. (2016) Deep residual learning for image recognition. In CVPR

  • He, K., Gkioxari, G., Dollár, P., Girshick, R. (2017) Mask r-cnn. In ICCV

  • Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D. (2023) Prompt-to-prompt image editing with cross attention control. In ICLR

  • Hinterstoisser, S., Lepetit, V., Wohlhart, P., Konolige, K .(2018) On pre-trained image features and synthetic images for deep learning. In ECCV Workshops

  • Hu, X., Jiang, Y., Tang, K., Chen, J., Miao, C., Zhang, H. (2020) Learning to segment the tail. In CVPR

  • Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T. (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In ICML

  • Karras, T., Aittala, M., Aila, T., Laine, S. (2022) Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364

  • Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T. (2014) Referitgame: Referring to objects in photographs of natural scenes. In EMNLP

  • Kingma, D.P., Welling, M. (2014) Auto-encoding variational bayes. In ICLR

  • Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al (2023) Segment anything. In ICCV

  • Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y. (2022) Multi-concept customization of text-to-image diffusion. arXiv preprint arXiv:2212.04488

  • Kuo, W., Cui, Y., Gu, X., Piergiovanni, A., Angelova, A. (2023) F-vlm: Open-vocabulary object detection upon frozen vision and language models. In ICLR

  • Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., et al. (2020). The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision, 128(7), 1956–1981.

    Article  Google Scholar 

  • Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J. (2024) Lisa: Reasoning segmentation via large language model. In CVPR

  • Li, B,. Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R. (2022a) Language-driven semantic segmentation. In ICLR

  • Li, D., Ling, H., Kim, SW., Kreis, K., Fidler, S., Torralba, A. (2022b) Bigdatasetgan: Synthesizing imagenet with pixel-wise annotations. In CVPR

  • Li, Y., Wang, T., Kang, B., Tang, S., Wang, C., Li, J., Feng, J. (2020) Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In CVPR

  • Li, Z., Zhou, Q., Zhang, X., Zhang, Y., Wang, Y., Xie, W. (2023) Guiding text-to-image diffusion model towards grounded generation. arXiv preprint arXiv:2301.05221

  • Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L. (2014) Microsoft coco: Common objects in context. In ECCV

  • Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S. (2017) Feature pyramid networks for object detection. In CVPR

  • Liu, J., Sun, Y., Han, C., Dou, Z., Li, W. (2020) Deep representation learning on long-tailed data: A learnable embedding augmentation perspective. In CVPR

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B. (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV

  • Loshchilov, I., Hutter, F. (2017) Sgdr: Stochastic gradient descent with warm restarts. In ICLR

  • Loshchilov, I., Hutter, F. (2019) Decoupled weight decay regularization. In ICLR

  • Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Sun, Y., et al (2024) Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525

  • Mao, J., Huang, J., Toshev, A, Camburu, O., Yuille, A.L., Murphy, K. (2016) Generation and comprehension of unambiguous object descriptions. In CVPR

  • Miller, G. A. (1995). Wordnet: A lexical database for english. Communications of the ACM, 38(11), 39–41.

    Article  MATH  Google Scholar 

  • Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., et al (2022) Simple open-vocabulary object detection with vision transformers. In ECCV

  • OpenAI (2023) Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  • Otsu, N. (1979). A threshold selection method from gray-level histograms. IEEE Trans SMC, 9, 62.

    MATH  Google Scholar 

  • Parmar, G., Singh, K.K., Zhang, R., Li, Y., Lu, J., Zhu, J.Y. (2023) Zero-shot image-to-image translation. arXiv preprint arXiv:2302.03027

  • Phung, Q., Ge, S., Huang, J. B. (2023) Grounded text-to-image synthesis with attention refocusing. arXiv preprint arXiv:2306.05427

  • Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R. (2023) Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952

  • Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al (2021a) Learning transferable visual models from natural language supervision. In ICML

  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. (2021b) Learning transferable visual models from natural language supervision. In ICML

  • Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M. (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125

  • Rasheed, H., Maaz, M., Khattak, M.U., Khan, S., Khan, F.S. (2022) Bridging the gap between object and image-level representations for open-vocabulary detection. In NeurIPS

  • Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S., Cholakkal, H., Anwer, R.M., Xing, E., Yang, M.H., Khan, F.S. (2024) Glamm: Pixel grounding large multimodal model. In CVPR

  • Ren, J., Yu, C., Ma, X., Zhao, H., Yi, S., et al (2020) Balanced meta-softmax for long-tailed visual recognition. In NeurIPS

  • Richter, S. R., Vineet, V., Roth, S., Koltun, V. (2016) Playing for data: Ground truth from computer games. In ECCV

  • Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B. (2022) High-resolution image synthesis with latent diffusion models. In CVPR

  • Ronneberger, O., Fischer, P., Brox, T. (2015) U-net: Convolutional networks for biomedical image segmentation. In MICCAI

  • Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S.K.S., Ayan, B.K., Mahdavi, S.S., Lopes, R.G., et al (2022) Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS

  • Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., Sun, J. (2019) Objects365: A large-scale, high-quality dataset for object detection. In ICCV

  • Sharma, P., Ding, N., Goodman, S., Soricut, R. (2018) Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL

  • Su, H., Qi, C.R., Li, Y., Guibas, L.J. (2015) Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In ICCV

  • Tan, J., Wang, C., Li, B., Li, Q., Ouyang, W., Yin, C., Yan, J. (2020a) Equalization loss for long-tailed object recognition. In CVPR

  • Tan, J., Zhang, G., Deng, H., Wang, C., Lu, L., Li, Q., Dai, J. (2020b) 1st place solution of lvis challenge 2020: A good box is not a guarantee of a good mask. arXiv preprint arXiv:2009.01559

  • Tan, J., Lu, X., Zhang, G., Yin, C., Li, Q. (2021) Equalization loss v2: A new gradient balance approach for long-tailed object detection. In CVPR

  • Tan, M., Pang, R., Le, Q.V. (2020c) Efficientdet: Scalable and efficient object detection. In CVPR

  • Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F. (2023a) Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

  • Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al (2023b) Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

  • Wang, J., Zhang, W., Zang, Y., Cao, Y., Pang, J., Gong, T., Chen, K., Liu, Z., Loy, C.C., Lin, D. (2021a) Seesaw loss for long-tailed instance segmentation. In CVPR

  • Wang, J., Zhang, P., Chu, T., Cao, Y., Zhou, Y., Wu, T., Wang, B., He, C., Lin, D. (2023) V3det: Vast vocabulary visual detection dataset. In ICCV

  • Wang, T., Li, Y., Kang, B., Li, J., Liew, J., Tang, S., Hoi, S., Feng, J. (2020) The devil is in classification: A simple framework for long-tail instance segmentation. In ECCV

  • Wang, T., Zhu, Y., Zhao, C., Zeng, W., Wang, J., Tang, M. (2021b) Adaptive class suppression loss for long-tail object detection. In CVPR

  • Waqas, Z. S., Arora, A., Gupta, A., Khan, S., Sun, G., Khan, S. F., Zhu, F., Shao, L., Xia, G. S., Bai, X. (2019) isaid: A large-scale dataset for instance segmentation in aerial images. In CVPRW

  • Wei, F., Gao, Y., Wu, Z., Hu, H., Lin, S. (2021) Aligning pretraining for detection via object-level contrastive learning. In NeurIPS

  • Wu, J., Song, L., Wang, T., Zhang, Q., Yuan, J. (2020) Forest r-cnn: Large-vocabulary long-tailed object detection and instance segmentation. In ACM-MM

  • Wu, J., Li, X., Xu, S., Yuan, H., Ding, H., Yang, Y., Li, X., Zhang, J., Tong, Y., Jiang, X., Ghanem, B., Tao, D. (2023) Towards open vocabulary learning: A survey. arXiv preprint arXiv:2306.15880

  • Wu, S., Jin, S., Zhang, W., Xu, L., Liu, W., Li, W., Loy, C. C. (2024) F-lmm: Grounding frozen large multimodal models. arXiv preprint arXiv:2406.05821

  • Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R. (2019) Detectron2

  • Xie, J., Li, Y., Huang, Y., Liu, H., Zhang, W., Zheng, Y., Shou, M.Z. (2023) Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In ICCV

  • Zang, Y., Huang, C., Loy, C.C. (2021) Fasa: Feature augmentation and sampling adaptation for long-tailed instance segmentation. In ICCV

  • Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C. (2022) Open-vocabulary detr with conditional matching. In ECCV

  • Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F. (2021) Open-vocabulary object detection using captions. In CVPR

  • Zhang, C., Pan, T. Y., Li, Y., Hu, H., Xuan, D., Changpinyo, S., Gong, B., Chao, W. L. (2021a) Mosaicos: A simple and effective use of object-centric images for long-tailed object detection. In ICCV

  • Zhang, J., Huang, J., Jin, S., Lu, S. (2023) Vision-language models for vision tasks: A survey. arXiv preprint arXiv:2304.00685

  • Zhang, S., Li, Z., Yan, S., He, X., Sun, J. (2021b) Distribution alignment: A unified framework for long-tail visual recognition. In CVPR

  • Zhang, Y., Ling, H., Gao, J., Yin, K., Lafleche, J. F., Barriuso, A., Torralba, A., Fidler, S. (2021c) Datasetgan: Efficient labeled data factory with minimal human effort. In CVPR

  • Zhao, H., Sheng, D., Bao, J., Chen, D., Chen, D., Wen, F., Yuan, L., Liu, C., Zhou, W., Chu, Q., Zhang, W., Yu, N. (2023) X-paste: Revisiting scalable copy-paste for instance segmentation using clip and stablediffusion. In ICML

  • Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., et al (2022) Regionclip: Region-based language-image pretraining. In CVPR

  • Zhou, K., Yang, J., Loy, C.C., Liu, Z. (2022a) Conditional prompt learning for vision-language models. In CVPR

  • Zhou, X., Koltun, V., Krähenbühl, P. (2021) Probabilistic two-stage detection. arXiv preprint arXiv:2103.07461

  • Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I. (2022b) Detecting twenty-thousand classes using image-level supervision. In ECCV

  • Zong, Z., Song, G., Liu, Y. (2023) Detrs with collaborative hybrid assignments training. In ICCV

Download references

Acknowledgements

This study is supported under the RIE2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). The project is also supported by NTU NAP and Singapore MOE AcRF Tier 2 (MOE-T2EP20120-0001, MOE-T2EP20221-0012).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chen Change Loy.

Additional information

Communicated by ZHUN ZHONG.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xie, J., Li, W., Li, X. et al. MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation. Int J Comput Vis 133, 1456–1475 (2025). https://doi.org/10.1007/s11263-024-02223-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-024-02223-3

Keywords

Profiles

  1. Chen Change Loy