Abstract
We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code: https://github.com/Jiahao000/MosaicFusion.
Similar content being viewed by others
Data Availability Statements
All data supporting the findings of this study are available online. The LVIS dataset can be downloaded from https://www.lvisdataset.org/dataset. The V3Det dataset can be downloaded from https://v3det.openxlab.org.cn/. The RefCOCO, RefCOCO+, and RefCOCOg datasets can be downloaded from https://github.com/lichengunc/refer. The Stable Diffusion models used to generate data are available at https://github.com/CompVis/stable-diffusion and https://github.com/Stability-AI/generative-models.
Notes
We use the word “single” to encourage the diffusion model to generate a single object at each specific location.
We empirically find that appending a category definition after the category name reduces the semantic ambiguity of generated images due to the polysemy of some category names. For LVIS (Gupta et al., 2019) the category definitions are readily available in annotations, where the meanings are mostly derived from WordNet (Miller, 1995). See Table 2d for ablations on the effect of different prompt templates.
The image resolution used during training in Stable Diffusion is \(512\times 512\). We notice that the generated results will get worse if one deviates from this training resolution too much. Thus, we simply choose the aspect ratio of the average LVIS image and keep the longer dimension to 512.
We are aware that different works may use different notations for a \(1\times \) training schedule. In this work, we always refer \(1\times \) schedule to a total of 16 \(\times \) 90k images.
Note that the final number of synthetic images per category is not fixed as we need to filter some images during the mask filtering process introduced in Sect. 3.3. We observe that \(\sim \)50% generated images and masks will be discarded accordingly.
We find that the baseline and X-Paste cannot be fully reproduced using the official code (https://github.com/yoctta/XPaste/issues/2). Therefore, our reproduced results are relatively lower than the original reported performance. Nevertheless, all experiments are done under the same settings for a fair comparison.
References
Baranchuk, D., Rubachev, I., Voynov, A., Khrulkov, V., Babenko, A. (2022) Label-efficient semantic segmentation with diffusion models. In ICLR
Barron, J.T., Poole, B. (2016) The fast bilateral solver. In ECCV
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M. (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934
Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., & Cohen-Or, D. (2023). Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4), 1–10.
Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., Lin, D. (2019) MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155
Chen, P., Sheng, K., Zhang, M., Shen, Y., Li, K., Shen, C. (2022) Open vocabulary object detection with proposal mining and prediction equalization. arXiv preprint arXiv:2206.11134
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B. (2016) The cityscapes dataset for semantic urban scene understanding. In CVPR
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L. (2009) Imagenet: A large-scale hierarchical image database. In CVPR
Di Stefano, L., Bulgarelli, A. (1999) A simple and efficient connected components labeling algorithm. In ICIAP
Du, Y., Wei, F., Zhang, Z., Shi, M., Gao, Y., Li, G. (2022) Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR
Dvornik, N., Mairal, J., Schmid, C. (2018) Modeling visual context is key to augmenting object detection datasets. In ECCV
Dwibedi, D., Misra, I., Hebert, M. (2017) Cut, paste and learn: Surprisingly easy synthesis for instance detection. In ICCV
Fang, H. S., Sun, J., Wang, R., Gou, M., Li, Y. L., Lu, C. (2019) Instaboost: Boosting instance segmentation via probability map guided copy-pasting. In ICCV
Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D. (2023) Designing an encoder for fast personalization of text-to-image models. arXiv preprint arXiv:2302.12228
Gao, M., Xing, C., Niebles, J. C., Li, J., Xu, R., Liu, W., Xiong, C. (2022) Open vocabulary object detection with pseudo bounding-box labels. In ECCV
Ge, Y., Xu, J., Zhao, B. N., Itti, L., Vineet, V. (2022) Dall-e for detection: Language-driven context image synthesis for object detection. arXiv preprint arXiv:2206.09592
Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T.Y., Cubuk, E.D., Le, Q.V., Zoph, B. (2021) Simple copy-paste is a strong data augmentation method for instance segmentation. In CVPR
Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y. (2022) Scaling open-vocabulary image segmentation with image-level labels. In ECCV
Gu, X., Lin, T.Y., Kuo, W., Cui, Y. (2022) Open-vocabulary object detection via vision and language knowledge distillation. In ICLR
Gupta, A., Dollar, P., Girshick, R. (2019) Lvis: A dataset for large vocabulary instance segmentation. In CVPR
Han, J., Niu, M., Du, Z., Wei, L., Xie, L., Zhang, X., Tian, Q. (2020) Joint coco and lvis workshop at eccv 2020: Lvis challenge track technical report: Asynchronous semi-supervised learning for large vocabulary instance segmentation. In ECCVW
He, K., Zhang, X., Ren, S., Sun, J. (2016) Deep residual learning for image recognition. In CVPR
He, K., Gkioxari, G., Dollár, P., Girshick, R. (2017) Mask r-cnn. In ICCV
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D. (2023) Prompt-to-prompt image editing with cross attention control. In ICLR
Hinterstoisser, S., Lepetit, V., Wohlhart, P., Konolige, K .(2018) On pre-trained image features and synthetic images for deep learning. In ECCV Workshops
Hu, X., Jiang, Y., Tang, K., Chen, J., Miao, C., Zhang, H. (2020) Learning to segment the tail. In CVPR
Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T. (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In ICML
Karras, T., Aittala, M., Aila, T., Laine, S. (2022) Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T. (2014) Referitgame: Referring to objects in photographs of natural scenes. In EMNLP
Kingma, D.P., Welling, M. (2014) Auto-encoding variational bayes. In ICLR
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al (2023) Segment anything. In ICCV
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y. (2022) Multi-concept customization of text-to-image diffusion. arXiv preprint arXiv:2212.04488
Kuo, W., Cui, Y., Gu, X., Piergiovanni, A., Angelova, A. (2023) F-vlm: Open-vocabulary object detection upon frozen vision and language models. In ICLR
Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., et al. (2020). The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision, 128(7), 1956–1981.
Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J. (2024) Lisa: Reasoning segmentation via large language model. In CVPR
Li, B,. Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R. (2022a) Language-driven semantic segmentation. In ICLR
Li, D., Ling, H., Kim, SW., Kreis, K., Fidler, S., Torralba, A. (2022b) Bigdatasetgan: Synthesizing imagenet with pixel-wise annotations. In CVPR
Li, Y., Wang, T., Kang, B., Tang, S., Wang, C., Li, J., Feng, J. (2020) Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In CVPR
Li, Z., Zhou, Q., Zhang, X., Zhang, Y., Wang, Y., Xie, W. (2023) Guiding text-to-image diffusion model towards grounded generation. arXiv preprint arXiv:2301.05221
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L. (2014) Microsoft coco: Common objects in context. In ECCV
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S. (2017) Feature pyramid networks for object detection. In CVPR
Liu, J., Sun, Y., Han, C., Dou, Z., Li, W. (2020) Deep representation learning on long-tailed data: A learnable embedding augmentation perspective. In CVPR
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B. (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV
Loshchilov, I., Hutter, F. (2017) Sgdr: Stochastic gradient descent with warm restarts. In ICLR
Loshchilov, I., Hutter, F. (2019) Decoupled weight decay regularization. In ICLR
Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Sun, Y., et al (2024) Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525
Mao, J., Huang, J., Toshev, A, Camburu, O., Yuille, A.L., Murphy, K. (2016) Generation and comprehension of unambiguous object descriptions. In CVPR
Miller, G. A. (1995). Wordnet: A lexical database for english. Communications of the ACM, 38(11), 39–41.
Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., et al (2022) Simple open-vocabulary object detection with vision transformers. In ECCV
OpenAI (2023) Gpt-4 technical report. arXiv preprint arXiv:2303.08774
Otsu, N. (1979). A threshold selection method from gray-level histograms. IEEE Trans SMC, 9, 62.
Parmar, G., Singh, K.K., Zhang, R., Li, Y., Lu, J., Zhu, J.Y. (2023) Zero-shot image-to-image translation. arXiv preprint arXiv:2302.03027
Phung, Q., Ge, S., Huang, J. B. (2023) Grounded text-to-image synthesis with attention refocusing. arXiv preprint arXiv:2306.05427
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R. (2023) Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al (2021a) Learning transferable visual models from natural language supervision. In ICML
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. (2021b) Learning transferable visual models from natural language supervision. In ICML
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M. (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125
Rasheed, H., Maaz, M., Khattak, M.U., Khan, S., Khan, F.S. (2022) Bridging the gap between object and image-level representations for open-vocabulary detection. In NeurIPS
Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S., Cholakkal, H., Anwer, R.M., Xing, E., Yang, M.H., Khan, F.S. (2024) Glamm: Pixel grounding large multimodal model. In CVPR
Ren, J., Yu, C., Ma, X., Zhao, H., Yi, S., et al (2020) Balanced meta-softmax for long-tailed visual recognition. In NeurIPS
Richter, S. R., Vineet, V., Roth, S., Koltun, V. (2016) Playing for data: Ground truth from computer games. In ECCV
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B. (2022) High-resolution image synthesis with latent diffusion models. In CVPR
Ronneberger, O., Fischer, P., Brox, T. (2015) U-net: Convolutional networks for biomedical image segmentation. In MICCAI
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S.K.S., Ayan, B.K., Mahdavi, S.S., Lopes, R.G., et al (2022) Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS
Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., Sun, J. (2019) Objects365: A large-scale, high-quality dataset for object detection. In ICCV
Sharma, P., Ding, N., Goodman, S., Soricut, R. (2018) Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL
Su, H., Qi, C.R., Li, Y., Guibas, L.J. (2015) Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In ICCV
Tan, J., Wang, C., Li, B., Li, Q., Ouyang, W., Yin, C., Yan, J. (2020a) Equalization loss for long-tailed object recognition. In CVPR
Tan, J., Zhang, G., Deng, H., Wang, C., Lu, L., Li, Q., Dai, J. (2020b) 1st place solution of lvis challenge 2020: A good box is not a guarantee of a good mask. arXiv preprint arXiv:2009.01559
Tan, J., Lu, X., Zhang, G., Yin, C., Li, Q. (2021) Equalization loss v2: A new gradient balance approach for long-tailed object detection. In CVPR
Tan, M., Pang, R., Le, Q.V. (2020c) Efficientdet: Scalable and efficient object detection. In CVPR
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F. (2023a) Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al (2023b) Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288
Wang, J., Zhang, W., Zang, Y., Cao, Y., Pang, J., Gong, T., Chen, K., Liu, Z., Loy, C.C., Lin, D. (2021a) Seesaw loss for long-tailed instance segmentation. In CVPR
Wang, J., Zhang, P., Chu, T., Cao, Y., Zhou, Y., Wu, T., Wang, B., He, C., Lin, D. (2023) V3det: Vast vocabulary visual detection dataset. In ICCV
Wang, T., Li, Y., Kang, B., Li, J., Liew, J., Tang, S., Hoi, S., Feng, J. (2020) The devil is in classification: A simple framework for long-tail instance segmentation. In ECCV
Wang, T., Zhu, Y., Zhao, C., Zeng, W., Wang, J., Tang, M. (2021b) Adaptive class suppression loss for long-tail object detection. In CVPR
Waqas, Z. S., Arora, A., Gupta, A., Khan, S., Sun, G., Khan, S. F., Zhu, F., Shao, L., Xia, G. S., Bai, X. (2019) isaid: A large-scale dataset for instance segmentation in aerial images. In CVPRW
Wei, F., Gao, Y., Wu, Z., Hu, H., Lin, S. (2021) Aligning pretraining for detection via object-level contrastive learning. In NeurIPS
Wu, J., Song, L., Wang, T., Zhang, Q., Yuan, J. (2020) Forest r-cnn: Large-vocabulary long-tailed object detection and instance segmentation. In ACM-MM
Wu, J., Li, X., Xu, S., Yuan, H., Ding, H., Yang, Y., Li, X., Zhang, J., Tong, Y., Jiang, X., Ghanem, B., Tao, D. (2023) Towards open vocabulary learning: A survey. arXiv preprint arXiv:2306.15880
Wu, S., Jin, S., Zhang, W., Xu, L., Liu, W., Li, W., Loy, C. C. (2024) F-lmm: Grounding frozen large multimodal models. arXiv preprint arXiv:2406.05821
Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R. (2019) Detectron2
Xie, J., Li, Y., Huang, Y., Liu, H., Zhang, W., Zheng, Y., Shou, M.Z. (2023) Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In ICCV
Zang, Y., Huang, C., Loy, C.C. (2021) Fasa: Feature augmentation and sampling adaptation for long-tailed instance segmentation. In ICCV
Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C. (2022) Open-vocabulary detr with conditional matching. In ECCV
Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F. (2021) Open-vocabulary object detection using captions. In CVPR
Zhang, C., Pan, T. Y., Li, Y., Hu, H., Xuan, D., Changpinyo, S., Gong, B., Chao, W. L. (2021a) Mosaicos: A simple and effective use of object-centric images for long-tailed object detection. In ICCV
Zhang, J., Huang, J., Jin, S., Lu, S. (2023) Vision-language models for vision tasks: A survey. arXiv preprint arXiv:2304.00685
Zhang, S., Li, Z., Yan, S., He, X., Sun, J. (2021b) Distribution alignment: A unified framework for long-tail visual recognition. In CVPR
Zhang, Y., Ling, H., Gao, J., Yin, K., Lafleche, J. F., Barriuso, A., Torralba, A., Fidler, S. (2021c) Datasetgan: Efficient labeled data factory with minimal human effort. In CVPR
Zhao, H., Sheng, D., Bao, J., Chen, D., Chen, D., Wen, F., Yuan, L., Liu, C., Zhou, W., Chu, Q., Zhang, W., Yu, N. (2023) X-paste: Revisiting scalable copy-paste for instance segmentation using clip and stablediffusion. In ICML
Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., et al (2022) Regionclip: Region-based language-image pretraining. In CVPR
Zhou, K., Yang, J., Loy, C.C., Liu, Z. (2022a) Conditional prompt learning for vision-language models. In CVPR
Zhou, X., Koltun, V., Krähenbühl, P. (2021) Probabilistic two-stage detection. arXiv preprint arXiv:2103.07461
Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I. (2022b) Detecting twenty-thousand classes using image-level supervision. In ECCV
Zong, Z., Song, G., Liu, Y. (2023) Detrs with collaborative hybrid assignments training. In ICCV
Acknowledgements
This study is supported under the RIE2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). The project is also supported by NTU NAP and Singapore MOE AcRF Tier 2 (MOE-T2EP20120-0001, MOE-T2EP20221-0012).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by ZHUN ZHONG.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xie, J., Li, W., Li, X. et al. MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation. Int J Comput Vis 133, 1456–1475 (2025). https://doi.org/10.1007/s11263-024-02223-3
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-02223-3