+
Skip to content
/ DIM Public

The official implementation of the paper "Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing"

License

Notifications You must be signed in to change notification settings

showlab/DIM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing

arXiv Code Hugging Face Datasets Hugging Face Models Hugging Face Models

DIM-Edit

📰 News

[2025-10-08] We release the DIM-Edit dataset and the DIM-4.6B-T2I / DIM-4.6B-Edit models.

[2025-09-26] We upload a new version of the paper, including more results across various designers.

[2025-09-02] The DIM paper is released.

Introduction

Unified models achieve strong results in text-to-image generation but remain weak in precise editing. This limitation arises from an imbalanced division of responsibilities. The understanding module is usually treated as a translator that encodes instructions into conditions, while the generation module must act as both designer and painter. The result is that the generation module carries too much responsibility, even though it is not optimized for complex reasoning.

To address this, we introduce Draw-In-Mind (DIM), a dataset with two complementary parts:

  • DIM-T2I: 14M long-context image–text pairs that strengthen instruction comprehension.
  • DIM-Edit: 233K chain-of-thought imaginations from GPT-4o that provide explicit design blueprints.

We connect a frozen Qwen2.5-VL-3B with a trainable SANA1.5-1.6B via a lightweight MLP, forming DIM-4.6B-T2I/Edit. With this setup, the understanding module takes on the designer responsibility, while the generation module focuses on rendering. Despite its modest size, DIM-4.6B-Edit achieves SOTA or competitive results on ImgEdit and GEdit-Bench, outperforming much larger models.

Performance

GenEval and MJHQ-30K

*: denotes using an LLM rewriter. For MJHQ(-30K), we report FID.

Model Params Sin. Two CT. Colors Pos. Attr. Overall MJHQ
Gen. Only
PixArt-α 0.6B🔥 0.98 0.50 0.44 0.80 0.08 0.07 0.48 6.14
SDXL 2.6B🔥 0.98 0.74 0.39 0.85 0.15 0.23 0.55 8.76
DALL-E·3 - 0.96 0.87 0.47 0.83 0.43 0.45 0.67 -
SD3-Medium 2.0B🔥 0.99 0.94 0.72 0.89 0.33 0.60 0.74 11.92
Unified
Janus 1.3B🔥 0.97 0.68 0.30 0.84 0.46 0.42 0.61 10.10
Emu3-Gen 8.0B🔥 0.99 0.81 0.42 0.80 0.49 0.45 0.66 -
Show-o 1.3B🔥 0.98 0.80 0.66 0.84 0.31 0.50 0.68 15.18
Show-o2-7B 7.0B🔥 1.00 0.87 0.58 0.92 0.52 0.62 0.76 -
Janus-Pro-7B 7.0B🔥 0.99 0.89 0.59 0.90 0.79 0.66 0.80 13.48
BAGEL 14.0B🔥 0.99 0.94 0.81 0.88 0.64 0.63 0.82 -
MetaQuery-L 3.0B❄️ | 3.2B🔥 - - - - - - 0.78 6.35
DIM-4.6B-T2I 3.0B❄️ | 1.6B🔥 0.99 0.89 0.63 0.86 0.62 0.61 0.77 5.50
ImgEdit Overall

*: Q3/7B indicates using Qwen2.5-VL-3/7B as the external designer during inference. By default, GPT-4o is employed as the external designer to ensure the best performance. All models are evaluated using GPT-4.1.

Model Add Adj. Ext. Rep. Rem. Back. Sty. Hyb. Act. Overall
MagicBrush 2.84 1.58 1.51 1.97 1.58 1.75 2.38 1.62 1.22 1.83
Instruct-P2P 2.45 1.83 1.44 2.01 1.50 1.44 3.55 1.20 1.46 1.88
AnyEdit 3.18 2.95 1.88 2.47 2.23 2.24 2.85 1.56 2.65 2.45
UltraEdit 3.44 2.81 2.13 2.96 1.45 2.83 3.76 1.91 2.98 2.70
Step1X-Edit 3.88 3.14 1.76 3.40 2.41 3.16 4.63 2.64 2.52 3.06
BAGEL 3.56 3.31 1.70 3.30 2.62 3.24 4.49 2.38 4.17 3.20
UniWorld-V1 3.82 3.64 2.27 3.47 3.24 2.99 4.21 2.96 2.74 3.26
Janus-4o 3.35 3.35 2.25 3.01 2.18 3.32 4.71 2.49 4.04 3.19
GPT-4o-Image 4.61 4.33 2.90 4.35 3.66 4.57 4.93 3.96 4.89 4.20
DIM-4.6B-Edit 4.09 3.47 2.30 4.00 3.43 3.87 4.92 2.85 4.08 3.67
ImgEdit Designer Ablation

: The default setting.

Designer Add Adj. Ext. Rep. Rem. Back. Sty. Hyb. Act. Overall
3.53 3.23 2.01 3.49 1.47 3.42 4.79 2.35 3.64 3.10
Qwen2.5-VL-3B 3.80 3.24 2.03 3.89 3.21 3.52 4.92 2.71 4.05 3.49
Qwen2.5-VL-7B 3.95 3.35 2.25 3.85 3.31 3.57 4.88 2.81 4.02 3.55
MiMo-VL-7B 3.95 3.32 2.20 3.75 2.46 3.82 4.88 2.52 3.93 3.43
InternVL3.5-8B 3.98 3.40 2.05 4.14 3.30 3.84 4.94 2.77 3.89 3.59
GLM-4.1V-9B 3.95 3.27 2.23 3.90 2.64 3.81 4.92 2.23 4.02 3.44
GPT-4o 4.09 3.47 2.30 4.00 3.43 3.87 4.92 2.85 4.08 3.67
Visualization

*:Green and Blue denote the edits of Janus-4o and Step1X-Edit respectively; Red denotes the edits of our models trained on different data corpora.

Overall Add Change Remove Replace Transfer

Dataset Usage

DIM-T2I

Not available yet.

DIM-Edit

Please first download DIM-Edit from our 🤗HF repo. You can use huggingface-cli to download it quickly:

# 1. Install the huggingface hub tools (if not yet installed)
pip install -U huggingface_hub

# 2. Log in with your Hugging Face account token
huggingface-cli login

# 3. Download the dataset
huggingface-cli download stdKonjac/DIM-Edit --repo-type dataset --local-dir ./DIM-Edit

After downloading, navigate into the dataset folder, merge and extract the split archives using the following bash commands:

cd DIM-Edit
cat images.tar.gz.part* > images.tar.gz
tar -xvzf images.tar.gz

In the meantime, you will find a JSONL file named tos_dataset_edit.jsonl in the root directory, which records all image editing samples. Each line in this file corresponds to a single sample containing four fields:

Field Description
id Unique identifier for each sample.
image_path Path to the source image, beginning with image/.
image_path_target Path to the target image, beginning with image/.
prompt The CoT-style instruction describing how to transform the source into the target.

We recommend using the huggingface datasets library to load the dataset efficiently:

from datasets import load_dataset, Features, Value

features = Features({
    "id": Value("string"),
    "image_path": Value("string"),
    "image_path_target": Value("string"),
    "prompt": Value("string"),
})

ds = load_dataset(
    "json",
    data_files="DIM-Edit/tos_dataset_edit.jsonl",
    features=features,
    split="train",
)

print(ds[0])

Model Usage

Environment Setup

Run the following script to set up the Python environment.

pip install -r requirements.txt

🦙 Model Zoo

Please first create a checkpoints folder in the root directory:

mkdir checkpoints

Then download the models from our 🤗HF repo below, and move them to the checkpoints folder.

*: To facilitate reproducibility, we release DIM-4.6B-Edit-Stage1, which is trained solely on the UltraEdit dataset. By fine-tuning this checkpoint on our proposed DIM-Edit dataset, you should obtain DIM-4.6B-Edit.

Model Task Training Data ImgEdit Parameters
DIM-4.6B-T2I Text-to-Image DIM-T2I + 6.9M Public Data 3.0B❄️ + 1.6B🔥
DIM-4.6B-Edit-Stage1 Image Editing UltraEdit 2.76 3.0B❄️ + 1.6B🔥
DIM-4.6B-Edit Image Editing UltraEdit → DIM-Edit 3.67 3.0B❄️ + 1.6B🔥

The checkpoints should be organized like:

DIM/
└── checkpoints/
    ├── DIM-4.6B-T2I/
    │   ├── model.safetensors
    │   └── ...
    ├── DIM-4.6B-Edit-Stage1/
    │   ├── model.safetensors
    │   └── ...
    └── DIM-4.6B-Edit/
        ├── model.safetensors
        └── ...

Inference

T2I Generation

The demo T2I instructions are provided in cache/demo/tos_dataset_demo.jsonl, where each line is an instruction in json format like:

{"id": "0000", "image_path": "./cache/demo/edit_demo_0000.png", "prompt": "A yummy cupcake floating in the air dark background"}

The image_path is just a placeholder, and you can modify prompt to create your own image.

To generate images from the jsonl file, run the following script:

bash scripts/demo_t2i.sh

For each instruction, the generated image will be saved at cache/inference/demo/DIM-4.6B-T2I/{id}_gen.jpg.

Image Editing

The demo edit instructions are provided in cache/demo/tos_dataset_edit_demo.jsonl, where each line is an instruction in json format like:

{"id": "0", "image_path": "./cache/demo/edit_demo_0000.png", "prompt": "Remove the lemons on the table.", "image_path_target": "./cache/demo/edit_demo_0000.png"}

The image_path corresponds to the source image, and the prompt is the edit instruction. The image_path_target is just a placeholder.

In infer/demo_edit.py, use the set_designer_gpt API with your own key to set GPT-4o as the external designer for optimal performance.

# GPT-4o as external designer
model.set_designer_gpt(api_key='')

You can also use the set_designer_X API to set various open-source VLMs as the external designer. The VLMs will be automatically downloaded to local disk.

# Qwen2.5-VL as external designer
model.set_designer_qwen(version='Qwen/Qwen2.5-VL-3B-Instruct')
model.set_designer_qwen(version='Qwen/Qwen2.5-VL-7B-Instruct')

# InternVL3.5 as external designer (recommend using transformers==4.53.0)
model.set_designer_internvl(version='OpenGVLab/InternVL3_5-8B-HF')

# MiMo-VL as external designer
model.set_designer_mimo(version='XiaomiMimo/MiMo-VL-7B-RL-2508')

# GLM-4.1V as external designer (recommend using transformers==4.53.1)
model.set_designer_glm(version='THUDM/GLM-4.1V-9B-Thinking')

To generate edited images from the jsonl file, run the following script:

bash scripts/demo_edit.sh

The model will first generate a CoT-guided edit instruction for each prompt and save it to cache/inference/demo/DIM-4.6B-Edit/tos_dataset_edit_cot_demo_gen.jsonl. Then the generated images will be saved at cache/inference/demo/DIM-4.6B-Edit/{id}_edited.jpg.

We also provide a sample GPT-4o generated CoT jsonl file at cache/demo/tos_dataset_edit_cot_demo.jsonl for reference.

Evaluation

GenEval

We provide two evaluation jsonl files according to prompt types in cache/GenEval:

  1. tos_dataset.jsonl: Origin prompts.
  2. tos_dataset_rewritten.jsonl: LLM-rewritten prompts.

The image_path field in each line of the jsonl is just a placeholder, please replace it with a pseudo image on your local disk first.

Run the following script to generate images:

bash scripts/eval_geneval.sh

The generated images will be saved to cache/inference/DIM-4.6B-T2I/GenEval(_rewritten). Please follow the guide in GenEval official repo for metrics calculation.

MJHQ-30K

First download MJHQ-30K from the HF repo. You only need to download mjhq30k_imgs.zip. Then extract all images in the cache folder and organize them as follows:

cache
└── MJHQ-30K
    ├── animals
    │   ├── {id}.jpg
    │   ├── {id}.jpg
    │   └── ...
    ├── art
    ├── fashion
    ├── food
    ├── indoor
    ├── landscape
    ├── logo
    ├── people
    ├── plants
    └── vehicles

We have provided all prompts of MJHQ-30K in cache/MJHQ-30K/tos_dataset.jsonl. Run the following script to generate images:

bash scripts/eval_mjhq30k.sh

The generated images will be saved to cache/inference/DIM-4.6B-T2I/MJHQ-30K. We use pytorch-fid to calculate the FID on MJHQ-30K.

ImgEdit

First download ImgEdit from the HF repo. Put the dataset in the cache folder, and organize it as follows:

cache
└── ImgEdit
    └── Benchmark
        ├── hard
        ├── multiturn
        └── singleturn
            ├── animal
            │   ├── {id}.jpg
            │   └── ...
            ├── architecture
            ├── clothes
            ├── compose
            ├── daily object
            ├── for_add
            ├── human
            ├── style
            ├── transport
            ├── judge_prompt.json
            └── singleturn.json

We provide four evaluation jsonl files according to prompt types in cache/ImgEdit:

  1. tos_dataset_edit.jsonl: Origin prompts.
  2. tos_dataset_edit_cot.jsonl: CoT-style prompts generated by GPT-4o.
  3. tos_dataset_edit_cot_Qwen2.5-VL-3B-Instruct.jsonl: CoT-style prompts generated by Qwen2.5-VL-3B.
  4. tos_dataset_edit_cot_Qwen2.5-VL-7B-Instruct.jsonl: CoT-style prompts generated by Qwen2.5-VL-7B.

Run the following script to generate images:

bash scripts/eval_imgedit.sh

The generated images will be saved to cache/inference/DIM-4.6B-Edit/ImgEdit. Please follow the guide in ImgEdit official repo for metrics calculation.

GEdit-Bench-EN

First download GEdit-Bench from the HF repo. Extract all raw images from the dataset and put them in the cache folder. Organize them as follows:

cache
└── GEdit-Bench
    └── input_image_raw
        ├── {id}.png
        ├── {id}.png
        ├── {id}.png
        ├── {id}.png
        └── ...

We provide four evaluation jsonl files according to prompt types in cache/GEdit-Bench:

  1. tos_dataset_edit_en.jsonl: Origin prompts.
  2. tos_dataset_edit_en_cot.jsonl: CoT-style prompts generated by GPT-4o.
  3. tos_dataset_edit_en_ot_Qwen2.5-VL-3B-Instruct.jsonl: CoT-style prompts generated by Qwen2.5-VL-3B.
  4. tos_dataset_edit_en_cot_Qwen2.5-VL-7B-Instruct.jsonl: CoT-style prompts generated by Qwen2.5-VL-7B.

Run the following script to generate images:

bash scripts/eval_gedit_bench.sh

The generated images will be saved to cache/inference/DIM-4.6B-Edit/GEdit-Bench. Please follow the guide in GEdit-Bench official repo for metrics calculation.

License

Dataset

The dataset is licensed under the CC-BY-NC 4.0 license.

Model

The models are developed based on Qwen2.5-VL-3B-Instruct (subject to Qwen RESEARCH LICENSE AGREEMENT) and SANA1.5_1.6B_1024px (subject to NVIDIA License). We retain ownership of all intellectual property rights in and to any derivative works and modifications that we made.

Citation

If you find our work useful or helpful for your R&D works, please feel free to cite our paper as below.

@misc{zeng2025drawinmindrebalancingdesignerpainterroles,
      title={Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing}, 
      author={Ziyun Zeng and Junhao Zhang and Wei Li and Mike Zheng Shou},
      year={2025},
      eprint={2509.01986},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.01986}, 
}

About

The official implementation of the paper "Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载