Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing

📰 News

[2025-10-08] We release the DIM-Edit dataset and the DIM-4.6B-T2I / DIM-4.6B-Edit models.

[2025-09-26] We upload a new version of the paper, including more results across various designers.

[2025-09-02] The DIM paper is released.

Introduction

Unified models achieve strong results in text-to-image generation but remain weak in precise editing. This limitation arises from an imbalanced division of responsibilities. The understanding module is usually treated as a translator that encodes instructions into conditions, while the generation module must act as both designer and painter. The result is that the generation module carries too much responsibility, even though it is not optimized for complex reasoning.

To address this, we introduce Draw-In-Mind (DIM), a dataset with two complementary parts:

DIM-T2I: 14M long-context image–text pairs that strengthen instruction comprehension.
DIM-Edit: 233K chain-of-thought imaginations from GPT-4o that provide explicit design blueprints.

We connect a frozen Qwen2.5-VL-3B with a trainable SANA1.5-1.6B via a lightweight MLP, forming DIM-4.6B-T2I/Edit. With this setup, the understanding module takes on the designer responsibility, while the generation module focuses on rendering. Despite its modest size, DIM-4.6B-Edit achieves SOTA or competitive results on ImgEdit and GEdit-Bench, outperforming much larger models.

Performance

GenEval and MJHQ-30K

*: ^† denotes using an LLM rewriter. For MJHQ(-30K), we report FID.

Model	Params	Sin.	Two	CT.	Colors	Pos.	Attr.	Overall	MJHQ

Gen. Only

PixArt-α	0.6B🔥	0.98	0.50	0.44	0.80	0.08	0.07	0.48	6.14
SDXL	2.6B🔥	0.98	0.74	0.39	0.85	0.15	0.23	0.55	8.76
DALL-E·3	-	0.96	0.87	0.47	0.83	0.43	0.45	0.67	-
SD3-Medium	2.0B🔥	0.99	0.94	0.72	0.89	0.33	0.60	0.74	11.92

Unified

Janus	1.3B🔥	0.97	0.68	0.30	0.84	0.46	0.42	0.61	10.10
Emu3-Gen^†	8.0B🔥	0.99	0.81	0.42	0.80	0.49	0.45	0.66	-
Show-o	1.3B🔥	0.98	0.80	0.66	0.84	0.31	0.50	0.68	15.18
Show-o2-7B	7.0B🔥	1.00	0.87	0.58	0.92	0.52	0.62	0.76	-
Janus-Pro-7B	7.0B🔥	0.99	0.89	0.59	0.90	0.79	0.66	0.80	13.48
BAGEL	14.0B🔥	0.99	0.94	0.81	0.88	0.64	0.63	0.82	-
MetaQuery-L^†	3.0B❄️ \| 3.2B🔥	-	-	-	-	-	-	0.78	6.35
DIM-4.6B-T2I^†	3.0B❄️ \| 1.6B🔥	0.99	0.89	0.63	0.86	0.62	0.61	0.77	5.50

ImgEdit Overall

*: Q3/7B indicates using Qwen2.5-VL-3/7B as the external designer during inference. By default, GPT-4o is employed as the external designer to ensure the best performance. All models are evaluated using GPT-4.1.

Model	Add	Adj.	Ext.	Rep.	Rem.	Back.	Sty.	Hyb.	Act.	Overall
MagicBrush	2.84	1.58	1.51	1.97	1.58	1.75	2.38	1.62	1.22	1.83
Instruct-P2P	2.45	1.83	1.44	2.01	1.50	1.44	3.55	1.20	1.46	1.88
AnyEdit	3.18	2.95	1.88	2.47	2.23	2.24	2.85	1.56	2.65	2.45
UltraEdit	3.44	2.81	2.13	2.96	1.45	2.83	3.76	1.91	2.98	2.70
Step1X-Edit	3.88	3.14	1.76	3.40	2.41	3.16	4.63	2.64	2.52	3.06
BAGEL	3.56	3.31	1.70	3.30	2.62	3.24	4.49	2.38	4.17	3.20
UniWorld-V1	3.82	3.64	2.27	3.47	3.24	2.99	4.21	2.96	2.74	3.26
Janus-4o	3.35	3.35	2.25	3.01	2.18	3.32	4.71	2.49	4.04	3.19
GPT-4o-Image	4.61	4.33	2.90	4.35	3.66	4.57	4.93	3.96	4.89	4.20
DIM-4.6B-Edit	4.09	3.47	2.30	4.00	3.43	3.87	4.92	2.85	4.08	3.67

ImgEdit Designer Ablation

^†: The default setting.

Designer	Add	Adj.	Ext.	Rep.	Rem.	Back.	Sty.	Hyb.	Act.	Overall
–	3.53	3.23	2.01	3.49	1.47	3.42	4.79	2.35	3.64	3.10
Qwen2.5-VL-3B	3.80	3.24	2.03	3.89	3.21	3.52	4.92	2.71	4.05	3.49
Qwen2.5-VL-7B	3.95	3.35	2.25	3.85	3.31	3.57	4.88	2.81	4.02	3.55
MiMo-VL-7B	3.95	3.32	2.20	3.75	2.46	3.82	4.88	2.52	3.93	3.43
InternVL3.5-8B	3.98	3.40	2.05	4.14	3.30	3.84	4.94	2.77	3.89	3.59
GLM-4.1V-9B	3.95	3.27	2.23	3.90	2.64	3.81	4.92	2.23	4.02	3.44
GPT-4o^†	4.09	3.47	2.30	4.00	3.43	3.87	4.92	2.85	4.08	3.67

Visualization

*：Green and Blue denote the edits of Janus-4o and Step1X-Edit respectively; Red denotes the edits of our models trained on different data corpora.

Dataset Usage

DIM-T2I

Not available yet.

DIM-Edit

Please first download DIM-Edit from our 🤗HF repo. You can use huggingface-cli to download it quickly:

# 1. Install the huggingface hub tools (if not yet installed)
pip install -U huggingface_hub

# 2. Log in with your Hugging Face account token
huggingface-cli login

# 3. Download the dataset
huggingface-cli download stdKonjac/DIM-Edit --repo-type dataset --local-dir ./DIM-Edit

After downloading, navigate into the dataset folder, merge and extract the split archives using the following bash commands:

cd DIM-Edit
cat images.tar.gz.part* > images.tar.gz
tar -xvzf images.tar.gz

In the meantime, you will find a JSONL file named tos_dataset_edit.jsonl in the root directory, which records all image editing samples. Each line in this file corresponds to a single sample containing four fields:

Field	Description
id	Unique identifier for each sample.
image_path	Path to the source image, beginning with `image/`.
image_path_target	Path to the target image, beginning with `image/`.
prompt	The CoT-style instruction describing how to transform the source into the target.

We recommend using the huggingface datasets library to load the dataset efficiently:

from datasets import load_dataset, Features, Value

features = Features({
    "id": Value("string"),
    "image_path": Value("string"),
    "image_path_target": Value("string"),
    "prompt": Value("string"),
})

ds = load_dataset(
    "json",
    data_files="DIM-Edit/tos_dataset_edit.jsonl",
    features=features,
    split="train",
)

print(ds[0])

Model Usage

Environment Setup

Run the following script to set up the Python environment.

pip install -r requirements.txt

🦙 Model Zoo

Please first create a checkpoints folder in the root directory:

mkdir checkpoints

Then download the models from our 🤗HF repo below, and move them to the checkpoints folder.

*: To facilitate reproducibility, we release DIM-4.6B-Edit-Stage1, which is trained solely on the UltraEdit dataset. By fine-tuning this checkpoint on our proposed DIM-Edit dataset, you should obtain DIM-4.6B-Edit.

Model	Task	Training Data	ImgEdit	Parameters
DIM-4.6B-T2I	Text-to-Image	DIM-T2I + 6.9M Public Data	–	3.0B❄️ + 1.6B🔥
DIM-4.6B-Edit-Stage1	Image Editing	UltraEdit	2.76	3.0B❄️ + 1.6B🔥
DIM-4.6B-Edit	Image Editing	UltraEdit → DIM-Edit	3.67	3.0B❄️ + 1.6B🔥

The checkpoints should be organized like:

DIM/
└── checkpoints/
    ├── DIM-4.6B-T2I/
    │   ├── model.safetensors
    │   └── ...
    ├── DIM-4.6B-Edit-Stage1/
    │   ├── model.safetensors
    │   └── ...
    └── DIM-4.6B-Edit/
        ├── model.safetensors
        └── ...

Inference

T2I Generation

The demo T2I instructions are provided in cache/demo/tos_dataset_demo.jsonl, where each line is an instruction in json format like:

{"id": "0000", "image_path": "./cache/demo/edit_demo_0000.png", "prompt": "A yummy cupcake floating in the air dark background"}

The image_path is just a placeholder, and you can modify prompt to create your own image.

To generate images from the jsonl file, run the following script:

bash scripts/demo_t2i.sh

For each instruction, the generated image will be saved at cache/inference/demo/DIM-4.6B-T2I/{id}_gen.jpg.

Image Editing

The demo edit instructions are provided in cache/demo/tos_dataset_edit_demo.jsonl, where each line is an instruction in json format like:

{"id": "0", "image_path": "./cache/demo/edit_demo_0000.png", "prompt": "Remove the lemons on the table.", "image_path_target": "./cache/demo/edit_demo_0000.png"}

The image_path corresponds to the source image, and the prompt is the edit instruction. The image_path_target is just a placeholder.

In infer/demo_edit.py, use the set_designer_gpt API with your own key to set GPT-4o as the external designer for optimal performance.

# GPT-4o as external designer
model.set_designer_gpt(api_key='')

You can also use the set_designer_X API to set various open-source VLMs as the external designer. The VLMs will be automatically downloaded to local disk.

# Qwen2.5-VL as external designer
model.set_designer_qwen(version='Qwen/Qwen2.5-VL-3B-Instruct')
model.set_designer_qwen(version='Qwen/Qwen2.5-VL-7B-Instruct')

# InternVL3.5 as external designer (recommend using transformers==4.53.0)
model.set_designer_internvl(version='OpenGVLab/InternVL3_5-8B-HF')

# MiMo-VL as external designer
model.set_designer_mimo(version='XiaomiMimo/MiMo-VL-7B-RL-2508')

# GLM-4.1V as external designer (recommend using transformers==4.53.1)
model.set_designer_glm(version='THUDM/GLM-4.1V-9B-Thinking')

To generate edited images from the jsonl file, run the following script:

bash scripts/demo_edit.sh

The model will first generate a CoT-guided edit instruction for each prompt and save it to cache/inference/demo/DIM-4.6B-Edit/tos_dataset_edit_cot_demo_gen.jsonl. Then the generated images will be saved at cache/inference/demo/DIM-4.6B-Edit/{id}_edited.jpg.

We also provide a sample GPT-4o generated CoT jsonl file at cache/demo/tos_dataset_edit_cot_demo.jsonl for reference.

Evaluation

GenEval

We provide two evaluation jsonl files according to prompt types in cache/GenEval:

tos_dataset.jsonl: Origin prompts.
tos_dataset_rewritten.jsonl: LLM-rewritten prompts.

The image_path field in each line of the jsonl is just a placeholder, please replace it with a pseudo image on your local disk first.

Run the following script to generate images:

bash scripts/eval_geneval.sh

The generated images will be saved to cache/inference/DIM-4.6B-T2I/GenEval(_rewritten). Please follow the guide in GenEval official repo for metrics calculation.

MJHQ-30K

First download MJHQ-30K from the HF repo. You only need to download mjhq30k_imgs.zip. Then extract all images in the cache folder and organize them as follows:

cache
└── MJHQ-30K
    ├── animals
    │   ├── {id}.jpg
    │   ├── {id}.jpg
    │   └── ...
    ├── art
    ├── fashion
    ├── food
    ├── indoor
    ├── landscape
    ├── logo
    ├── people
    ├── plants
    └── vehicles

We have provided all prompts of MJHQ-30K in cache/MJHQ-30K/tos_dataset.jsonl. Run the following script to generate images:

bash scripts/eval_mjhq30k.sh

The generated images will be saved to cache/inference/DIM-4.6B-T2I/MJHQ-30K. We use pytorch-fid to calculate the FID on MJHQ-30K.

ImgEdit

First download ImgEdit from the HF repo. Put the dataset in the cache folder, and organize it as follows:

cache
└── ImgEdit
    └── Benchmark
        ├── hard
        ├── multiturn
        └── singleturn
            ├── animal
            │   ├── {id}.jpg
            │   └── ...
            ├── architecture
            ├── clothes
            ├── compose
            ├── daily object
            ├── for_add
            ├── human
            ├── style
            ├── transport
            ├── judge_prompt.json
            └── singleturn.json

We provide four evaluation jsonl files according to prompt types in cache/ImgEdit:

tos_dataset_edit.jsonl: Origin prompts.
tos_dataset_edit_cot.jsonl: CoT-style prompts generated by GPT-4o.
tos_dataset_edit_cot_Qwen2.5-VL-3B-Instruct.jsonl: CoT-style prompts generated by Qwen2.5-VL-3B.
tos_dataset_edit_cot_Qwen2.5-VL-7B-Instruct.jsonl: CoT-style prompts generated by Qwen2.5-VL-7B.

Run the following script to generate images:

bash scripts/eval_imgedit.sh

The generated images will be saved to cache/inference/DIM-4.6B-Edit/ImgEdit. Please follow the guide in ImgEdit official repo for metrics calculation.

GEdit-Bench-EN

First download GEdit-Bench from the HF repo. Extract all raw images from the dataset and put them in the cache folder. Organize them as follows:

cache
└── GEdit-Bench
    └── input_image_raw
        ├── {id}.png
        ├── {id}.png
        ├── {id}.png
        ├── {id}.png
        └── ...

We provide four evaluation jsonl files according to prompt types in cache/GEdit-Bench:

tos_dataset_edit_en.jsonl: Origin prompts.
tos_dataset_edit_en_cot.jsonl: CoT-style prompts generated by GPT-4o.
tos_dataset_edit_en_ot_Qwen2.5-VL-3B-Instruct.jsonl: CoT-style prompts generated by Qwen2.5-VL-3B.
tos_dataset_edit_en_cot_Qwen2.5-VL-7B-Instruct.jsonl: CoT-style prompts generated by Qwen2.5-VL-7B.

Run the following script to generate images:

bash scripts/eval_gedit_bench.sh

The generated images will be saved to cache/inference/DIM-4.6B-Edit/GEdit-Bench. Please follow the guide in GEdit-Bench official repo for metrics calculation.

License

Dataset

The dataset is licensed under the CC-BY-NC 4.0 license.

Model

The models are developed based on Qwen2.5-VL-3B-Instruct (subject to Qwen RESEARCH LICENSE AGREEMENT) and SANA1.5_1.6B_1024px (subject to NVIDIA License). We retain ownership of all intellectual property rights in and to any derivative works and modifications that we made.

Citation

If you find our work useful or helpful for your R&D works, please feel free to cite our paper as below.

@misc{zeng2025drawinmindrebalancingdesignerpainterroles,
      title={Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing}, 
      author={Ziyun Zeng and Junhao Zhang and Wei Li and Mike Zheng Shou},
      year={2025},
      eprint={2509.01986},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.01986}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
cache		cache
data		data
infer		infer
models		models
scripts		scripts
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing

📰 News

Introduction

Performance

Dataset Usage

DIM-T2I

DIM-Edit

Model Usage

Environment Setup

🦙 Model Zoo

Inference

Evaluation

License

Dataset

Model

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

showlab/DIM

Folders and files

Latest commit

History

Repository files navigation

Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing

📰 News

Introduction

Performance

Dataset Usage

DIM-T2I

DIM-Edit

Model Usage

Environment Setup

🦙 Model Zoo

Inference

Evaluation

License

Dataset

Model

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages