Question: Why inject 2D features at decoder skip connections (sum) instead of encoder / cross-attention? Any ablations?

Hi and thank you for the very inspiring work! I’m really looking forward to the code release.

I’m working on multi-modal fusion for 3D segmentation, and I’m curious about the design choice for how and where you inject the 2D DINO features in DITR.

From the paper, my understanding is:

> 2D features are assigned to points via patch index, then max-pooled to obtain a multi-scale hierarchy aligned with the 3D U-Net hierarchy.

> At each decoder level 𝑙, you linearly project the three inputs — the upsampled features from 𝐷_𝑙+1, the skip connection from 𝐸_𝑙, and the pooled 2D features — apply normalization + GELU, and then element-wise add them before the decoder block.

**Questions**

**1. Rationale for decoder-side injection via element-wise addition.**

What considerations led you to inject at the **decoder skip connections** with simple summation rather than earlier in the **encoder** or at the **bottleneck**? Was this mainly for stability/compute, or did you observe accuracy differences?

**2. Ablations you may have tried.**

- Injecting 2D features in the encoder (before/after downsampling blocks)?

- Bottleneck-only injection?

- Replacing the element-wise sum with concatenation + MLP/conv, gating/FiLM, or a lightweight cross-attention module (e.g., 3D as Q, 2D as K/V)?

If you ran any of these, could you share qualitative/quantitative trends (e.g., mIoU changes, stability, runtime/memory)?

**3. On cross-attention variants.**
Many prior fusion works use cross-attention in the encoder to do data-dependent selection/alignment. Did you try such a variant and find drawbacks cost, sensitivity to noisy matches, training instability), or was it simply out of scope?

**4. Practical guidance.**
If I wanted to prototype an encoder-side or cross-attn injector on top of your code once released, are there pitfalls you’d recommend avoiding (normalization type, position/relative positional encoding, where to put residuals, etc.)?

Thanks again for the great paper! If you’re open to it, I’d be happy to run a small set of controlled ablations after the release and share back results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question: Why inject 2D features at decoder skip connections (sum) instead of encoder / cross-attention? Any ablations? #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question: Why inject 2D features at decoder skip connections (sum) instead of encoder / cross-attention? Any ablations? #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions