这是indexloc提供的服务,不要输入任何密码
Skip to content

Question: Why inject 2D features at decoder skip connections (sum) instead of encoder / cross-attention? Any ablations? #8

@Xin200203

Description

@Xin200203

Hi and thank you for the very inspiring work! I’m really looking forward to the code release.

I’m working on multi-modal fusion for 3D segmentation, and I’m curious about the design choice for how and where you inject the 2D DINO features in DITR.

From the paper, my understanding is:

2D features are assigned to points via patch index, then max-pooled to obtain a multi-scale hierarchy aligned with the 3D U-Net hierarchy.

At each decoder level 𝑙, you linearly project the three inputs — the upsampled features from 𝐷_𝑙+1, the skip connection from 𝐸_𝑙, and the pooled 2D features — apply normalization + GELU, and then element-wise add them before the decoder block.

Questions

1. Rationale for decoder-side injection via element-wise addition.

What considerations led you to inject at the decoder skip connections with simple summation rather than earlier in the encoder or at the bottleneck? Was this mainly for stability/compute, or did you observe accuracy differences?

2. Ablations you may have tried.

  • Injecting 2D features in the encoder (before/after downsampling blocks)?

  • Bottleneck-only injection?

  • Replacing the element-wise sum with concatenation + MLP/conv, gating/FiLM, or a lightweight cross-attention module (e.g., 3D as Q, 2D as K/V)?

If you ran any of these, could you share qualitative/quantitative trends (e.g., mIoU changes, stability, runtime/memory)?

3. On cross-attention variants.
Many prior fusion works use cross-attention in the encoder to do data-dependent selection/alignment. Did you try such a variant and find drawbacks cost, sensitivity to noisy matches, training instability), or was it simply out of scope?

4. Practical guidance.
If I wanted to prototype an encoder-side or cross-attn injector on top of your code once released, are there pitfalls you’d recommend avoiding (normalization type, position/relative positional encoding, where to put residuals, etc.)?

Thanks again for the great paper! If you’re open to it, I’d be happy to run a small set of controlled ablations after the release and share back results.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions