Robust Object Detection with Domain-Invariant Training and Continual Test-Time Adaptation

Fan, Qi; Segu, Mattia; Schiele, Bernt; Dai, Dengxin; Tai, Yu-Wing; Tang, Chi-Keung

doi:10.1007/s11263-025-02465-9

Robust Object Detection with Domain-Invariant Training and Continual Test-Time Adaptation

Published: 25 June 2025

Volume 133, pages 6768–6793, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Qi Fan ORCID: orcid.org/0000-0002-2644-4457^1,2,
Mattia Segu^3,4,
Bernt Schiele³,
Dengxin Dai⁵,
Yu-Wing Tai⁶ &
…
Chi-Keung Tang⁷

692 Accesses
Explore all metrics

Abstract

Real-world environment can be highly dynamic causing substantial domain shifts. Such real-world domain shifts can span over time with domain changes across multiple domains, manifested into the pertinent content or style changes, or both, where content may refer to underlying image layout and styles are domain-specific such as color and texture. Performance of safety-critical applications, especially robust object detection system in autonomous driving, must adapt to such test-time domain shifts. However, our empirical analysis shows existing domain adaptation and generalization methods fail to fit the domain changes with substantial style or content shifts. In this paper, we first analyze and investigate effective solutions to overcome domain overfitting for robust object detection without the above shortcomings. To simultaneously address temporal and multiple domain shifts, we propose a continual test-time generalizable domain adaptation (CoTGA) method for robust object detection: 1) the domain-invariant training (DIT) module leverages the Normalization Perturbation (NP) method to initialize a style-invariant object detection model, by perturbing the channel statistics of source domain low-level features to synthesize various latent styles. The trained deep model can perceive diverse potential domains and generalizes well even without observations of target domain data in training; 2) the test-time adaptation (TTA) module updates the DIT-trained model online during inference, through the consistency regularization between predictions of the weakly and strongly augmented unlabeled images. TTA addresses the content discrepancies problem of the DIT-initialized generalizable model; 3) the generalizable weights preservation (GWP) module keeps the learned generalizable weights to avoid domain overfitting in generalization across multiple domains. Extensive experiments demonstrate these three modules collaboratively enable a deep model to generalize well under challenging real-world domain shifts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Region Feature Disentanglement for Domain Adaptive Object Detection

Real-time object detection method with single-domain generalization based on YOLOv8

Article 04 November 2024

Refining unified model by test-time dynamic adaptation for domain shift problem

Article 29 August 2025

Data Availibility

We evaluate our method on multiple datasets, under diverse real-world and synthetic domain shifts. The datasets that support the findings of this study are publicly available. Table 14 shows the datasets details used in our experiments, including Cityscapes (Cordts et al., 2016) (https://www.cityscapes-dataset.com/downloads/), Foggy Cityscapes (Sakaridis et al., 2018) (https://people.ee.ethz.ch/~csakarid/SFSU_synthetic/), Sim10k (Johnson-Roberson et al., 2017) (https://fcav.engin.umich.edu/projects/driving-in-the-matrix), BDD100k (Yu et al., 2020) (https://doc.bdd100k.com/download.html), Waymo (Sun et al., 2020) (https://waymo.com/open/download), GTAV (Richter et al., 2016) (https://download.visinf.tu-darmstadt.de/data/from_games/), Mapillary Vistas (Neuhold et al., 2017) (https://www.mapillary.com/dataset/vistas), Synthia (Ros et al., 2016) (https://synthia-dataset.net/), ACDC (Sakaridis et al., 2021) (https://acdc.vision.ee.ethz.ch/download) and PACS (Li et al., 2017) (https://domaingeneralization.github.io/#data). For Foggy Cityscapes (Sakaridis et al., 2018), we evaluate models on the highest fog intensity images (with least visibility). ACDC (Sakaridis et al., 2021) dataset is used in the additional semantic segmentation domain generalization experiments.

Notes

For all t-SNE visualizations in this paper, the features from multiple models are mapped jointly into a unified space but are separately visualized for clarity.
To compute $[\sum _\mu (x)]^2$ and $[\sum _\sigma (x)]^2$ for the dataset, we first shuffle the train set, and then sample a batch of 64 images to extract their ResNet stage1 features, and finally we compute the variance of their feature channel statistics, i.e., mean and standard deviation, respectively. We traverse all image batches of the train set and average all the computed variance values as the output.
Although Sim10k dataset is a single domain dataset, it contains diverse styles synthesized by graphics engine, e.g., daytime, night, dawn, dusk, clear, snowy and rainy. Thus Sim10k has larger inter-image feature statistics variance than Cityscapes, but is still smaller than that of PACS.

References

An, J., Huang, S., Song, Y., Dou, D., Liu, W. & Luo, J. (2021). Artflow: Unbiased image style transfer via reversible neural flows. In: CVPR.
Balaji, Y., Sankaranarayanan, S. & Chellappa, R. (2018). Metareg: Towards domain generalization using meta-regularization. In: NeurIPS.
Bartler, A., Bühler, A., Wiewel, F., Döbler, M., & Yang, B. (2022). Mt3: Meta test-time training for self-supervised test-time adaption. In: ICAIS.
Bochkovskiy, A., Wang, C.-Y., & Liao, H.-Y.M. (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934.
Borgwardt, K.M., Gretton, A., Rasch, M.J., Kriegel, H.-P., Schölkopf, B., & Smola, A.J. (2006). Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics.
Borlino, F.C., Polizzotto, S., Caputo, B., & Tommasi, T. (2022). Self-supervision & meta-learning for one-shot unsupervised cross-domain detection. CVIU.
Bui, M.-H., Tran, T., Tran, A., & Phung, D. (2021). Exploiting domain-specific features to enhance domain generalization. NeurIPS.
Cai, Q., Pan, Y., Ngo, C.-W., Tian, X., Duan, L. & Yao, T. (2019). Exploring object relation in mean teacher for cross-domain detection. In: CVPR.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In: ECCV.
Carlucci, F.M., D’Innocente, A., Bucci, S., Caputo, B. & Tommasi, T. (2019). Domain generalization by solving jigsaw puzzles. In: CVPR.
Carlucci, F.M., Russo, P., Tommasi, T., & Caputo, B. (2019). Hallucinating agnostic images to generalize across domains. In: ICCV Workshop.
Chen, Y., Li, W., Sakaridis, C., Dai, D., & Van Gool, L. (2018). Domain adaptive faster r-cnn for object detection in the wild. In: CVPR.
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F. & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV.
Choi, S., Jung, S., Yun, H., Kim, J.T., Kim, S., & Choo, J. (2021). Robustnet: Improving domain generalization in urban-scene segmentation via instance selective whitening. In: CVPR.
Choi, J., Kim, T., & Kim, C. (2019). Self-ensembling with gan-based data augmentation for domain adaptation in semantic segmentation. In: ICCV.
Choi, S., Kim, T., Jeong, M., Park, H., & Kim, C. (2021). Meta batch-instance normalization for generalizable person re-identification. In: CVPR.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S. & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In: CVPR.
Dayal, A., KB, V., Cenkeramaddi, L.R., Mohan, C., Kumar, A., & N Balasubramanian, V. (2024). Madg: margin-based adversarial learning for domain generalization. NeurIPS.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In: CVPR.
DeVries, T. & Taylor, G.W. (2017). Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552.
D’Innocente, A., Borlino, F.C., Bucci, S., Caputo, B. & Tommasi, T. (2020). One-shot unsupervised cross-domain detection. In: ECCV.
Fahes, M., Vu, T.-H., Bursuc, A., Pérez, P. & De Charette, R. (2023). Poda: Prompt-driven zero-shot domain adaptation. In: ICCV.
Fan, Q., Segu, M., Tai, Y.-W., Yu, F., Tang, C.-K., Schiele, B., & Dai, D. (2022). Normalization perturbation: A simple domain generalization method for real-world domain shifts. arXiv preprint arXiv:2211.04393.
Fan, X., Wang, Q., Ke, J., Yang, F., Gong, B., & Zhou, M. (2021). Adversarially adaptive normalization for single domain generalization. In: CVPR.
Fang, H.-S., Sun, J., Wang, R., Gou, M., Li, Y.-L. & Lu, C. (2019). Instaboost: Boosting instance segmentation via probability map guided copy-pasting. In: ICCV.
Ganin, Y., & Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In: ICML.
Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A. & Brendel, W. (2018). Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In: ICLR.
Ghiasi, G., Lin, T.-Y. & Le, Q.V. (2018). Dropblock: A regularization method for convolutional networks. NeurIPS.
Gholami, B., Sahu, P., Rudovic, O., Bousmalis, K., & Pavlovic, V. (2020). Unsupervised multi-target domain adaptation: An information theoretic approach. IEEE TIP.
Gong, R., Li, W., Chen, Y., & Gool, L.V. (2019). Dlow: Domain flow for adaptation and generalization. In: CVPR.
Grigorescu, S., Trasnea, B., Cocias, T., & Macesanu, G. (2020). A survey of deep learning techniques for autonomous driving. Journal of Field Robotics.
He, M., Wang, Y., Wu, J., Wang, Y., Li, H., Li, B., Gan, W., Wu, W. & Qiao, Y. (2022). Cross domain object detection by target-perceived dual branch distillation. In: CVPR.
He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep residual learning for image recognition. In: CVPR.
Hsu, C.-C., Tsai, Y.-H., Lin, Y.-Y. & Yang, M.-H. (2020). Every pixel matters: Center-aware feature alignment for domain adaptive object detector. In: ECCV.
Huang, X. & Belongie, S. (2017). Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV.
Huang, L., Zhou, Y., Zhu, F., Liu, L. & Shao, L. (2019). Iterative normalization: Beyond standardization towards efficient whitening. In: CVPR.
Ioffe, S. & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML.
Isobe, T., Jia, X., Chen, S., He, J., Shi, Y., Liu, J., Lu, H., & Wang, S. (2021). Multi-target domain adaptation with collaborative consistency learning. In: CVPR.
Iwasawa, Y., & Matsuo, Y. (2021). Test-time classifier adjustment module for model-agnostic domain generalization. In: NeurIPS.
Jackson, P.T., Abarghouei, A.A., Bonner, S., Breckon, T.P., & Obara, B. (2019). Style augmentation: data augmentation via style randomization. In: CVPR Workshops.
Jin, X., Lan, C., Zeng, W., & Chen, Z. (2021). Style normalization and restitution for domain generalization and adaptation. TMM.
Johnson-Roberson, M., Barto, C., Mehta, R., Sridhar, S.N., Rosaen, K. & Vasudevan, R. (2017). Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? In: ICRA.
Kaggle: Painter by Numbers. https://www.kaggle.com/c/painter-by-numbers/
Kang, G., Zheng, L., Yan, Y., & Yang, Y. (2018). Deep adversarial attention alignment for unsupervised domain adaptation: the benefit of target expectation maximization. In: ECCV.
Kim, T., Jeong, M., Kim, S., Choi, S., & Kim, C. (2019). Diversify and match: A domain adaptive representation learning paradigm for object detection. In: CVPR.
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., & Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences.
Lee, W., Hong, D., Lim, H., & Myung, H. (2024). Object-aware domain generalization for object detection. In: AAAI.
Li, X., Dai, Y., Ge, Y., Liu, J., Shan, Y., & Duan, L.-Y. (2022). Uncertainty modeling for out-of-distribution generalization. In: ICLR.
Li, Y.-J., Dai, X., Ma, C.-Y., Liu, Y.-C., Chen, K., Wu, B., He, Z., Kitani, K. & Vajda, P. (2022). Cross-domain adaptive teacher for object detection. In: CVPR.
Li, C., Li, L., Jiang, H., Weng, K., Geng, Y., Li, L., Ke, Z., Li, Q., Cheng, M., Nie, W., et al. (2022). Yolov6: a single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976.
Li, P., Li, D., Li, W., Gong, S., Fu, Y., & Hospedales, T.M. (2021). A simple feature augmentation for domain generalization. In: ICCV.
Li, H., Pan, S.J., Wang, S. & Kot, A.C. (2018). Domain generalization with adversarial feature learning. In: CVPR.
Li, Y., Wang, N., Shi, J., Liu, J., & Hou, X. (2016). Revisiting batch normalization for practical domain adaptation. arXiv preprint arXiv:1603.04779.
Li, D., Yang, Y., Song, Y.-Z. & Hospedales, T.M. (2017) Deeper, broader and artier domain generalization. In: ICCV.
Li, Y., Zhang, D., Keuper, M., & Khoreva, A. (2024). Intra- & extra-source exemplar-based style synthesis for improved domain generalization. IJCV.
Li, D., Zhang, J., Yang, Y., Liu, C., Song, Y.-Z. & Hospedales, T.M. (2019). Episodic training for domain generalization. In: ICCV.
Liang, J., Hu, D. & Feng, J. (2020). Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In: ICML.
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B. & Belongie, S. (2017). Feature pyramid networks for object detection. In: CVPR.
Lin, C., Yuan, Z., Zhao, S., Sun, P., Wang, C., & Cai, J. (2021). Domain-invariant disentangled network for generalizable object detection. In: ICCV.
Liu, Q., Dou, Q., Yu, L., & Heng, P.A. (2020). Ms-net: multi-site network for improving prostate segmentation with heterogeneous mri data. TMI.
Liu, Y., Kothari, P., Delft, B., Bellot-Gurlet, B., Mordan, T., & Alahi, A. (2021). Ttt++: When does self-supervised test-time training fail or thrive? NeurIPS.
Long, M., Cao, Y., Wang, J., & Jordan, M. (2015). Learning transferable features with deep adaptation networks. In: ICML.
Luo, Y., Liu, P., & Yang, Y. (2024). Kill two birds with one stone: Domain generalization for semantic segmentation via network pruning. IJCV.
Luo, P., Zhang, R., Ren, J., Peng, Z., & Li, J. (2019). Switchable normalization for learning-to-normalize deep representation. TPAMI.
Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research.
Mancini, M., Akata, Z., Ricci, E., & Caputo, B. (2020). Towards recognizing unseen categories in unseen domains. In: ECCV.
Maria Carlucci, F., Porzi, L., Caputo, B., Ricci, E., & Rota Bulo, S. (2017). Autodial: Automatic domain alignment layers. In: ICCV.
Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A.S., Bethge, M., & Brendel, W. (2019). Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484.
MixStyle: MixStyle. https://github.com/KaiyangZhou/mixstyle-release
Motiian, S., Piccirilli, M., Adjeroh, D.A. & Doretto, G. (2017). Unified deep supervised domain adaptation and generalization. In: ICCV.
Muhammad, U., Laaksonen, J., Romaissa Beddiar, D., & Oussalah, M. (2024). Domain generalization via ensemble stacking for face presentation attack detection. IJCV.
Mummadi, C.K., Hutmacher, R., Rambach, K., Levinkov, E., Brox, T. & Metzen, J.H. (2021). Test-time adaptation to distribution shift by confidence maximization and input transformation. arXiv preprint arXiv:2106.14999.
Neuhold, G., Ollmann, T., Rota Bulo, S. & Kontschieder, P. (2017). The mapillary vistas dataset for semantic understanding of street scenes. In: ICCV.
Nuriel, O., Benaim, S., & Wolf, L. (2021). Permuted adain: reducing the bias towards global statistics in image classification. In: CVPR.
Otálora, S., Atzori, M., Andrearczyk, V., Khan, A., & Müller, H. (2019). Staining invariant features for improving generalization of deep convolutional neural networks in computational pathology. Frontiers in bioengineering and biotechnology.
Pan, X., Luo, P., Shi, J., & Tang, X. (2018). Two at once: Enhancing learning and generalization capacities via ibn-net. In: ECCV.
Pan, X., Zhan, X., Shi, J., Tang, X. & Luo, P. (2019). Switchable whitening for deep representation learning. In: ICCV.
Peng, D., Lei, Y., Hayat, M., Guo, Y., & Li, W. (2022). Semantic-aware domain generalized segmentation. In: CVPR.
Peng, X., Li, Y., & Saenko, K. (2020). Domain2vec: Domain embedding for unsupervised domain adaptation. In: ECCV.
Raghunandan, A., Raghav, P., & Aradhya, H.R., et al. (2018). Object detection algorithms for video surveillance applications. In: ICCSP.
Rao, Z., Guo, J., Tang, L., Huang, Y., Ding, X. & Guo, S. (2024). Srcd: Semantic reasoning with compound domains for single-domain generalized object detection. TNNLS.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In: NeurIPS.
Rezaeianaran, F., Shetty, R., Aljundi, R., Reino, D.O., Zhang, S., & Schiele, B. (2021). Seeking similarities over differences: Similarity-based domain alignment for adaptive object detection. In: ICCV.
Richter, S.R., Vineet, V., Roth, S. & Koltun, V. (2016). Playing for data: Ground truth from computer games. In: ECCV.
Ros, G., Sellart, L., Materzynska, J., Vazquez, D. & Lopez, A.M. (2016). The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In: CVPR.
Roy, S., Krivosheev, E., Zhong, Z., Sebe, N., & Ricci, E. (2021). Curriculum graph co-teaching for multi-target domain adaptation. In: CVPR.
Saito, K., Ushiku, Y., Harada, T., & Saenko, K. (2019). Strong-weak distribution alignment for adaptive object detection. In: CVPR.
Sakaridis, C., Dai, D. & Van Gool, L. (2021). Acdc: The adverse conditions dataset with correspondences for semantic driving scene understanding. In: ICCV.
Sakaridis, C., Dai, D., & Van Gool, L. (2018). Semantic foggy scene understanding with synthetic data. IJCV.
Schneider, S., Rusak, E., Eck, L., Bringmann, O., Brendel, W., & Bethge, M. (2020). Improving robustness against common corruptions by covariate shift adaptation. In: NeurIPS.
Segu, M., Tonioni, A., & Tombari, F. (2020). Batch normalization embeddings for deep domain generalization. arXiv preprint arXiv:2011.12672.
Shankar, S., Piratla, V., Chakrabarti, S., Chaudhuri, S., Jyothi, P. & Sarawagi, S. (2018). Generalizing across domains via cross-gradient training. In: ICLR.
Somavarapu, N., Ma, C.-Y., & Kira, Z. (2020). Frustratingly simple domain generalization via image stylization. arXiv preprint arXiv:2006.11207.
Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al. (2020). Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR.
Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A., & Hardt, M. (2020). Test-time training with self-supervision for generalization under distribution shifts. In: ICML.
Tang, Z., Gao, Y., Zhu, Y., Zhang, Z., Li, M., & Metaxas, D.N. (2021). Crossnorm and selfnorm for generalization under distribution shifts. In: ICCV.
Ulyanov, D., Vedaldi, A. & Lempitsky, V. (2016). Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022.
VS, V., Gupta, V., Oza, P., Sindagi, V.A., & Patel, V.M. (2021). Mega-cda: Memory guided attention for category-aware unsupervised domain adaptive object detection. In: CVPR.
Venkateswara, H., Eusebio, J., Chakraborty, S., & Panchanathan, S. (2017). Deep hashing network for unsupervised domain adaptation. In: CVPR.
Verma, V., Lamb, A., Beckham, C., Najafi, A., Mitliagkas, I., Lopez-Paz, D. & Bengio, Y. (2019). Manifold mixup: Better representations by interpolating hidden states. In: ICML.
Vidit, V., Engilberge, M., & Salzmann, M. (2023). Clip the gap: A single domain generalization approach for object detection. In: CVPR.
Volpi, R., & Murino, V. (2019). Addressing model vulnerability to distributional shifts over image transformation sets. In: ICCV.
Volpi, R., Namkoong, H., Sener, O., Duchi, J.C., Murino, V., & Savarese, S. (2018). Generalizing to unseen domains via adversarial data augmentation. In: NeurIPS.
Wan, Z., Li, L., Li, H., He, H., & Ni, Z. (2020). One-shot unsupervised domain adaptation for object detection. In: IJCNN.
Wang, Q., Fink, O., Van Gool, L., & Dai, D. (2022). Continual test-time domain adaptation. In: CVPR.
Wang, X., Huang, T.E., Liu, B., Yu, F., Wang, X., Gonzalez, J.E., & Darrell, T. (2021). Robust object detection via instance-level temporal cycle confusion. In: ICCV.
Wang, D., Shelhamer, E., Liu, S., Olshausen, B., & Darrell, T. (2020). Tent: Fully test-time adaptation by entropy minimization. In: ICLR.
Wang, K., Yang, C., & Betke, M. (2021). Consistency regularization with high-dimensional nonadversarial source-guided perturbation for unsupervised domain adaptation in segmentation. In: AAAI.
Wu, Y. & He, K. (2018). Group normalization. In: ECCV.
Wu, A., & Deng, C. (2022). Single-domain generalized object detection in urban scene via cyclic-disentangled self-distillation. In: CVPR.
Wu, Y., & Johnson, J. (2021). Rethinking” batch” in batchnorm. arXiv preprint arXiv:2105.07576.
Wu, Y., Chen, Y., Yuan, L., Liu, Z., Wang, L., Li, H. & Fu, Y. (2020). Rethinking classification and localization for object detection. In: CVPR.
Xie, X., Chen, J., Li, Y., Shen, L., Ma, K., & Zheng, Y. (2020). Self-supervised cyclegan for object-preserving image-to-image domain adaptation. In: ECCV.
Xu, Z., Liu, D., Yang, J., Raffel, C., & Niethammer, M. (2020). Robust and generalizable visual representation learning via random convolutions. In: ICLR.
Xu, M., Wang, H., Ni, B., Tian, Q. & Zhang, W. (2020). Cross-domain detection via graph-induced prototype alignment. In: CVPR.
Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V. & Darrell, T. (2020). Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In: CVPR.
Yue, X., Zhang, Y., Zhao, S., Sangiovanni-Vincentelli, A., Keutzer, K., & Gong, B. (2019). Domain randomization and pyramid consistency: Simulation-to-real generalization without accessing target domain data. In: ICCV.
Yue, X., Zheng, Z., Zhang, S., Gao, Y., Darrell, T., Keutzer, K., & Vincentelli, A.S. (2021). Prototypical cross-domain self-supervised learning for few-shot unsupervised domain adaptation. In: CVPR.
Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J. & Yoo, Y. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In: ICCV.
Zeiler, M.D. & Fergus, R. (2014). Visualizing and understanding convolutional networks. In: ECCV.
Zhang, H., Cisse, M., Dauphin, Y.N. & Lopez-Paz, D. (2018). mixup: Beyond empirical risk minimization. In: ICLR.
Zhang, Y., David, P., & Gong, B. (2017). Curriculum domain adaptation for semantic segmentation of urban scenes. In: ICCV.
Zhang, M.M., Levine, S., & Finn, C. (2021). Memo: Test time robustness via adaptation and augmentation. In: NeurIPS Workshop.
Zhang, X., Xu, Z., Xu, R., Liu, J., Cui, P., Wan, W., Sun, C., & Li, C. (2022). Towards domain generalization in object detection. arXiv preprint arXiv:2203.14387.
Zhao, Y., Zhong, Z., Zhao, N., Sebe, N., & Lee, G.H. (2024). Style-hallucinated dual consistency learning: A unified framework for visual domain generalization. IJCV.
Zhou, K., Yang, Y., Hospedales, T. & Xiang, T. (2020). Learning to generate novel domains for domain generalization. In: ECCV.
Zhou, K., Yang, Y., Hospedales, T., & Xiang, T. (2020). Deep domain-adversarial image generation for domain generalisation. In: AAAI.
Zhou, K., Yang, Y., Qiao, Y., & Xiang, T. (2020). Domain generalization with mixstyle. In: ICLR.
Zhou, K., Yang, Y., Qiao, Y., & Xiang, T. (2024). Mixstyle neural networks for domain generalization and adaptation. IJCV.
Zhu, X., Pang, J., Yang, C., Shi, J. & Lin, D. (2019). Adapting object detectors via selective cross-domain alignment. In: CVPR.

Download references

Funding

This work is supported in part by the National Natural Science Foundation of China (62406140), Natural Science Foundation of Jiangsu Province (BK20241200), and the Research Grant Council of the Hong Kong SAR (16201420).

Author information

Authors and Affiliations

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Qi Fan
School of Intelligence Science and Technology, Nanjing University, Suzhou, China
Qi Fan
Computer Vision and Machine Learning Department, Max-Planck-Institut für Informatik, Saarbrücken, Germany
Mattia Segu & Bernt Schiele
Department of Information Technology and Electrical Engineering, ETH Zürich, Zürich, Switzerland
Mattia Segu
Huawei Zurich Research Center, Zürich, Switzerland
Dengxin Dai
Computer Science Department, Dartmouth College, Hanover, USA
Yu-Wing Tai
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR, China
Chi-Keung Tang

Authors

Qi Fan
View author publications
Search author on:PubMed Google Scholar
Mattia Segu
View author publications
Search author on:PubMed Google Scholar
Bernt Schiele
View author publications
Search author on:PubMed Google Scholar
Dengxin Dai
View author publications
Search author on:PubMed Google Scholar
Yu-Wing Tai
View author publications
Search author on:PubMed Google Scholar
Chi-Keung Tang
View author publications
Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Yu-Wing Tai or Chi-Keung Tang.

Additional information

Communicated by Yen-Yu Lin.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 More Problem Analysis

We perform further problem analysis for robust object detection. We observe that the domain style overfitting problem is mainly caused by the biased distribution of low-level features learned in shallow CNN layers. Table 15 shows four Faster R-CNN models used for analysis and their performance on three datasets.

Biased Model Impedes Domain Generalization. With training data under the same domain style, the learned model performs well in testing for data under the same distribution as the training data, with ability of grouping in-domain features together. But the learned model tends to separate distinct domains and thus hardly generalizes from the source to target domain. Figure 2 shows image feature channel statistics of the same domain are grouped together, while different domains are separated. Figure 11 and Table 15 show that the biased distribution in the Baseline and Overfit models causes large domain feature statistic discrepancy which impedes model generalization to unseen domains.

Shallow CNN Features Matter for Generalization. Figure 2 and Figure 11 show that shallow CNN layers exhibit larger domain feature statistic discrepancy. Such discrepancy is propagated from the shallow to deep layers and finally results in the poor target domain performance. The shallow CNN layers suffer more from severe biased distribution when they are further trained on the source domain. Note in particular Figure 11 shows that the Overfit model has larger domain feature gaps on all layers. Table 15 further shows quantitatively that this overfitting model generalizes worse on unseen target domains, while capable of producing better source domain performance. Thus shallow CNN layers do matter for generalizing model to different domain styles, because they preserve more style information through encoding local structures, such as corner, edge, color and texture, which are closely relevant to styles (Zeiler & Fergus, 2014). While the deep CNN layers encode more semantic information which are more insensitive to the style effect, if the model is trained on the biased shallow CNN features, the deep layers cannot effectively calibrate the style-biased semantic information and thus the entire model overfits to the source domain.

Table 15 Four Faster R-CNN models with different settings. They are all trained on Cityscapes train set and evaluated on Cityscapes (C), Foggy Cityscapes (F) and BDD100k (B) val sets

Full size table

Table 16 Leave-one-domain-out generalization results on PACS dataset

Full size table

Reducing Domain Style Overfitting. Diverse training domains would help deep models to learn domain-invariant representations and thus reduce the domain style overfitting. Our NP efficiently synthesizes diverse latent domain styles and effectively reduces any inherent domain style overfitting. Figure 2 and Figure 11 show our NP significantly reduces the domain feature gap, especially in the shallow and deep CNN layers. Table 15 shows that Ours model with NP generalizes well on unseen target domains while simultaneously keeping the source domain performance. The image-level domain style synthesis method StyeRD also reduces domain style gaps and improves domain generalization. However, as we show in the main paper, this method is not as desirable as ours.

Table 17 Semantic segmentation domain generalization results. Train datasets are underlined

Full size table

1.2 Classification Domain Generalization

We compare our method to other DG techniques on the classification domain generalization (DG) task, i.e., MMD-AAE Li et al. (2018), CCSA Motiian et al. (2017), JiGen Carlucci et al. (2019), CrossGrad Shankar et al. (2018), Epi-FCR Li et al. (2019), Metareg Balaji et al. (2018), L2A-OT Zhou et al. (2020), Manifold Mixup Verma et al. (2019), Cutout DeVries and Taylor (2017), CutMix Yun et al. (2019), Mixup Zhang et al. (2018), DropBlock Ghiasi et al. (2018), and DSU Li et al. (2022). We follow MixStyle Zhou et al. (2020) to implement our method on the popular PACS Li et al. (2017) dataset. Specifically, we use the MixStyle public codebase [97] to train and evaluate our method by directly replacing the mixstyle module with our NP/NP+ module, keeping all other settings unchanged. We remove the default photometric data augmentation of NP+ for a fair comparison. The model is trained on three domains and evaluated on the leave-out domain. Table 16 shows our NP substantially improves the classification DG performance and our NP+ further boosts the performance to 84.0. Note that the PACS domain shifts in classification DG are distinct from real-world domain shifts in dense preidiction tasks. Although not specifically designed for classification DG, our method still performs better or comparable to previous classification DG methods thanks to our diverse latent styles generated by the perturbation operation.

1.3 Semantic Segmentation Domain Generalization

We follow the previous semantic segmentation domain generalization SOTA method RobustNet Choi et al. (2021) to train and evaluate our method. The model is trained on GTAV/Cityscapes datasets, and evaluated on various datasets, i.e., GTAV (G) Richter et al. (2016), Cityscapes (C) Cordts et al. (2016), BDD100k (B) Yu et al. (2020), Mapillary Vistas (M) Neuhold et al. (2017), and Synthia (S) Ros et al. (2016). We compare our method to UDA segmentation methods, i.e., SW Pan et al. (2019), IBN-Net Pan et al. (2018), IterNorm Huang et al. (2019), and ISW Choi et al. (2021). We also compare our method to classification DG methods by applying them on our baseline for a fair comparison, i.e., SFA Li et al. (2021), pAdaIN Nuriel et al. (2021), Mixstyle Zhou et al. (2020), and DSU Li et al. (2022). Table 17 shows that our method performs the best.

We evaluate our method on the recently proposed ACDC Sakaridis et al. (2021) dataset, which contains four adverse weather types: fog, night, rain, and snow. Specifically, we directly apply our DeepLabv3+ Chen et al. (2018) model trained on Cityscapes Cordts et al. (2016) dataset on ACDC val set. Table 18 shows our method significantly improves the semantic segmentation generalization performance under adverse weather domain shifts.

Table 18 Semantic segmentation domain generalization results on ACDC dataset

Full size table

1.4 Limitation Discussion

In most real-world scenarios, domain discrepancy is mainly caused by different styles, which is the fundamental assumption of our method and other feature statistic perturbation methods. However, discrepancies can occur in object and background contents, e.g., automobiles in the 1950s and today are very different, even from the same car manufacturer. Background content can vary across wide domains, such as forest, desert, countryside and city. All the feature statistic perturbation based DG methods including ours cannot handle well such content discrepancy. This is a significant future DG research direction to address the domain content discrepancy, especially in the age of globalization. Thus, we propose Domain-Invariant Training and Continual Test-Time Adaptation for robust object detection.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Fan, Q., Segu, M., Schiele, B. et al. Robust Object Detection with Domain-Invariant Training and Continual Test-Time Adaptation. Int J Comput Vis 133, 6768–6793 (2025). https://doi.org/10.1007/s11263-025-02465-9

Download citation

Received: 29 August 2024
Accepted: 25 April 2025
Published: 25 June 2025
Version of record: 25 June 2025
Issue date: October 2025
DOI: https://doi.org/10.1007/s11263-025-02465-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust Object Detection with Domain-Invariant Training and Continual Test-Time Adaptation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Region Feature Disentanglement for Domain Adaptive Object Detection

Real-time object detection method with single-domain generalization based on YOLOv8

Refining unified model by test-time dynamic adaptation for domain shift problem

Explore related subjects

Data Availibility

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Appendix

Appendix

1.1 More Problem Analysis

1.2 Classification Domain Generalization

1.3 Semantic Segmentation Domain Generalization

1.4 Limitation Discussion

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now