Abstract
The attention mechanism has been proven effective on various visual tasks in recent years. In the semantic segmentation task, the attention mechanism is applied in various methods, including the case of both convolution neural networks and vision transformer as backbones. However, we observe that the attention mechanism is vulnerable to patch-based adversarial attacks. Through the analysis of the effective receptive field, we attribute it to the fact that the wide receptive field brought by global attention may lead to the spread of the adversarial patch. To address this issue, in this paper, we propose a robust attention mechanism (RAM) to improve the robustness of the semantic segmentation model, which can notably relieve the vulnerability against patch-based attacks. Compared to the vallina attention mechanism, RAM introduces two novel modules called max attention suppression and random attention dropout, both of which aim to refine the attention matrix and limit the influence of a single adversarial patch on the semantic segmentation results of other positions. Extensive experiments demonstrate the effectiveness of our RAM to improve the robustness of semantic segmentation models against various patch-based attack methods under different attack settings.
Similar content being viewed by others
References
Andriushchenko, M., Croce, F., Flammarion, N., & Hein, M. (2020). Square attack: A query-efficient black-box adversarial attack via random search. In ECCV (Vol. 12368, pp. 484–501).
Athalye, A., Engstrom, L., Ilyas, A., & Kwok, K. (2018). Synthesizing robust adversarial examples. In ICML (Vol. 80, pp. 284–293).
Bai, Y., Mei, J., Yuille, A. L., & Xie, C. (2021). Are transformers more robust than cnns? In NeurIPS (pp. 26831–26843).
Bai, J., Yuan, L., Xia, S., Yan, S., Li, Z., & Liu, W. (2022). Improving vision transformers by revisiting high-frequency components. arXiv preprint arXiv:2204.00993.
Benz, P., Ham, S., Zhang, C., Karjauv, A., & Kweon, I. S. (2021). Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. In BMVC (p. 25).
Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., & Veit, A. (2021). Understanding robustness of transformers for image classification. In ICCV (pp. 10211–10221).
Brown, T.B., Mané, D., Roy, A., Abadi, M., & Gilmer, J. (2017). Adversarial patch. arXiv preprint arXiv:1712.09665.
Chen, Z., Li, B., Xu, J., Wu, S., Ding, S., & Zhang, W. (2022). Towards practical certifiable patch defense with vision transformer. In CVPR (pp. 15127–15137).
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2015). Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR.
Chen, L., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587.
Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In CVPR (pp. 1280–1289).
Cheng, B., Schwing, A. G., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. In NeurIPS (pp. 17864–17875).
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI, 40(4), 834–848.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR (pp. 3213–3223).
Croce, F., & Hein, M. (2020). Minimally distorted adversarial examples with a fast adaptive boundary attack. In ICML (Vol. 119, pp. 2196–2205).
Croce, F., & Hein, M. (2020). Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In ICML (Vol. 119, pp. 2206–2216).
Debenedetti, E., Sehwag, V., & Mittal, P. (2022). A light recipe to train robust vision transformers. arXiv preprint arXiv:2209.07399.
Deng, J., Guo, J., Xue, N., & Zafeiriou, S. (2019). Arcface: Additive angular margin loss for deep face recognition. In CVPR (pp. 4690–4699).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.
Everingham, M., Eslami, S. M. A., Gool, L. V., Williams, C. K. I., Winn, J. M., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. IJCV, 111(1), 98–136.
Fu, Y., Zhang, S., Wu, S., Wan, C., & Lin, Y. (2022). Patch-fool: Are vision transformers always robust against adversarial perturbations? In ICLR.
Gu, J., Tresp, V., & Qin, Y. (2022). Are vision transformers robust to patch perturbations? In ECCV (Vol. 13672, pp. 404–421).
He, K., Zhang, X., Ren, S., & Sun, J. (2016) Deep residual learning for image recognition. In: CVPR (pp. 770–778).
Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., Krishnan, D., & Sun, D. (2022). Pyramid adversarial training improves vit performance. In CVPR (pp. 13409–13419).
Hu, Y., Chen, J., Kung, B., Hua, K., & Tan, D. S. (2021). Naturalistic physical adversarial patch for object detectors. In ICCV (pp. 7828–7837).
Huang, Y., & Li, Y. (2021). Zero-shot certified defense against adversarial patches with vision transformers. arXiv preprint arXiv:2111.10481.
Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., & Liu, W. (2019). Ccnet: Criss-cross attention for semantic segmentation. In ICCV (pp. 603–612).
Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., & Shi, H. (2021). Semask: Semantically masked transformers for semantic segmentation. arXiv preprint arXiv:2112.12782.
Kamann, C., & Rother, C. (2020). Benchmarking the robustness of semantic segmentation models. In CVPR (pp. 8825–8835).
Kamann, C., & Rother, C. (2020). Increasing the robustness of semantic segmentation models with painting-by-numbers. In ECCV (Vol. 12355, pp. 369–387).
Karmon, D., Zoran, D., & Goldberg, Y. (2018). Lavan: Localized and visible adversarial noise. In ICML (Vol. 80, pp. 2512–2520).
Kirillov, A., Girshick, R. B., He, K., & Dollár, P. (2019). Panoptic feature pyramid networks. In CVPR (pp. 6399–6408).
Lee, M., & Kolter, J. Z. (2019). On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897.
Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., & Liu, H. (2019). Expectation-maximization attention networks for semantic segmentation. In ICCV (pp. 9166–9175).
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV (pp. 9992–10002).
Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In CVPR (pp. 11966–11976).
Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., & Song, L. (2017). Sphereface: Deep hypersphere embedding for face recognition. In CVPR (pp. 6738–6746).
Liu, X., Yang, H., Liu, Z., Song, L., Chen, Y., & Li, H. (2019). DPATCH: An adversarial patch attack on object detectors. In: Workshop on artificial intelligence safety 2019 co-located with the thirty-third AAAI conference on artificial intelligence 2019 (AAAI-19), Honolulu, Hawaii, January 27, 2019 (Vol. 2301).
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR (pp. 3431–3440).
Lovisotto, G., Finnie, N., Munoz, M., Mummadi, C. K., & Metzen, J. H. (2022). Give me your attention: Dot-product attention considered harmful for adversarial patch robustness. In CVPR (pp. 15213–15222).
Luo, W., Li, Y., Urtasun, R., & Zemel, R. S. (2016). Understanding the effective receptive field in deep convolutional neural networks. In NeurIPS (pp. 4898–4906).
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2018). Towards deep learning models resistant to adversarial attacks. In ICLR.
Mahmood, K., Mahmood, R., & Dijk, M. (2021). On the robustness of vision transformers to adversarial examples. In ICCV (pp. 7818–7827).
Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., & Xue, H. (2022). Towards robust vision transformer. In CVPR (pp. 12032–12041).
Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A. S., Bethge, M., & Brendel, W. (2019). Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484.
Mirsky, Y. (2021). Ipatch: A remote adversarial patch. arXiv preprint arXiv:2105.00113.
Nakka, K. K., & Salzmann, M. (2020). Indirect local attacks for context-aware semantic segmentation networks. In ECCV (Vol. 12350, pp. 611–628).
Nesti, F., Rossolini, G., Nair, S., Biondi, A., & Buttazzo, G. C. (2022). Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In WACV (pp. 2826–2835).
Rando, J., Naimi, N., Baumann, T., & Mathys, M. (2022). Exploring adversarial attacks and defenses in vision transformers trained with DINO. arXiv preprint arXiv:2206.06761.
Salman, H., Jain, S., Wong, E., & Madry, A. (2022). Certified patch robustness via smoothed vision transformers. In CVPR (pp. 15116–15126).
Sato, T., Shen, J., Wang, N., Jia, Y., Lin, X., & Chen, Q. A. (2021). Dirty road can attack: Security of deep learning based automated lane centering under physical-world attack. In 30th USENIX Security Symposium, USENIX Security 2021, August 11–13, 2021 (pp. 3309–3326).
Serrurier, M., Mamalet, F., González-Sanz, A., Boissin, T., Loubes, J., & Barrio, E. (2021). Achieving robustness in classification using optimal transport with hinge regularization. In CVPR (pp. 505–514).
Shao, R., Shi, Z., Yi, J., Chen, P.-Y., & Hsieh, C.-J. (2022). On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670.
Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.
Strudel, R., Pinel, R.G., Laptev, I., & Schmid, C. (2021). Segmenter: Transformer for semantic segmentation. In ICCV (pp. 7242–7252).
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016) Rethinking the inception architecture for computer vision. In CVPR (pp. 2818–2826).
Tan, M., & Le, Q. V. (2021). Efficientnetv2: Smaller models and faster training. In ICML (pp. 10096–10106).
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In ICML (Vol. 139, pp. 10347–10357).
Wang, Z., Bai, Y., Zhou, Y., & Xie, C. (2022). Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452.
Wang, X., Girshick, R. B., Gupta, A., & He, K. (2018). Non-local neural networks. In CVPR (pp. 7794–7803).
Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., & Gu, Q. (2019). On the convergence and robustness of adversarial training. In ICML (Vol. 97, pp. 6586–6595).
Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., & Liu, W. (2018). Cosface: Large margin cosine loss for deep face recognition. In CVPR (pp. 5265–5274.
Wei, X., Guo, Y., & Yu, J. (2022). Adversarial sticker: A stealthy attack method in the physical world. IEEE TPAMI.
Wu, B., Gu, J., Li, Z., Cai, D., He, X., & Liu, W. (2022). Towards efficient adversarial training on vision transformers. arXiv preprint arXiv:2207.10498.
Xiao, Z., Gao, X., Fu, C., Dong, Y., Gao, W., Zhang, X., Zhou, J., & Zhu, J. (2021). Improving transferability of adversarial patches on face recognition with generative models. In CVPR (pp. 11845–11854).
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In ECCV (Vol. 11209, pp. 432–448).
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., & Luo, P. (2021). Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS (pp. 12077–12090).
Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., & Yuille, A. L. (2017). Adversarial examples for semantic segmentation and object detection. In ICCV (pp. 1378–1387).
Xu, X., Zhao, H., & Jia, J. (2021). Dynamic divide-and-conquer adversarial training for robust semantic segmentation. In ICCV (pp. 7466–7475).
Yang, C., Kortylewski, A., Xie, C., Cao, Y., & Yuille, A. L. (2020). Patchattack: A black-box texture-based attack with reinforcement learning. In ECCV (Vol. 12371, pp. 681–698).
Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., & Yan, S. (2022). Metaformer is actually what you need for vision. In CVPR (pp. 10809–10819).
Yuan, Y., & Wang, J. (2018). Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916.
Zhang, B., Liu, L., Phan, M. H., Tian, Z., Shen, C., & Liu, Y. (2024). Segvit v2: Exploring efficient and continual semantic segmentation with plain vision transformers. IJCV, 132(4), 1126–1147.
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In CVPR (pp. 6230–6239).
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P. H. S., & Zhang, L. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR (pp. 6881–6890).
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ADE20K dataset. In CVPR (pp. 5122–5130).
Zoph, B., Vasudevan, V., Shlens, J., & Le, Q. V. (2018). Learning transferable architectures for scalable image recognition. In CVPR (pp. 8697–8710).
Acknowledgements
This work is partially supported by National Key R &D Program of China (No. 2021YFC3310100), Strategic Priority Research Program of the Chinese Academy of Sciences (No. XDB0680000), Beijing Nova Program (20230484368), National Natural Science Foundation of China (No. 62176251), and Youth Innovation Promotion Association CAS.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Kaiyang Zhou.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A The Details of Attack Methods
In this section, we first introduce the general framework of attack methods, and then describe ten attack methods (i.e., PGD Madry et al. (2018), DAG Xie et al. (2017), IPatch Mirsky (2021), SSAP Nesti et al. (2022), Patch-Fool Fu et al. (2022), Attention-Fool Lovisotto et al. (2022), AutoAttack Croce and Hein (2020), EOT Athalye et al. (2018), MaxVarDAG and MaxAttnDAG) in detail.
In order to comprehensively evaluate the defense effect of our proposed RAM, we first use the widely used PGD (Madry et al., 2018) method, together with three strong attack methods designed for semantic segmentation models (i.e., DAG (Xie et al., 2017), IPatch (Mirsky, 2021), SSAP (Nesti et al., 2022)) to evaluate the robustness of models. In addition, we further adopt recent proposed attention-based attack methods (e.g., Patch-Fool Fu et al. (2022) and Attention-Fool Lovisotto et al. (2022)), together with several strong attack methods which often used as the benchmark (i.e., AutoAttack Croce and Hein (2020) and EOT Athalye et al. (2018)), to further evaluate the robustness of segmentation models. Moreover, in order to more comprehensively evaluate the effectiveness of our defense method, we use several adaptive attack methods to conduct the evaluation. Specifically, based on the DAG method, we propose two adaptive attack methods, MaxVarDAG and MaxAttnDAG to elaborately attack the attention matrix.
1.1 A.1 The General Framework of Attack Methods
We use the process of adversarial attack against the semantic segmentation model in DAG Xie et al. (2017) method as the general framework, and add the limitation of a patch-based mask to it. The algorithm is shown in Algorithm 1, where T is the number of iterative steps in the attack, \(\gamma \) is a hyperparameter to control the scale of the gradient, \(\text {Clip}\) means to clip the generated adversarial examples to the image scale range, i.e., [0, 1]. The specific loss functions used in different attack methods are introduced below.
In the experiments, we set \(T=400\) for all attack methods. For the hyperparameter of \(\gamma \), we set \(\gamma =0.005\) for PGD Madry et al. (2018), and \(\gamma =1\) for the other all attack methods.
1.2 A.2 Projected Gradient Descent
Projected Gradient Descent (PGD) (Madry et al., 2018) method is one of the earliest classic adversarial attack methods, and it focuses on the adversarial attack of the classification model under \(L_p\) norm. In the targeted attack scenario, it generates adversarial examples by maximizing the cross-entropy loss between model predictions and target labels. In this paper, we migrate it to the task of patch-based attacks on semantic segmentation models. Assume that the predicted segmentation result of the model for the input image \(\textbf{X}\in \mathbb {R}^{H\times W\times 3}\) is \(\textbf{Y}\in [0,1]^{H\times W\times C}\), the ground truth label is \(\mathbf {Y^g}\in \{0,1,\cdots , C-1\}^{H\times W}\) and the target label specified by the attacker is \(\mathbf {Y^t}\in \{0,1,\cdots , C-1\}^{H\times W}\), where H and W are the height and width of the input image, and C is the number of categories for image segmentation. The loss function of PGD method is as follows:
where \(\mathbf {Y^t}_{ij}\) represents the target label at position (i, j) in the image, and \(\textbf{Y}_{ijc}\) represents the probability predicted by the model that the pixel at position (i, j) in the image belongs to category c.
1.3 A.3 Dense Adversary Generation
Dense Adversary Generation (DAG) (Xie et al., 2017) method is an attack method specific to dense prediction tasks, such as semantic segmentation and object detection. The basic idea is to define a dense set of targets as well as a different set of desired labels, and optimize a loss function in order to produce incorrect recognition results on all the targets simultaneously. The loss function used in DAG is as follows:
1.4 A.4 IPatch
IPatch (Mirsky, 2021) aims to change the semantics of locations far from the patch to a specific category by placing a universal adversarial patch anywhere within an image. When we utilize the method of IPatch to conduct the patch-based attack in our work, we fix the location of the adversarial patch at the lower right corner of the image and generate the adversarial patch for each image separately. The loss function used in IPatch is as follows:
where \(\epsilon =1\textrm{e}{-10}\).
1.5 A.5 Scene-specific Patch Attack
The Scene-specific Patch Attack (SSPA) (Nesti et al., 2022) method designs a new loss function, which considers the pixels that have been successfully attacked and the pixels that have not yet been successfully attacked simultaneously, and adjusts the weights of the two terms through adaptive coefficients. The attack process is formalized as follows:
1.6 A.6 Patch-Fool
Given the insensitivity of ViTs’ self-attention mechanism to local perturbations in the task of classification, Patch-Fool (Fu et al., 2022) is designed to fool the self-attention mechanism by attacking its basic component (i.e., a single patch) with a series of attention-aware optimization techniques. It should be noted that the Patch-Fool method is an attack method designed for the classification model, so we have partially modified the loss function in order to be suitable for the attack on semantic segmentation models.
The loss function used in Patch-Fool consists of two parts, one of which is the commonly used cross-entropy loss:
The other one is the attention-aware loss:
where \(\mathbf {a^{(l,h,i)}}=[a_1^{(l,h,i)}, \cdots , a_n^{(l,h,i)}]\in \mathbb {R}^n\) is the attention distribution for the i-th token of the h-th head in the l-th layer, \(a_p^{(l,h,i)}\) denotes the attention value corresponding to the adversarial patch p.
The adversarial patch is then updated based on both the cross-entropy loss and the layer-wise attention-aware loss:
where
where l is the predefined self-attention layer and \(\alpha \) is a weighted coeffficient.
1.7 A.7 Attention-Fool
Attention-Fool (Lovisotto et al., 2022) claims that Vision Transformer (ViT) models may give a false sense of robustness due to dot-product attention, and can be easily attacked with a specific kind of attack called Attention-Fool attack. It should be noted that the Attention-Fool method is also an attack method designed for the classification model, so we have partially modified the loss function in order to be suitable for the attack on semantic segmentation models.
We denote projected queries by \(\mathbf {P_Q^{hl}=X^{hl}W_Q^{hl}}\in \mathbb {R}^{n\times d_q}\) and projected keys by \(\mathbf {P_K^{hl}=X^{hl}W_K^{hl}}\in \mathbb {R}^{n\times d_k}\), where \(\mathbf {X^{hl}}\) is the feature of images fed to the h-th attention head of l-th layer, \(\mathbf {W_Q^{hl}}\) and \(\mathbf {W_K^{hl}}\) are the learned projection matrices, respectively. Then \(\mathbf {B^{hl}}=\frac{\mathbf {P_Q^{hl}}(\mathbf {P_K^{hl}})^\top }{\sqrt{d_k}}\in \mathbb {R}^{n\times n}\) quantifies the dot-product similarity between each pair of key and query in the attention head h and layer l.
\(\mathcal {L}_{kq}\) is proposed to attack all attention layers and heads simultaneously, which is formalized as follows:
where \(\mathbf {B_{jp}^{hl}}\) denotes the corresponding feature of adversarial patch p.
The loss function used in Attention-Fool attack is the combination of the cross-entropy loss and \(\mathcal {L}_{kq}\):
where \(\alpha \) is a weighted coefficient.
1.8 A.8 AutoAttack
AutoAttack (Croce & Hein, 2020) is a strong adaptive attack method that has been widely used in recent years to evaluate the robustness of the model. It is a kind of parameter-free ensemble attack, which consists of four attack methods, i.e., APGD\(_{\text {CE}}\), APGD\(_{\text {DLR}}\), FAB (Croce & Hein, 2020) and Square Attack (Andriushchenko et al., 2020). It should be noted that AutoAttack is also an attack method designed for the classification model, so we have partially modified the loss function in order to be suitable for the attack on semantic segmentation models.
1.9 A.9 EOT
Expectation Over Transformation (EOT) (Athalye et al., 2018) method is first proposed by Athalye et al., which calculates the expected value of the loss function over the distribution of possible transformations (e.g., random rotation, translation, addition of noise), and then uses this expected value to guide the generation of adversarial examples. EOT method allows the attacker to craft adversarial examples that are effective even when the defense mechanism introduces randomness into the input image.
1.10 A.10 MaxVarDAG
In order to further explore the defense effect of our method under adaptive attack, we designed two additional adaptive attack methods (i.e., MaxVarDAG and MaxAttnDAG) based on DAG, that is, to elaborately attack the design of our attention mechanism, so as to more comprehensively evaluate the effectiveness of our defensive approach. Specifically, MaxVarDAG extends the DAG method by introducing an additional regularization term that maximizes the variance of the attention matrix in order to amplify the influence of the dirty patch. The loss function of MaxVarDAG is formalized as follows:
where \(\textbf{M}\) is the normalized attention matrix in Eq. (5), \(\text {Var}\) represents the funciton of variance, \(\alpha \) is a weighted coefficient.
1.11 A.11 MaxAttnDAG
MaxAttnDAG also extends DAG by directly adding a regularization term that maximizes the attention value corresponding to the dirty patch position. Through this design, the inference of the dirty patch to the result of image semantic segmentation can be maximized, so as to conduct the adaptive attack based on our proposed defense method. The loss function of MaxVarDAG is formalized as follows:
where \(\textbf{M} \in \mathbb {R}^{N\times N}\) is the normalized attention matrix in Eq. (5), \(\text {resize}\) refers to the process of resizing the attention value vector of each position \(\sum _i \mathbf {M_{i\cdot }}\) to a new size of \(H\times W\), \(\mathbf {M_{mask}} \in \{0,1\}^{H\times W}\) is the mask of dirty patch, \(\alpha \) is a weighted coefficient.
Appendix B More Visualization Results
In this section, we visualize more segmentation results of semantic segmentation models under different attack scenarios in Fig 5.
In Fig 5a, we compare the segmentation results of UPerNet (Xiao et al., 2018)/DeiT (Touvron et al., 2021) model with our proposed RAM and the baseline under the attack setting of Permute. When the baseline model is faced with an input image with the adversarial patch, the segmentation result almost completely meets the adversary’s intention, that is, most areas of the segmentation results are consistent with the target labels determined by the adversary. As a comparison, when the model is with our proposed RAM, the adversarial patch less affects the output of the model, and most of the regions maintain correct segmentation results.
In Fig. 5b, we compare the segmentation results of SegFormer (Xie et al., 2021)/MiT (Xie et al., 2021) model with our proposed RAM and the baseline under the attack setting of Strip. The segmentation results of the baseline model are very close to the strip pattern designed by the adversary, while the segmentation results of our RAM are rarely affected.
Both of the visualization results demonstrate that the baseline model is quite vulnerable to patch-based attacks, while our RAM method significantly improves the robustness of semantic segmentation models against various attack scenarios.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yuan, Z., Zhang, J., Wang, Y. et al. Towards Robust Semantic Segmentation against Patch-Based Attack via Attention Refinement. Int J Comput Vis 132, 5270–5292 (2024). https://doi.org/10.1007/s11263-024-02120-9
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-02120-9