1 Introduction

The emergence of vision transformers in the field of computer vision has driven improvements in model performance (Carion et al., 2020; Dosovitskiy et al., 2020). Unlike a CNN model, a vision transformer learns the association between image patches and classifies images using a [CLS] token. A CNN model and a vision transformer have structural differences that lead to variances in visualization methods (e.g., segmentation and localization). A representative approach is GradCAM, which demonstrates the explainability of CNN models by reflecting the importance of pixel levels using the feature maps and gradients of the models. However, it is somewhat difficult to effectively apply GradCAM to vision transformers because the structural characteristics of vision transformers pose several challenges, such as skip connections, dependency on attention operators, and unstable learning due to non-linearities.

To overcome these challenges, previous research has mainly used attention score information between [CLS] tokens and other patch embeddings to discriminate patch embeddings with a significant impact on learning and visualizing the explainability of vision transformers (Carion et al., 2020; Vaswani et al., 2017). Subsequent studies have evaluated the degree to which each attention head contributes to performance (Voita et al., 2019) or integrated the relevance and attention scores in layers by proposing a relevance propagation rule (Chefer et al., 2021). Another study that evaluated the explainability of localization task used the long-range visual dependency features of the vision transformer to capture semantically perceived locations (Gao et al., 2021). More recently, relevance map optimization has improved the explainability of a vision transformer by assigning low relevance to the background region of an image, while placing high relevance on the foreground region (Chefer et al., 2022).

Despite the advantage of this optimization, challenges to explainability methods for vision transformers remain due to their structural characteristics (Chefer et al., 2021; Chefer et al., 2022). We identified two primary limitations in the visualization methods of previous research. The first limitation arises from the discrepancy between prediction results before and after fine-tuning the model for visualization. Prior studies conducted additional fine-tuning to visualize explainability for classification models, which resulted in different prediction results (Chefer et al., 2022; Gao et al., 2021; Naseer et al., 2021; Choi et al., 2023). The variability of prediction results caused by fine-tuning can undermine the reliability of the model. Explainability visualizations aim to verify the reasoning behind the prediction results of the model, which requires preserving the prediction results and properties (e.g., model weight) of the model as much as possible. However, previous research has shown a lack of effort in preserving the characteristics of the original model and has not adequately discussed related issues.

Second, prior studies optimized the threshold by repeatedly using localization labels in test data (Kumar Singh & Jae Lee, 2017; Zhang et al., 2018; Choe & Shim, 2019). The threshold is used to differentiate between the foreground and background in segmentation tasks using the model’s score map. For object localization, the threshold is also used to generate bounding boxes based on the score map. By applying multiple thresholds to a single model through the repeated use of test data, previous studies have attempted to find the optimal threshold that improves performance even when the model highlights areas that are disproportionately large or small relative to the object size.  Choe et al. (2020); Choe et al. (2022) suggested that the process of selecting the optimal threshold leads to an illusory improvement in localization performance. We argue that prior research has not yet discussed or considered visualization methods that do not use thresholds.

We note that the output patch embeddings for each input image patch in a vision transformer retain the image information of each patch location, and these embeddings can help predict image classes. Based on this motivation, we proposed ICE (\({{\underline{\varvec{I}}}} {{{\underline{\varvec{C}}}}}\)an visualize \({{{\underline{\varvec{E}}}}}\)verything) (Choi et al., 2023), a novel method that uses the output patch embeddings of a vision transformer for each image patch, except for [CLS] tokens, in visualizing explainability. ICE initially assumes that the class of all patch embedding is a background and gradually learns the direction in which the class of each patch embedding in an image is predicted. With this approach, we proposed a loss function for adversarial normalization that combines background and classification losses for each patch embedding. ICE predicts a class for each patch embedding in a foreground region of an image where the object of the class is likely to exist and classifies other regions as a background.

Our preliminary version, ICE (Choi et al., 2023), demonstrated state-of-the-art performance in explainability visualization, validating its effectiveness. However, ICE still has limitations due to its sensitivity to hyperparameters and the inability to preserve the model’s properties (e.g., weights and prediction results) by fine-tuning the model. In this paper, we introduce ICEv2 (\({{{\underline{\varvec{I}}}}}\)nterpretability, \({{{\underline{\varvec{C}}}}}\)omprehensiveness, and \({{{\underline{\varvec{E}}}}}\)xplainability in Vision Transformer). ICEv2 demonstrates higher efficiency, performance, robustness, and scalability, achieved by minimizing the number of training encoder layers, redesigning the MLP layer with respect to width and depth scaling, and optimizing hyperparameters along with various model sizes. ICEv2 proves its effectiveness by training only minimum weights (i.e., the last three encoder layers of the vision transformer) and achieving higher performance than ICE in the ImageNet-Segmentation, Jaccard similarity, and Unsupervised object discovery. In contrast to ICE, which proved its effectiveness on smaller models, we tested ICEv2 on larger and various sizes of vision transformer models to assess the robustness and scalability of ICEv2. We found that even with a larger number of encoder layers and increased patch embedding size (i.e., ViT-Large), visualization using ICEv2 is feasible. Finally, by analyzing hyperparameters that significantly affect the ICEv2 training through two distinct losses (i.e., \(L_{class}\) and \(L_{bg}\)) to discreetly separate background and classes, we confirmed that ICEv2 training is feasible for most models.

The preliminary version of this study is presented in ICE (Choi et al., 2023). This research extends previous work by introducing the following three contributions.

  • We propose ICEv2 which can be applied to vision transformers based on the notion of patch-wise classification and adversarial normalization. DeiT-S models with ICEv2 improve class-specific explainability visualization performance.

  • We verified the effectiveness of ICEv2, which achieves state-of-the-art performance while maintaining the weight of the model as much as possible.

  • We confirmed the robutsness and scalability of ICEv2, which trains from two opposing losses through the convergence of class patch numbers, by demonstrating its ability to be trained within a similar range of hyperparameters across various models.

2 Related Work

The explainability visualization methods applicable to vision transformers can be divided into two categories: (a) gradient or attribute propagation-based visualization, primarily applied to CNN-based models, and (b) visualization methods that consider the transformer structure. Two categories of research on explainability visualization cover two primary tasks: segmentation and localization.

2.1 Explainability in Computer Vision

The gradient-based method normally used in CNN for the segmentation task, uses gradients calculated for each layer through backpropagation. Initially, studies proposed explainability visualization that uses input multiplied by gradient within a model learning process for image classification tasks (Shrikumar et al., 2016). Later research advocated visualization that adopts the average value of gradients (Sundararajan et al., 2017; Smilkov et al., 2017). However, these methods are class-agnostic in visualizing explainability, regardless of the predicted class. Among gradient-based methods, a representative class-specific approach is GradCAM (Selvaraju et al., 2017). GradCAM uses a weighted gradient feature map that considers the gradients and input features of a network layer. However, GradCAM has not been effectively applied to explainability visualization for vision transformers because of the structural nature of the transformers, which classify image classes using [CLS] tokens (Chefer et al., 2021).

The attribute-based method, another methodology used in CNN, visualizes a model’s explainability by exploiting the contribution decomposition of previous layers from prediction to input. A typical approach in this respect is Layer-wise Relevance Propagation (LRP) (Voita et al., 2019), which is based on a method of propagating the relevance score obtained from a predicted class to an input image. Other attribute-based methods include RAP (Nam et al., 2020), AGF (Gur et al., 2021), DeepLIFT (Shrikumar et al., 2017), and DeepSHAP (Lundberg & Lee, 2017); however, all these have class-agnostic characteristics. Methods that are attribute-based and have class-specific characteristics include Contrastive-LRP (CLRP] (Gu et al., 2018)) and Softmax-Gradient-LRP (SGLRP (Iwana et al., 2019)), whose applicability is constrained by the fact that they visualize LRP propagation results for a class, which are contrasted with the results of all other classes to highlight differences between classes.

The localization task (i.e., weakly supervised object localization (WSOL)) aims to localize the position of an object in an image using bounding boxes. These methods use gradient and activation map given the image and its corresponding class. A representative study in this field uses class-specific feature maps of CNN models for localization, such as CAM (Zhou et al., 2016). CAM-based methods (Choe & Shim, 2019; Zhang et al., 2018; Kumar Singh & Jae Lee, 2017) attempted to localize objects using the simple and effective feature maps of CAM but faced limitations in distinguishing small and distinct information due to the local focus of CNN models. These methods struggled to capture the full object information present in images, as CNN models primarily activate local discriminative regions for effective classification.

Various visualization methods have been successfully applied to CNN. However, they still have unoptimized properties for the structural characteristics of vision transformers that utilize [CLS] tokens in prediction. They also deal with discrete tokens of input data. In this work, we evaluated the performance of explainability visualization by comparing it with GradCAM, a class-specific method and one of the most effective CNN-oriented approaches to visualizing the explicability of vision transformers. Since our method directly visualizes class-specific explainability without additional contrasting stages, we did not compare the performance of our method with those that are attribute-based and have class-specific characteristics (e.g., CLRP, SGLRP).

2.2 Explainability for Vision Transformers

Existing studies on segmentation for vision transformers focused on attention scores (Abnar & Zuidema, 2020; Dosovitskiy et al., 2020; Caron et al., 2021). However, using information from raw attention poses limitations to the complete use of the structural characteristics of vision transformers, which include multiple learning modules (Pruthi et al., 2019). Given the nature of a transformer layer, information is continuously mixed according to layer, thus causing difficulties in effectively applying explainability visualization that relies only on attention scores to vision transformers (Abnar & Zuidema, 2020). The Rollout (Abnar & Zuidema, 2020) method involves quantifying information on radio waves from the input layer to the prediction layer. It assumes that attentions are linearly combined into subsequent contexts. However, this often tends to highlight unrelated tokens.

Partial LRP (Everingham et al., 2010) visualizes explainability by reflecting relevance scores, meaning that individual attention heads of a vision transformer-based encoder contribute to the overall performance of a model. However, the relevance score of each attention head does not reflect the propagation from prediction to model input, suggesting that relevance scores are insufficiently reflected in a model’s explainability. (Chefer et al., 2021) proposed several methods (e.g., relevance propagation rule, integration of propagation information, relevance, and attention scores) and solved some issues due to dependence on non-positive values and skip connections propagated in the learning process caused by the structural characteristics of vision transformers (Chefer et al., 2021). Subsequently, RobustViT was developed to visualize explainability by assigning low relevance to the background region in the image and optimizing the relevance map to assign high relevance to the foreground region (Chefer et al., 2022). However, our quantitative and qualitative analyses showed that RobustViT does not adequately highlight the foreground region that determines the classes of images.

On the other hand, applying vision transformer to WSOL has shown the ability to learn overall object information regardless of the distance between patch embeddings, unlike CNN-based models. In the field of WSOL, methods using vision transformers to activate global information have outperformed the results of CNN counterparts. TS-CAM (Gao et al., 2021), based on a vision transformer, proposes a visualization method that combines semantic-aware patch information and semantic-agnostic attention maps within the encoder, outperforming CNN-based models. However, TS-CAM still relies on applying a threshold to the result of combining score maps and attention scores, which fails to overcome the structural limitations of vision transformers and the inherent constraints of using thresholds.

Although various methods have been proposed to effectively visualize the explainability using information from the prediction layer of a vision transformer to the input layer, previous studies faced challenges because of the structural characteristics of such vision transformers. In this paper, we propose ICEv2, a novel method that visualizes the explainability of a vision transformer by directly predicting classes of foreground and background regions for each image patch embedding. Section 4 reports our comparison of ICEv2’s quantitative performance with that of other explainability visualization methods, and presents the qualitative analysis using visualization examples.

2.3 Explainability of ICEs

Both ICE and ICEv2 (i.e., ICEs) use the rich foreground information contained in patch embeddings to clarify the evidence for the vision transformer’s decisions, thereby effectively providing interpretability and explainability. Interpretability and explainability offer different perspectives for understanding and explaining the process of AI models, with their own unique meanings and scope, but are sometimes used interchangeably (Rigotti et al., 2021).

First, interpretability is the property required when trying to understand the internals of a model (e.g., weights and features). ICEs use patch embeddings within the vision transformer, which focuses on learning the shape of objects (Tuli et al., 2021), to provide an understanding of the processing and representation of information. Through ICEs, it is possible to visualize the foreground and the meaning of the foreground (i.e., the class of each patch) within patch embeddings, providing an intuitive understanding of how the vision transformer recognizes and distinguishes objects within images. This allows for a clear understanding of the learning and representation of the semantic and spatial relationships between patches. Second, explainability is the ability to explain the model’s decision-making process in a way that is understandable to humans. ICEs allow visualization of how the vision transformer trains specific patch embeddings with respect to class, based on the information contained in each patch embedding. By identifying the significance between the input image patches and the output patch embeddings, the reasoning behind the vision transformer’s decisions can be explained. Providing the rationale behind the decisions made by the model can help the users to trust the model’s predictions.

3 Method

In this section, we propose the fine-tuning process of ICEs, which learns to distinguish foreground from background patches by comparing the background and class probability of each patch embedding through adversarial normalization. Figure 1 illustrates an overview of ICE. We have introduced a background label and ICE predicts patch embeddings that are unrelated to the image class as the background class. ICE initially assumes that the class of all patch embeddings is a background and gradually learns the class of each patch embedding. To reflect the background probability of all patch embeddings, we intend to have all patch embeddings continuously receive gradients of the background label during training. The model learns the background classification from the foreground by performing adversarial normalization, which accounts for both background and classification losses.

Fig. 1
figure 1

The overview of ICE. First, ICE performs patch-wise classification using an output patch embedding of a vision transformer. We set an additional background label, and ICE predicts the total number of [classes + one background class] in patch-wise classification. Next, ICE is trained to distinguish the background and the class region of an image through adversarial normalization. ICE has visualization characteristics that highlight overall explainable regions with high relevance to a class label as much as possible  

3.1 Vision Transformers

In this section, we describe the structure of the vision transformer (Dosovitskiy et al., 2020). Vision transformers use linear projection to embed an input image \(x\in {\mathbb {R}}^{{H}\times {W}\times {C}}\) into image patch \(V\in {\mathbb {R}}^{k^2\times {d}}\) (\(k^2=N\)). The vision transformer concatenates a learnable patch, [CLS]\(\in {{\mathbb {R}}^d}\), and the image patches. The N+1 patches pass through the transformer encoders, and the [CLS] learns the global features of the image through the self-attention mechanism. [CLS] that has passed through the encoders is fed into the classification head to get the final classification result.

$$\begin{aligned} {V} = \left[ {v_1; v_2;...; v_N}\right] \end{aligned}$$
(1)

3.2 Patch-Wise Classification

ICE allows to visualize the image by embedding each patch to have prediction results for the classes and the background. Since not all patches of an image are related to the classes, patch embeddings with low class relevance should be predicted as background. Therefore, ICE intends to predict each patch embedding among \(C+1\) classes, including the background, and guide patch embeddings unrelated to the class to be predicted as background. We constructed our model by adding a classification head to the structure of a standard vision transformer without considering a [CLS] token.

The number of classes is c, and a background class is added, having a total of \(c+1\) classes. Then the d dimension patch embedding is transformed to \(c+1\) dimension through MLP, making each of V predict \(c+1\) classes.

$$\begin{aligned} {Z = MLP(V)} \end{aligned}$$
(2)

where \(Z\in {\mathbb {R}}^{N\times {(c+1)}}\).

3.3 Adversarial Normalization

We introduce adversarial normalization which distinguishes between a class and a background. ICE softmax-normalizes the prediction probability of each patch embedding by adversarially reflecting the background probability during the cross-entropy loss. Adversarial normalization makes all patches have prediction results, and at the same time, reflects the probability of a background, leading to unnecessary patch embeddings being trained as a background. The adversarial normalization consists of two phases. All patch embeddings were considered as background (background phase) and some patch embeddings were class-related by comparing the probabilities between a class and a background (class phase).

3.3.1 Background Phase

In the background phase, even if all patches may initially be considered as background at the beginning of training, our assumption is that by simultaneously applying two distinct losses (i.e., \(L_{class}\) and \(L_{bg}\)), foreground patches can be distinguished as training progresses. We design the \(L_{bg}\) as the cross entropy between a background label and the average of the \(c+1\) probabilities for each patch. The background label is a one-hot encoded vector where only the \(c+1\) index representing the background class is set to 1, and all others are set to 0. The background label, which is used as the ground truth for the background class, is specifically created for use during background training. Background loss continuously propagates gradients to reflect the possibility that all patch embeddings are background.

In Sect. 3.2, each of the N patches are fed into an MLP, resulting in a distribution of class probabilities with \(c+1\) dimensions. The collection of these distributions forms the matrix Z as defined in Eq. 3, where each \(i\)-th column denotes the class probability distribution for the \(i\)-th patch. We compute the average of the probabilities (i.e., \(p_i\)) across all patches for each index. Subsequently, we apply a softmax function that includes a temperature parameter to the average of the predicted probabilities to obtain \(\hat{B_i}\). We use the temperature parameter \(\tau \) in the softmax function, which helps to maintain a degree of loss, considering that there are generally more patch embeddings in a background region than those in foreground. Since we experimentally found that background loss becomes significantly small, we employed \(\tau \) to continuously propagate the probability that all patch embeddings could be background and adequately operate the background loss function. We experimentally found the optimal \(\tau \) to be 0.5. We calculate the background loss by computing the cross entropy between \({\hat{B}}\) and a background label B.

$$\begin{aligned} Z= & \begin{bmatrix} z_{1,1} & z_{1,2} & \cdots & z_{1,N} \\ z_{2,1} & z_{2,2} & \cdots & z_{2,N} \\ \vdots & \vdots & \ddots & \vdots \\ z_{c+1,1} & z_{c+1,2} & \cdots & z_{c+1,N} \end{bmatrix} \end{aligned}$$
(3)
$$\begin{aligned} {p_i}= & {\frac{1}{N}} \sum _{j=1}^{N}z_{ij},\quad {p_i}\in {\mathbb {R}}^{1\times {(c+1)}} \end{aligned}$$
(4)
$$\begin{aligned} {\hat{B_i}}= & softmax({p_i} \times \tau ) = \frac{e^{{p_i} \times \tau }}{\sum _{j=1}^{C+1}e^{{p_j} \times \tau }} \end{aligned}$$
(5)
$$\begin{aligned} L_{bg}= & CrossEntropy (B,\hat{B}) \end{aligned}$$
(6)
Fig. 2
figure 2

An overview of ICEv2. In ICEv2, among the N layers of the vision transformer, L layers are frozen, and fine-tuning is performed on the remaining \(N-L\) layers. Patches passed through the encoder of the vision transformer perform patch-wise classification and adversarial normalization. Unlike ICE, which employs a single linear MLP layer for patch-wise classification, ICEv2 uses two MLP layers, including a wider MLP, to conduct patch-wise classification  

3.3.2 Class Phase

We selected Z associated with the class by constructing a binary decision mask \(\tilde{D_i}\in \{0,1\}^N\). The decision mask of the patch embedding that predicted the background the highest becomes 0; otherwise, 1. Since we randomly initialized the parameters of the MLP layers, the initial binary decision mask was also determined randomly. Most decision masks were 0 at the beginning of training, but the number of class patch embeddings gradually increased as training continued.

We assumed that non-background patch embeddings are associated with classes. We thus averaged these Z and make one vector \(\hat{Y}\). Since \(\hat{Y}\) has the \(c+1\) dimension and one-hot encoded class label Y has the c dimension, we changed Y to have \(c+1\) dimension by adding the background class 0 at the end of Y in the one-hot form. Except for the less associated patch embeddings, the average of the remaining patch embeddings was calculated as one vector, \(\hat{Y}\in {\mathbb {R}}^{1\times (c+1)}\), and \(L_{class}\) calculated through cross entropy between \(Y\in {\mathbb {R}}^{1\times (c+1)}\), and \(\hat{Y}\in {\mathbb {R}}^{1\times (c+1)}\).

$$\begin{aligned} \hat{Y_i}= & \frac{\sum _{j=1}^{N}\tilde{D}_{j}z_{ij}}{\sum _{j=1}^{N}\tilde{D}_{j}} \end{aligned}$$
(7)
$$\begin{aligned} L_{class}= & CrossEntropy (Y,\hat{Y}) \end{aligned}$$
(8)

We can get the final loss \(L_{total}\) by adding \(L_{class}\) and \(L_{bg}\) multiplied by background weight, \(\lambda _{bg}\).

$$\begin{aligned} \begin{aligned} L_{total} = L_{class} + L_{bg} \times \lambda _{bg} \end{aligned} \end{aligned}$$
(9)

3.4 ICEv2

In this paper, we introduce ICEv2 that addresses the limitations of existing visualization methods, including ICE. Figure 2 illustrates the overview of ICEv2. ICEv2 is an extended visualization method based on ICE, showing efficiency, high performance, and scalability with four main differences.

Table 1 Implementation details of ICEv2

First, while previous research fine-tuned all the weights of the model, ICEv2 fine-tunes only the last three encoder layers. Through our experiments, we have verified that ICEv2 achieves high performance (Table 9) by fine-tuning only these encoder layers. Second, we redesigned the MLP layer of ICE. Instead of training more encoder layers, we found that expanding the MLP layer used for patch-wise classification and fine-tuning only three encoder layers maximized the performance of ICEv2. While the MLP layer in the original ICE consisted of a single linear layer, ICEv2 includes two broader MLP layers with layernorm. The first MLP layer in patch-wise classification is a linear layer that changes the dimension of the output patch embeddings to four times the embedding dimension, and the second layer changes it to the dimension of \(classes+background\). Between the two layers, layernorm is applied. Third, during testing, we applied the clustering of scipy (Pedregosa et al., 2011) to the results predicted by ICEv2, using the largest cluster among connected components. We found that the performance of ICE in distinguishing between foreground and background was degraded due to small foreground objects surrounding the main object. To address this, we applied clustering to select the largest object. Lastly, ICEv2 demonstrates stable training across various model sizes with similar hyperparameter settings without the need for threshold exploration. While ICE was sensitive to hyperparameters and was only validated on small models, ICEv2 demonstrates the stability of training across various model sizes with the proposed hyperparameter settings (i.e., SGD optimizer, learning rate=\(1e^{-3}\), weight decay=\(5e^{-2}\), temperature parameter \(\tau \)=\(7.5e^{-2}\)).

Table 2 Segmentation performance on the ImageNet-Segmentation (Guillaumin et al., 2014) dataset
Table 3 Extensive segmentation performance evaluation using the ImageNet-Segmentation dataset

4 Experiments

4.1 Experimental Setting

4.1.1 Datasets

We trained ICE using the ImageNet (Russakovsky et al., 2015) (ILSVRC) 2012 and CUB-200-2011 (Wah et al., 2011) training datasets, and for ICEv2, we only used the ImageNet (ILSVRC) 2012 training dataset. We did not use any additional data related to segmentation maps or object locations. To evaluate the explainability visualization performance of ICEs, we used seven datasets, ImageNet-Segmentation (Guillaumin et al., 2014), CUB-200-2011 (Wah et al., 2011), ECSSD (Shi et al., 2015), DUTS (Wang et al., 2017), DUT-OMRON (Yang et al., 2013), and Pascal VOC 07/12 (Everingham et al., 2010, 2015). ImageNet-Segmentation, created by using part of the ImageNet (ILSVRC) 2012 to evaluate Segmentation, contains 4,276 images with 445 classes. We measured segmentation performance using Pixel-wise accuracy and Mean IoU, and localization performance as in previous studies. The Pascal VOC 2012 validation set consists of 1,449 images with 20 classes and one or more objects in each image. We measured semantic layout performance using \(Jaccard\ index\). The Jaccard index is used as a metric to measure the similarity between two sets by dividing the intersection of the model output and the ground truth by the size of their union. The ECSSD dataset contains 1,000 real images, and the DUTS dataset contains 5,019 test images, collected from the ImageNet and SUN datasets. The DUT-OMRON dataset contains 5,168 high-quality natural images.

Fig. 3
figure 3

Semantic segmentation maps highlighted by image classes and background predictions for each patch in the images. These maps show that ICEv2 visualizes the explainability of DeiT-S (Touvron et al., 2021), leading to unsupervised foreground and background segmentation

Fig. 4
figure 4

Results of explainability visualization of the images with the same class by ICEv2

Fig. 5
figure 5

Results of explainability visualization for the presence of two different class objects in the image by ICEv2  

4.1.2 Implementation Details

For training strategies and optimization methods of ICE, we employed the pre-trained DeiT-S and set hyperparameters as follows: temperature parameter \(\tau \)=0.5, background weight \(\lambda _{bg}\)=\(2.5e^{-2}\), learning rate=\(1e^{-5}\), and batch size=256. We followed other hyperparameters specified in the official DeiT repository.Footnote 1 We set a background weight \(\lambda _{bg}\)=\(1e^{-2}\) in order to train ICE-f, a method that freezes the parameters of DeiT-S and trains only two MLP layers. Our intention in considering ICE-f was to verify that output patch embeddings for each input image patch in a vision transformer retain the image information of each patch location, which can facilitate the prediction of an image class. The other hyperparameters were the same as those applied to ICE. (Table 1) shows the implementation details of ICEv2. Note that while AR-L consists of a total of 24 layers, ICEv2 employs 18 layers and fine-tuned 3 of these layers.

To visualize the explainability of other methods, we used the official repository of  (Chefer et al., 2021),Footnote 2  (Chefer et al., 2022).Footnote 3 We ran our experiments on the machine equipped with two NVIDIA RTX3090 GPUs.

4.2 Explainability on ImageNet

To verify the effectiveness of the explainability visualization of ICEs, we measured quantitative performance and analyzed visualization examples compared with existing explainability visualization methodologies on ImageNet-Segmentation.

4.2.1 Quantitative Analysis

Table 2 shows that ICEs outperformed all explainability visualization methods (all methods used the DeiT-S model). ICEs performed 4.05%, 5.50% better on pixel-wise accuracy and 3.94%, 7.16% better on mean IoU, respectively, compared to RobustViT, the state-of-the-art visualization method. The explainability visualization of ICE-f showed comparable segmentation performance to RobustViT without modifying the original model. ICEv2 performed 1.45% better on pixel-wise accuracy and 3.22% better on mean IoU than ICE while minimizing changes in model weights by training 75% fewer encoder layers than ICE. We experimentally found that by maintaining a significant portion of the original model and only fine-tuning some of the later layers, we can still achieve high accuracy in predicting classes in a patch-wise manner. We demonstrate that the output patch embeddings for each image patch in a vision transformer-based model sufficiently preserve the information in the original image and can be effectively used to visualize the explainability power of the model.

We also evaluated the performance improvement of explainability visualization using ICE on the models fine-tuned by RobustViT, the state-of-the-art visualization method. Table 3 demonstrates that ICE can improve explainability visualization even when applied to models fine-tuned by RobustViT. We demonstrate that the output patch embeddings for each image patch in a vision transformer-based model sufficiently preserve the information in the original image and can be effectively used to visualize the explainability of the model.

Table 4 The Jaccard similarity between the ground truth and predicted foreground on the Pascal VOC 12 validation set (Everingham et al., 2015).

4.2.2 Qualitative Analysis

Figure 3 shows examples of semantic segmentation maps highlighted by ICEv2. ICEv2 visualized the explainability of DeiT-S (Touvron et al., 2021) by separating foreground and background regions despite the presence of multiple objects of different sizes and classes. Figure 4 shows that other methods tend to focus on only a small portion of the image or unmatched areas in the examples of multiple object detection. On the other hand, ICEs adequately distinguish between foreground and background regions. We found that ICEs can highlight object areas, regardless of object size.

Fig. 6
figure 6

Sampled segmentation maps. The input image in the first row is the one presented in the previous research (Naseer et al., 2021), and the input images in the second and third rows are from Pascal VOC 12  

Figure 5 illustrates a case where two different classes of objects exist in one image. The first and second rows show the explainability visualization results with each class. Rollout and Partial LRP methods show class-agnostic characteristics that highlight the same region regardless of the predicted class of the model, while GradCAM, RobustViT, ICEs show class-specific characteristics. Overall, ICEs show the result of highlighting the area of the objects in the image. ICE-f also shows comparable results with existing visualization methods. However, when it was necessary to distinguish fine-grained characteristics (e.g., tusker, african elephant, indian elephant), ICEs still showed a tendency to predict patches into one class, which also occurred in other models.

4.3 Discovering Semantic Layouts

4.3.1 Quantitative Analysis

To verify the effectiveness of ICEs in selecting background patches from images containing new classes, we evaluated the performance of background and foreground separation on Pascal VOC 12 containing classes not learned in our DeiT-S model with ICEs. The Table 4 shows that ICE outperforms both DeiT-S-Raw-Attention and DeiT-S-SIN (Naseer et al., 2021) on the Jaccard similarity index, achieving higher results by 18.15 and 7.02, respectively. Furthermore, ICEv2 outperforms by an even greater margin, achieving 20.23 and 9.10 higher results than DeiT-S-Raw-Attention and DeiT-S-SIN on the Jaccard similarity index, respectively. DeiT-S-Raw-Attention serves as the baseline model. DeiT-S-SIN uses a shape distillation token in DeiT-S and employs Resnet50-SIN (Geirhos et al., 2018) learned on the SIN dataset with strong shape characteristics. However, we have achieved better performance in foreground and background separation by applying ICE and ICEv2 methods to the same DeiT-S without additional datasets and models.

As shown in the Table 4, ICE outperforms DINO (Caron et al., 2021), the standard self-supervision method used in prior research. Moreover, ICEv2 exceeds DINO by an even larger margin, achieving a 3.10% higher performance than DINO. Other methods have shown experimental tendencies where performance varies with the key hyperparameters (i.e., threshold, output head type). ICEs do not have such constraints, implying the possibility of a more flexible and easier adaptation of such a method to vision transformers.

4.3.2 Qualitative Analysis

Figure 6 shows samples visualizing semantic layouts on the Pascal VOC 12 validation set. Compared to the visualization by applying raw attention to the original DeiT-S, ICEs significantly improved the performance of semantic map segmentation by separating background and foreground regions. Furthermore, we found a tendency that our method distinguishes background relatively well compared to self-supervised learning methods.

4.4 Single Object Discovery

We conducted experiments on weakly-supervised single-object localization and unsupervised object discovery to evaluate the performance of explainability visualization and compare it to other state-of-the-art methods (Tables 5 and 6).

Table 5 Comparisons with other methods for weakly-supervised single object localization of ICE  
Table 6 Comparisons with other methods for unsupervised object discovery  

4.4.1 Weakly-Supervised Single Object Localization

We evaluated the performance of weakly-supervised single-object localization on the CUB-200-2011 using the Top-1 cls, GT Loc, and Top-1 Loc metrics. Top-1 Cls refers to the top-1 classification accuracy. GT Loc is considered correct if the intersection over union (IoU) between a predicted bounding box and the ground truth bounding box exceeds 0.5. Top-1 Loc is counted as correct if both the Top-1 Cls and GT Loc are correct. We conducted comparative experiments focusing on models that use a pure vision transformer structure without any modifications to the Encoder architecture. On CUB-200-2011, ICE outperformed other methods on GT Loc and achieved comparable performance on Top-1 Loc. We note the relatively high performance of ICE on Top-1 Loc, even though its Top-1 result was the lowest. This implies that, once an object in an image is correctly detected, ICE does a good job of locating it. Note that, to evaluate ICEv2 on CUB, it is necessary to apply ICEv2 to a pre-trained standard baseline model and then fine-tune it. However, due to the unavailability of such baseline models, there were limitations in conducting the performance evaluation.

Fig. 7
figure 7

Unsupervised semantic segmentation maps by ICEv2. Images are taken from previous generative AI model  (Li et al., 2024)

Fig. 8
figure 8

Convergence of patch ratios

Table 7 Comparisons on ICE and EViT (Liang et al., 2021). For fair comparisons, all models are initialized with a pre-trained DeiT-S and trained 30 epochs using 2 GPUs, and the throughput (img/s) is measured on the same machine with the same setting using a maximum batch size

4.4.2 Unsupervised Object Discovery

We evaluated unsupervised object discovery on the Pascal VOC 07/12, ECSSD, DUTS, and DUT-OMRON datasets. Our intention here is to highlight the potential of ICEs for open foreground object discovery from classification-based backbones compared to others that are based on self-supervised learning (i.e., DINO). On the Pascal VOC 07/12, ICE achieved comparable (although lower) performance on CorLoc compared to the state-of-the-art methods. Corloc counts when the intersection between a predicted bounding box and the ground truth bounding box is greater than 0.5. Following ICE, which demonstrated a slight performance improvement over Lost, ICEv2 achieved a performance closer to Tokencut when compared to Lost. On the ECSSD, DUTS, and DUT-OMRON datasets, we used IoU, Acc, and \(maxF_\beta \) following previous works (Siméoni et al., 2021; Wang et al., 2022). IoU is the intersection between a predicted bounding box and the ground truth. Acc measures the pixel-wise ratio between the object and the background. \(maxF_\beta \) score is calculated using the formula \(F_{\beta } = \frac{(1+\beta ^2) \cdot \text {Precision} \cdot \text {Recall}}{\beta ^2 \cdot \text {Precision} + \text {Recall}}\). Contrary to ICE, which showed similar performance to Lost, ICEv2 demonstrated superior performance across the majority of metrics, compared to both Lost and ICE. ICEv2, which uses only 1,000 classes from ImageNet to distinguish between foreground and background, presents potential applicability for performance evaluation on real-world images containing objects not encountered during training. Overall, we have verified the explainability visualization performance of ICEv2 from various perspectives by showing the superiority of foreground segmentation performance.

4.4.3 Quantitative Analysis

We also tested ICEv2 on the images shown in the previous generative AI paper (Li et al., 2024) in Fig. 7. We have verified ICEv2’s unsupervised explainability to distinguish between foreground and background in images containing untrained objects. In images that are significantly different from the ImageNet training dataset, and applied with visual effects (e.g., animation, cyberpunk, fire and oil painting), ICEv2 prominently highlights the areas of the foreground where objects are likely to be present. This demonstrates the effectiveness of ICEv2’s in detecting and highlighting potential object locations in visually unconventional scenes.

4.5 Efficiency Improvement

To examine another key aspect of the effectiveness of background patch selection by ICE, we evaluated the accuracy and efficiency of ImageNet by applying ICE’s background patch selection within the encoder of the original DeiT-S and compared its results with DeiT (baseline) and EViT (Liang et al., 2021), a state-of-the-art model. Background patch selection is one of the key requirements in EViT, thus by comparing the performance of accuracy and efficiency between ICE and EViT, we can verify the role of ICE in visualizing the explainability of a vision transformer. We trained ICE in the same environment as EViT by referring to the EViT’s official repository code Footnote 4 and set ICE to maintain keep rates after the 4th, 7th, and 10th layers in the pre-trained DeiT-S.

Table 7 shows our experimental results. By applying ICE to DeiT-S, the throughput performance was significantly improved by 44.01% in the inference while maintaining comparable accuracy (only a 0.46% decrease compared to the original DeiT-S). ICE showed 0.1% higher accuracy than the EViT under the same keep rate condition. This result means that the patch selection of ICE can help improve the classification performance more than the patch selection of EViT. However, the throughput performance of ICE is 8.7% slower than EViT under the same keep rate condition, which may be because ICE trains additional layers.

Table 8 Ablation study on the instance normalization (Ulyanov et al., 2016) and \(\lambda _{bg}\) of ICE
Fig. 9
figure 9

Ablation study on \(\lambda _{bg}\) and softmax temeperature of ICEv2

4.6 Analysis and Ablation Study

4.6.1 Convergence of Patch Ratio

To further explain the learning process of ICEs, we visualized the variation of the ratio of foreground to background patches by the learning iteration (Fig. 8). For both the ImageNet-Segmentation and CUB datasets, the ratios of ICEs converge around 0.3, which can be considered as reasonable results since the mean foreground ratios for the whole image of both datasets are 0.29 and 0.34, respectively. ICE was trained on 5 epochs (1 epoch = 2,500 iterations) for ImageNet and on 60 epochs (1 epoch = 60 iterations) for CUB, achieving the best performance between 1 and 3 epochs and between 25 and 30 epochs, respectively. ICEv2 was trained on 15 epochs (1 epoch = 1,250 iterations) achieving the best performance between 9 and 12 epochs.

4.6.2 Effectiveness of Instance Normalization

In the original ICE paper (Choi et al., 2023), we considered that the sum of prediction probabilities for each patch embedding varies, and applied instance normalization (Ulyanov et al., 2016) to vectors transformed by MLP during the patch-wise classification process. To evaluate the impact of the presence or absence of instance normalization on visualization performance, we conducted an ablation study with varying the instance normalization in relation to the \(\lambda _{bg}\) values. As shown in Table 8, we observed that removing the instance normalization resulted in improved performance, particularly at \(\lambda _{bg}\) values of 0.02, 0.025, and 0.03, where the explainability visualization performance was superior. Given the improved performance of ICE without the application of instance normalization, we decided not to apply instance normalization in the patch-wise classification process in ICEv2.

4.6.3 Number of Encoder Layers

We conducted an ablation study to determine the optimal number of training layers in ICEv2 (Table 9). We found that ICEv2 shows peak performance when trained with three encoder layers, with performance degradation observed when the number of encoder layers is increased or decreased from this optimal number. Training more encoder layers resulted in lower explainability performance and increased computational cost of training. Notably, ICEv2 demonstrated a pattern similar to previous research findings, where training a subset of components (i.e., two MLP layers of patch-wise classification) while keeping the rest fixed is advantageous in preventing catastrophic forgetting, especially when new components are introduced (Alayrac et al., 2022; Tsimpoukelli et al., 2021; Mokady et al., 2021; Luo et al., 2022; Eichenberg et al., 2022).

4.6.4 \(\lambda _{bg}\) and Softmax Temeperature

We conducted an ablation study on the core hyperparameters of ICEv2, namely \(\lambda _{bg}\) and softmax temperature (Fig. 9). Our results show that \(\lambda _{background}\) exhibits superior visualization performance at values above 0.7, peaking at 0.75. The importance of \(\lambda _{bg}\) is underscored by observing a gradual decline in performance when this parameter is not applied or when its value exceeds 0.75. In addition, as shown in Fig. 9b, visualization performance improves with softmax temperature values above 0.45, reaching optimal performance at 0.5. While setting the temperature to 1.0 also yielded high performance, we found that using a softmax temperature of 0.5 yielded the highest performance. Hence, we decided to set \(\lambda _{bg}\) to 0.75 and the softmax temperature to 0.5. This result shows the importance of selecting the appropriate \(\lambda _{bg}\) and softmax temperature to improve the explainability of ICEv2. We acknowledge the limitation of our study to specific datasets, and point out the need for further research to explore these the applicability of these hyperparameters in broader contexts. Future work should also investigate the potential impact of other hyperparameters of ICEv2 that provide a more comprehensive understanding.

Table 9 Ablation study on the number of training encoder layers of ICEv2. ICEv2 shows peak performance when trained with three encoder layers, with any variation causing performance degradation
Fig. 10
figure 10

Effect of patch resolution. The number of patches in the image is determined by the resolution of the input image

4.6.5 Patch Resolution

ICEv2 provides visualization through patch-wise prediction on the 16\(\times \)16 patches of the vision transformer. Thus, the number of patches in a single image is determined by the resolution of the input image. We used a 16\(\times \)16 patch size following the representative vision transformer architectures (Dosovitskiy et al., 2020; Steiner et al., 2021; Touvron et al., 2021). The patch size is fixed during the pre-training and cannot be changed. However, the number of output patches (i.e., the patch resolution of the explanation) can be modified by changing the number of pixels in the input image through interpolation. If the number of patches is increased, the size of each patch decreases, reducing the amount of information each patch contains. Figure 10 shows how the visualization performance varies with the number of patches in the visualization applied to DeiT-S using ICEv2. The resolution of the input images used in the experiments can be determined by multiplying the number of patches by the patch size, which is 16. ICEv2 consistently shows high visualization performance at the commonly used resolutions of 224\(\times \)224 (i.e., 14\(\times \)14 patches) and 512\(\times \)512 (i.e., 32\(\times \)32 patches), with the highest performance observed at a resolution of 352\(\times \)352 (i.e., 22\(\times \)22 patches). Using fewer patches makes detailed segmentation of the foreground difficult, resulting in lower visualization performance. On the other hand, using more patches up to a certain point can maintain performance, but exceeding that point may reduce the amount of information each patch contains, leading to decreased performance. Note that for a fair comparison of ImageNet-Segmentation performance as shown in Table 2, we have reported the results using the same resolution of 224\(\times \)224 (i.e., 14\(\times \)14 patches) as used in previous studies (Chefer et al., 2021; Chefer et al., 2022).

5 Discussion

ICEs show superior explainability visualization performance compared to other explainability visualization methods in the case where multiple objects of a single class exist in different sizes within an image, as shown in Fig. 4. Our methods adequately predict the learned image classes or background classes for all patch locations in the image. Furthermore, our methods separate the background region and highlight the region that determines the class of the image as much as possible, regardless of the size of the object in the image.

We expect our methodology to be useful for visualizing the explainability of image classification models when there are multiple objects of different classes and different sizes exist in an image. For example, for tasks related to medical image classification (Shamshad et al., 2022), defect classification (Haurum et al., 2021), and fashion style classification (Jeon et al., 2021), key objects with various sizes may exist in a target image. In these examples, it may be necessary to visualize the regions that determine image classes as much as possible when the end-user (e.g., domain expert) of the classification model needs to check the visualized explainability of the model. Since ICEs visualize the explainability of all image regions that have features of an image class, the ICEs can be useful and well applicable to many domains.

6 Conclusion

In this paper, we proposed ICEv2, an explainability visualization method that demonstrates efficiency, high performance, robustness, and scalability. ICEv2 achieves superior state-of-the-art performance while training 75% fewer encoder layers, thereby preserving the characteristics of the original model as much as possible. This indicates that ICEv2 enables explainability visualization with significantly lower computational complexity than traditional fine-tuning visualization methods. We demonstrated the effectiveness and superiority of ICEv2 in the visualization of class-specific explainability and the separation of a background region from a foreground region through quantitative and qualitative analyses. We showed that the output patch embedding for each image patch preserves sufficient image information at each patch location. We also presented the scalability of ICEv2 to larger and various sizes of vision transformer models. By analyzing the hyperparameters that significantly affect the ICEv2, we confirmed that ICEv2 training is feasible for most models. Since ICEv2 does not use information from the prediction to the input layers, it is rarely affected by penalties derived from the structural properties of a vision transformer. Based on these results, we expect that ICEv2 can be employed by vision transformers of various structures for explainability visualization.