Introduction

Neural Architecture Search (NAS) (Elsken et al., 2019; White et al., 2023) seeks to automate the process of designing high-performing neural networks. This research field has gained immense popularity in the last years, with only a few NAS papers in 2016 to almost 700 in 2022 (White et al., 2023). The ability of NAS to find high-performing architectures, capable of outperforming hand-designed ones, notably on image classification (Zoph and Le, 2017), makes this research field highly valuable. However, classical approaches like reinforcement learning (Zoph and Le, 2017; Li et al., 2018) or evolutionary algorithms (Real et al., 2019) are expensive, which led to a focus-shift towards improving the search efficiency. Therefore different approaches such as one-shot methods (Liu et al., 2019) or performance prediction methods (White et al., 2021b) were introduced. However these methods face issues, when needing to be adapt larger search spaces and different and larger tasks. Therefore, zero-cost proxies (ZCP) (Mellor et al., 2021), as part of performance prediction strategies, were recently developed. These proxies build upon fast computations, mostly in one single forward and backward pass on an untrained model, and attempt to predict the accuracy that the underlying architecture will have after training, overcoming the issues of traditional predictions methods by allowing for an easy transfer to different search space and different tasks.

In recent years, several zero-cost proxies and network features were introduced (Mellor et al., 2021; Abdelfattah et al., 2021; Roberts et al., 2021; Kadlecová et al., 2024), including simple architectural baselines as FLOPS or the number of parameters. NAS-Bench-Suite-Zero (Krishnakumar et al., 2022) provides a more in-depth analysis and evaluates 13 different ZCPs on 28 different tasks to show the effectiveness as a performance prediction technique.

So far, the main focus of NAS research was the resulting performance of the architectures on a downstream task. Another important aspect of networks, namely their robustness, has been less addressed in NAS so far. Most works targeting both high accuracy and a robust network rely on one-shot methods (Hosseini et al., 2021; Mok et al., 2021). However, using only ZCPs as a performance prediction model for multi-objective tasks has not been addressed. The search for architectures that are robust against adversarial attacks is especially important for computer vision and image classification since networks can be easily fooled when the input data is changed using slight perturbations that are even invisible to the human eye. This can lead to networks making false predictions with high confidence.

This aspect is particularly important in the context of NAS and ZCPs because the search for robust architectures is significantly more expensive: the architecture’s robustness is evaluated against different adversarial attacks after the architecture is trained (Goodfellow et al., 2015; Kurakin et al., 2017; Croce and Hein, 2020).

In this paper, we therefore address the question: How transferable are the accuracy-focused ZCPs to the architecture’s robustness? A high-performing architecture is not necessarily robust. Therefore, we analyze which ZCPs perform well for predicting clean accuracy, which are good at predicting robustness, and which do well at both.

For our evaluation, we leverage the recently published robustness datasets (Jung et al., 2023; Wu et al., 2024) which allows for easily accessible robustness evaluations on an established NAS search space (Dong and Yang, 2020), for clean and adversarially trained networks. Since every ZCP provides a low-dimensional (scalar) measure per architecture, we understand each ZCP as a feature dimension in a concatenated feature vector and employ random forest regression as a predictor of clean and robust accuracy. This facilitates not only the evaluation of the performance or correlation of the different measures with the prediction target but also allows to gain direct access to every proxy’s feature importance compared with all others.

Our evaluation of the most popular ZCPs allows us to make the following observations:

  • While the correlation of every single ZCP with the target is not very strong, the random forest regression allows predicting the clean accuracy with very good and the robust accuracy with good precision.

  • When analyzing the feature importance, ZCPs using Jacobian-based information generally carry the most employed information.

  • The feature importance distribution shows that clean accuracies can be predicted from one or few ZCPs while regressors trained to predict the robust accuracy tend to rely on all available information.

1 Related Work

1.1 Zero-Cost Proxies for NAS

NAS is the process to automate the design of neural architectures with the goal to find a high-performing architecture on a particular dataset. In the last few years, this research field gained immense popularity and is able to surpass hand-designed neural architectures on different tasks, especially on image classification. See White et al. (2023) for a survey. For fast search and evaluation of found architectures, many NAS methods make use of performance prediction techniques via surrogate models. The surrogate model predicts the performance of an architecture, without the need of training the architecture, with the goal to keep the query amount (i.e. the amount necessary training data to train the surrogate model) low. Each query means one full training of the architecture, therefore, successful performance prediction-based NAS methods are able to use only a few queries. Wen et al. (2020); Ru et al. (2021); White et al. (2021a); Wu et al. (2021); Lukasik et al. (2022) show improved results using surrogate models keeping the query amount low. However, these methods use the validation or test accuracy of the architecture as a target and require high computation time. In order to predict the performance of an architecture without full training, zero-cost proxies (ZCPs), measured on untrained architectures, can be used (Mellor et al., 2021). The idea is that these ZCPs, which often require only one forward and one backward pass on a single mini-batch, are somehow correlated with the resulting performance of the architecture after full training. Mellor et al. (2021) originally used ZCPs for NAS by analyzing the linear regions and how well they are separated. In contrast, Abdelfattah et al. (2021) uses pruning-at-initialization techniques (Lee et al., 2019; Wang et al., 2020; Tanaka et al., 2020) as ZCPs. The best performing ZCP in Abdelfattah et al. (2021) synflow is data-independent, which does not consider the input data for the proxy calculation. Another data-independent architecture was proposed by Lin et al. (2021). Other approaches use the neural tangent kernel for faster architecture search (Chen et al., 2021).

The benchmark NAS-Bench-suite zero (Krishnakumar et al., 2022) compares different ZCPs on different NAS search spaces and shows how they can be integrated into different NAS frameworks. Other works include ZCPs into Bayesian optimization NAS approaches (as for example Shen et al. (2021); White et al. (2021b)) and one-shot architecture search (Xiang et al., 2021). Kadlecová et al. (2024) recently proposed neural graph features (GRAF), which are properties of the architectural topology, e.g. the minimal path from the input to the output. The combination of GRAF and previous ZCPs in an Bayesian optimization approach leads to a sample efficient search method.

In contrast, this paper evaluates how well ZCPs can predict the robust accuracy of a model under adversarial attacks, and demonstrates (surprising) success in combining different ZCP features in a random-forest classifier.

1.2 Robustness in NAS

Compared to searching for an architecture with the single objective of having a high performance, including the architecture’s robustness results in an even more challenging task that requires a multi-objective search. Recent works that search for both high-performing and robust architectures combine both objectives in one-shot search approaches (Guo et al., 2020; Dong et al., 2020; Mok et al., 2021; Hosseini et al., 2021). Dong et al. (2020) includes a parameter constraint in the supernet training in order to reduce the Lipschitz constant. Also Hosseini et al. (2021) adds additional maximization objectives, the certified lower bound and Jacobian norm bound, to the supernet training. Mok et al. (2021) includes the loss landscape of the Hessian into the bi-level optimization approach from Liu et al. (2019). In contrast to these additional objectives, Guo et al. (2020) proposes adversarial training of the supernet training for increased network robustness. The first robustness dataset (Jung et al., 2023) facilitates this research area. All \(6\,466\) unique architectures in the popular NAS cell-based search space, NAS-Bench-201 (Dong and Yang, 2020), are evaluated on four different adversarial attacks (FGSM (Goodfellow et al., 2015), PGD (Kurakin et al., 2017), APGD and Square (Croce and Hein, 2020)) with different attack strengths on three different image datasets, CIFAR-10 (Krizhevsky, 2009), CIFAR-100 (Krizhevsky, 2009), and ImageNet16-120 (Chrabaszcz et al., 2017). Based on the robustness dataset (Jung et al., 2023), Wu et al. (2024) construct NAS-RobBench-201, for which the \(6\,466\) unique architectures are trained adversarially, i.e., 7-step \(L_{\infty }\) PGD with step-size 2/255, and perturbation \(\epsilon = 8/255\). Each architecture is evaluated on FGSM with \(L_{\infty }, \epsilon =\{3/255, 8/255\}\), PGD with \(L_{\infty }, \epsilon =\{3/255, 8/255\}\), 20-steps with step-size \(2.5 \times \epsilon /20\), and AutoAttack (Croce and Hein, 2020) with \(\epsilon = 8/255\). We leverage both (Jung et al., 2023) and Wu et al. (2024) for our evaluation.

This paper is a significantly extended and cosolidated version of our previous conference publication Lukasik et al. (2023), where we evaluate the ability of ZCPs as predictors for a model’s robustness. We provide additional evaluations on the feature importance of ZCPs for adversarially trained neural networks (see Tables 2, 3, 4, Figs. 8, 11). Furthermore, we add more zero-cost proxies from recently proposed graph features Kadlecová et al. (2024) into our evaluations (see Tables 6, 7, 8).

2 Background on Zero-Cost Proxies

As presented in NAS-Bench-Suite-Zero (Krishnakumar et al., 2022), we can differentiate the proxies into different types: Jacobian-based (), pruning-based (), baseline (), piecewise-linear (), Hessian-based () zero-cost proxies and neural graph features (). In the following, we will provide more information about these ZCPs, which we evaluate in this paper.

2.1 Jacobian-Based

Mellor et al. (2021) were the first to introduce ZCPs into NAS, by analyzing the network behavior using local linear operations for each input \({\textbf {x}}_i \in \mathbb {R}^D\) in the training data mini-batch, which can be computed by the Jacobian of the network for each input. The idea is based on the fact that a resulting well-performing untrained network is supposed to be able to distinguish the local linear operations of different data points. For that, the metric jacov was introduced, which is using the correlation matrix of the Jacobian as a Gaussian kernel. The score itself is the Kullback–Leibler divergence between a standard Gaussian and a Gaussian with the mentioned kernel. Therefore, the higher the score, the better the architecture is likely to be.

Building on that, Mellor et al. (2021) further introduced nwot (Neural Architecture Search without Training), which forms binary codes, depending on whether the rectified linear unit is active or not, which define the linear region in the network. Similar binary codes for two input points indicate that it is more challenging to learn to separate these points. Lopes et al. (2021) developed nwot even further by introducing epe-nas (Efficient Performance Estimation Without Training for Neural Architecture Search). The goal of epe-nas is to evaluate if an untrained network is able to distinguish two points from different classes and equate points from the same class. This is measured by the correlation of the Jacobian (jacov) for input data being from the same class. Therefore, the resulting correlation matrix can be used to analyze how the network behaves for each class and thus may indicate if this network can also distinguish between different classes.

As an alternative to these methods, Abdelfattah et al. (2021) proposed a simple proxy, grad-norm, which is simply the sum of the Euclidean norm of the weight gradients.

So far, these Jacobian-based measurements focused on the correlation with the resulting clean performance of architectures. In this paper, we are also interested in the correlation and influence on the robust accuracy of the architecture. As also used in Jung et al. (2023), Hosseini et al. (2021) combined the search for a high-performing architecture that is also robust against adversarial attacks, by including the Frobenius norm of the Jacobian (jacob-fro) into the search. As introduced in Hoffman et al. (2019) the change of the network output, when a perturbed data point \({\textbf {x}}_i + \mathbf {\epsilon }, \mathbf {\epsilon } \in \mathbb {R}^D\) is the input to the network instead of the clean data point \({\textbf {x}}_i\), can be used as a measurement for the robustness of the architecture: the larger the change, the more unstable is the network in case of perturbed input data. This change can be measured by the square of the Frobenius norm of the difference of the network’s prediction on perturbed and unperturbed data.

2.2 Pruning-Based

Pruning-based ZCPs are based on network pruning metrics, which identify the least important parameters in a network at initialization time. Lee et al. (2019) introduced a measurement, snip (Single-shot Network Pruning), which uses a connection sensitivity metric to approximate the change in the loss, when weights with a small gradient magnitude are pruned. Based on that, Wang et al. (2020) improves snip by approximating the change in the loss after weight pruning in their grasp (Gradient Signal Preservation) metric. Lastly, Tanaka et al. (2020) investigated these two pruning-based metrics in terms of layer collapse, and proposes synflow, which multiplies all absolute values of the parameters in the network, which is independent of the training data.

In contrast to that, Turner et al. (2020) obtains and aggregates the Fisher information fisher for all channels in a convolution block, to identify the channel with the least important effect on the network’s loss.

2.3 Piecewise Linear

Lin et al. (2021) proposes the zen score, motivated the observation that a CNN can be also seen as a composition of piecewise linear functions being conditioned on activation patterns. Therefore, they propose to measure the network’s ability to express complex functions by its Gaussian complexity. This score is data-independent since both the network weights and the input data are sampled from a standard Gaussian distribution.

2.4 Hessian-based

As mentioned in Sect. 3.1, the goal for ZCP research was mainly motivated by finding a measurement, which is correlated with the network’s performance after training. In this paper, we want to shift the focus towards the robustness of architectures, which is also a crucial aspect of neural architectures. Mok et al. (2021) also included the robustness as a target for architecture search, by considering the smoothness of the loss landscape of the architecture. Zhao et al. (2020) investigated the connection between the smoothness of the loss landscape of a network and its robustness, and shows that the adversarial loss is correlated with the biggest eigenvalue of the Hessian. A small Hessian spectrum implies a flat minimum, whereas a large Hessian spectrum, implies a sharp minimum, which is more sensitive to changes in the input and therefore can be more easily fooled by adversarial attacks.

2.5 Baselines

In addition to the above mentioned developed ZCPs, basic network information have also been successfully used as ZCPs (Abdelfattah et al., 2021; Ning et al., 2021). The most common baseline proxies are the number of FLOPS (flops) as well as the number of parameters (params). In addition to these, Abdelfattah et al. (2021) also considers the sum of the L2-norm of the untrained network weights, l2-norm, and the multiplication of the network weights and its gradients, plain.

Fig. 1
figure 1

Kendall tau rank correlation in absolute values between all zero-cost proxies computed on all architectures given in the robustness dataset (Jung et al., 2023) to the test accuracy and adversarial attacks for CIFAR-10

2.6 Neural Graph Features

Very recently, proxies based on the architectural topology GRAF, which are very simple to compute, were proposed in (Kadlecová et al., 2024) for cell-based search spaces such as NAS-Bench-201(Dong and Yang, 2020). Different features are calculated on the presence/absence of operations in each cell. In total, the following features are calculated for all possible operation subsets \(\mathcal {O'} \in \mathcal {O}\) from the operation set \( \mathcal {O}\):

  • Number of times the operation \(o \in \mathcal {O}\) is used in the cell

  • Minimum/Maximum path length from the input to the output node only using operations in predefined \(\mathcal {O'} \in \mathcal {O}\)

  • Input degree of the output node considering only operations \(\mathcal {O'} \in \mathcal {O}\)

  • Output degree of the input node considering only operations \(\mathcal {O'} \in \mathcal {O}\)

  • Mean of the input and output degree of the intermediate nodes considering only operations \(\mathcal {O'} \in \mathcal {O}\).

3 Feature Collection and Evaluation

In the following, we describe our evaluation setting.

3.1 NAS-Bench-201

NAS-Bench-201 (Dong and Yang, 2020) is a cell-based search space (White et al., 2023), in which each cell has 4 nodes and 6 edges. The node represents the architecture’s feature map and each edge represents one operation from a predefined operation set. This operation set contains 5 different operations: \(1 \times 1~\textrm{ convolution}, 3 \times 3~\textrm{ convolution}, 3 \times 3~\mathrm { avg. pooling}, ~\mathrm { skip-connection}, ~\textrm{ zero}\). The cells are integrated into a macro-architecture. The overall search space has a size of \(5^6=15\,625\) different architectures, from which \(6\,466\) architectures are unique and non-isomorphic. Isomorphic architectures have the same information flow in the network architecture, resulting in similar outcomes and only differ due to numerical reasons. All architectures are trained on three different image datasets, CIFAR-10 (Krizhevsky, 2009), CIFAR-100 (Krizhevsky, 2009), and ImageNet16-120 (Chrabaszcz et al., 2017).

3.2 Neural Architecture Design and Robustness Datasets

The dataset by Jung et al. (2023) evaluated all the unique architectures in NAS-Bench-201 (Dong and Yang, 2020) against three white-box attacks, i.e., FGSM (Goodfellow et al., 2015), PGD (Kurakin et al., 2017), APGD (Croce and Hein, 2020), as well as one black-box attack, Square Attack (Croce and Hein, 2020) to evaluate the adversarial robustness of the architectures, given by the different topologies. In addition, all architectures were also evaluated on corrupted image datasets, CIFAR-10 C and CIFAR-100-C (Hendrycks and Dietterich, 2019). Along with these evaluations, the dataset also shows three different use cases, on how the data can be used: evaluation of ZCPs for robustness, NAS on robustness, and an analysis of how the topology and the design of an architecture influence its resulting robustness. This dataset also provides evaluations for the Frobenius norm from Sect. 3.1 and the biggest eigenvalue of the Hessian from Sect. 3.4 as two zero-cost proxies for their first use case. Note, the latter proxy, the Hessian, was only evaluated on the CIFAR-10 image data (Krizhevsky, 2009).

The dataset NAS-RobBench201 (Wu et al., 2024) trains all unique architectures adversarially and evaluates all architectures on FGSM (Goodfellow et al., 2015), PGD (Kurakin et al., 2017), and AutoAttack (Croce and Hein, 2020) with different perturbation strengths. This dataset provides therefore an extension of the robustness dataset by Jung et al. (2023).

3.3 Collection

Both datasets (Krishnakumar et al., 2022; Jung et al., 2023) provide us with the necessary proxies for the NAS-Bench-201 (Dong and Yang, 2020) search space evaluations. Therefore we will analyze 15 different proxies for \(6\,466\) architectures on three different image datasets for clean and adversarially trained networks (Jung et al., 2023; Wu et al., 2024).

4 Evaluations of Zero-Cost Proxies

As presented in Sect. 4, we can directly gather all proxies. Having these at hand allows us to evaluate the correlation, influence and importance of each proxy not only on the clean accuracy of the architectures but also on their robust accuracy.

4.1 Correlation

Clean Trained Networks on the Robustness Dataset Fig. 1 shows the Kendall tau rank correlation in absolute values between each ZCP and all available accuracies, from clean validation accuracy to all adversarial attack accuracies (fgsm, pgd, aa-apgd-ce, aa-square) on CIFAR-10 (Jung et al., 2023). The Jacobian-based proxies (especially jacov and nwot) and the baseline proxies show the highest correlation over all datasets, especially for the FGSM attack. Furthermore, the large correlation within the same attack over different strengths of \(\epsilon \) values stays steady. However, the more difficult the attack gets, from FGSM over PGD to APGD, the larger is the correlations decrease. Interestingly, the zen proxy has the lowest correlation with each accuracy.

Fig. 2
figure 2

Kendall tau rank correlation in absolute values between all zero-cost proxies computed on all architectures given in the NAS-RobBench-201 dataset (Wu et al., 2024) to the validation accuracy and adversarial attacks for CIFAR-10, calculated with clean input image data (left) and perturbed input data (right)

Adversarially Trained Networks in NAS-RobBench-201 Fig. 2 shows the Kendall tau rank correlation in absolute values between each ZCP and all available accuracies, from clean validation accuracy to all adversarial attack accuracies (fgsm, pgd, auto-attack) on clean CIFAR-10 (left) and PGD-perturbed CIFAR-10 (right) (Wu et al., 2024). Similar to the robustness dataset, the Jacobian-based proxies (jacov,nwot) and the baseline proxies (flops, params) show the highest correlation over all attacks and perturbations strengths. The hessian proxy shows a lower correlation that for the previous evaluated robustness dataset. Also here zen has the lowest correlation.

The dataset NAS-RobBench-201 (Wu et al., 2024) provides the robustness accuracies of the architectures after adversarial training. Accordingly, we calculate the zero-cost proxies from Sect. 3 using perturbed input-data, instead of clean input data, to evaluate the influence on the correlation. We perturb the input data by a PGD \(L_{\infty }\) attack, with \(\epsilon =8/255\), step-size 2/255 and 7 steps and recalculate each proxy. Interestingly using perturbed data for the proxy calculation does not have any influence on the correlation (cf. Fig. 2(right))

Fig. 3
figure 3

Random forest regression model with zero-cost proxies as input and the objective being the architecture’s accuracy

4.2 Performance Prediction Using Random Forest

So far the ZCPs in NAS-Bench-Suite-Zero are motivated by their correlation with the resulting clean accuracy of the architectures. Yet, how well can they be transferred to the more challenging task of robust accuracy and even more challenging than that, the multi-objective task? High clean accuracy does not necessarily mean that a network is also robust. Therefore we use all ZCPs from NAS-Bench-Suite-Zero (Krishnakumar et al., 2022) and the robustness dataset (Jung et al., 2023), which we already presented in Sect. 3 as feature inputs for the random forest prediction model, i.e., each architecture a in the NAS-Bench-201 search space (\(\mathcal {A}\)) is represented by its zero-cost proxies, which is then fed as an input to the random forest with different robustness targets (single and multi-objective) (cf. Figure 3). We focus here on the random forest prediction model for several reasons. On the one hand, it is easy to train and works well without much effort for hyperparameter optimization, but on the other hand, the random forest provides interpretable results, which is particularly relevant for us. We use the default regression parameters, with 100 numbers of trees and the mean squared error as the target criterion and train the random forest regression model on 3 different sample sizes (32, 128, \(1\,024\)) for each robustness dataset.

4.2.1 Robustness Dataset

For a better overview, we will consider the attack perturbation value of \(\epsilon = 1/255\) for all the following experiments on the robustness dataset (Jung et al., 2023).

We analyze the prediction ability in Table 1 by means of the \(R^2\) score of the prediction model on the test datasets. As we can see, the random forest prediction model is able to predict the single accuracy (clean and robust) and the multi-objective accuracies in a proper way, for the largest train sizes, while the prediction of the clean accuracy seems to be the easiest task. Interestingly, the prediction ability in terms of \(R^2\) is similar small for both smaller sample sizes (32 and 128). Furthermore, for the all sample sizes, the multi-objective tasks results in a better prediction ability than for the single robust objective, with the biggest difference being for the smallest sample size.

Table 1 Test \(R^2\) of the random forest prediction model for both single objective and multi objectives on the clean test accuracy and the robust test accuracy for \(\epsilon = 1/255\)
Table 2 Test \(R^2\) of the random forest prediction model for both single objective and multi objectives on the clean validation accuracy and the adversarially trained validation accuracy for FGSM and PGD, \(\epsilon \in \{3/255, 8/255\}\) respectively, and AutoAttack
Table 3 Test \(R^2\) of the random forest prediction model for both single objective and multi objectives on the clean validation accuracy and the adversarially trained validation accuracy for FGSM and PGD, \(\epsilon \in \{3/255, 8/255\}\), and AutoAttack, using perturbed input data respectively

4.2.2 NAS-RobBench-201 Dataset

Next, we analyze the performance prediction ability for the NAS-RobBench-201 dataset (Wu et al., 2024) for clean input data (Table 2) and perturbed input data (Table 3). As we can see, predicting the adversarially trained robust accuracy seems to be a much easier task, than predicting the native robustness of the clean trained network from the previous section. This holds also for stronger perturbations (\(\epsilon =8/255\)) as well as stronger attacks, as AutoAttack. However, turning to Table 3 which present the prediction ability using perturbed input data for the proxy calculation, we can see a similar prediction behavior. Interestingly, using perturbed input data only improves the prediction ability on the ImageNet16-120 dataset, especially for small sample sizes as 32. This improvement can be explained by the influence of the individual input features on the prediction, as well as the degree to which they have an effect. In this case, the proxy jacov has the biggest influence on the prediction, for both cases (clean input Table 2, perturbed input Table 3), in terms of permutation importance. However, in the case of perturbed input data, this proxy has a much larger impact on the prediction than in the case of clean input data. We will furthermore investigate the general importance of the features on the prediction in Sect. 5.3.

4.2.3 Top 5 Architectures in the Test Set

So far, we have shown that predicting the clean and robust accuracy is possible in a proper way even with a small sample size. In the following, we are interested in answering the question: how crucial is the partition of the training and test set, i.e., what influence does it have, when the best five architectures are in the test set? To answer this question, we repeat the previous experiment with a sample size of 128 for all image dataset on both robustness datasets (clean and adversarially trained), with the difference that the best 5 architectures are in the test set, respectively for each objective. Table 4 shows the results for the robustness dataset by Jung et al. (2023) ( shows improvements over Table 1, and deteriorations). The influence on the best architectures being in the test dataset, has a more deteriorated influence for the CIFAR-10 dataset prediction, whereas for CIFAR-100 and ImageNet16-120, it shows mostly beneficial results. Interestingly, the prediction ability for the APGD robustness accuracy as a single objective is improved over Table 1 for all three image datasets. Since APGD is the strongest considered white box attack, we hypothesize that this clear behavior might be due to a slightly lower evaluation noise.

When we turn towards the NAS-RobBench-201 dataset (Wu et al., 2024), on which the prediction method already shows impressive prediction abilities in Table 2, the picture changes. Here (cf. Table 5) forcing the best five architectures to be in the test set has for almost all objectives a negative influence on the prediction ability.

4.2.4 Including Additional GRAF Features

The recently proposed neural graph features (GRAF) (Kadlecová et al., 2024) allow us to extend the input features for our prediction model. For NAS-Bench-201 GRAF consists of 191 features. For the robustness dataset (Jung et al., 2023), we perform the same experiment as in Sect. 5.2.1 with a sample size of 128, just that the input features to our prediction model are expanded by the GRAF features, and thus compare Table 6 with Table 1. As we can see, including additional network topology features improves the prediction ability almost everywhere. Interestingly, when including GRAF to the zero-cost proxies for the prediction task on NAS-RobBench-201 (Wu et al., 2024) (cf. Table 7), the prediction on CIFAR-10 faces a slight decrease. Note here, the test \(R^2\) decreases only by 0.1. For the datasets CIFAR-100 and ImageNet16-120 the same holds as for the robustness dataset: An increase in prediction. Important to mention here is that the \(R^2\) for the NAS-RobBench-201 dataset is already extremely high, which rather hinders the possibility to improve it by a big margin.

4.2.5 Analysis Across Different Proxy Categories

In the previous sections we have seen, that using the combination of zero-cost proxies allows for a good performance prediction of a network’s clean and robust accuracy. This leads to the next question, if we can also only use certain types of zero-cost proxies for performance prediction. Table 8 shows the results, using the 6 different types (Jacobian, Pruning, Piecewise-linear, Hessian, Baselines, GRAF) for the prediction task on the robustness dataset. We can see, that using only GRAF leads to good prediction abilities. Note here, that this particular category contains 191 features. The second best category is the Jacobian category; using only \(\{\texttt {epe-nas}, \texttt {grad-norm}, \texttt {jacob-frobenius}, \texttt {jacov}, \texttt {nwot}\}\) as input features also allows for a good prediction. Interestingly, using only zen (Piecewise-linear proxy) or hessian hinders the prediction substantially. For low sample sizes the Baseline category shows the best prediction ability for the clean accuracy and multi-objective task, while with increasing sample size GRAF becomes better.

Table 4 Test \(R^2\) of the random forest prediction model on the Robustness Dataset (Jung et al., 2023) with best 5 architectures enforced to be in the test set, and 128 training data
Table 5 Test \(R^2\) of the random forest prediction model on the NAS-RobBench-201 Dataset (Wu et al., 2024) with best 5 architectures enforced to be in the test set, and 128 training data
Table 6 Test \(R^2\) of the random forest prediction model on the Robustness Dataset (Jung et al., 2023) using zero-cost-proxies and GRAF, trained on 128 training data
Table 7 Test \(R^2\) of the random forest prediction model on the NAS-RobBench-201 Dataset (Wu et al., 2024) using zero-cost-proxies and GRAF, trained on 128 training data

4.3 Feature Importance Based on Permutation

Fig. 4
figure 4

Permutation feature importance of the random forest prediction model trained on 1024 training data points provided in (Jung et al., 2023) with all zero-cost proxies as features and multi targets being clean test accuracy and different adversarial attacks accuracy for perturbation strength \(\epsilon =1/255\) on CIFAR-10

Fig. 5
figure 5

Best 5 features based on their permutation feature importance of the random forest prediction model trained on 1024 training data points provided in Jung et al. (2023) with all zero-cost proxies and GRAF as features and multi targets being clean test accuracy and different adversarial attacks accuracy for perturbation strength \(\epsilon =1/255\) on CIFAR-10

Next, we are interested in the importance of the individual feature inputs for the random forest prediction model. The permutation importance measures the prediction error of the trained random forest after permuting the features, which breaks the relationship between the feature input and the objective. Therefore, the more important a feature is for the prediction, the larger is the prediction error.

Figure 4 visualizes the feature importance on the CIFAR-10 image datasets for all adversarial attacks with \(\epsilon =1/255\) for all three use cases: clean test accuracy, robust test accuracy, both test and robust accuracy on the robustness dataset. The latter one is used as the bar plot alignment. As we can see, jacov is the most important feature for both the clean test accuracy and the multi-objective task, except for FGSM. Interestingly, the Top 5 most important ZCPs are all different for all considered adversarial attacks. Next we include the GRAF features as additional input features for our performance prediction model, as presented in Sect. 5.2.4, and evaluate the new set of input features for their importance. We plot the 5 most important features from a total of 206 in Fig. 5. The behavior is different to the case of only using the 15 ZCPs: For the multi-objective, clean and FGSM, \(\epsilon =3/255\), jacob-fro is the most important feature, whereas flops is the most important feature for the single robust objective (only FGSM, \(\epsilon =1/255\)). Additionally, the most important features for the robust objectives are not important for the clean accuracy objective. We can also see, that there is only a slight decrease in importance. Furthermore, no GRAF feature is in the top 5. Similar behavior holds for the Square attack. Whereas when we move the stronger attacks, PGD and APGD, the GRAF feature, min-path-len-banned [1,4] is the most important feature for both robustness targets, single and multi. This feature is the minimum length from the input to the output node, without the operations \(\{\textrm{zero}, \mathrm {avg.~pool.}\}\). For both attacks, the pruning-based methods synflow and fisher are among the most important features.

If we look at the feature importance for CIFAR-100 (Fig. 6), we can see that the Top 4 most important features for clean and robust accuracy are always the same, with the Jacobian-based ZCP nwot being the most important one. Note here, for CIFAR-100 the rank correlation decreases the most for more difficult attacks.

When we turn now to ImageNet16-120 in Fig. 7, the most important feature is always the Jacobian-based ZCP jacov, with zen and nwot being always in the three most important features

We additionally investigate the permutation feature importance for the NAS-RobBench-201 dataset in Fig. 8 on the example of CIFAR-10. At first glance, we can see that the drop in importance from one feature to another is much smaller than in the previous setting. In addition, the feature importances for FGSM and PDG, \(\epsilon =3/255\) are exactly the same with the same behaviour. For the stronger perturbation \(\epsilon =8/255\), the Top 4 most important features are always the same. For all cases, either jacob-fro or jacov is the most important feature. Here we also include the GRAF features as additional input features for the performance prediction model, as presented in Sect. 5.2.4, and evaluate the new set of input features for their importance in Fig. 9. The Jacobian-based proxies jacob-fro, and javoc are the two most important features, which is similar to the prediction without the additional GRAF features. The evaluations on the robustness benchmark (cf. Figure 5) show that similar proxies are in the top 5 most important features across all attacks. The features “minimum length from the input to the output node”, without the operations \(\{\textrm{zero}, \mathrm {avg.~pool.}\}\), min-path-len-banned [1,4], and max-op-on-path-allowed [2,3], max-op-on-path-allowed [0,2,3], i.e., only considering the path from input to output that only contains \(\{\textrm{conv}~3\times 3, \textrm{conv}~1\times 1\}\), and \(\{\textrm{skip}, \textrm{conv}~3\times 3, \textrm{conv}~1\times 1\}\), respectively, are the most important GRAF features.

Table 8 Test \(R^2\) of the random forest prediction model for both single objective and multi objectives on the clean test accuracy and the robust test accuracy for \(\epsilon = 1/255\), based on the different proxy categories

In total for both robustness datasets, the same features are the most important for the clean accuracy are also important for the architecture’s robust accuracy.

Interestingly, we can see that the task of robust accuracy prediction becomes more difficult on all three image datasets and attacks (see smaller \(R^2\) values in Tables 1, 2). In addition, Figs. 4, 6, and 7 give the impression that solely the most important feature could be used to predict the clean accuracy, but not the robust accuracy.

Therefore, we evaluate the ability of using only the most important feature for the prediction task as input feature in Table 9 on the example of Jung et al. (2023). As we can see, the joint consideration of several proxies is necessary in terms of \(R^2\) to predict a model’s robustness, while the clean accuracy can be regressed from only one such feature.

4.3.1 Feature Importance Excluding Top 1

The previous section shows that only one proxy is not sufficient to predict the robustness of an architecture. This leads to the next question if the regression model can still make good predictions, using all ZCPs but the most important one.

The comparison of Table 1 with Table10 presents only a slight decrease in \(R^2\) values. The prediction ability for the clean accuracy and FGSM robust accuracy on CIFAR-10, as well as for the clean accuracy and PGD accuracy on CIFAR-100 even remained the same.

4.3.2 Analysis Across Different Training Sizes

Next, we investigate which proxies are the most important ones over different sample sizes. Figure 10 shows the permutation feature importance for the robustness dataset exemplary on FGSM, \(L_{\infty }, \epsilon =1/255\) over the three training data size (\(32, 128, 1\,024\)). For the small sample size 32, the baseline proxies, flops, params are the most important proxies for the multi-objective target, and also important for predicting the robust accuracy with snip being the proxy with the highest importance. The Jacobian-based proxy however, jacov, is by far the most important proxy for the clean accuracy target. For the sample size 128 also snip and flops are the most important proxies for both multi-objective prediction and robust accuracy prediction. Again, jacov is the most important proxy for the clean accuracy. As soon as the training size increases from 128 to \(1\,024\), the importance of baseline proxies decreases substantially, and the pruning and Jacobian-based proxies, i.e., here snip and jacov, become the most important ones. Furthermore we can observe that, for the small sample sizes, there is no distinctive important feature for predicting the robustness as single and multi-objective targets, and rather a steady decrease in importance from one feature to the other. This picture changes for large sample sizes, with mostly one feature being the most distinctive one, for each target respectively.

When turning to the NAS-RobBench-201 dataset, the behavior changes (cf. Fig. 11). For each sample size jacov is the most important feature. Interestingly, for the small sample size of 32, we can see that jacov is the most distinctive feature. This changes with increasing sample size to a more steady decrease of different feature importances.

Fig. 6
figure 6

Permutation feature importance of the random forest prediction model trained on 1024 training data points provided in Jung et al. (2023) with all zero-cost proxies as features and multi targets being clean test accuracy and different adversarial attacks accuracy for perturbation strength \(\epsilon =1/255\) on CIFAR-100

Fig. 7
figure 7

Permutation feature importance of the random forest prediction model trained on 1024 training data points provided in Jung et al. (2023) with all zero-cost proxies as features and multi targets being clean test accuracy and different adversarial attacks accuracy for perturbation strength \(\epsilon =1/255\) on ImageNet16-120

Fig. 8
figure 8

Permutation feature importance of the random forest prediction model trained on 1024 training data points provided in Wu et al. (2024) with all zero-cost proxies as features and multi targets being clean test accuracy and different adversarial accuracies on CIFAR-10

Fig. 9
figure 9

Best 5 features based on their permutation feature importance of the random forest prediction model trained on 1024 training data points provided in Wu et al. (2024) with all zero-cost proxies and GRAF as features and multi targets being clean test accuracy and different adversarial attacks accuracy for perturbation strength \(\epsilon =1/255\) on CIFAR-10

5 Conclusion

In this paper, we presented an analysis of 15 zero-cost proxies and additional graph features with regards to their ability to act as performance prediction techniques for architecture robustness using a random forest. We made use of two provided robustness datasets, the robustness dataset by Jung et al. (2023) and the adversarially trained robustness dataset by Wu et al. (2024), which allow for fast evaluations on the here-considered NAS-Bench-201 search space. Commonly, the zero-cost proxies are targeting the clean accuracy of architectures in the literature. Therefore, our analysis shows that the prediction of robustness is a more difficult task. Additionally, we investigated the feature importance of these zero-cost proxies and found that a single feature is not sufficient to predict the robustness, and that the regression model tends to rely on all available features.

Limitations: This paper focuses only on adversarial attacks and no other types of distribution shifts such as considered in Koh et al. (2021); Yao et al. (2022); Wiles et al. (2022). While this would indeed by very interesting, we have to leave such aspects to future work for the simple lack of available benchmarks comparable to Jung et al. (2023); Wu et al. (2024), that provide model evaluations on the considered distribution shifts for a wide range of neural architectures that are trained in a comparable way.

Table 9 Test \(R^2\) of the random forest prediction model using the most important feature for both single objectives and multi objectives on the clean test accuracy and the robust test accuracy for \(\epsilon = 1/255\) on the robustness dataset for 1024 training data
Table 10 Test \(R^2\) of the random forest prediction model using without the most important feature for both single objectives and multi objectives on the clean test accuracy and the robust test accuracy for \(\epsilon = 1/255\) on the robustness dataset for 1024 training data
Fig. 10
figure 10

Permutation feature importance of the random forest prediction model trained over different training data sizes provided in Jung et al. (2023) on CIFAR-10 for FGSM, \(L_{\infty }, \epsilon =1/255\)

Fig. 11
figure 11

Permutation feature importance of the random forest prediction model trained over different training data sizes provided in Wu et al. (2024) on CIFAR-10 for adversarially trained FGSM, \(L_{\infty }, \epsilon =3/255\)