Confidence Intervals for Error Rates in 1:1 Matching Tasks: Critical Statistical Analysis and Recommendations

Fogliato, Riccardo; Patil, Pratik; Perona, Pietro

doi:10.1007/s11263-024-02078-8

Confidence Intervals for Error Rates in 1:1 Matching Tasks: Critical Statistical Analysis and Recommendations

Published: 11 June 2024

Volume 132, pages 5346–5371, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

193 Accesses
1 Citation
Explore all metrics

Abstract

Matching algorithms predict relationships between items in a collection. For example, in 1:1 face verification, a matching algorithm predicts whether two face images depict the same person. Accurately assessing the uncertainty of the error rates of such algorithms can be challenging when test data are dependent and error rates are low, two aspects that have been often overlooked in the literature.In this work, we review methods for constructing confidence intervals for error rates in 1:1 matching tasks. We derive and examine the statistical properties of these methods, demonstrating how coverage and interval width vary with sample size, error rates, and degree of data dependence with experiments on synthetic and real-world datasets. Based on our findings, we provide recommendations for best practices for constructing confidence intervals for error rates in 1:1 matching tasks.l

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating the Impact of Algorithm Confidence Ratings on Human Decision Making in Visual Search

A new algorithm for the confidence interval construction with Monte Carlo simulation and backtesting validation

Article Open access 27 October 2025

A robust and efficient algorithm to find profile likelihood confidence intervals

Article Open access 07 May 2021

Notes

The framework can be adapted to include conditioning on predefined attributes of identities, such as when it’s already known which demographic groups certain identities belong to.
The framework can also apply to other losses such as cross-entropy. The principles and methods reviewed, including the bootstrap techniques, can be adapted or directly employed.
This is a simplifying assumption that may not always hold true. For instance, in datasets containing mugshots like MORPH, individuals who have been arrested more frequently could be more identifiable because their facial images are more up-to-date.

References

Agresti, A., & Coull, B. A. (1998). Approximate is better than ôexactö for interval estimation of binomial proportions. The American Statistician, 52(2), 119–126.
MathSciNet Google Scholar
Aronow, P. M., Samii, C., & Assenova, V. A. (2015). Cluster-robust variance estimation for dyadic data. Political Analysis, 23(4), 564–577.
Article Google Scholar
Balakrishnan, G., Xiong, Y., Xia, W., & Perona, P. (2020). Towards causal benchmarking of bias in face analysis algorithms. In European conference on computer vision, pp. 547–563.
Bhattacharyya, S., & Bickel, P. J. (2015). Subsampling bootstrap of count features of networks. The Annals of Statistics, 43(6), 2384–2411.
Article MathSciNet Google Scholar
Bickel, P. J., Chen, A., & Levina, E. (2011). The method of moments and degree distributions for network models. The Annals of Statistics, 39(5), 2280–2301.
Article MathSciNet Google Scholar
Bolle, R. M., Pankanti, S., & Ratha, N. K. (2000). Evaluation techniques for biometrics-based authentication systems (FRR). In International conference on pattern recognition, pp. 831–837.
Bolle, R. M., Ratha, N. K., & Pankanti, S. (2004). Error analysis of pattern recognition systemsùthe subsets bootstrap. Computer Vision and Image Understanding, 93(1), 1–33.
Article Google Scholar
Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Statistical Science, 16(2), 101–133.
Article MathSciNet Google Scholar
Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2011). Robust inference with multiway clustering. Journal of Business and Economic Statistics, 29(2), 238–249.
Article MathSciNet Google Scholar
Cameron, A. C., & Miller, D. L. (2015). A practitioner’s guide to cluster-robust inference. Journal of Human Resources, 50(2), 317–372.
Article Google Scholar
Casella, G., & Berger, R. L. (2021). Statistical inference. Cengage Learning.
Google Scholar
Chouldechova, A., Deng, S., Wang, Y., Xia, W., & Perona, P. (2022). Unsupervised and semi-supervised bias benchmarking in face recognition. In European conference on computer vision, pp. 289–306.
Conti, J. -R., & Clémençon, S. (2022). Assessing performance and fairness metrics in face recognition-bootstrap methods. arXiv preprint arXiv:2211.07245.
Davezies, L., D’Haultfœuille, X., & Guyonvarch, Y. (2021). Empirical process results for exchangeable arrays. The Annals of Statistics, 49(2), 845–862.
Article MathSciNet Google Scholar
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their application. Cambridge University Press.
Book Google Scholar
Desplanques, B., Thienpondt, J., & Demuynck, K. (2020). ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification. arXiv preprint arXiv:2005.07143.
DiCiccio, T. J., & Efron, B. (1996). Bootstrap confidence intervals. Statistical Science, 11(3), 189–228.
Article MathSciNet Google Scholar
Fafchamps, M., & Gubert, F. (2007). Risk sharing and network formation. American Economic Review, 97(2), 75–79.
Article Google Scholar
Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Machine Learning, 31(1), 1–38.
MathSciNet Google Scholar
Field, C. A., & Welsh, A. H. (2007). Bootstrapping clustered data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(3), 369–390.
Article MathSciNet Google Scholar
Graham, B. S. (2020). Network data. In Handbook of econometrics (vol. 7, pp. 111–218). Elsevier.
Green, A., & Shalizi, C. R. (2022). Bootstrapping exchangeable random graphs. Electronic Journal of Statistics, 16(1), 1058–1095.
Article MathSciNet Google Scholar
Grother, P., Ngan, M., & Hanaoka, K. (2019). Face recognition vendor test (FVRT): Part 3, demographic effects. National Institute of Standards and Technology Gaithersburg.
Book Google Scholar
Hoff, P. (2021). Additive and multiplicative effects network models. Statistical Science, 36(1), 34–50.
Article MathSciNet Google Scholar
Hoff, P. D., Raftery, A. E., & Handcock, M. S. (2002). Latent space approaches to social network analysis. Journal of the American Statistical Association, 97(460), 1090–1098.
Article MathSciNet Google Scholar
Kearns, M., & Roth, A. (2019). The ethical algorithm: The science of socially aware algorithm design. Oxford University Press.
Google Scholar
King, D. E. (2009). Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10, 1755–1758.
Google Scholar
Krzanowski, W. J., & Hand, D. J. (2009). ROC curves for continuous data. Chapman and Hall/CRC.
Book Google Scholar
Macskassy, S., Provost, F., & Rosset, S. (2005). Pointwise ROC confidence bounds: An empirical evaluation. In International conference on machine learning.
McCullagh, P. (2000). Resampling and exchangeable arrays. Bernoulli, pp. 285–301.
Menzel, K. (2021). Bootstrap with cluster-dependence in two or more dimensions. Econometrica, 89(5), 2143–2188.
Article MathSciNet Google Scholar
Miao, W., & Gastwirth, J. L. (2004). The effect of dependence on confidence intervals for a population proportion. The American Statistician, 58(2), 124–130.
Article MathSciNet Google Scholar
Mitra, S., Savvides, M., & Brockwell, A. (2007). Statistical performance evaluation of biometric authentication systems using random effects models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(4), 517–530.
Article Google Scholar
Ni, J., Li, J., & McAuley, J. (2019). Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp. 188–197.
Owen, A. B., & Eckles, D. (2012). Bootstrapping data arrays of arbitrary order. The Annals of Applied Statistics, 6(3), 895–927.
Article MathSciNet Google Scholar
Phillips, P. J., Flynn, P. J., Bowyer, K. W., Bruegge, R. W. V., Grother, P. J., Quinn, G. W., & Pruitt, M. (2011). Distinguishing identical twins by face recognition. In International conference on automatic face and gesture recognition, pp. 185–192.
Phillips, P. J., Grother, P., Micheals, R., Blackburn, D. M., Tabassi, E., & Bone, M. (2003). Face recognition vendor test 2002. In IEEE international workshop on analysis and modeling of faces and gestures.
Phillips, P. J., Yates, A. N., Hu, Y., Hahn, C. A., Noyes, E., Jackson, K., Cavazos, J. G., Jeckeln, G., Ranjan, R., Sankaranarayanan, S., et al. (2018). Face recognition accuracy of forensic examiners, superrecognizers, and face recognition algorithms. Proceedings of the National Academy of Sciences, 115(24), 6171–6176.
Article Google Scholar
Poh, N., Martin, A., & Bengio, S. (2007). Performance generalization in biometric authentication using joint user-specific and sample bootstraps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(3), 492–498.
Article Google Scholar
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763). PMLR.
Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., Subakan, C., Dawalatabad, N., Heba, A., Zhong, J., Chou, J.-C., Yeh, S.-L., Fu, S.-W., Liao, C.-F., Rastorgueva, E., Grondin, F., Aris, W., Na, H., Gao, Y., Mori, R. D., & Bengio, Y. (2021). SpeechBrain: A general-purpose speech toolkit. arXiv:2106.04624.
Ricanek, K., & Tesafaye, T. (2006). MORPH: A longitudinal image database of normal adult age-progression. In International conference on automatic face and gesture recognition, pp. 341–345.
Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and statistical modeling with python. In Proceedings of the 9th python in science conference.
Serengil, S. I., & Ozpinar, A. (2020). Lightface: A hybrid deep face recognition framework. In Innovations in intelligent systems and applications conference, pp. 23–27.
Snijders, T. A., Borgatti, S. P., et al. (1999). Non-parametric standard errors and tests for network statistics. Connections, 22(2), 161–170.
Google Scholar
Tabord-Meehan, M. (2019). Inference with dyadic data: Asymptotic behavior of the dyadic-robust t-statistic. Journal of Business and Economic Statistics, 37(4), 671–680.
Article MathSciNet Google Scholar
Van Horn, G., Cole, E., Beery, S., Wilber, K., Belongie, S., & Mac Aodha, O. (2021). Benchmarking representation learning for natural world image collections. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12884–12893.
Vangara, K., King, M. C., Albiero, V., & Bowyer, K. (2019). Characterizing the variability in face recognition accuracy relative to race. In Conference on computer vision and pattern recognition workshops.
Wasserman, L. (2004). All of statistics: A concise course in statistical inference. Springer.
Book Google Scholar
Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158), 209–212.
Article Google Scholar
Wu, J. C., Martin, A. F., Greenberg, C. S., & Kacker, R. N. (2016). The impact of data dependence on speaker recognition evaluation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(1), 5–18.
Article Google Scholar
Xiao, S., Liu, Z., Zhang, P., & Muennighoff, N. (2023). C-pack: Packaged resources to advance general Chinese embedding.
Zeileis, A., Köll, S., & Graham, N. (2020). Various versatile variances: An object-oriented implementation of clustered covariances in R. Journal of Statistical Software, 95, 1–36.
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank Mathew Monfort and Yifan Xing for the insightful discussions and valuable feedback on the paper. The anonymous reviewers and the associate editor are also gratefully acknowledged for their constructive feedback that helped improved the clarity of the paper.

Author information

Authors and Affiliations

AWS AI, Seattle, USA
Riccardo Fogliato
University of California, Berkeley, USA
Pratik Patil
AWS AI, Pasadena, USA
Pietro Perona

Authors

Riccardo Fogliato
View author publications
Search author on:PubMed Google Scholar
Pratik Patil
View author publications
Search author on:PubMed Google Scholar
Pietro Perona
View author publications
Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Riccardo Fogliato or Pratik Patil.

Additional information

Communicated by Zhouchen Lin.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Supplementary Material for “Confidence Intervals for Error Rates in 1:1 Matching Tasks: Critical Statistical Analysis and Recommendations”

This document acts as a supplement to the paper “Confidence Intervals for Error Rates in 1:1 Matching Tasks: Critical Statistical Analysis and Recommendations.” The supplement is organized as follows.

(A)
In Appendix A, we provide proofs of all the theoretical claims in the main paper.
1. (1)
  Appendix A.1 contains proofs for parametric methods in Appendix 4.1.
2. (2)
  Appendix A.2 contains proofs for resampling-based methods in Appendix 4.2
3. (3)
  Appendix A.3 contains proofs for unbalanced datasets in Sect. 5.1.
(B)
In Appendix B, we describe protocol design strategies (i.e., sampling) for the estimation of error rates and their associated uncertainty on large datasets.
(C)
In Appendix C, we provide additional numerical experiments, supplementing those in Appendix 6.

Proofs of theoretical results

1.1 Proofs for parametric methods in Sect. 4.1

1.1.1 Proof of Proposition 1 (normality of scaled error rates)

As explained in the main paper, because identity-level observations are assumed to be independent, the case of $\texttt{FRR}$ in Proposition 1 follows from applying the central limit theorem. The case of the $\texttt{FAR}$ follows from Proposition 3.2 in Tabord-Meehan (2019).

1.1.2 Proof of Proposition 2 (consistency of plug-in variance estimators)

The convergence in probability of $\widehat{{\mathop {\textrm{Var}}}}(\sqrt{G}\widehat{\texttt{FRR}})$ to ${\mathop {\textrm{Var}}}(\bar{Y}_{11})$ simply follows from an application of the weak law of large numbers. In the following, we will show the convergence in probability of $\widehat{{\mathop {\textrm{Var}}}}(\sqrt{G} \widehat{\texttt{FAR}}) - (G - 1)^{-1}(2{\mathop {\textrm{Var}}}(\bar{Y}_{12}) + 4(G-2){\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13}))$ to 0. We begin by recalling the estimator:

$$\begin{aligned} \widehat{{\mathop {\textrm{Var}}}}(\sqrt{G}\widehat{\texttt{FAR}})&= \frac{2}{G-1}\left[ \widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12}) \right. \nonumber \\&\left. + 2(G-2) \widehat{{\mathop {\textrm{Cov}}}}(\bar{Y}_{12}, \bar{Y}_{13})\right] , \end{aligned}$$

(A1)

where the components $\widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12})$ and $\widehat{{\mathop {\textrm{Cov}}}}(\bar{Y}_{12}, \bar{Y}_{13})$ are defined as:

$$\begin{aligned} \widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12})&= \frac{1}{G(G-1)}\sum _{i=1}^G \sum _{j=1, j\ne i}^G (\bar{Y}_{ij} - \widehat{\texttt{FAR}})^2, \end{aligned}$$

(A2)

$$\begin{aligned}&\widehat{{\mathop {\textrm{Cov}}}}(\bar{Y}_{12}, \bar{Y}_{13})\nonumber \\&= \frac{1}{G(G-1)(G-2)}\sum _{i=1}^G\sum _{\begin{array}{c} j=1 \\ j\ne i \end{array}}^G\nonumber \\&\sum _{\begin{array}{c} k=1\\ k\ne i, j \end{array}}^G (\bar{Y}_{ij} - \widehat{\texttt{FAR}}) (\bar{Y}_{ik} - \widehat{\texttt{FAR}}). \end{aligned}$$

(A3)

We want to show that, as $G\rightarrow \infty $, $\widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12}) \xrightarrow {p }{\mathop {\textrm{Var}}}(\bar{Y}_{12})$, and $\widehat{{\mathop {\textrm{Cov}}}}(\bar{Y}_{12}, \bar{Y}_{13}) \xrightarrow {p }{\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13})$. If these conditions are verified, then $ \widehat{{\mathop {\textrm{Var}}}}(\sqrt{G}\widehat{\texttt{FAR}}) - {\mathop {\textrm{Var}}}(\sqrt{G}\widehat{\texttt{FAR}}) \xrightarrow {p }0$ by Slutsky’s theorem.

(Consistency of variance estimator)Consistency of $\widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12})$ By Chebyshev’s inequality, we have

$$\begin{aligned}{} & {} \mathbb {P}(\vert \widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12}) -{\mathop {\textrm{Var}}}(\bar{Y}_{12}) \vert \ge t)\nonumber \\{} & {} \le \frac{\mathbb {E}\left[ \left( \widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12}) -{\mathop {\textrm{Var}}}(\bar{Y}_{12})\right) ^2 \right] }{t^2}, \end{aligned}$$

(A4)

for any $t > 0$. We will now bound the numerator of (A4). Decompose the numerator into:

$$\begin{aligned} \mathbb {E}\left[ \left( \widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12})) - {\mathop {\textrm{Var}}}(\bar{Y}_{12})\right) ^2 \right]= & {} \underbrace{{\mathop {\textrm{Var}}}(\widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12}))}_{{\textbf {Term 1}}}\nonumber \\{} & {} + \underbrace{\left( \mathbb {E}\left[ \widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12})\right] - {\mathop {\textrm{Var}}}(\bar{Y}_{12}) \right) ^2}_{{\textbf {Term 2}}}. \end{aligned}$$

(A5)

We will show below that both the two terms on the right-hand side of (A5) are $O(G^{-1})$.

$\boxed {{\textbf {Term 1}}}$ The first term in (A5) is equal to

$$\begin{aligned}{} & {} {\mathop {\textrm{Var}}}(\widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12})) = \frac{1}{G(G-1)}\nonumber \\{} & {} \bigg [ 2{\mathop {\textrm{Var}}}((\bar{Y}_{12} - \widehat{\texttt{FAR}})^2) + 4(G-2){\mathop {\textrm{Cov}}}((\bar{Y}_{12} - \widehat{\texttt{FAR}})^2,\nonumber \\{} & {} (\bar{Y}_{13} - \widehat{\texttt{FAR}})^2)+ (G-2)(G-3){\mathop {\textrm{Cov}}}((\bar{Y}_{12} - \widehat{\texttt{FAR}})^2,\nonumber \\ {}{} & {} (\bar{Y}_{34} - \widehat{\texttt{FAR}})^2) \bigg ]. \end{aligned}$$

(A6)

It is easy to see that all terms are $O(G^{-1})$ or of smaller order.

$\boxed {{\textbf {Term 2}}}$ The second term on the right hand side of (A5) is equal to

$$\begin{aligned}&\left[ \mathbb {E}\left[ \widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12}) \right] - {\mathop {\textrm{Var}}}(\bar{Y}_{12})\right] ^2 \nonumber \\&= - \left[ \frac{{\mathop {\textrm{Var}}}(\bar{Y}_{12}) +4(G-2){\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13})}{G(G-1)}\right] ^2 = O(G^{-2}). \end{aligned}$$

(A7)

The consistency of $\widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12})$ then follows by combining the results in (A6) and (A7) with the inequality in (A4).

(Consistency of covariance estimator)Consistency of $\widehat{{\mathop {\textrm{Cov}}}}(\bar{Y}_{12}, \bar{Y}_{13})$ By Chebyshev’s inequality,

$$\begin{aligned}{} & {} \mathbb {P}(\vert \widehat{{\mathop {\textrm{Cov}}}}(\bar{Y}_{12}, \bar{Y}_{13}) -{\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13}) \vert \ge t)\nonumber \\{} & {} \le \frac{\mathbb {E}\left[ \left( \widehat{{\mathop {\textrm{Cov}}}}(\bar{Y}_{12}, \bar{Y}_{13}) - {\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13})\right) ^2 \right] }{t^2}, \end{aligned}$$

(A8)

for any $t > 0$. We now proceed to bound the numerator of (A8). Note that

$$\begin{aligned}&\mathbb {E}\left[ \left( \widehat{{\mathop {\textrm{Cov}}}}(\bar{Y}_{12}, \bar{Y}_{13}) - {\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13})\right) ^2 \right] \nonumber \\&\quad =\underbrace{{\mathop {\textrm{Var}}}(\widehat{{\mathop {\textrm{Cov}}}}(\bar{Y}_{12}, \bar{Y}_{13}))}_{{\textbf {Term 3}}}\nonumber \\&\qquad +\,\underbrace{\left( \mathbb {E}\left[ \widehat{{\mathop {\textrm{Cov}}}}(\bar{Y}_{12}, \bar{Y}_{13}) \right] - {\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13}) \right) ^2}_{{\textbf {Term 4}}}. \end{aligned}$$

(A9)

To complete the proof, we will show below that that each of the two terms on the right hand side of (A9) are $O(G^{-1})$.

$\boxed {{\textbf {Term 3}}}$ We start with the first term, the variance of the covariance estimator. We can rewrite

$$\begin{aligned}&{\mathop {\textrm{Var}}}(\widehat{{\mathop {\textrm{Cov}}}}(\bar{Y}_{12}, \bar{Y}_{13})) =\sum _{i=1}^G\sum _{\begin{array}{c} j=1 \\ j\ne i \end{array}}^G\sum _{\begin{array}{c} k=1\\ k\ne i, j \end{array}}^G \sum _{l=1}^G\sum _{\begin{array}{c} m=1\\ m\ne l \end{array}}^G\sum _{\begin{array}{c} n=1\\ n\ne l, m \end{array}}^G\\&\times \frac{\displaystyle {\mathop {\textrm{Cov}}}\left\{ (\bar{Y}_{ij} - \widehat{\texttt{FAR}})(\bar{Y}_{ik} - \widehat{\texttt{FAR}}), (\bar{Y}_{lm} - \widehat{\texttt{FAR}}) (\bar{Y}_{ln} - \widehat{\texttt{FAR}}))\right\} }{\displaystyle G^2(G-1)^2(G-2)^2}. \end{aligned}$$

In order to show that it converges to 0, we need to prove that the number of nonzero covariance terms is of the order smaller than $G^6$.

Terms involving ${\mathop {\textrm{Cov}}}(\bar{Y}_{ij}\bar{Y}_{ik}, \bar{Y}_{lm}\bar{Y}_{ln})$: These terms will be zero when all indices are different, that is in $G!/(G-6)!$ cases. Thus, $[G(G-1)(G-2)]^2 - G!/(G-6)!=O(G^5)$ of the terms in the sum above will be nonzero.
Terms involving ${\mathop {\textrm{Cov}}}(\bar{Y}_{ij}\bar{Y}_{ik}, \bar{Y}_{lm}\widehat{\texttt{FAR}})$: We have
$$\begin{aligned}&{\mathop {\textrm{Cov}}}(\bar{Y}_{ij}\bar{Y}_{ik}, \bar{Y}_{lm}\widehat{\texttt{FAR}}) \\&\quad = \frac{1}{G(G-1)} \\&\qquad {\mathop {\textrm{Cov}}}\left\{ \bar{Y}_{ij}\bar{Y}_{ik}, 2\bar{Y}_{lm}^2 + 4\sum _{\begin{array}{c} n=1 \\ n\ne l,m \end{array}}^G\bar{Y}_{lm}\bar{Y}_{ln} + \sum _{\begin{array}{c} n=1\\ n\ne l,m \end{array}}^G\sum _{\begin{array}{c} p=1\\ p\ne l,m,n \end{array}}^G \bar{Y}_{lm}\bar{Y}_{np}\right\} \\&\quad = \frac{{\mathop {\textrm{Cov}}}(\bar{Y}_{ij}\bar{Y}_{ik}, 2\bar{Y}_{lm}^2)}{G(G-1)} + 4\sum _{\begin{array}{c} n=1 \\ n\ne l,m \end{array}}^G\\&\qquad \frac{{\mathop {\textrm{Cov}}}\left( \bar{Y}_{ij}\bar{Y}_{ik}, \bar{Y}_{lm}\bar{Y}_{ln}\right) }{G(G-1)} + \sum _{\begin{array}{c} n=1\\ n\ne l,m \end{array}}^G\sum _{\begin{array}{c} p=1\\ p\ne l,m,n \end{array}}^G \frac{{\mathop {\textrm{Cov}}}(\bar{Y}_{ij}\bar{Y}_{ik}, \bar{Y}_{lm}\bar{Y}_{np})}{G(G-1)}. \end{aligned}$$
The first term will be nonzero when $\bar{Y}_{ij}\bar{Y}_{ik}$ and $\bar{Y}_{lm}^2$ share any of the indices, hence
$$\begin{aligned}&\frac{1}{G(G-1)}\sum _{i=1}^G\sum _{\begin{array}{c} j=1 \\ j\ne i \end{array}}^G\sum _{\begin{array}{c} k=1\\ k\ne i, j \end{array}}^G \sum _{l=1}^G\sum _{\begin{array}{c} m=1\\ m\ne l \end{array}}^G\sum _{\begin{array}{c} n=1\\ n\ne l,m \end{array}}^G {\mathop {\textrm{Cov}}}(\bar{Y}_{ij}\bar{Y}_{ik}, 2\bar{Y}_{lm}^2) \\&\quad = \frac{G(G-1)(G-2)^2}{G(G-1)}\sum _{l=1}^G \sum _{\begin{array}{c} m=1 \\ m\ne l \end{array}}^G {\mathop {\textrm{Cov}}}(\bar{Y}_{12}\bar{Y}_{13}, 2\bar{Y}_{lm}^2), \end{aligned}$$
which is $O(G^3)$. The second term is $O(G^4)$, while the third term is $O(G^5)$.
Terms involving ${\mathop {\textrm{Cov}}}(\widehat{\texttt{FAR}}^2, \widehat{\texttt{FAR}}^2)$: We have
$$\begin{aligned}&{\mathop {\textrm{Cov}}}(\widehat{\texttt{FAR}}^2, \widehat{\texttt{FAR}}^2) \\&\quad = \frac{1}{G^2(G-1)^2} \sum _{i=1}^G \sum _{\begin{array}{c} j=1 \\ j \ne i \end{array}}^G {\mathop {\textrm{Cov}}}\left\{ 2\bar{Y}_{ij}^2 + 4\sum _{\begin{array}{c} k=1 \\ k\ne i, j \end{array}}^G \bar{Y}_{ij}\bar{Y}_{ik} +\sum _{\begin{array}{c} k=1\\ k\ne i, j \end{array}}^G\sum _{\begin{array}{c} l=1\\ l\ne i, j,k \end{array}}^G \bar{Y}_{ij}\bar{Y}_{kl} , \widehat{\texttt{FAR}}^2 \right\} \\&\quad = \frac{1}{G(G-1)} {\mathop {\textrm{Cov}}}\left\{ (2\bar{Y}_{12}^2 + 4\sum _{\begin{array}{c} k=1 \\ k\ne 1, 2 \end{array}}^G \bar{Y}_{12}\bar{Y}_{1k} + \sum _{\begin{array}{c} k=1\\ k\ne 1,2 \end{array}}^G \sum _{\begin{array}{c} l=1\\ l\ne 1,2,k \end{array}}^G \bar{Y}_{12}\bar{Y}_{kl}, \widehat{\texttt{FAR}}^2\right\} \\&\quad = \frac{1}{G(G-1)} {\mathop {\textrm{Cov}}}\left\{ (2\bar{Y}_{12}^2 + 4(G-2) \bar{Y}_{12}\bar{Y}_{13} + (G-2)(G-3)\bar{Y}_{12}\bar{Y}_{34}, \widehat{\texttt{FAR}}^2\right\} . \end{aligned}$$
The leading term in this expression is
$$\begin{aligned}{} & {} \frac{(G-2)(G-3)}{G^3(G-1)^3}\sum _{i=1}^G\sum _{\begin{array}{c} j=1\\ j\ne i \end{array}}^G\sum _{\begin{array}{c} k=1\\ k\ne i, j \end{array}}^G\sum _{\begin{array}{c} l=1\\ l\ne i, j, k \end{array}}^G \\{} & {} \quad {\mathop {\textrm{Cov}}}(\bar{Y}_{12}\bar{Y}_{34}, \bar{Y}_{ij}\bar{Y}_{kl}) = O(G^{-1}). \end{aligned}$$
Terms involving ${\mathop {\textrm{Cov}}}(\bar{Y}_{ij}\bar{Y}_{ik}, \widehat{\texttt{FAR}}^2)$ and ${\mathop {\textrm{Cov}}}(\bar{Y}_{ij}\widehat{\texttt{FAR}}, $ $\widehat{\texttt{FAR}}^2)$: These terms are handled in a similar manner and their proofs are omitted.

Thus, we have thus shown that

$$\begin{aligned} {\mathop {\textrm{Var}}}(\widehat{{\mathop {\textrm{Cov}}}}(\bar{Y}_{12}, \bar{Y}_{13})) = O(G^5/G^6) = O(G^{-1}). \end{aligned}$$

(A10)

$\boxed {{\textbf {Term 4}}}$ We now turn to the second term, which is the squared bias. We have

$$\begin{aligned}{} & {} \mathbb {E}\left[ \widehat{{\mathop {\textrm{Cov}}}}(\bar{Y}_{12}, \bar{Y}_{13})) \right] = \left[ 1 - \frac{4(G-2)}{G(G-1)}\right] \mathbb {E}[\bar{Y}_{12}\bar{Y}_{13}] \\{} & {} -\frac{2}{G(G-1)}\mathbb {E}[\bar{Y}_{12}^2] - \texttt{FAR}^2 \frac{(G-2)(G-3)}{G(G-1)}. \end{aligned}$$

It follows that the bias is given by

$$\begin{aligned}{} & {} \mathbb {E}\left[ \widehat{{\mathop {\textrm{Cov}}}}(\bar{Y}_{12}, \bar{Y}_{13}) \right] - {\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13}) \\{} & {} \quad = - \frac{4(G-2)}{G(G-1)} \mathbb {E}[\bar{Y}_{12}\bar{Y}_{13}] - \frac{2}{G(G-1)}\mathbb {E}[\bar{Y}_{12}^2] \\{} & {} \qquad - \frac{2(2G - 3)}{G(G-1)}\texttt{FAR}^2. \end{aligned}$$

Thus, we have

$$\begin{aligned} \left( \mathbb {E}\widehat{{\mathop {\textrm{Cov}}}}(\bar{Y}_{12}, \bar{Y}_{13}) - {\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13})\right) ^2 = O(G^{-2}), \end{aligned}$$

(A11)

which goes to 0 as $G\rightarrow \infty $.

Putting (A10) and (A11) together, along with (A8), the result then follows. This completes the proof of the consistency of $\widehat{{\mathop {\textrm{Cov}}}}(\bar{Y}_{12}, \bar{Y}_{13})$.

1.1.3 Proof of Proposition 3(equivalence of plug-in and jackknife variance estimators)

Recall the form of the jackknife variance estimator of ${\mathop {\textrm{Var}}}(\sqrt{G} \widehat{\texttt{FAR}})$:

$$\begin{aligned} \widehat{{\mathop {\textrm{Var}}}}_{JK}(\sqrt{G}\widehat{\texttt{FAR}}){} & {} = \frac{(G-2)^2}{G} \sum _{i=1}^G (\widehat{\texttt{FAR}}_{-i} - \widehat{\texttt{FAR}})^2\nonumber \\{} & {} \quad - 2\frac{\widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12})}{G-1}, \end{aligned}$$

(A12)

where $\widehat{\texttt{FAR}}_{-i}$ is defined as

$$\begin{aligned} \widehat{\texttt{FAR}}_{-i} = \frac{1}{(G - 1) (G - 2)} \sum _{j=1}^G \sum _{\begin{array}{c} k=1,\\ k\ne j \end{array}}^{G} \bar{Y}_{jk}\mathbb {1}(\{j\ne i\}\cap \{k\ne i\}). \nonumber \\ \end{aligned}$$

(A13)

Recall also the estimator for ${\mathop {\textrm{Var}}}(\bar{Y}_{12})$:

$$\begin{aligned}{} & {} \widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12}) = \frac{1}{G(G-1)}\sum _{i=1}^G\sum _{\begin{array}{c} j=1,\\ j\ne i \end{array}}^{G} (\bar{Y}_{ij}-\widehat{\texttt{FAR}})^2. \end{aligned}$$

(A14)

Through a series of algebraic manipulations, we will show that after substituting for (A13) and (A14), the expression (A12) simplifies to plug-in estimator from (9).

Towards that end, we start by expanding the sum in the first term on the right-hand side of (A12):

$$\begin{aligned}&\sum _{i=1}^G ( \widehat{\texttt{FAR}}_{-i} - \widehat{\texttt{FAR}})^2 \nonumber \\&\quad = \sum _{i=1}^G \left( \frac{\sum _{k=1}^G\sum _{\begin{array}{c} l=1 \\ l \ne k \end{array}}^G \bar{Y}_{kz} - 2\sum _{\begin{array}{c} j=1\\ j\ne i \end{array}}^G \bar{Y}_{ij}}{(G-1)(G-2)} - \widehat{\texttt{FAR}}\right) ^2 \nonumber \\&\quad = \frac{4}{(G-2)^2} \sum _{i=1}^G \left( \sum _{\begin{array}{c} j=1 \\ j\ne i \end{array}}^G\frac{\bar{Y}_{ij}}{G-1} - \widehat{\texttt{FAR}}\right) ^2 \nonumber \\&\quad = \frac{4}{(G-2)^2(G-1)^2}\sum _{i=1}^G\left[ \sum _{\begin{array}{c} j=1 \\ j\ne i \end{array}}^G(\bar{Y}_{ij} - \widehat{\texttt{FAR}})^2\right. \nonumber \\&\qquad \left. +\sum _{\begin{array}{c} j=1\\ j\ne i \end{array}}^G\sum _{\begin{array}{c} k=1\\ k\ne i, j \end{array}}^G(\bar{Y}_{ij} - \widehat{\texttt{FAR}})(\bar{Y}_{ik} - \widehat{\texttt{FAR}}) \right] . \end{aligned}$$

(A15)

Moving the appropriate factor across and subtracting the second term on the right-hand side of (A12), we arrive at

$$\begin{aligned}&\frac{(G-2)^2}{G} \sum _{i=1}^G (\widehat{\texttt{FAR}}_{-i} - \widehat{\texttt{FAR}})^2 - 2\frac{\widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12})}{G-1}\nonumber \\&\quad = \frac{4\sum _{i=1}^G\sum _{\begin{array}{c} j=1\\ j\ne i \end{array}}^G(\bar{Y}_{ij} - \widehat{\texttt{FAR}})^2}{G(G-1)^2}\nonumber \\&\qquad + \frac{4\sum _{i=1}^G\sum _{\begin{array}{c} j = 1\\ j\ne i \end{array}}^G\sum _{\begin{array}{c} k = 1\\ k\ne i, j \end{array}}^G(\bar{Y}_{ij}-\widehat{\texttt{FAR}})(\bar{Y}_{ik} - \widehat{\texttt{FAR}})}{G(G-1)^2} \nonumber \\&\qquad - 2\frac{\widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12})}{G-1} = \frac{2\widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12}) +4\widehat{{\mathop {\textrm{Cov}}}}(\bar{Y}_{12}, \bar{Y}_{13})(G-2)}{G-1}\nonumber \\&\qquad + 2\frac{\widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12})}{G-1} - 2\frac{\widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12})}{G-1} \quad =\frac{2}{G-1}\widehat{{\mathop {\textrm{Var}}}}(\bar{Y}_{12})\nonumber \\&\qquad +\frac{4 (G-2)}{G-1} \widehat{{\mathop {\textrm{Cov}}}}(\bar{Y}_{12}, \bar{Y}_{13}), \end{aligned}$$

(A16)

Noting that the (A16) matches with (9). we have that $ \widehat{{\mathop {\textrm{Var}}}}_{JK}(\sqrt{G} \widehat{\texttt{FAR}}) = \widehat{{\mathop {\textrm{Var}}}}(\sqrt{G} \widehat{\texttt{FAR}})$, as claimed.

1.2 Proofs for resampling-based methods in Sect. 4.2

1.2.1 Proof of Proposition 4(bias of subsets bootstrap estimators)

Recall that $\widehat{\texttt{FRR}}_b^*$ and $\widehat{\texttt{FAR}}_b^*$ indicate the $\texttt{FRR}$ and $\texttt{FAR}$ estimates respectively based on the b-th bootstrap sample. The proofs for various statements in the proposition are separated below.

Showing that ${\mathop {\textrm{Bias}}}(\widehat{\texttt{FRR}}_b^*) = 0$ and ${\mathop {\textrm{Bias}}}({\mathop {\textrm{Var}}}^*(\sqrt{G}\widehat{\texttt{FRR}}^*)) $ $= - {\mathop {\textrm{Var}}}(\widehat{\texttt{FRR}})$ is straightforward.
- For ${\mathop {\textrm{Bias}}}(\widehat{\texttt{FRR}}_b^*)$, it is easy to see that
  $$\begin{aligned} \mathbb {E}[\mathbb {E}^*[\widehat{\texttt{FRR}}_b^*]]&= \frac{1}{G}\sum _{i=1}^G\mathbb {E}[\mathbb {E}^*[W_i]\bar{Y}_{ii}] \\&= \frac{1}{G}\mathbb {E}[\bar{Y}_{ii}] = \texttt{FRR}. \end{aligned}$$
  Hence, ${\mathop {\textrm{Bias}}}(\widehat{\texttt{FRR}}_b^*) = 0$.
- Towards computing ${\mathop {\textrm{Bias}}}(\sqrt{G} {\mathop {\textrm{Var}}}^*(\widehat{\texttt{FRR}}_b^*))$, observe that
  $$\begin{aligned} \hspace{-22pc} \mathbb {E}[{\mathop {\textrm{Var}}}^*[\widehat{\texttt{FRR}}_b^*]]&= \frac{1}{G^2}\sum _{i=1}^G \mathbb {E}\left\{ {\mathop {\textrm{Var}}}^*(W_i)\bar{Y}_{ii}^2 + {\mathop {\textrm{Cov}}}^*(\bar{W}_{i}, \bar{W}_{k})\sum _{\begin{array}{c} k=1 \\ k\ne i \end{array}}^G\bar{Y}_{ii}\bar{Y}_{kk}\right\} \\ \hspace{-22pc}&= \frac{1}{G^2}\sum _{i=1}^G \mathbb {E}\left\{ \frac{G-1}{G} \bar{Y}_{ii}^2 - \frac{1}{G}\sum _{\begin{array}{c} k=1 \\ k\ne i \end{array}}^G\bar{Y}_{ii}\bar{Y}_{kk}\right\} = \frac{G-1}{G^2}\mathbb {E}[\bar{Y}_{11}^2] - \frac{G-1}{G^2}\texttt{FRR}= \frac{G-1}{G}{\mathop {\textrm{Var}}}(\widehat{\texttt{FRR}}). \end{aligned}$$
  Thus, we have $ {\mathop {\textrm{Bias}}}({\mathop {\textrm{Var}}}^*(\sqrt{G} \widehat{\texttt{FRR}})) = (G - 1) {\mathop {\textrm{Var}}}$ $(\widehat{\texttt{FRR}}) - G {\mathop {\textrm{Var}}}(\widehat{\texttt{FRR}}) = - {\mathop {\textrm{Var}}}(\widehat{\texttt{FAR}}), $ as claimed.
Obtaining expressions for ${\mathop {\textrm{Bias}}}(\widehat{\texttt{FAR}}_b^*)$ and ${\mathop {\textrm{Bias}}}({\mathop {\textrm{Var}}}^*$ $(\sqrt{G} \widehat{\texttt{FAR}}_b^*))$ is slightly more involved.
- For ${\mathop {\textrm{Bias}}}(\widehat{\texttt{FAR}}_b^*)$, note that
  $$\begin{aligned} \mathbb {E}[\mathbb {E}^*[\widehat{\texttt{FAR}}_b^*]]= & {} \frac{1}{G(G-1)}\sum _{i=1}^G\mathbb {E}\left[ \mathbb {E}^*[W_i] \sum _{\begin{array}{c} j=1 \\ j\ne i \end{array}}^G \bar{Y}_{ij} \right] \\= & {} \frac{1}{G(G-1)}\sum _{i=1}^G\sum _{\begin{array}{c} j=1\\ j\ne i \end{array}}^G\mathbb {E}\left[ \bar{Y}_{ij}\right] = \texttt{FAR}. \end{aligned}$$
  Hence,${\mathop {\textrm{Bias}}}(\widehat{\texttt{FAR}}_b^*) = 0$.
- For ${\mathop {\textrm{Var}}}^*(\widehat{\texttt{FAR}}_b^*)$, observe that
  $$\begin{aligned}&\mathbb {E}[{\mathop {\textrm{Var}}}^*[\widehat{\texttt{FAR}}_b^*]] = \frac{1}{G^2}\sum _{i=1}^G \mathbb {E}\left\{ \frac{\left( \sum _{\begin{array}{c} j=1 \\ j \ne i \end{array}}^G \bar{Y}_{ij}\right) ^2}{(G-1)^2} {\mathop {\textrm{Var}}}^*\left( W_i \right) + \sum _{\begin{array}{c} k=1 \\ k\ne i \end{array}}^G \frac{\sum _{\begin{array}{c} j=1\\ j \ne i \end{array}}^G \bar{Y}_{ij}\sum _{\begin{array}{c} l=1\\ l \ne k \end{array}}^G \bar{Y}_{kl}}{(G-1)^2} {\mathop {\textrm{Cov}}}^*\left( W_i, W_k \right) \right\} \\&\quad = \frac{1}{G^2}\sum _{i=1}^G \mathbb {E}\left\{ \frac{\left( \sum _{\begin{array}{c} j=1 \\ j \ne i \end{array}}^G \bar{Y}_{ij}\right) ^2}{(G-1)^2} \frac{G-1}{G} - \sum _{\begin{array}{c} k=1 \\ k\ne i \end{array}}^G \frac{\sum _{\begin{array}{c} j=1\\ j \ne i \end{array}}^G \bar{Y}_{ij}\sum _{\begin{array}{c} l=1\\ l \ne k \end{array}}^G \bar{Y}_{kl}}{(G-1)^2} \frac{1}{G} \right\} . \end{aligned}$$
  We thus have
  $$\begin{aligned}&\mathbb {E}[{\mathop {\textrm{Var}}}^*[\widehat{\texttt{FAR}}_b^*]] = \frac{1}{G^2}\sum _{i=1}^G \mathbb {E}\left\{ \frac{\left( \sum _{\begin{array}{c} j=1 \\ j \ne i \end{array}}^G\bar{Y}_{ij}\right) ^2}{(G-1)^2} - \widehat{\texttt{FAR}}^2 \right\} \nonumber \\&\quad = \frac{1}{G} \mathbb {E}\left\{ \frac{\bar{Y}_{12}^2 + (G-2)\bar{Y}_{12}\bar{Y}_{13}}{G-1} - \widehat{\texttt{FAR}}^2\right\} = \frac{1}{G} \mathbb {E}\Bigg \{\frac{\bar{Y}_{12}^2 + (G-2)\bar{Y}_{12}\bar{Y}_{13}}{G-1} \nonumber \\&\qquad \qquad \qquad - \left[ \frac{2\bar{Y}_{12}^2 + 4(G-2)\bar{Y}_{12}\bar{Y}_{13} + (G-2)(G-3)\bar{Y}_{12}\bar{Y}_{34}}{G(G-1)} \right] \Bigg \}. \end{aligned}$$
  (A17)
  We can rewrite the first of the two terms in (A17) as
  $$\begin{aligned}&\frac{1}{G} \mathbb {E}\left\{ \frac{\bar{Y}_{12}^2 + (G-2)\bar{Y}_{12}\bar{Y}_{13}}{G-1}\right\} = {\mathop {\textrm{Var}}}(\widehat{\texttt{FAR}}) - \frac{{\mathop {\textrm{Var}}}(\bar{Y}_{12}) + 3(G-2){\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13})}{G(G-1)} + \frac{\texttt{FAR}^2}{G}, \end{aligned}$$
  (A18)
  and the second as
  $$\begin{aligned} \frac{1}{G}\mathbb {E}\left\{ \frac{2\bar{Y}_{12}^2 + 4(G-2)\bar{Y}_{12}\bar{Y}_{13} + (G-2)(G-3)\bar{Y}_{12}\bar{Y}_{34}}{G(G-1)} \right\} = \frac{{\mathop {\textrm{Var}}}(\widehat{\texttt{FAR}})}{G} +\frac{\texttt{FAR}^2}{G}. \end{aligned}$$
  (A19)
  Thus, combining (A18) and (A19) with (A17), we obtain
  $$\begin{aligned}&\mathbb {E}[{\mathop {\textrm{Var}}}^*[\widehat{\texttt{FAR}}_b^*]] = \frac{G-1}{G}{\mathop {\textrm{Var}}}(\widehat{\texttt{FAR}}) \\&\quad - \frac{{\mathop {\textrm{Var}}}(\bar{Y}_{12}) + 3(G-2){\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13})}{G(G-1)}. \end{aligned}$$
  Therefore, we have
  $$\begin{aligned}&{\mathop {\textrm{Bias}}}({\mathop {\textrm{Var}}}^*(\sqrt{G} \widehat{\texttt{FRR}}_b^*)) = (G - 1) {\mathop {\textrm{Var}}}(\widehat{\texttt{FAR}})\\&- \frac{{\mathop {\textrm{Var}}}(\bar{Y}_{12}) + 3(G-2){\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13})}{(G-1)} - G {\mathop {\textrm{Var}}}((\widehat{\texttt{FAR}}) \\&= - {\mathop {\textrm{Var}}}(\widehat{\texttt{FAR}}) \\&- \frac{{\mathop {\textrm{Var}}}(\bar{Y}_{12}) + 3(G-2){\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13})}{(G-1)}, \end{aligned}$$
  as promised.

This completes the bias derivations for subsets bootstrap estimators.

1.2.2 Proof of Proposition 5(bias of vertex bootstrap estimators)

Recall from (13) the expression for $\widehat{\texttt{FAR}}_b^*$, the estimator for $\texttt{FAR}$ based on the b-th bootstrap sample using vertex bootstrap:

$$\begin{aligned} \widehat{\texttt{FAR}}_b^*= \sum _{i, j=1}^G W_i\bigg [ \frac{(W_i - 1) \widehat{\texttt{FAR}}\mathbb {1}(i=j)}{G(G-1)} + \frac{W_j \bar{Y}_{ij} \mathbb {1}(i\ne j) }{G(G-1)}\bigg ]. \end{aligned}$$

We start with deriving ${\mathop {\textrm{Bias}}}(\widehat{\texttt{FAR}}_b^*)$. Note that
$$\begin{aligned}&\mathbb {E}\left\{ \mathbb {E}^*[\widehat{\texttt{FAR}}_b^*] \right\} \\&\quad = \frac{1}{G(G-1)}\sum _{i, j=1}^G\mathbb {E}\left\{ \mathbb {E}^*\left[ W_i(W_i-1)\mathbb {1}(i=j) \right] \right. \\&\qquad \left. \widehat{\texttt{FAR}}+ \bar{Y}_{ij} \mathbb {E}^*\left[ W_iW_j \mathbb {1}(i \ne j) \right] \right\} \\&\quad = \frac{1}{G(G-1)}\sum _{i=1}^G\mathbb {E}^*\left[ \widehat{\texttt{FAR}}\frac{G-1}{G} + \sum _{\begin{array}{c} j=1 \\ j \ne i \end{array}}^G\bar{Y}_{ij} \frac{G-1}{G} \right] \\&\quad = \frac{1}{G(G-1)} \left[ G\texttt{FAR}\frac{G-1}{G} + G(G-1)\texttt{FAR}\frac{G-1}{G}\right] \\&\quad = \texttt{FAR}. \end{aligned}$$
Thus, ${\mathop {\textrm{Bias}}}(\widehat{\texttt{FAR}}_b^*) = 0$.
We next turn to deriving ${\mathop {\textrm{Bias}}}({\mathop {\textrm{Var}}}^*(\widehat{\texttt{FAR}}_b^*))$. Let $\bar{Y}^*_{ij}$ denote the observation corresponding to the i-th and j-th identities in the b-th bootstrap sample, where the subscript b is omitted. It is easy to see that
$$\begin{aligned}&\mathbb {E}\left\{ {\mathop {\textrm{Var}}}^*(\widehat{\texttt{FAR}}_b^*)\right\} \\&\quad = \frac{1}{G(G-1)}\mathbb {E}\bigg \{2 {\mathop {\textrm{Var}}}^*(\bar{Y}_{12}^*) + 4(G-2)\\&{\mathop {\textrm{Cov}}}^*\left( \bar{Y}_{12}^*, \bar{Y}_{13}^*\right) + (G-2)(G-3){\mathop {\textrm{Cov}}}^*\left( \bar{Y}_{12}^*, \bar{Y}_{34}^*\right) \bigg \}. \end{aligned}$$
For the variance term ${\mathop {\textrm{Var}}}^*(\bar{Y}_{12}^*)$, we have
$$\begin{aligned}&\mathbb {E}\left\{ {\mathop {\textrm{Var}}}^*(\bar{Y}_{12}^*)\right\} \nonumber \\&= \mathbb {E}\left\{ \mathbb {E}^*\left[ \bar{Y}_{12}^{2*}\right] - \mathbb {E}^*\left[ \bar{Y}_{12}^*\right] ^2\right\} \nonumber \\&= \frac{G-1}{G}\mathbb {E}\left[ \bar{Y}_{12}^2\right] \nonumber \\&\quad + \frac{1}{G}\mathbb {E}\left[ \widehat{\texttt{FAR}}^2\right] - \mathbb {E}\left[ \widehat{\texttt{FAR}}^2\right] \nonumber \\&= \frac{G-1}{G}\left\{ \mathbb {E}[\bar{Y}_{12}^2] - \mathbb {E}[\widehat{\texttt{FAR}}^2] \right\} \nonumber \\ {}&= \frac{G-1}{G} \left\{ {\mathop {\textrm{Var}}}(\bar{Y}_{12}) - {\mathop {\textrm{Var}}}(\widehat{\texttt{FAR}}) \right\} . \end{aligned}$$
(A20)
For the covariance term ${\mathop {\textrm{Cov}}}^*\left( \bar{Y}_{12}^*, \bar{Y}_{34}^*\right) $, we can show that
$$\begin{aligned}&\mathbb {E}\left\{ {\mathop {\textrm{Cov}}}^*(\bar{Y}_{12}^*, \bar{Y}_{13}^*)\right\} \\&= \mathbb {E}\left\{ \mathbb {E}^*[\bar{Y}_{12}^*\bar{Y}_{13}^*] - \mathbb {E}^*[\bar{Y}_{12}^*]\mathbb {E}^*[\bar{Y}_{13}^*]\right\} \\&= \mathbb {E}\left\{ \frac{2G-1}{G^2}\widehat{\texttt{FAR}}^2 + \frac{(G-1)^2}{G^2} \left( \frac{1}{G-1}\bar{Y}_{12}^2 + \frac{G-2}{G-1}\bar{Y}_{12}\bar{Y}_{13}\right) - \widehat{\texttt{FAR}}^2\right\} \\&= \mathbb {E}\left\{ \frac{(G-1)^2}{G^2} \left( \frac{1}{G-1}\bar{Y}_{12}^2 + \frac{G-2}{G-1}\bar{Y}_{12}\bar{Y}_{13}\right) - \frac{(G-1)^2}{G^2} \widehat{\texttt{FAR}}^2\right\} \\&=\frac{(G-1)^2}{G}\frac{1}{G}\mathbb {E}\left\{ \frac{\bar{Y}_{12}^2}{G-1} + \frac{(G-2)\bar{Y}_{12}\bar{Y}_{13}}{G-1} - \widehat{\texttt{FAR}}^2\right\} . \end{aligned}$$
By following the same derivation as in (A17), we can further show that
$$\begin{aligned}&\mathbb {E}\left\{ {\mathop {\textrm{Cov}}}^*(\bar{Y}_{12}^*, \bar{Y}_{13}^*)\right\} \nonumber \\&\quad = \frac{(G-1)^2}{G} \left\{ \frac{G-1}{G}{\mathop {\textrm{Var}}}(\widehat{\texttt{FAR}})\right. \nonumber \\&\left. \quad - \frac{{\mathop {\textrm{Var}}}(\bar{Y}_{12}) + 3(G-2){\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13})}{G(G-1)}\right\} . \end{aligned}$$
(A21)
Thus, combining (A20) and (A21), together with the fact that ${\mathop {\textrm{Cov}}}(\bar{Y}_{12}^*, \bar{Y}_{34}^*) = 0$ (by independence), yields
$$\begin{aligned}&\mathbb {E}\left\{ {\mathop {\textrm{Var}}}^*(\widehat{\texttt{FAR}}_b^*)\right\} \\&\quad = \frac{1}{G(G-1)} \bigg \{ 2\frac{G-1}{G}\left[ {\mathop {\textrm{Var}}}(\bar{Y}_{12}) - {\mathop {\textrm{Var}}}(\widehat{\texttt{FAR}}) \right] \\&\quad \quad + \frac{4(G-1)^2(G-2)}{G}\left[ \frac{G-1}{G}{\mathop {\textrm{Var}}}(\widehat{\texttt{FAR}})\right. \\&\quad \left. - \frac{{\mathop {\textrm{Var}}}(\bar{Y}_{12}) + 3(G-2){\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13})}{G(G-1)} \right] \bigg \} \\&\quad = \frac{2}{G^2}\left[ {\mathop {\textrm{Var}}}(\bar{Y}_{12}) - {\mathop {\textrm{Var}}}(\widehat{\texttt{FAR}})\right] \\&\quad \quad + \frac{4(G-1)(G-2)}{G^2}\left[ \frac{G-1}{G}{\mathop {\textrm{Var}}}(\widehat{\texttt{FAR}}) \right. \\&\left. \quad \quad - \frac{{\mathop {\textrm{Var}}}(\bar{Y}_{12}) + 3(G-2){\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13})}{G(G-1)} \right] \\&\quad = \frac{2}{G^2}\left[ {\mathop {\textrm{Var}}}(\bar{Y}_{12}) - {\mathop {\textrm{Var}}}(\widehat{\texttt{FAR}})\right] \\&\quad \quad + \frac{4(G-1)(G-2)}{G^2}\\&\quad \left[ \frac{{\mathop {\textrm{Var}}}(\bar{Y}_{12}) + (G-2){\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13})}{G(G-1)} - \frac{{\mathop {\textrm{Var}}}(\widehat{\texttt{FAR}})}{G} \right] \\&\quad = {\mathop {\textrm{Var}}}(\widehat{\texttt{FAR}}) - \frac{2{\mathop {\textrm{Var}}}(\bar{Y}_{12})}{G^2(G-1)} - \frac{2{\mathop {\textrm{Var}}}(\widehat{\texttt{FAR}})}{G^2} \\&\quad \quad - \frac{4(3G-2)(G-2)}{G^3(G-1)}{\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13}) \\&\quad \quad + \frac{4(G-2)}{G^3}{\mathop {\textrm{Var}}}(\bar{Y}_{12}) - \frac{4(G-1)(G-2)}{G^3}{\mathop {\textrm{Var}}}(\widehat{\texttt{FAR}}). \end{aligned}$$
We can rearrange the terms to obtain
$$\begin{aligned} \begin{aligned}&{\mathop {\textrm{Bias}}}({\mathop {\textrm{Var}}}^*(\widehat{\texttt{FAR}}_b^*) \\&\quad = {\mathop {\textrm{Var}}}(\bar{Y}_{12}) \left[ \frac{4(G-2)}{G^3} + O\left( G^{-3}\right) \right] \\&\qquad + {\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13})\left[ -\frac{28}{G(G-1)} + O(G^{-3}) \right] . \end{aligned} \end{aligned}$$
(A22)
Up to constants, the expression in (A22) matches with the expression in the statement.

This completes the derivations of the bias for the vertex bootstrap estimators.

1.2.3 Proof of Proposition 6(bias of double-or-nothing bootstrap estimators)

Recall from (14) the expression for $\widehat{\texttt{FRR}}_b^*$ and $\widehat{\texttt{FAR}}_b^*$, the estimator for $\texttt{FRR}$ and $\texttt{FAR}$ based on the b-th bootstrap sample using double-or-nothing bootstrap:

$$\begin{aligned} \widehat{\texttt{FRR}}_b^* = \frac{\sum _{i=1}^G W_i \bar{Y}_{ii}}{\sum _{i=1}^G W_i}, \quad \text {and} \quad \widehat{\texttt{FAR}}_b^*= \frac{\sum _{\begin{array}{c} i, j=1 \\ j\ne i \end{array}}^G W_iW_j\bar{Y}_{ij}}{\sum _{\begin{array}{c} i, j=1\\ j\ne i \end{array}}^G W_iW_j}, \end{aligned}$$

where $\mathbb {E}[W_i] = 1$, ${\mathop {\textrm{Var}}}(W_i) = \tau $, and $W_i$ is independent of $W_j$ whenever $i\ne j$ for $i, j\in \mathcal {G}$. The double-or-nothing bootstrap falls in this framework when $\tau = 1$.

It it straightforward to show that ${\mathop {\textrm{Bias}}}(\widehat{\texttt{FRR}}_b^*) = 0$. Next, we examine ${\mathop {\textrm{Bias}}}({\mathop {\textrm{Var}}}^*(\widehat{\texttt{FRR}}_b^*))$. Let $T^*=\sum _{i=1}^G W_i\bar{Y}_{ii}$ and $N^*=\sum _{i=1}^GW_i$. Through an application of the Delta method, we obtain
$$\begin{aligned}&\mathbb {E}\left\{ {\mathop {\textrm{Var}}}^*(\widehat{\texttt{FRR}}_b^*)\right\} = \frac{1}{G^2}\mathbb {E}\left\{ {\mathop {\textrm{Var}}}^*(T^*) \right. \\&\left. - 2\widehat{\texttt{FRR}}{\mathop {\textrm{Cov}}}^*(T^*, N^*) + \widehat{\texttt{FRR}}^2 {\mathop {\textrm{Var}}}^*(N^*)\right\} , \end{aligned}$$
where
$$\begin{aligned} \mathbb {E}[{\mathop {\textrm{Var}}}^*(T^*)]&= \mathbb {E}\left[ \sum _{i=1}^G\bar{Y}_{ii}^2\tau \right] = \tau G\mathbb {E}[\bar{Y}_{11}^2], \\&\mathbb {E}[\widehat{\texttt{FRR}}{\mathop {\textrm{Cov}}}^*(T^*, N^*)]\\&= \mathbb {E}[\widehat{\texttt{FRR}}^2{\mathop {\textrm{Var}}}^*(N^*)] = G\tau \mathbb {E}[\widehat{\texttt{FRR}}^2] \\&= \tau \left[ \mathbb {E}[\bar{Y}_{11}^2] + \texttt{FRR}^2(G-1) \right] . \end{aligned}$$
Hence, we have
$$\begin{aligned} \mathbb {E}[{\mathop {\textrm{Var}}}^*(\widehat{\texttt{FRR}}_b^*)] = \frac{G-1}{G} \tau {\mathop {\textrm{Var}}}(\widehat{\texttt{FRR}}). \end{aligned}$$
Taking $\tau =1$ yields the result.
With respect to the $\texttt{FAR}$, let $T^*=\sum _{i=1}^G\sum _{\begin{array}{c} j=1\\ j\ne i \end{array}}^GW_i W_j\bar{Y}_{ij}$ and $N^*=\sum _{i=1}^G\sum _{\begin{array}{c} j=1\\ j\ne i \end{array}}^GW_iW_j$. Again, it is easy to see that ${\mathop {\textrm{Bias}}}(\widehat{\texttt{FAR}}_b^*) = 0$. An application of the Delta method yields
$$\begin{aligned}{} & {} \mathbb {E}\left\{ {\mathop {\textrm{Var}}}^*(\widehat{\texttt{FAR}}_b^*)\right\} = \frac{1}{G^2(G-1)^2}\\{} & {} \mathbb {E}\left\{ {\mathop {\textrm{Var}}}^*(T^*) - 2\widehat{\texttt{FAR}}{\mathop {\textrm{Cov}}}^*(T^*, N^*) + \widehat{\texttt{FAR}}^2 {\mathop {\textrm{Var}}}^*(N^*)\right\} , \end{aligned}$$
where
$$\begin{aligned}&\mathbb {E}[{\mathop {\textrm{Var}}}^*(T^*)] =G(G-1)\mathbb {E}\left\{ 2{\mathop {\textrm{Var}}}^*(\bar{Y}_{12}W_1W_2) \right. \\&\left. + 4(G-2){\mathop {\textrm{Cov}}}^*(\bar{Y}_{12}W_1W_2, \bar{Y}_{13}W_1W_3) \right\} \\&=G(G-1)\mathbb {E}\left\{ 2\bar{Y}_{12}(\mathbb {E}^*[W_1^2]\mathbb {E}^*[W_2^2]-1) \right. \\&\left. + 4(G-2)\bar{Y}_{12}\bar{Y}_{13}(\mathbb {E}^*[W_1^2] - 1)\right\} \\&=G(G-1) [2\tau (\tau + 2) \mathbb {E}[\bar{Y}_{12}^2] + 4(G-2)\tau \mathbb {E}[\bar{Y}_{12}\bar{Y}_{13}]], \end{aligned}$$
and
$$\begin{aligned}&\mathbb {E}\left[ \widehat{\texttt{FAR}}{\mathop {\textrm{Cov}}}^*(T^*, N^*)\right] \\&= \mathbb {E}[\widehat{\texttt{FAR}}^2 {\mathop {\textrm{Var}}}^*(N^*)] \\&= G(G-1)[2\tau (\tau + 2) + 4(G-2)\tau ]\mathbb {E}[\widehat{\texttt{FAR}}^2]. \end{aligned}$$
It then follows that
$$\begin{aligned}&\mathbb {E}[{\mathop {\textrm{Var}}}^*(\widehat{\texttt{FAR}}_b^*)] = \frac{1}{G(G-1)}\left[ 2\tau (\tau + 2)\right. \\&\left. \mathbb {E}[\bar{Y}_{12}^2 - \widehat{\texttt{FAR}}^2] + 4(G-2)\tau \mathbb {E}[\bar{Y}_{12}\bar{Y}_{13} - \widehat{\texttt{FAR}}^2] \right] \\&= \frac{1}{G(G-1)}\left[ 2\tau (\tau + 2){\mathop {\textrm{Var}}}(\bar{Y}_{12}) + 4(G-2)\tau {\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13})\right] \\ {}&\quad \quad - \frac{2\tau (\tau + 2G - 2)}{G(G-1)}\frac{2{\mathop {\textrm{Var}}}(\bar{Y}_{12}) + 4(G-2){\mathop {\textrm{Cov}}}(\bar{Y}_{12}, \bar{Y}_{13})}{G(G-1)}. \end{aligned}$$
Choosing $\tau = 1$ yields
$$\begin{aligned}&{\mathop {\textrm{Bias}}}({\mathop {\textrm{Var}}}^*(\widehat{\texttt{FAR}}_b^*)) = -{\mathop {\textrm{Var}}}(\widehat{\texttt{FAR}}) \\&\frac{2(2G -1)}{G(G-1)} + \frac{4{\mathop {\textrm{Var}}}(\bar{Y}_{12})}{G(G-1)}. \end{aligned}$$

This completes the bias derivations for the double-or-nothing bootstrap estimators.

1.3 Proofs for unbalanced setting in Sect. 5.1

1.3.1 Proof of Proposition 7 (consistency of bootstrap estimators for $\texttt{FRR}$)

We separate the proof into consistency of subsets and vertex bootstrap, and that of double-or-nothing bootstrap below.

Consistency of subsets and vertex bootstrap estimators. The resampling performed by these two bootstrap types for $\texttt{FRR}$ computations is analogous, thus we investigate consistency of both types altogether. By applying the Delta method, we obtain
$$\begin{aligned}&{\mathop {\textrm{Var}}}^*(\widehat{\texttt{FRR}}_b^*) \\&\quad = \frac{\sum _{i=1}^G \tilde{M}_i^2 \bar{Y}_{ii}^2 - \left( \frac{1}{\sqrt{G}}\sum _{i=1}^G \tilde{M}_i\bar{Y}_{ii}\right) ^2}{\left( \sum _{i=1}^G \tilde{M}_i\right) ^2} \\&\quad \quad - 2 \frac{ \left( \sum _{i=1}^G \tilde{M}_i \bar{Y}_{ii}\right) \left[ \sum _{i=1}^G \tilde{M}_i^2 \bar{Y}_{ii} - (\sum _{i=1}^G \tilde{M}_i \bar{Y}_{ii}) (\sum _{i=1}^G \tilde{M}_i/G)\right] }{\left( \sum _{i=1}^G \tilde{M}_i\right) ^3} \\&\quad \quad \quad + \frac{\left( \sum _{i=1}^G \tilde{M}_i \bar{Y}_{ii}\right) ^2 \left[ \sum _{i=1}^G \tilde{M}_i^2 - \left( \frac{1}{\sqrt{G}}\sum _{i=1}^G \tilde{M}_i\right) ^2\right] }{\left( \sum _{i=1}^G \tilde{M}_i\right) ^4}. \end{aligned}$$
Since $M_i$ is finite for any $i\in \mathcal {G}$, we can apply the weak law of large numbers and the continuous mapping theorem to obtain the following convergences in probability as $G \rightarrow \infty $:
$$\begin{aligned}&G\frac{\sum _{i=1}^G \tilde{M}_i^2 \bar{Y}_{ii}^2 - \left( \frac{1}{\sqrt{G}}\sum _{i=1}^G \tilde{M}_i\bar{Y}_{ii}\right) ^2}{\left( \sum _{i=1}^G \tilde{M}_i\right) ^2} \\&\xrightarrow {p }\frac{{\mathop {\textrm{Var}}}(\tilde{M}_1\bar{Y}_{11})}{\mathbb {E}[\tilde{M}_1]^2}, \\&G\frac{\left( \sum _{i=1}^G \tilde{M}_i \bar{Y}_{ii}\right) \sum _{i=1}^G \left( \tilde{M}_i^2 \bar{Y}_{ii} - (\sum _{i=1}^G \tilde{M}_i \bar{Y}_{ii}) (\sum _{i=1}^G \tilde{M}_i/G)\right) }{\left( \sum _{i=1}^G \tilde{M}_i\right) ^3} \\&\xrightarrow {p }\frac{\mathbb {E}[\tilde{M}_1\bar{Y}_{11}] {\mathop {\textrm{Cov}}}(\tilde{M}_1\bar{Y}_{11}, \tilde{M}_1)}{\mathbb {E}[\tilde{M}_1]^3}, \\&G\frac{\left( \sum _{i=1}^G \tilde{M}_i \bar{Y}_{ii}\right) ^2 \left[ \sum _{i=1}^G \tilde{M}_i^2 - \left( \frac{1}{\sqrt{G}}\sum _{i=1}^G \tilde{M}_i\right) ^2\right] }{\left( \sum _{i=1}^G \tilde{M}_i\right) ^4} \\&\xrightarrow {p }\frac{\mathbb {E}[\tilde{M}_1 \bar{Y}_{11}]^2{\mathop {\textrm{Var}}}(\tilde{M}_1)}{\mathbb {E}[\tilde{M}_1]^4}. \end{aligned}$$
It then follows that
$$\begin{aligned}{} & {} {\mathop {\textrm{Var}}}^*(\sqrt{G}\widehat{\texttt{FRR}}_b^*)\xrightarrow {p }\frac{{\mathop {\textrm{Var}}}(\tilde{M}_1\bar{Y}_{11})}{\mathbb {E}[\tilde{M}_1]^2} \\{} & {} \quad - 2\frac{\mathbb {E}[\tilde{M}_1\bar{Y}_{11}] {\mathop {\textrm{Cov}}}(\tilde{M}_1\bar{Y}_{11}, \tilde{M}_1)}{\mathbb {E}[\tilde{M}_1]^3} \\{} & {} \quad + \frac{\mathbb {E}[\tilde{M}_1 \bar{Y}_{11}]^2{\mathop {\textrm{Var}}}(\tilde{M}_1)}{\mathbb {E}[\tilde{M}_1]^4}. \end{aligned}$$
This completes the proof for the consistency of the subsets and vertex bootstrap estimators.
Consistency of double-or-nothing bootstrap estimator. Assume that $\mathbb {E}[W_i] = 1$ and ${\mathop {\textrm{Var}}}(W_i) = \tau $. In addition, let $W_i\perp \!\!\!\!\perp W_j$ whenever $i\ne j$ for $i, j\in \mathcal {G}$. The double-or-nothing bootstrap is obtained by taking $\tau = 1$. By applying the Delta method, we obtain
$$\begin{aligned}{} & {} {\mathop {\textrm{Var}}}^*(\widehat{\texttt{FRR}}_b^*) = \tau \frac{\sum _{i=1}^G \tilde{M}_i^2\bar{Y}_{ii}^2}{\left( \sum _{i=1}^G\tilde{M}_i\right) ^2}\\{} & {} - 2\tau \frac{\left( \sum _{i=1}^G \tilde{M}_i^2\bar{Y}_{ii}\right) \left( \sum _{i=1}^G \tilde{M}_i \bar{Y}_{ii} \right) }{\left( \sum _{i=1}^G\tilde{M}_i\right) ^3} \\{} & {} + \tau \frac{\left( \sum _{i=1}^G \tilde{M}_i^2\right) \left( \sum _{i=1}^G \tilde{M}_i \bar{Y}_{ii} \right) ^2 }{\left( \sum _{i=1}^G\tilde{M}_i\right) ^4}. \end{aligned}$$
Since $\tilde{M}_i$ is finite, we can apply the weak law of large numbers and the continuous mapping theorem to obtain, as $G\rightarrow \infty $,
$$\begin{aligned} \tau G \frac{\sum _{i=1}^G \tilde{M}_i^2\bar{Y}_{ii}^2}{\left( \sum _{i=1}^G\tilde{M}_i\right) ^2}&\xrightarrow {p }\tau \frac{\mathbb {E}[\tilde{M}_1^2\bar{Y}_{11}^2]}{\mathbb {E}[\tilde{M}_1]^2}, \\ 2G\tau \frac{\left( \sum _{i=1}^G \tilde{M}_i^2\bar{Y}_{ii}\right) \left( \sum _{i=1}^G \tilde{M}_i \bar{Y}_{ii} \right) }{\left( \sum _{i=1}^G\tilde{M}_i\right) ^3}&\xrightarrow {p }2\tau \frac{\mathbb {E}[\tilde{M}_1^2\bar{Y}_{11}]\mathbb {E}[\tilde{M}_1\bar{Y}_{11}]}{\mathbb {E}[\tilde{M}_1]^3}, \\ \tau G \frac{\left( \sum _{i=1}^G \tilde{M}_i^2\right) \left( \sum _{i=1}^G \tilde{M}_i \bar{Y}_{ii} \right) ^2 }{\left( \sum _{i=1}^G\tilde{M}_i\right) ^4}&\xrightarrow {p }\tau \frac{\mathbb {E}[\tilde{M}_1^2]\mathbb {E}[\tilde{M}_1\bar{Y}_{11}]^2}{\mathbb {E}[\tilde{M}_1]^4}. \end{aligned}$$
Fig. 6
Protocol design for selecting identities and instances in $\texttt{FAR}$ computations. The selection is based on Algorithm 1. The number in each cell represents the iteration at which the given pair of identities and instances is chosen. Left panel: balanced setting with one instance for each of the five identities. Middle panel: balanced setting with two instances for each of the five identities. Right panel: unbalanced setting with the first four and the fifth identities have one and three instances respectively. Note that in all scenarios we iterate through all identities before selecting the same identity again. However, while in the balanced setting we pick the combination of identities 1–2 first, in the unbalanced setting we choose identities 1–5 first as the latter has more observations
Full size image
Fig. 7
Estimated coverage versus average width (top) and average width versus nominal coverage (bottom) for $\texttt{FAR}$ intervals on synthetic data. Data contain $G=50$ with $M=5$ instances each. Colored lines and shaded regions indicate estimated coverage and corresponding 95% naive Wilson or Wald confidence intervals for the different methods
Full size image
Fig. 8
Mean squared error (MSE) of $\texttt{FAR}$ variance estimator on synthetic data. Data contain varying number of identities G with $M=5$ instances each. The lines correspond to the MSE of $\texttt{FAR}$ variance estimator as a function of G
Full size image
Fig. 9
Estimated coverage versus nominal coverage of confidence intervals for $\texttt{FRR}$@$\texttt{FAR}$ on MORPH data. Samples were generated by resampling $G=50$ identities from the original dataset without replacement
Full size image
Fig. 10
Estimated interval coverage versus nominal coverage for $\texttt{FRR}=10^{-1}$ (top) and $\texttt{FAR}=10^{-2}$ (bottom) on 1:1 object, speaker, and topic verification tasks respectively. Samples were generated by resampling $G=50$ identities without replacement from the original dataset
Full size image

Putting everything together, as $G \rightarrow \infty $, we have that
$$\begin{aligned} {\mathop {\textrm{Var}}}^*(\sqrt{G}\widehat{\texttt{FRR}}_b^*)&\xrightarrow {p }\tau \frac{\mathbb {E}[\tilde{M}_1^2\bar{Y}_{11}^2]}{\mathbb {E}[\tilde{M}_1]^2} \\&- 2\tau \frac{\mathbb {E}[\tilde{M}_1^2\bar{Y}_{11}]\mathbb {E}[\tilde{M}_1\bar{Y}_{11}]}{\mathbb {E}[\tilde{M}_1]^3} \\&+ \tau \frac{\mathbb {E}[\tilde{M}_1^2]\mathbb {E}[\tilde{M}_1\bar{Y}_{11}]^2}{\mathbb {E}[\tilde{M}_1]^4} \\&= \tau \frac{{\mathop {\textrm{Var}}}(\tilde{M}_1\bar{Y}_{11})}{\mathbb {E}[\tilde{M}_1]^2} \\&- 2\tau \frac{{\mathop {\textrm{Cov}}}(\tilde{M}_1\bar{Y}_{11}, \tilde{M_1})\mathbb {E}[\tilde{M}_1\bar{Y}_{11}]}{\mathbb {E}[\tilde{M}_1]^3} \\&+ \tau \frac{{\mathop {\textrm{Var}}}(\tilde{M}_1)\mathbb {E}[\tilde{M}_1\bar{Y}_{11}]^2}{\mathbb {E}[\tilde{M}_1]^4}. \end{aligned}$$
Choosing $\tau =1$ yields the desired result. This completes the proof for the consistency of the double-or-nothing bootstrap estimator.

Protocol design

Many vision and audio datasets comprise hundreds of thousands of instances, making it computationally infeasible to estimate $\texttt{FRR}$ and $\texttt{FAR}$ on all the data. In such cases, the researcher has to decide on which instance pairs their computational resources (i.e., budget) should be spent on. Since different combinations of pairwise comparisons between instances may lead to different estimates of model accuracy, dataset designers attach protocols specifying which comparisons to consider in computations. Consequently, a natural question is then: For a given budget, which instance pairs offer the lowest variance estimate of model accuracy?

Based on our theoretical analysis in Sect. 4 (and demonstrated in the empirical results in Sect. 6), it is clear that the dependence structure induced by the comparisons can significantly impact the coverage of the confidence intervals. This realization naturally leads to a strategy for protocol design to tries to maintain the independence structure between comparisons. For simplicity, consider the computation of $\texttt{FAR}$ on a sample where each identity has only one instance. Assume that a budget of $B\le G(G-1)$ comparisons is available, and let $B=\sum _{i=1}^G \sum _{j\ne i} b_{ij}$ with $b_{ij}=b_{ji}=1$ when $\bar{Y}_{ij}$ enters $\texttt{FAR}$ computations and 0 otherwise. Minimizing the variance of the estimated $\texttt{FAR}$ under budget constraints boils down to solving the following problem:

$$\begin{aligned} \mathop {\mathrm {arg\,min}}_{b_{12}, \dots , b_{G(G-1)}} \sum _{\begin{array}{c} i, j, k=1 \\ j \ne i\\ k \ne i, j \end{array}}^G b_{ij} b_{ik} \text { s.t. } \sum _{\begin{array}{c} i, j=1\\ i \ne j \end{array}}^G b_{ij} = B. \end{aligned}$$

(B23)

When $B \le \lfloor G/2\rfloor $, one can choose instance pairs that are independent, e.g., $\bar{Y}_{12}, \bar{Y}_{34}$, etc. When $B > \lfloor G/2\rfloor $, the objective in (B23) is minimized when the comparisons share as few instances as possible with each other. An approach to choose the terms to include in the $\texttt{FAR}$ computations is as follows: At each of the B iterations, select the observation that minimizes the objective evaluated using the allocation resulting from the previous iteration.

Algorithm 1 outlines the proposed approach for selecting the combinations of identities to be included in the $\texttt{FAR}$ estimation for the balanced setting. We start by creating all possible combinations of identities from which we will draw the instances to be considered. At each iteration, we use a priority queue to retrieve the identity candidates with the lowest number of visits. These candidates are sorted to ensure that those with a larger number of instances are visited first, which helps minimize the number of times a given instance will be reused in the estimation. Note that in the balanced setting, the last step is not necessary. If the total budget exceeds $\lfloor G(G-1)/2\rfloor $, we can iterate through the combinations yielded by Algorithm 1. Once the combinations of identities are available, we follow a similar strategy for selecting the pairs of instances within each pair of identities. Figure 6 describes three examples of protocols resulting from applying this strategy. For $\texttt{FRR}$ estimation, we follow a similar idea. We first iterate through the identities starting with those with the largest number of instances. We then use Algorithm 1 to select the comparisons within each identity.

Finally, a brief note about computations of error metrics and the associated uncertainties on massive datasets under computational constraints. The proposed strategy for protocol design can be applied to handle estimation in these settings. This involves selecting a fixed number of instance pairs using the protocol design, estimating the error metrics on these pairs, and then using Wilson or bootstrap methods to obtain confidence intervals. By following this approach, one can obtain reliable estimates of error metrics and their uncertainties while minizing computational costs.

Additional numerical experiments

In this section, we present additional experimental details and results, supplementing those in presented in Sect. 1 and Sect. 6.

1.1 Analysis of interval widths

The discussion in the main paper has focused on interval coverage and has only briefly mentioned width. In our experiments, we found that methods that yield intervals with higher coverage also generally presented larger widths, as we would expect in case of statistics that are asymptotically normal (see Proposition 1). Figure 7 shows the relationship between estimated coverage, average width, and nominal coverage for $\texttt{FAR}$ intervals with $G=50$ and $M=5$ using the setup described in Sect. 6.1 (see Fig. 3 for estimated vs. nominal coverage). In case of $\texttt{FAR}=10^{-3}, 10^{-4}$, a given estimated coverage corresponds to the same interval width across all methods. This indicates that recalibrating the nominal coverage (e.g., increasing the nominal level $1-\alpha $ for the subsets or two-level bootstrap to achieve intervals with coverage $1-\alpha _{\text {target}}$) for any of the methods will not yield intervals with the target coverage but with smaller width.

1.2 Variance estimation accuracy versus sample size

One natural question is how large the sample should be to obtain an accurate estimate of the variances of $\texttt{FRR}$ and $\texttt{FAR}$, and for Wilson intervals to achieve close-to-nominal coverage. As we have seen in proposition 2, asymptotically the reviewed variance estimators converge to the true parameters. Their behavior in case of few observations in the sample may be less clear. However, in Fig. 4 we have seen that Wilson intervals achieve coverage close to nominal for any number of identities. This observation suggests that the variances of $\texttt{FRR}$ and $\texttt{FAR}$ are close to the true variances even when only few identities are present in the data. This is to show in case of $\texttt{FRR}$, for which one can obtain finite-sample guarantees via standard arguments. For the estimator of the $\texttt{FAR}$ variance, the derivation of the limiting distribution is more complex due data dependence. For this reason, we resort to simulation and in Fig. 8 we show how the mean squared error (MSE) of the estimator of the $\texttt{FAR}$ variance in Eq. 9 varies with the number of identities G present in synthetic data. The results show that the MSE of the variance estimator greatly decreases with the number of identities in the data (at rate $\sqrt{G}$, consistently with proposition 2)), and is small even when a limited number of identities is available in the data.

1.3 Pointwise intervals for the ROC

We evaluate the coverage of pointwise confidence intervals for the ROC on MORPH data. The experimental setup follows the description of section 6. The vertex bootstrap performs similarly to the double-or-nothing procedure and thus, for the sake of simplifying the presentation of the results, it is omitted from the discussion. Figure 9 shows estimated coverage as a function of nominal coverage for the reviewed methods at different levels of $\texttt{FAR}$. Consistently with the discussion of section 5.2, we observe that Wilson intervals achieve coverage that is higher than nominal across all $\texttt{FAR}$ levels. While the overcoverage may be expected for low values of $\texttt{FRR}$ (e.g., see the results in Fig. 3), the overestimation is present albeit it is lower for larger values of $\texttt{FRR}$. For low $\texttt{FRR}$, we also observe that the version of the double-or-nothing bootstraps which employ ROC curves smoothed using log-normal distributions to model the scores perform better than their counterparts. This is suggestive of the benefits of imposing smoothness assumptions. When $\texttt{FRR}$ is large enough, the bootstraps perform similarly.

1.4 Experiments on diverse data types

Our theoretical analysis applies to any 1:1 matching task. Here we explore its properties empirically on real data, with data types and tasks beyond 1:1 face verification.

In particular, we investigate the performance of our methods in the following tasks:

1:1 object verification Matching images from a randomly sampled subset of the iNat2021 dataset (Van Horn et al., 2021). The iNat2021 dataset is an image collection specifically curated for species recognition, featuring over 10,000 different species. The matching tasks is to recognize whether the animals in two different images belongs to the same species. For this purpose, we extract feature representations obtained via CLIP (Radford et al., 2021).
1:1 speaker verification We use a large dataset of voice recordings corresponding to multiple speakers. For the speaker verification task, we extract the embeddings of the audio recordings using an ECAPA-TDNN pretrained model Ravanelli et al. (2021); Desplanques et al. (2020).
1:1 topic verification We aim to detect whether two text paragraphs are related to the same topic. For this purpose, we use a subset of the Amazon review dataset (Ni et al., 2019), comprising product information and corresponding Amazon reviews. Specifically, we focus on identifying if two reviews pertain to the same product. Classification is performed using text embeddings generated by BAAI/bge-smal-en-v1.5 (Xiao et al., 2023).

For all datasets, our experimental framework follows the same setup of Sect. 6.2, using $G=50$ identities. Figure 10 shows show how the coverage of the confidence intervals for $\texttt{FRR}=10^{-1}$ and $\texttt{FAR}=10^{-2}$, constructed using the reviewed methods, varies with nominal coverage. The results are consistent with our empirical findings of Sect. 6: For $\texttt{FRR}$, all methods other than naive Wilson tend to cover approximately at the right level. For $\texttt{FAR}$, the Wilson intervals, as well as the vertex and double-or-nothing bootstrap intervals, achieve coverage close to nominal. The other methods severely undercover.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Fogliato, R., Patil, P. & Perona, P. Confidence Intervals for Error Rates in 1:1 Matching Tasks: Critical Statistical Analysis and Recommendations. Int J Comput Vis 132, 5346–5371 (2024). https://doi.org/10.1007/s11263-024-02078-8

Download citation

Received: 17 May 2023
Accepted: 06 April 2024
Published: 11 June 2024
Version of record: 11 June 2024
Issue date: November 2024
DOI: https://doi.org/10.1007/s11263-024-02078-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Confidence Intervals for Error Rates in 1:1 Matching Tasks: Critical Statistical Analysis and Recommendations

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Evaluating the Impact of Algorithm Confidence Ratings on Human Decision Making in Visual Search

A new algorithm for the confidence interval construction with Monte Carlo simulation and backtesting validation

A robust and efficient algorithm to find profile likelihood confidence intervals

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Appendices

Supplementary Material for “Confidence Intervals for Error Rates in 1:1 Matching Tasks: Critical Statistical Analysis and Recommendations”

Proofs of theoretical results

1.1 Proofs for parametric methods in Sect. 4.1

1.1.1 Proof of Proposition 1 (normality of scaled error rates)

1.1.2 Proof of Proposition 2 (consistency of plug-in variance estimators)

1.1.3 Proof of Proposition 3(equivalence of plug-in and jackknife variance estimators)

1.2 Proofs for resampling-based methods in Sect. 4.2

1.2.1 Proof of Proposition 4(bias of subsets bootstrap estimators)

1.2.2 Proof of Proposition 5(bias of vertex bootstrap estimators)

1.2.3 Proof of Proposition 6(bias of double-or-nothing bootstrap estimators)

1.3 Proofs for unbalanced setting in Sect. 5.1

1.3.1 Proof of Proposition 7 (consistency of bootstrap estimators for \(\texttt{FRR}\))

Protocol design

Additional numerical experiments

1.1 Analysis of interval widths

1.2 Variance estimation accuracy versus sample size

1.3 Pointwise intervals for the ROC

1.4 Experiments on diverse data types

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now