On the Privacy of Selection Mechanisms with Gaussian Noise
Jonathan Lebensold Doina Precup Borja Balle
McGill University, Mila McGill University, Mila, Google DeepMind Google DeepMind
Abstract
Report Noisy Max and Above Threshold are two classical differentially private (DP) selection mechanisms. Their output is obtained by adding noise to a sequence of low-sensitivity queries and reporting the identity of the query whose (noisy) answer satisfies a certain condition. Pure DP guarantees for these mechanisms are easy to obtain when Laplace noise is added to the queries. On the other hand, when instantiated using Gaussian noise, standard analyses only yield approximate DP guarantees despite the fact that the outputs of these mechanisms lie in a discrete space. In this work, we revisit the analysis of Report Noisy Max and Above Threshold with Gaussian noise and show that, under the additional assumption that the underlying queries are bounded, it is possible to provide pure ex-ante DP bounds for Report Noisy Max and pure ex-post DP bounds for Above Threshold. The resulting bounds are tight and depend on closed-form expressions that can be numerically evaluated using standard methods. Empirically we find these lead to tighter privacy accounting in the high privacy, low data regime. Further, we propose a simple privacy filter for composing pure ex-post DP guarantees, and use it to derive a fully adaptive Gaussian Sparse Vector Technique mechanism. Finally, we provide experiments on mobility and energy consumption datasets demonstrating that our Sparse Vector Technique is practically competitive with previous approaches and requires less hyper-parameter tuning.
1 INTRODUCTION
Differential Privacy (DP) (Dwork,, 2006) has become the standard framework used for the private release of sensitive statistics. In particular, DP has been embraced by industry and governments to guarantee that potentially sensitive statistics cannot be linked back to individual users. For example, during the COVID-19 pandemic, Google Maps mobility data was published with DP (Aktay et al.,, 2020) in order for public health authorities to better understand various curve-flattening measures (such as work-from-home, or shelter-in-place). Recently, the Wikimedia Foundation deployed DP for their page-level visit statistics (Desfontaines,, 2023).
Underpinning the design of differentially private mechanisms is a fundamental trade-off between privacy and utility characterized by the Fundamental Law of Information Recovery stating that “overly accurate answers to too many questions will destroy privacy in a spectacular way” (Dwork and Roth,, 2014). In practice this imposes a limit on how many statistics can be privately released to a desired accuracy within a pre-specified privacy budget. In applications where the space of possible statistics is too large to allow for a full private and accurate release, analysts can overcome the privacy-utility trade-off by identifying and releasing only those statistics that contain relevant information. This is necessary for example in periodic data collection where users contribute many times to a dataset and relevant statistics must be released repeatedly (e.g. to report temporal trends, change points, extreme events, etc.) (Hu et al.,, 2021; Wang et al.,, 2016; Xu et al.,, 2017). A notable example is the use of smart meters in energy grids, where statistics can help manage electrical demand and encourage smoother consumption but can also be used to infer information like income, occupancy, etc. (Ács and Castelluccia,, 2011; Bohli et al.,, 2010; Haji Mirzaee et al.,, 2022).
Private selection mechanisms aim to identify relevant statistics. This problem is usually framed as query selection: an analyst defines a collection of queries against the target dataset, and a private mechanism is used to identify queries returning “abnormally” large values. Two notable settings arise: the offline case, where all the queries can be specified in advance, and the online case, where the analyst can adaptively select queries based on the previous ones. In the offline setting, one of the best known private selection mechanisms is Report Noisy Max whereby the analyst submits a collection of low-sensitivity scalar queries. The mechanism then adds noise to the values of all the queries and returns the index of the query attaining the largest (noisy) value. The Exponential Mechanism (Dwork and Roth,, 2014) and Permute-and-flip are commonly used for offline selection and are generally considered best-in-class (McKenna and Sheldon,, 2020).
In the online setting, the cornerstone selection mechanism is Above Threshold, where the analyst submits a threshold and a sequence of (potentially adaptive) low-sensitivity scalar queries to which the mechanism iteratively computes answers, until a noisy value exceeds the target threshold. Running Above Threshold repeatedly is called the Sparse Vector Technique (Dwork et al.,, 2009), and is a common privacy primitive in change-point detection, empirical risk minimization, density estimation and other online learning algorithms (Zhang et al.,, 2021; Ligett et al.,, 2017; Dwork and Roth,, 2014; Cummings et al.,, 2018).
When Laplace noise is used, Above Threshold and Report Noisy Max offer pure privacy guarantees – the strongest type of DP guarantee which does not have to account for some small failure probability. The use of the Laplace distribution also has the advantage of making the mathematical analysis of the privacy guarantees relatively simple; however, since the noise distribution is less concentrated than the Gaussian distribution, it can lead to a less accurate mechanism. Replacing the Laplace distribution with a Gaussian distribution is advantageous in many DP mechanisms, including online query selection (Abadi et al.,, 2016; Zhu and Wang,, 2020; Papernot et al.,, 2018).
Contributions.
In this paper we revisit the privacy analysis of the Above Threshold mechanism instantiated with Gaussian noise. Our main observation is that under the mild assumption that the queries submitted are uniformly bounded (in addition to low-sensitivity), we can provide pure DP guarantees that can be computed using standard numerical tools. In particular, we provide pure ex-post111This is a guarantee that depends on the output produced by the mechanism; see Section 3 for details. DP guarantees for Gaussian Above Threshold. In the process, we also develop an ex-ante DP guarantee for Gaussian Report Noisy Max. Our analysis relies on identifying the worst case values for the query answers on a pair of neighboring datasets, a technique which might be of independent interest. Empirically, we find that these privacy bounds lead to tighter privacy accounting in the high privacy, low data regime, when compared to other Gaussian-based mechanisms.
Further, we define a meta-algorithm that composes Gaussian Above Threshold with ex-post DP guarantees. Thus, we derive a fully-adaptive Sparse Vector Technique (SVT), which we call Filtered Self-Reporting Composition (FSRC). Our method is particularly appealing since an analyst need not choose the number of releases or the maximum number of queries up front.
Finally, we provide experiments on mobility and energy consumption datasets demonstrating that our analyses yield mechanisms that in practice match or outperform previous approaches.
2 PRELIMINARIES
Differential Privacy.
Throughout we will be considering randomized mechanisms operating on some dataset . Two datasets and are said to be neighboring, denoted , if they differ in the data of a single individual (e.g. one user is added, removed, or replaced by another).
Definition 1 (DP (Dwork et al.,, 2006)).
Let and . A randomized mechanism , is -DP if for every pair of neighboring datasets and every subset , we have:
When , we write -DP and say the mechanism satisfies pure DP; otherwise we say that it satisfies approximate DP. If the output space is discrete, then a mechanism is -DP if and only if for all and .
A key feature of DP is its resilience to post-processing: any function of the output of a DP mechanism still satisfies DP with the same (or better) parameters. A common way to establish that a mechanism satisfies differential privacy is through a high-probability bound on the privacy loss random variable.
Definition 2 (Privacy Loss (Dinur and Nissim,, 2003)).
Let be a randomized mechanism and consider neighboring datasets . The privacy loss of the pair and under for a given output is defined as
Definition 3 (pDP(Kasiviswanathan and Smith,, 2014)).
A mechanism satisfies -pDP if for any ,
If a mechanism is -pDP, then it is also -DP (Kasiviswanathan and Smith,, 2014). A simple way to obtain pDP guarantees is through bounds on the moment-generating function of the privacy loss random variable; this is one of the motivations for Rényi Differential Privacy (RDP), which relies on a bound of the Rényi divergence between two distributions.
Definition 4 (Rényi divergence (Rényi,, 1961)).
Let . The Rényi divergence of order between two probability distributions and on is defined by:
Definition 5 (Rényi DP (Mironov,, 2017)).
Let and . A randomized mechanism is -RDP for all if, .
It is possible to convert RDP guarantees into probabilistic and approximate DP guarantees (Mironov,, 2017; Balle et al.,, 2020). In particular, any -RDP mechanism satisfies -pDP guarantees with by Markov’s inequality. RDP greatly simplifies the analysis of mechanisms based on Gaussian noise because the Rényi divergence between Gaussian distributions has a simple expression, as well as a simple analysis under composition.
Gaussian Mechanism.
Adding Gaussian noise to the result of a low-sensitivity query is a staple of differentially private mechanism design. Let be a query with global sensitivity and . The Gaussian Mechanism defined as satisfies -RDP with for every (Mironov,, 2017). It is possible to convert this RDP guarantee into an approximate DP guarantee, although tighter approximate DP bounds can be obtained directly (Balle and Wang,, 2018, Theorem 8).
Gaussian Report Noisy Max.
In some applications it is useful to privately select which among a collection of queries (approximately) attains the largest value on a given dataset. This leads to the report noisy max mechanism – when instantiated using Gaussian noise, the mechanism is given by, , where . Since the operation is merely a post-processing of the Gaussian mechanism applied to the -dimensional query with sensitivity , the Gaussian Report Noisy Max mechanism inherits the same privacy guarantees as the Gaussian mechanism above (e.g. the bounds provided by Balle and Wang, (2018, Theorem 8) or the RDP to DP conversion). Note that if each of the individual queries has sensitivity bounded by , then we have .
Gaussian Above Threshold.
The Above Threshold mechanism receives a sequence of low-sensitivity queries and privately identifies the first query (approximately) exceeding a given threshold. The mechanism was introduced by Dwork et al., (2009) and forms the basis of the Sparse Vector Technique, a composition of Above Threshold algorithms to find a sparse set of relevant queries among a large set. The standard version of the Above Threshold algorithm uses Laplace noise, and its privacy analysis is notoriously subtle (Lyu et al.,, 2016). Zhu and Wang, (2020) recently proposed an RDP analysis which can be applied to the Gaussian version of Above Threshold (see Algorithm 3).
Theorem 6 (General RDP Bound on Gaussian Above Threshold (Zhu and Wang,, 2020)).
Suppose all queries given to the mechanism in Algorithm 3 have sensitivity bounded by . Then for , ,
where is a random variable indicating the stopping time of .
Unlike in the Laplace-based Above Threshold, the privacy bound for the Gaussian case depends on how long it takes the mechanism to stop (the third term in the expression above). When the queries are known a priori to be non-negative, it is possible to obtain bounds on the running time to provide the RDP guarantee below.
Theorem 7 (Gaussian Above Threshold RDP (Zhu and Wang,, 2020)).
Gaussian Above Threshold, with , threshold , and -sensitive, non-negative queries, satisfies -RDP with
Alternative methods to bound the running time often include the analyst supplying a bound on the running time to the algorithm, forcing it to stop after a certain number of steps even if the threshold has not been exceeded. Zhu and Wang, (2020) also propose a number of composition results for the Sparse Vector Technique. One version allows the analyst to continue releasing queries without having to re-sample the threshold noise , but in each case the analyst must know ahead of time the number of times they wish to release a “”, and have an upper bound on the maximum number of queries they intend to run.
3 PURE PRIVATE GAUSSIAN MECHANISMS
To introduce our main contribution, we begin with a warm-up. We propose a pure DP bound for Gaussian Report Noisy Max. The proof techniques are the same for Above Threshold, but the analysis is simpler since all the queries are symmetric.
3.1 Warm-Up: Gaussian Report Noisy Max
The following is a pure DP bound for Gaussian Report Noisy Max for queries with bounded range.
Theorem 8 (Pure DP for Gaussian Report Noisy Max).
Let be the Gaussian Report Noisy Max mechanism with standard deviation applied to bounded queries , each with sensitivity bounded by . Let . Let be the standard Gaussian CDF. Then satisfies -DP with
Notably, with the standard analysis of Gaussian Report Noisy Max, the privacy guarantee only depends on the ratio . However in our case, the bound on privacy additionally depends on the range of the queries through . In fact, it is easy to see that if then Gaussian Report Noisy Max cannot admit a pure DP bound.
The proof, deferred to the supplemental, relies on identifying the worst-case values that uniformly bounded queries can attain on a pair of neighboring datasets. In particular, we show that the worst case is obtained on a pair of neighboring datasets and such that , , , and .
Note that the classical approach sketched in the previous section applied to our setting would consider the privacy of a Gaussian mechanism with a -dimensional query of sensitivity , and treat the operation as a post-processing step. While our approach gives a pure DP guarantee, the classical approach does not because Gaussian noise by itself cannot provide that guarantee. However, the classical approach gives a bound that is easy to compute.
The RDP approach gives a simple closed form expression, while the direct approximate DP approach gives a bound that has a simple dependence on the CDF of a standard Gaussian (Balle and Wang,, 2018). In contrast, the bound provided by our result requires estimating Gaussian expectations of a complex function; unfortunately, this does not admit a simple closed form expression. In Section 3.4 we evaluate numerical methods for computing this quantity and compare the resulting values of with those provided by the classical approach.
3.2 Ex-Post Gaussian Above Threshold
In the preceding section, we introduced a method to judge the privacy of Above Threshold before execution (Theorem 7) – these are often referred to as ex-ante privacy guarantees. Alternatively, we can measure the privacy loss after execution, leading to so-called ex-post privacy guarantees.
Definition 9 (Ex-Post DP (Ligett et al.,, 2017)).
Consider a function . A randomized mechanism satisfies -ex-post-DP if for any possible output and any pair of neighboring datasets we have .
Since Above Threshold outputs the (approximate) halting time, we can exploit the difference between when the ex-post and ex-ante privacy guarantee. In Fig. 1, we observe when the ex-post analysis is most beneficial, and when it is most likely to occur. As the public threshold increases, so does the separation between the ex-ante analysis and the ex-post analysis. With Above Threshold, we also need to consider how the mechanism might perform against a worst-case dataset maximizing the stopping time. Since queries are non-negative and bounded, this occurs when . We then compute the median stopping time (dashed blue line) and the 80th percentile (solid red line) for the worst case dataset. For bounded range between zero and one, we see the greatest effect when the mechanism halts early and is calibrated to be closer to one.
When the queries have bounded range, Gaussian Above Threshold (Algorithm 3) admits the following ex-post privacy bound.
Theorem 10 (Pure Ex-post Gaussian Above Threshold).
Let
Given a stream with global sensitivity , the Gaussian Above Threshold mechanism (Algorithm 3), halting at time step with , satisfies -ex-post-DP with
Note that this formulation is very similar to the ratio of expectations in our pure DP analysis of Report Noisy Max (Theorem 12). The proof, deferred to the supplemental, follows from Pure DP Gaussian Report Noisy Max, where we find that the point where the ratio is maximized is the same in all but the final time step. In the last step, the query bound is negated. Hence, the ratio is largest when . A final remark is that the mechanism’s greatest privacy loss occurs when each prior step would have been more optimal than the step at which it halts.
3.3 Fully-Adaptive Composition with Ex-Post Privacy Guarantees
Above Threshold stops the first time the threshold is exceeded. The Sparse Vector Technique (SVT) applies a sequence of Above Threshold mechanisms to find a set of queries that (approximately) exceed the pre-defined threshold. In the case of Laplace noise, the privacy analysis of SVT can be performed either directly (if the noise applied to the threshold is not refreshed after each Above Threshold terminates) or via composition (Dwork and Roth,, 2014; Lyu et al.,, 2016). Something similar holds for SVT with Gaussian noise, although in this case the analysis without noise resampling only applies to a range of parameters (Zhu and Wang,, 2020). Here we present a simple technique for fully adaptive composition of mechanisms that have both pDP and ex-post-DP guarantees, which can be combined with our analysis of Gaussian Above Threshold to yield a fully adaptive SVT with Gaussian noise algorithm without additional hyper-parameters like the maximum number of queries per Above Threshold or the maximum number of invocations of the Above Threshold mechanism.
Our method (Algorithm 4) sequentially composes a stream of adaptive mechanisms and applies a stopping time rule that limits over-spending a predefined privacy budget. The privacy expenditure of each mechanism is tracked based on their output, using the ex-post privacy guarantee. The stopping rule uses the probabilistic DP guarantee to halt when there is a high enough probability of exceeding the privacy budget.
Theorem 11.
Suppose the mechanisms provided to Algorithm 4 are such that is -pDP and -ex-post-DP for all . Then Algorithm 4 satisfies -DP.
In Fig. 2 we plot the ex-post privacy loss bound as a function of the public threshold parameter, . Given a stream of queries bounded by , the mechanism will report less privacy loss as .
3.4 Numerical Computation of Bounds
We evaluate two methods for producing numerical estimates of the bounds: Monte Carlo and numerical integration. Monte Carlo density estimation is a natural starting point for computing bounds for Theorem 12 and Theorem 17. However, Monte Carlo methods require a tremendous amount of samples to yield accurate results. Note that in both cases, the CDF is taken to an exponential power in the numerator and the denominator, causing numerical instabilities. To improve runtime performance and stability, we use a scientific library that can compute integrals with high precision. The Python mpmath (mpmath,, 2023) library is able to compute each density and the outer integral with arbitrary degree of precision.
As shown in Fig. 3, for low sensitivity queries and , the variance between trials becomes unwieldy, even when we rely on common open source libraries that can precisely estimate the CDF of a univariate Gaussian.
4 EMPIRICAL RESULTS
We benchmark our SVT-like method (FSRC and Gaussian Above Threshold) on mobility and energy datasets. Additional experiments with the Pure DP Gaussian Report Noisy Max bound are included in the appendix. Experiments were done on a Apple M1 processor (32 GB), except for the Monte Carlo numerical estimation, which was done on a NVIDIA V100 GPU with 32 GB VRAM.
The UCI Bikes Dataset captures the utilization of shared bikes in a geographic area over the course of a year (Fanaee-T and Gama,, 2013). Since we do not know the upper bound on registered customers, we take the maximum (6,946 users) and assume that this is a public value. Note that our analysis is still worst-case, in that a user is assumed to be contributing to each daily (normalized) count query. We set Above Threshold to HALT when bike sharing exceeds a threshold . In Fig. 17 we plot a range of calibrations, and in the supplemental we show the privacy spend over time. The LCL London Energy Dataset (Greater London Authority,, 2012), consists of energy usage for customers over days. The larger number of queries increases the privacy cost; however this is balanced by a decrease in individual contribution to each query due to each query having . In Fig. 5 we plot a range of calibrations. As the threshold decreases, we witness more queries released and therefore a greater privacy spend.
4.1 Online Selection with Filtered Self-Reporting Composition
Answering sequential questions in an interactive setting typically requires the up-front selection of privacy parameters. The most common method to answer a large number of queries is the Sparse Vector Technique. In this setting, Above Threshold serves as a privacy primitive in more complex algorithms (Hardt and Rothblum,, 2010). First, the curator decides how many times they expect to release a query in order to allocate the privacy budget. The budget is further split across two noise adding mechanisms. Noise is added to each query, as well as a public threshold parameter, which centers whether a point is flagged as a “” (interesting) or a “” (not interesting). One solution—and our baseline for consideration—restricts a curator to fully-adaptive composition using a privacy filter (Rogers et al.,, 2023) and a novel result by Zhu and Wang, (2020) which places no restriction on the number of queries. We are primarily concerned with satisfying the following two requirements: (1) the analyst should be able to modify queries after Above Threshold halts, and (2) they should be able to guarantee that an up-front privacy budget is respected. Therefore, we combine FSRC with the privacy analysis in Theorem 17. To calibrate , we take an RDP bound of Gaussian Above Threshold (Theorem 7) with .
Experiment Parameters.
In each case, the datasets have a temporal axis, meaning that when the mechanism halts, we restart at the ’ timestep and continue accumulating privacy spend. To evaluate FSRC, we compare Algorithm 4 using Gaussian Above Threshold (Algorithm 3), to a Privacy Filter (Whitehouse et al.,, 2023) with sequential composition of the same mechanism. In FSRC, we apply Theorem 17 in our privacy accounting. For the baseline, we compute an RDP bound using Zhu and Wang, (2020). For each pair of noise multipliers and thresholds, we plot each result in a scatter plot, and the privacy spend over 50 runs. We use AutoDP (Wang,, 2023) with support for global sensitivity calibration to apply their bound with . As in Zhu and Wang, (2020), we fix . We ran all our experiments over 50 runs with a range of values for . Importantly, the privacy spend using a DP filter cannot be disclosed without leaking privacy, and must therefore be replaced with a privacy odometer if the data curator wishes to share the spent budget (Rogers et al.,, 2016). Utility. We compute the F1 Score to measure utility for both algorithms. Since each algorithm outputs a binary vector, we can measure how many times the mechanism matches with a ground truth algorithm where no noise is added to the queries or the threshold.
5 RELATED WORK
Our work spans four areas of privacy research, (1) accounting methods for Gaussian mechanisms, (2) maximizing the number of interactive, user-level queries, (3) privacy filters and fully adaptive composition, and (4) ex-post privacy analysis.
The Gaussian Mechanism adds calibrated noise to the output of a query. In the learning setting, a number of privacy accounting techniques exist (Abadi et al.,, 2016). However such approaches do not map directly to query selection. In the online setting, ex-post DP accounting already exists for the Laplace Mechanism (Ligett et al.,, 2017); however, many successful deployments of DP rely on Gaussian mechanisms. This is likely due to the greater noise concentration around the mean and the thinner tails exhibited by Gaussian mechanisms (Dwork and Roth,, 2014).
The Sparse Vector Technique (SVT) (Dwork et al.,, 2009) is a foundational differentially private algorithm. By splitting the privacy budget between the cost of returning a binary vector and the query of interest, utility is increased.
A Gaussian version offers better utility over the Laplace mechanism and can, surprisingly, support a potentially infinite number of queries (Zhu and Wang,, 2020).
Hartmann et al., (2022) introduce Output DP as a means of analyzing SVT and the Propose-Test-Release with Laplace noise. Their work generalizes results from Ligett et al., (2017) and they show how basic composition bounds can (for few queries) offer better utility over advanced composition results.
Ex-Post Privacy. Ligett et al., (2017) defined ex-post privacy in terms of -DP. In common with this work, they studied SVT, but with a focus on the Laplace Mechanism. Redberg and Wang, (2021) consider ex-post privacy with Gaussian mechanisms for data-dependent privacy parameters with the Gaussian mechanism over real-valued queries.
Privacy Filters, first proposed by Rogers et al., (2016), are a key ingredient in allowing adaptive composition of privacy-preserving mechanisms as well as the privacy spend. Feldman and Zrnic, (2021), and Lécuyer, (2021) extended these results to Rényi DP, with the caveat that the higher order parameters needed to be pre-defined. Recently, Whitehouse et al., (2023) tightened these results and Rogers et al., (2023) then applied ex-post privacy to probabilistic DP mechanisms. We consider their efforts to be closest to ours; however, they do not consider query online selection.
6 CONCLUSION
We provided pure-DP bounds on Gaussian selection mechanisms with bounded queries. Additionally, we developed new composition tools for ex-ante and ex-post privacy analysis. We demonstrated increased query accuracy for an equivalent privacy budget in energy and mobility datasets in the online setting.
We consider ex-post privacy analysis a promising method tightening privacy accounting in online algorithms, and our composition result, FSRC, could be applied to other sequential algorithms.
Accounting for user-level privacy over long time horizons is necessary in a number of sequential and interactive decision-making tasks. In particular, we foresee direct benefit in applying these methods to model selection.
Acknowledgements
J.L. is supported by the Google DeepMind Graduate Fund and the Fonds de Recherche du Québec. We wish to thank Thomas Steinke for his comments on an early draft. We also thank Guillaume Rabusseau, Jose Gallego-Posada, Maxime Wabartha and Vincent Luczkow for many fruitful discussions. Finally, we thank Iosif Pinelis for bringing l’Hôpital’s Monotone Rule to our attention.
References
- Abadi et al., (2016) Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. (2016). Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318.
- Ács and Castelluccia, (2011) Ács, G. and Castelluccia, C. (2011). I have a dream!(differentially private smart metering). In International Workshop on Information Hiding, pages 118–132. Springer.
- Aktay et al., (2020) Aktay, A., Bavadekar, S., Cossoul, G., Davis, J., Desfontaines, D., Fabrikant, A., Gabrilovich, E., Gadepalli, K., Gipson, B., Guevara, M., et al. (2020). Google covid-19 community mobility reports: anonymization process description (version 1.1). arXiv preprint arXiv:2004.04145.
- Anderson et al., (1993) Anderson, G., Vamanamurthy, M., and Vuorinen, M. (1993). Inequalities for quasiconformal mappings in space. Pacific J. Math., 160(1):1–18.
- Balle et al., (2020) Balle, B., Barthe, G., Gaboardi, M., Hsu, J., and Sato, T. (2020). Hypothesis testing interpretations and renyi differential privacy. In Chiappa, S. and Calandra, R., editors, The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], volume 108 of Proceedings of Machine Learning Research, pages 2496–2506. PMLR.
- Balle and Wang, (2018) Balle, B. and Wang, Y.-X. (2018). Improving the gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In International Conference on Machine Learning, pages 394–403. PMLR.
- Bohli et al., (2010) Bohli, J.-M., Sorge, C., and Ugus, O. (2010). A privacy model for smart metering. In 2010 IEEE International Conference on Communications Workshops, pages 1–5.
- Cummings et al., (2018) Cummings, R., Krehbiel, S., Lai, K. A., and Tantipongpipat, U. (2018). Differential privacy for growing databases. Advances in Neural Information Processing Systems, 31.
- Desfontaines, (2023) Desfontaines, D. (2023). Publishing wikipedia usage data with strong privacy guarantees.
- Dinur and Nissim, (2003) Dinur, I. and Nissim, K. (2003). Revealing information while preserving privacy. In Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’03, page 202–210. Association for Computing Machinery.
- Dwork, (2006) Dwork, C. (2006). Differential privacy. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 4052 LNCS, pages 1–12. Springer Verlag.
- Dwork et al., (2006) Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 3876 LNCS, pages 265–284.
- Dwork et al., (2009) Dwork, C., Naor, M., Reingold, O., Rothblum, G. N., and Vadhan, S. (2009). On the complexity of differentially private data release. In Proceedings of the forty-first annual ACM symposium on Theory of computing, New York, NY, USA. ACM.
- Dwork and Roth, (2014) Dwork, C. and Roth, A. (2014). The Algorithmic Foundations of Differential Privacy. Foundations and trends in theoretical computer science. Now.
- Fanaee-T and Gama, (2013) Fanaee-T, H. and Gama, J. (2013). Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, pages 1–15.
- Feldman and Zrnic, (2021) Feldman, V. and Zrnic, T. (2021). Individual privacy accounting via a renyi filter. Advances in Neural Information Processing Systems, 34:28080–28091.
- Greater London Authority, (2012) Greater London Authority (2012). Smartmeter Energy Use Data in London Households.
- Haji Mirzaee et al., (2022) Haji Mirzaee, P., Shojafar, M., Cruickshank, H., and Tafazolli, R. (2022). Smart grid security and privacy: From conventional to machine learning issues (threats and countermeasures). IEEE Access, 10:52922–52954.
- Hardt and Rothblum, (2010) Hardt, M. and Rothblum, G. N. (2010). A multiplicative weights mechanism for privacy-preserving data analysis. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science. IEEE.
- Hartmann et al., (2022) Hartmann, V., Bindschaedler, V., Bentkamp, A., and West, R. (2022). Privacy accounting conomics: Improving differential privacy composition via a posteriori bounds. arXiv preprint arXiv:2205.03470.
- Hu et al., (2021) Hu, T., Wang, S., She, B., Zhang, M., Huang, X., Cui, Y., Khuri, J., Hu, Y., Fu, X., Wang, X., et al. (2021). Human mobility data in the covid-19 pandemic: characteristics, applications, and challenges. International Journal of Digital Earth, 14(9):1126–1147.
- Kasiviswanathan and Smith, (2014) Kasiviswanathan, S. P. and Smith, A. (2014). On the ’semantics’ of differential privacy: A bayesian formulation. Journal of Privacy and Confidentiality, 6(1).
- Lécuyer, (2021) Lécuyer, M. (2021). Practical privacy filters and odometers with r’enyi differential privacy and applications to differentially private deep learning. arXiv preprint arXiv:2103.01379.
- Ligett et al., (2017) Ligett, K., Neel, S., Roth, A., Waggoner, B., and Wu, S. Z. (2017). Accuracy first: Selecting a differential privacy level for accuracy constrained erm. Advances in Neural Information Processing Systems, 30.
- Lyu et al., (2016) Lyu, M., Su, D., and Li, N. (2016). Understanding the sparse vector technique for differential privacy. arXiv preprint arXiv:1603.01699.
- McKenna and Sheldon, (2020) McKenna, R. and Sheldon, D. R. (2020). Permute-and-flip: A new mechanism for differentially private selection. Advances in Neural Information Processing Systems, 33:193–203.
- Mironov, (2017) Mironov, I. (2017). Rényi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF), pages 263–275. IEEE.
- mpmath, (2023) mpmath (2023). mpmath: a Python library for arbitrary-precision floating-point arithmetic (version 1.3.0). http://mpmath.org/.
- Narayanan and Shmatikov, (2008) Narayanan, A. and Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (sp 2008), pages 111–125.
- Papernot et al., (2018) Papernot, N., Song, S., Mironov, I., Raghunathan, A., Talwar, K., and Erlingsson, Ú. (2018). Scalable private learning with pate. arXiv preprint arXiv:1802.08908.
- Pinelis, (2002) Pinelis, I. (2002). L’hospital type rules for monotonicity, with applications. J. Inequal. Pure Appl. Math, 3(1).
- Ratnam et al., (2017) Ratnam, E. L., Weller, S. R., Kellett, C. M., and Murray, A. T. (2017). Residential load and rooftop pv generation: an australian distribution network dataset. International Journal of Sustainable Energy, 36(8):787–806.
- Redberg and Wang, (2021) Redberg, R. and Wang, Y.-X. (2021). Privately publishable per-instance privacy. Advances in Neural Information Processing Systems, 34:17335–17346.
- Rényi, (1961) Rényi, A. (1961). On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, volume 4, pages 547–562.
- Rogers et al., (2023) Rogers, R., Samorodnitsky, G., Wu, Z. S., and Ramdas, A. (2023). Adaptive privacy composition for accuracy-first mechanisms. arXiv preprint arXiv:2306.13824.
- Rogers et al., (2016) Rogers, R. M., Roth, A., Ullman, J., and Vadhan, S. (2016). Privacy odometers and filters: Pay-as-you-go composition. Advances in Neural Information Processing Systems, 29.
- Wang et al., (2016) Wang, Q., Zhang, Y., Lu, X., Wang, Z., Qin, Z., and Ren, K. (2016). Real-time and spatio-temporal crowd-sourced social network data publishing with differential privacy. IEEE Trans. Dependable Secure Comput., pages 1–1.
- Wang, (2023) Wang, Y.-X. (2023). autodp: autodp: A flexible and easy-to-use package for differential privacy.
- Whitehouse et al., (2023) Whitehouse, J., Ramdas, A., Rogers, R., and Wu, S. (2023). Fully-adaptive composition in differential privacy. In International Conference on Machine Learning, pages 36990–37007. PMLR.
- Xu et al., (2017) Xu, F., Tu, Z., Li, Y., Zhang, P., Fu, X., and Jin, D. (2017). Trajectory recovery from ash: User privacy is not preserved in aggregated mobility data. In Proceedings of the 26th international conference on world wide web, pages 1241–1250.
- Zhang et al., (2021) Zhang, W., Krehbiel, S., Tuo, R., Mei, Y., and Cummings, R. (2021). Single and multiple Change-Point detection with differential privacy. J. Mach. Learn. Res., 22(29):1–36.
- Zhu and Wang, (2020) Zhu, Y. and Wang, Y.-X. (2020). Improving sparse vector technique with renyi differential privacy. Advances in Neural Information Processing Systems, 33:20249–20258.
Appendix A PROOFS FROM SECTION 2
Our main contribution relies on a number of proof techniques which are extended from the simpler Gaussian Report Noisy Max setting.
A.1 Pure DP Gaussian Report Noisy Max
Theorem 12 (Pure DP for Gaussian Report Noisy Max).
Let be the Gaussian Report Noisy Max mechanism with standard deviation applied to bounded queries , each with sensitivity bounded by . Let . Then satisfies -DP with
(1) |
To prove our Pure DP Gaussian Report Noisy Max bound, we first analyze the effect of the sensitivity in the privacy loss of the Gaussian Report Noisy Max mechanism instantiated with uniformly bounded queries.
Lemma 13.
Let be the Gaussian Report Noisy Max mechanism with standard deviation applied to bounded queries , each with sensitivity bounded by . Then we have
(2) |
Proof.
Fix a pair of neighbouring datasets and . With a slight abuse of notation we let and for all . Recall that with , and for all . Without loss of generality fix the output . Then we have:
(3) | ||||
(4) | ||||
(5) | ||||
(6) | ||||
(7) | ||||
(8) | ||||
(9) | ||||
(10) |
Therefore, writing , we get
(11) |
Note that a priori we have for all . However, for the above inequality to hold we set for and . These imply and , so for . ∎
Given the result above, all that remains is to identify where the supremum over the is attained in Equation 2. To that end we will show that the ratio is non-decreasing in each of the . The following Hôpital-like monotonicity rule and property of ratios of moment generation functions will be useful.
Lemma 14 ((Pinelis,, 2002; Anderson et al.,, 1993)).
Let and let be continuous differentiable functions on , with or , and for . If is non-decreasing in , then so is .
We say that a random variable has tails majorized by an exponential decay if the cumulative distribution function of is such that there exist positive constants satisfying for and for . This is a well-known sufficient condition for the cumulant generating function to exist.
Lemma 15.
Suppose is a random variable with tails majorized by an exponential decay. Then for any the function
(12) |
is non-decreasing for all .
Proof.
Let be the moment generating function of . Taking the derivative of we get:
(13) |
Thus, if and only if , which is equivalent to
(14) |
Therefore is non-decreasing if the derivative of the cumulant generating function (which exists by assumption) is non-decreasing. But note that when exists it is known to be convex and infinitely differentiable, and therefore its derivative is non-decreasing. ∎
Note that by symmetry of the expression of interest in the it suffices to establish monotonicity with respect to a single variable. Thus we define the following functions:
(15) | ||||
(16) |
where and are constants.
Lemma 16.
The function is non-decreasing for all .
Proof.
We are going to apply Lemma 14. First note that since we have . Next we observe that, writing for the density of a standard Gaussian random variable, we have
(17) |
Thus, when computing we will obtain a term of the form , which can be simplified to
(18) |
From this we can now show that is proportional to the moment generating function of a random variable with density where is a normalizing constant (independent of and ):
(19) | ||||
(20) | ||||
(21) | ||||
(22) |
A similar derivation also yields the following expression for :
(23) |
Putting these two derivations together we obtain the following expression for the test function in Lemma 14:
(24) |
Noting that the distribution with density satisfies the condition of Lemma 15 we see that is non-decreasing and therefore is non-decreasing. ∎
With all these ingredients in place we see that Theorem 12 follows by using Lemma 16 to show that the supremum for each in Lemma 13 is attained at .
A.2 Pure DP Above Threshold
Theorem 17 (Pure Ex-post Gaussian Above Threshold).
Given a stream with global sensitivity , the Gaussian Above Threshold mechanism (Algorithm 3) satisfies -ex-post-DP with
(25) |
Overall, our proof follows a similar strategy to the proof in the previous section.
Lemma 18.
Let be the Gaussian Above Threshold mechanism, with public threshold , threshold noise of standard deviation , and query noise of standard deviation applied to a stream of bounded queries . Then for any stopping time we have
(26) |
Proof.
Fix a pair of datasets . Like before, we let and ; they satisfy . First of all we bound the probability that the mechanism on input stops at time as follows:
(27) | ||||
(28) | ||||
(29) | ||||
(30) | ||||
(31) | ||||
(32) | ||||
(33) | ||||
(34) | ||||
(35) | ||||
(36) |
A similar derivation also yields:
(37) |
Thus, the ratio of probabilities of stopping at time on and can be bounded as:
(38) |
Now note that in the bound we set for and . Since all the queries are bounded in we have that for and . Taking the supremum over these ranges completes the proof. ∎
A.3 Filtered Self-Reporting Composition
Theorem 19.
Suppose the mechanisms provided to Algorithm 4 are such is -pDP and -ex-post-DP for all . Then Algorithm 4 satisfies -DP.
Proof.
We will prove that the mechanism satisfies -pDP and therefore also -DP. Let denote the mechanism and fix a pair of neighboring datasets and . Let be sampled from the mechanism (here is a random stopping time) and consider the privacy loss random variable
First of all we observe that given the outputs up to a certain time step , the stopping condition is deterministic and independent of the dataset. Thus, the last term above vanishes. Furthermore, since each of the mechanisms satisfies -ex-post-DP, for all we have
In addition, since is -pDP, we also have
Now, since the stopping rule guarantees that , we get
∎
Appendix B PURE DP OFFLINE SELECTION MECHANISMS
In Fig. 6 we illustrate the variance observed when taking a Monte-Carlo estimate over a range of values when computing our Pure DP Gaussian Report Noisy Max bound.
B.1 Numerical evaluation compared to post-processing bounds
We study how privacy loss changes as noise (and privacy) are increased in Fig. 7 for Gaussian Report Noisy Max. The standard deviation of the query noise, , and the number of queries (dimension ) on the privacy loss estimate. We note a slow increase in the privacy loss as the number of queries increases, in particular when compared to the ex-ante analysis with .
B.2 Where pure DP offers tighter accounting for Report Noisy Max
To assess whether there are any utility gains from performing a numerical evaluation of the privacy loss, we produce a heatmap of the difference reported between the standard post-processing bound under RDP and our method in Fig. 8. There exists a smooth region where the privacy accounting difference is significant, particularly in the the high privacy, high query setting. As the queries become less sensitive (), this region becomes more pronounced.
B.3 Offline Selection with Gaussian Report Noisy Max
Each accuracy/privacy loss plot reports the mean value over 1,000 trials, with the shaded region covering the standard error.
In our experiments, we normalize the dataset to values between zero and one. The shaded region represents the standard error across several runs. The mean value is represented in as a line. A classic example of this sort of problem is query selection over a long time horizon. Our experiments center on energy and mobility datasets, which have been known to leak privacy (Narayanan and Shmatikov,, 2008). In both instances, user behavior, such as whether power consumption deviates from historical levels, can be joined with other datasets to infer changes to a person’s socio-economic situation.
Dataset Pre-Processing
The datasets we consider are temporal event logs over a fixed period of time with a regular reporting interval (e.g. hour, day, month). In each instance, we re-scale the reported values to be between zero and one. These transformations do not affects the task result since we are interesting in reporting a discrete value. In the offline selection task, we wish to find the time step whose reported value is largest, and in the online selection task, the goal is to produce a binary vector indicating which step exceeds a fixed threshold.
Utility
We measure the accuracy of Gaussian Report Noisy Max by comparing the distance between the true answer and the noisy (private) answer step. Let be the true query answer and be the answer selected by Report Noisy Max. Then accuracy is defined as
(39) |
where is the maximum value of any query. In our experiments, . In this construction, perfect utility occurs when . This is a reasonable measure of utility since we wish to measure how close on average we are to the max query value.
NEAR Energy Dataset.
The NEAR Energy Dataset (Ratnam et al.,, 2017) includes the annual energy consumption of 300 customers, we formulate the following question: Which day was consumption highest?
London Energy Dataset
Here we follow the same analysis as in the case of the NEAR dataset (Greater London Authority,, 2012), however the dataset comprises of customers over days. The larger number of queries increases the privacy cost, however this is balanced by a decrease in individual contribution to each query due to each query having .
We remark that there are other offline selection mechanisms that guarantee pure DP without Gaussian noise. The exponential mechanism, which calibrates an exponential distribution with respect to a utility function, is commonly used solution to the offline query selection problem (Dwork and Roth,, 2014). Despite a relatively simple implementation, the exponential mechanism requires that a sum over all query values be computed to fix the probability distribution. Permute-and-flip is a drop-in replacement for the exponential mechanism which has been shown to provide some utility benefits, but has the same calibration requirement (McKenna and Sheldon,, 2020). In both instances the global sensitivity is no longer dependent on the number of queries since the maximum contribution for each user is taken with respect to each query individually.
In Fig. 12 we measure compare our method to pure DP baselines with other noise distributions. The global sensitivity for these exponential-based mechanisms is , where is the number of users, compared to our bound, which depends on queries. One could make apply the same post-processing argument for Gaussian Report Noisy Max with noise drawn from the Laplace distribution however, this approach was omitted from our baselines since it did not appear competitive in our setting. Finally, we remark that our method may be easier to calibrate since each noisy query can be computed separately.
UCI Bikes Dataset
The UCI Bike Sharing Dataset captures the utilization of shared bikes in a geographic area over the course of a year (Fanaee-T and Gama,, 2013). We apply the same definition of accuracy and report on the day with peak consumption. Since we do not know the upper bound on registered customers, we take the maximum (6,946 users) and assume that this is a public value. We note that our analysis is still worst-case, in that a user is assumed to be contributing to each daily (normalized) count query. We limit our Report Noisy Max query to 365 days.
As shown in Fig. 15, the Exponential Mechanism and Permute-and-flip offer a competitive baseline. Fig. 16 provides a comparison of our method with Laplace Report Noisy Max. We observe a clear improvement in the pure DP setting using Gaussian noise and our bounds.
Appendix C COMPARISON WITH PRIVACY FILTERS
Our baseline for comparison are recent results by (Rogers et al.,, 2023), who propose a privacy filter which introduces an additive bound in both and , thereby creating a result similar to advanced composition. Their construction does not rely on computing some .
Theorem 20 (-DP Filters (Whitehouse et al.,, 2023)).
Suppose is a sequence of mechanisms such that, for any , is -differentially private conditioned on . Let , and be max privacy parameters s.t. . Let the stopping time function be given by,
(40) |
Then is -DP.
To compare Filtered Self-Reporting Composition, we apply the following privacy filter, (Definition 21) with Algorithm 5. The bound we compute for Gaussian Above Threshold comes from (Zhu and Wang,, 2020).
Definition 21 (DP Privacy Filter (Rogers et al.,, 2016)).
Let be a -DP Filter. Then the DP Composition Privacy Filter is given by
(41) |
Appendix D PURE DP ONLINE SELECTION: ADDITIONAL EXPERIMENTS
UCI Bikes Dataset
We use the same UCI dataset as before, however this time configure Above Threshold to HALT when bike sharing exceeds a certain threshold between zero and one. In Fig. 17 we plot a range of calibrations, and in Fig. 18 we show the privacy spend over time.
London Energy Dataset.
We plot Gaussian Above Threshold under the same utility metric on the LCL London energy dataset, with 5,564 customers. In Fig. 19 we plot a range of calibrations. As the threshold decreases, we witness more queries released and therefore a greater privacy spend. we show the effect over time in Fig. 20.