Beyond Internal Data:
Constructing Complete Datasets for Fairness Testing

Varsha Ramineni, Hossein A. Rahmani, Emine Yilmaz, David Barber
Centre of Artificial Intelligence
University College London
{varsha.ramineni.23, hossein.rahmani.22, emine.yilmaz, david.barber}
@ucl.ac.uk

Abstract

As AI becomes prevalent in high-risk domains and decision-making, it is essential to test for potential harms and biases. This urgency is reflected by the global emergence of AI regulations that emphasise fairness and adequate testing, with some mandating independent bias audits. However, procuring the necessary data for fairness testing remains a significant challenge. Particularly in industry settings, legal and privacy concerns restrict the collection of demographic data required to assess group disparities, and auditors face practical and cultural challenges in gaining access to data. Further, internal historical datasets are often insufficiently representative to identify real-world biases. This work focuses on evaluating classifier fairness when complete datasets including demographics are inaccessible. We propose leveraging separate overlapping datasets to construct complete synthetic data that includes demographic information and accurately reflects the underlying relationships between protected attributes and model features. We validate the fidelity of the synthetic data by comparing it to real data, and empirically demonstrate that fairness metrics derived from testing on such synthetic data are consistent with those obtained from real data. This work, therefore, offers a path to overcome real-world data scarcity for fairness testing, enabling independent, model-agnostic evaluation of fairness, and serving as a viable substitute where real data is limited.

1 Introduction

It is well established that Artificial Intelligence (AI) systems have the potential to perpetuate, amplify, and systemise harmful biases [10, 11]. Therefore, rigorous testing for bias is imperative to mitigate harms, especially given the increasing influence of AI in high-stakes domains such as lending, hiring, and healthcare. Such concerns have fuelled active research in bias detection and mitigation [32], and ensuring the fairness of AI systems has become an urgent policy priority for governments around the world [17, 47]. For instance, the EU AI Act imposes strict safety testing on high-risk systems [20], while New York City Local Law 144 mandates independent bias audits for AI used in employment decisions [23].

However, procuring the necessary data for fairness testing remains a significant challenge. Influential works in ethics and fairness of machine learning have highlighted the centrality of datasets [26, 3], emphasising how representative model testing and evaluation data is crucial [7, 40]. To effectively uncover biases, complete datasets that include demographic information and their relationship with model features are essential for controlling the impact of proxy variables. However, having access to such datasets that can reliably be used for evaluating fairness may not always be possible in practice.

As a motivating example, consider a bank that uses an AI system to assess loan applicants based on non-protected variables such as occupation and savings. The bank wants to perform an internal audit as to whether its AI system inadvertently discriminates against certain racial groups. For this, the bank requires data concerning protected attributes such as the race of the applicants alongside data of non-protected attributes required by the model to make a loan decision.

Whilst protected attributes such as race, sex, age etc. are crucial to assess bias, their collection and use in modelling are heavily restricted under regulations such as GDPR [1, 44]. Hence, most internal datasets collected by organisations that use AI systems for decision making (such as the bank in our example) do not contain such protected attributes [29]. Similarly, procuring the necessary data is also a huge complexity for auditors, hindering the effective implementation of algorithmic auditing laws [23]. In an external audit of fairness, the auditing agency often has access only to the black box loan predictions and is not provided any data by the bank since existing regulations often do not allow data holders to release datasets that pause privacy concerns. For this external audit the agency needs a joint distribution of both the attributes needed by the black box loan classifier and the protected attributes. Therefore, the development of curated test sets capable of effectively uncovering biases is essential [29].

Recently, there has been shift away from using limited real test data towards leveraging synthetic data, which has shown promise in a variety of applications ranging from privacy preservation [2] to emulating scenarios for which collecting data is challenging [27].

Refer to caption — Figure 1: Creation of a synthetic dataset by using two separate datasets and learning their joint distribution. This produces a complete and representative synthetic dataset with essential demographic information necessary for fairness testing.

Our work focuses on the challenge of evaluating classifier fairness in scenarios where complete data including protected attributes is inaccessible. To overcome this challenge, we propose leveraging separate datasets containing overlapping variables, which are more accessible in real-world scenarios than complete datasets containing all variables [21]. Specifically, in addition to using an internal dataset that lacks protected attribute information, we propose utilising external data, such as census datasets which provide representative demographic information. For example, the UK Office for National Statistics [33] offers multivariate data from the 2021 Census, providing access to customisable combinations of census variables. Such external data could be utilised when the essential demographic information needed for fairness testing is not directly available.

In our motivating example above, even if the protected attribute ‘race’ is not directly available in the internal dataset, its connection to the features used by the model such as occupation, savings, etc. can be used to evaluate fairness with respect to race. For instance, the internal dataset used by the bank might include information about $\{loan\,outcome,\;savings,\;occupation\}$ . By utilising an external dataset which contains an overlapping variable such as $\{occupation,\;race\}$ that is representative of the population, we can learn the joint distribution of variables from these two datasets, which can then be used to generate synthetic joint test data that contains all the variables, e.g. $\{loan\,outcome,\;savings,\;occupation,\;race\}$ as shown in Figure 1. This dataset can then be reliably used for evaluating the fairness of the model, as shown in Figure 2.

In this work, we conduct experiments on multiple real-world datasets commonly used in fairness research, simulating realistic scenarios involving separated datasets, such as isolated protected attributes and only a single overlapping variable. Our results show that the synthetic test data generated using our proposed approach exhibits high fidelity when compared to real test data. Crucially, we find that fairness metrics derived from testing classifier models on synthetic data closely align with those obtained from real data. These findings suggest that our approach provides a reliable method for fairness evaluation in scenarios where complete datasets are inaccessible, offering a viable alternative for testing in such contexts.

2 Related Work

Fairness Testing

Significant work on fairness evaluation has centered on formalising definitions of fairness [32] and emphasising the critical role of data [3, 26, 22, 35]. Recent work has also explored fairness testing in response to regulatory requirements [23, 44] and in the context of industry [24, 29] and software development [14]. Additionally, there is growing interest in sample-efficient approaches to fairness testing [25, 43].

Synthetic Data Generation.

Generative models aim to learn the underlying distribution from real data and produce realistic synthetic data. In our work, we focus on tabular data, as it is the most common data type for real-world applications [41]. Various models have been developed for tabular data generation, from simple methods like SMOTE [13] to deep learning approaches such as CTGAN [49] and TVAE [49]. Significant previous work has focused on privacy-preserving synthetic data generation, employing marginal-based methods like the MST algorithm [30], with work showing that marginal-based algorithms and traditional methods such as mixture models, are more effective at preserving attribute correlations compared to deep learning approaches [39, 36]. Recent innovative advancements also include using large language models [8] and offering customisable tabular data generation [45]. However, these methods typically assume access to full datasets to learn from, limiting their effectiveness in scenarios with restricted data access.

Synthetic Data for Bias.

Synthetic data for bias has predominately focused on creating fair data for training [48, 9]; however, this offers no guarantee of unbiased models [19] and reliable testing methods are therefore crucial. Another approach is to simulate different scenarios to explore the interconnection between biases and their effect on performance and fairness evaluations [5, 12]. Recent work highlights the potential of synthetic data for evaluation, showing that, whilst testing on limited real data is unreliable, utilising synthetic test data allows for granular evaluation and testing on distributional shifts [43]. Emerging work also looks at most effective synthetic data generation techniques for training and evaluating machine learning models and the implications of model fairness [36].

3 Methodology

Returning to our motivating example of a loan classifier, our assumption is that the classifier uses only non-protected attributes, such as savings $X$ , and occupation $O$ in order to form a loan prediction $\hat{Y}$ ; in this case, the loan decision is some function of the non-protected attributes, e.g. $\hat{Y}=f(X,O)$ . However, we would like to assess whether this prediction is fair against a protected attribute $A$ such as race. There are various statistical definitions of group fairness in classification, typically conditioned on protected attributes along which fairness should be ensured. We use the following notation: Let $Y\in\{+,-\}$ represent the true outcome, $\hat{Y}\in\{+,-\}$ the predicted outcome, and $A\in\{privileged,unprivileged\}$ the protected attribute. Here, ‘ $+$ ’ denotes a positive classification outcome (e.g., loan approval), while ‘ $-$ ’ denotes a negative outcome (e.g., loan rejection). For instance, the fairness metric Equal Opportunity Difference (EOD) is given by:

EOD=P(\hat{Y}=+\mid Y=+,A=unprivileged)-P(\hat{Y}=+\mid Y=+,A=privileged)

(1)

To calculate this, one necessary term is $P(\hat{Y}=+\mid Y=+,A),$ where

P(Y=+,A)=\sum_{O,X}P(Y=+,O,X,A).

(2)

This requires a model of the joint distribution (as shown in Figure 1), which can then be used to test the fairness of a pre-trained black-box classifier, as illustrated in Figure 2. In the following section, we explain how to construct joint distributions from a collection of overlapping marginal distributions.

3.1 Learning a Joint Distribution

Consider a fairness testing scenario that requires access to the distribution $p(loan\,outcome,savings,occupation,race)$ . Most real-world datasets, such as provided by publicly available census data, often only provide sets of marginal distributions [21]. Suppose we have two separate datasets with empirical distributions $\hat{p}(loan\,outcome,savings,occupation)$ and $\hat{p}(occupation,race)$ , where $occupation$ is the overlapping variable. Our goal is to estimate the joint distribution $p(loan\,outcome,savings,occupation,race)$ . Theoretically, this problem is ill-posed and therefore requires additional assumptions.

Using marginal data observations and a structural independence assumption, the joint distribution can be estimated using maximum likelihood estimation. We consider below three simple structural independence assumptions, illustrated by graphical models, to fit a joint distribution on four variables $p(x_{1},x_{2},x_{3},x_{4})$ , given two empirical marginal distributions $\hat{p}(x_{1},x_{2},x_{3})$ and $\hat{p}(x_{3},x_{4})$ . The estimated joint distribution is then used as a generative model to create synthetic data points through sampling [46]. Note that we assume marginal consistency i.e. that all marginal distributions considered originate from a common underlying joint distribution.

3.1.1 Independence Given Overlap

p(x_{1},x_{2},x_{3},x_{4})=p(x_{3})\cdot\hat{p}(x_{1},x_{2}|x_{3})\cdot\hat{p}% (x_{4}|x_{3})

(3)

We model the joint distribution of $x_{1},x_{2},x_{3}$ , and $x_{4}$ by treating the association between $(x_{1},x_{2})$ and $x_{4}$ as the product of their conditional distributions given $x_{3}$ . To estimate $p(x_{3})$ , we take the average of the proportions from both marginal datasets and use this to sample $x_{3}$ (see Appendix (A.1) for proof of optimality). To sample from this model, we first sample from $p(x_{3})$ and then draw conditional samples for $(x_{1},x_{2})$ and $x_{3}$ from the marginal datasets. Note that if the marginals are consistent, namely $\sum_{x_{1},x_{2}}\hat{p}(x_{1},x_{2},x_{3})=\sum_{x_{4}}\hat{p}(x_{3},x_{4})% \equiv\hat{p}(x_{3})$ , then we simply set $p(x_{3})=\hat{p}(x_{3})$ .

3.1.2 Marginal Preservation

p(x_{1},x_{2},x_{3},x_{4})=\hat{p}(x_{1},x_{2},x_{3})\cdot\hat{p}(x_{4}|x_{3})

(4)

We directly use the proportions from the first marginal dataset to model the joint distribution of $x_{1}$ , $x_{2}$ and $x_{3}$ . A sample is then obtained by sampling from the marginal $\hat{p}(x_{1},x_{2},x_{3})$ and then from the conditional marginal $\hat{p}(x_{4}|x_{3})$ . Alternatively, we could preserve the second marginal by modelling the distribution as $p(x_{1},x_{2},x_{3},x_{4})=\hat{p}(x_{1},x_{2}|x_{3})\cdot\hat{p}(x_{3},x_{4})$ .

3.1.3 Latent Naïve Bayes

p(x_{1},x_{2},x_{3},x_{4})=\sum_{z}p(z)\prod_{i=1}^{4}p(x_{i}|z)

(5)

We employ a latent variable model based on the Naïve Bayes assumption by introducing a latent variable $z$ , which assumes that $x_{1}$ , $x_{2}$ , $x_{3}$ , and $x_{4}$ are conditionally independent given $z$ . We use the Expectation-Maximization (EM) algorithm [16] to train the model (see Appendix (A.2) for details).

3.1.4 Extension to More Complex Scenarios

We can extend the Latent Naïve Bayes method to include more variables by adding the term $p(x_{k}\mid z)$ for any new variable $x_{k}$ . Similarly, other methods can be adapted to handle additional variables. For instance, if the second marginal distribution is $\hat{p}(x_{3},x_{4},x_{5},x_{6})$ , we adjust the conditional distribution from $\hat{p}(x_{4}\mid x_{3})$ to $\hat{p}(x_{4},x_{5},x_{6}\mid x_{3})$ . When multiple variables overlap between datasets, such as in the empirical distributions $\hat{p}(x_{1},x_{2},x_{3},x_{4})$ and $\hat{p}(x_{3},x_{4},x_{5})$ where $(x_{3},x_{4})$ are overlapping, we extend the methods to preserve the joint structure. For the Independence Given Overlap method, we use: $p(x_{1},x_{2},x_{3},x_{4},x_{5})=p(x_{3},x_{4})\cdot\hat{p}(x_{1},x_{2}\mid x_% {3},x_{4})\cdot\hat{p}(x_{5}\mid x_{3},x_{4}).$ For the Marginal Preservation method, we use: $p(x_{1},x_{2},x_{3},x_{4},x_{5})=\hat{p}(x_{1},x_{2},x_{3},x_{4})\cdot\hat{p}(% x_{5}\mid x_{3},x_{4}).$

In this work, we focus on estimating the joint distribution from two datasets that overlap in a single variable. Real-world datasets may exhibit more complex structures involving multiple datasets. While Latent Naïve Bayes offers a straightforward extension to multiple datasets, there could be alternative approaches such as using Junction Trees [4]. Such work is left for future research, with this study serving as a preliminary exploration of our proposed approach.

4 Experimental Setup

We aim to generate synthetic datasets and evaluate their quality based on two criteria: 1) how well they can approximate a real ground truth dataset, and 2) how accurately they can estimate the fairness of a black-box classifier in situations where complete data, including protected attributes is inaccessible. We assume that, as in our example, we have access to two separate datasets, for example one containing $\{loan\,outcome,savings,occupation\}$ and another containing $\{occupation,race\}$ , used to estimate a joint distribution and generate a synthetic test dataset including all attributes. In this setup, one dataset includes the protected attribute, while the other contains model input features, with an overlapping variable between the two datasets.

4.1 Datasets

We conduct our experiments using three real-world datasets: Adult [6], COMPAS [38], and German Credit [18], detailed in Table 1, which are commonly used in the fairness literature. For all three datasets we follow the literature by removing instances with null values, and map all continuous variables into categorical variables (see Appendix (B) for details) [28]. These datasets represent complete real data with protected attributes. Our goal is to approximate such data using our synthetic data generation approach.

Table 1: Overview of real world datasets used in experiments

Name	# Instances	# Attributes	Label	Protected Attributes
Adult [6]	45,222	13	Income	Sex (67.5% male, 32.5% female)
				Race (86% white,14% non-white)
COMPAS [38]	5278	9	Recidivism	Sex (80.5% male, 19.5% female)
				Race (60.2% white, 39.8% black)
German [18]	1000	22	Credit Risk	Age (81% > 25, 19% $\leq$ 25)
				Sex (69% male, 31% female)

Table 2: Separation of complete real datasets, with each row illustrating how attributes are categorised into ‘external’ and ‘internal’ datasets. The ‘external’ dataset shown includes protected attributes, while the ‘internal’ dataset comprises the remaining attributes. Protected attributes are shown in bold, and overlapping variables shared between the two datasets are shown in italics.

Dataset	Attributes in ‘External’ Dataset (overlapping variable in italics)
Adult	relationship, age, sex, race, marital-status, native-country
	marital-status, age, sex, race, marital-status, native-country
COMPAS	score, sex, age, race
	violent score, sex, age, race
German Credit	property, sex, marital-status, age, foreign-worker
	housing, sex, marital-status, age, foreign-worker

4.2 Simulating Data Scenarios

Our experimental setup is visualised in Figure 3. To assess our approach, we simulate having a known ground truth dataset to compare our generated synthetic data against.

Real Test Data.

Starting with a complete real dataset, we reserve a hold-out real test set $D_{\textit{test}}$ (30% of the complete real dataset) that includes all relevant attributes. This is the dataset that we would like to approximate using the synthetic data we generate and we use this to assess our approach.

Separated Data.

We wish to simulate the scenario where we don’t have access to complete data but only have two separate datasets as illustrated in Figure 1. We therefore separate the remaining complete real data by column into two overlapping datasets. We consider separations where protected attributes are isolated from other variables, and where there is one variable overlapping between datasets. We refer to these separate datasets as ‘internal’ and the ‘external’, where the ‘external’ data includes protected attributes not available in the ‘internal’ data. Such separation simulates only having access to protected attributes separately, such as in publicly available census data, and assumes limited overlap of attributes.

Table 2 demonstrates the separation of our three complete real-world datasets. Notably, the ‘external’ datasets includes data commonly found as census variables. As illustrated in Figure 1, we use the two separate datasets to estimate the joint distribution of all attributes and generate synthetic test data $D_{\textit{synth}}$ . We also wish to simulate having a trained classifier that we wish to test for fairness, as shown in Figure 2. This is done by training classifier models on one of the real separate datasets, the ‘internal’ dataset, which does not include protected attributes. The classifier models will then be tested on both synthetic and real test data.

4.3 Baselines

To our knowledge, no prior work for fairness testing has tackled the challenge of creating synthetic data from separate datasets that accurately capture the relationship between demographic and model features. We compare our approach with common methods for tabular synthetic data generation. The Independent Model assumes independence between any two variables to estimate the joint distributions [31]. Conditional Tabular GAN (CTGAN) [49] is a state-of-the-art method that learns from the full dataset, unlike our method, which works with separate datasets. Although CTGAN has an advantage due to its access to complete data, we include its performance using default hyperparameters for comparison.

5 Evaluating the Quality of Synthetic Datasets

We use two criteria to evaluate the quality of our synthetic datasets: 1) How does the synthetic data compare with real data? and 2) How does the fairness metrics computed on the synthetic data compare with real data?

We present results for eighteen synthetic test datasets which were generated using the three joint distribution estimation methods, applied to six different pairs of separated data across real world datasets: Adult, COMPAS and German Credit. Table 2 shows an overview of how our datasets have been separated and which overlapping attributes have been used. In addition to the synthetic datasets generated using our proposed approach, we also generate synthetic datasets using the two baseline approaches and compare the quality of our synthetic datasets with the quality of synthetic datasets generated using the baseline methods.

Table 3: Fidelity metrics for synthetic datasets of the Adult dataset, generated from separate data ( ‘relationship’ overlapping) with different joint estimation methods. Metrics include total variation distance complement (1-TVD), contingency similarity (CS), discriminator measure (DM), and KL divergence of

p(A,Y)

in synthetic vs real data (where Y is the outcome label and A is a protected attribute such as race and sex). Baseline methods include CTGAN and Indep.

Method	Overall Fidelity			Joint Distribution for (A,Y)
Method	1-TVD $\uparrow$	CS $\uparrow$	DM $\downarrow$	KL (Race) $\downarrow$	KL (Sex) $\downarrow$
Indep-Overlap (Relationship)	0.993	0.983	0.588	0.002	0.001
Marginal (Relationship)	0.993	0.983	0.588	0.002	0.001
Latent (Relationship)	0.986	0.968	0.658	0.002	0.002
CTGAN	0.935	0.938	0.656	0.132	0.048
Independent	0.935	0.895	0.808	0.005	0.026

5.1 Overall Fidelity of Synthetic Data Compared to Real Data

Fidelity evaluates how close the distribution of the synthetic data is to that of the real data with metrics often estimating the difference between marginal distributions [34, 37, 42]. To evaluate the fidelity of our synthetic datasets, we focus on the following metrics:

•

Total Variation Distance (TVD): Measures the difference between the empirical distributions of a variable in the synthetic data and the real data, defined as half the $L_{1}$ distance. We use the TVD Complement score, $1-\text{TVD}$ , where higher scores close to one indicate better quality synthetic data (averaged across variables) [34].
•

Contingency Similarity (CS): Assesses the similarity between normalised contingency tables of two variables, one from the real data and one from the synthetic data. This metric is calculated by first normalising the contingency tables to show the proportion of each category combination, then computing the TVD between these tables. The complement, $1-\text{TVD}$ , is used so that higher values close to one reflect greater similarity (averaged across varibales) [34].
•

Cramér’s V Correlation: Quantifies the strength of association between two categorical variables based on the Chi-square statistic [15]. We calculate the difference in Cramér’s V correlation between the synthetic and real data for each pair of variables.
•

Discriminator Measure (DM): Evaluates whether the synthetic data can be distinguished from the real data. We train a Random Forest Classifier on a balanced dataset, with synthetic data labeled as $1$ and real data labeled as $0$ . The classifier’s average accuracy on a test set is reported across five trials with different random seeds [8].

The eighteen synthetic datasets generated using our approach demonstrate high fidelity to the real test data, with an example shown in Table 3 for the Adult dataset. The average $1-\text{TVD}$ values across synthetic data for the Adult, COMPAS, and German datasets are 0.991, 0.978, and 0.966 respectively, while the average CS values are 0.978, 0.953, and 0.926. These results demonstrate the effectiveness of our approach in generating data that closely mirrors the proportions of the real test dataset. The results also show competitive or superior performance compared to the CTGAN baseline method, which generates synthetic data from complete data rather than separate data. The DM scores reveal moderate accuracy in distinguishing synthetic from real data. Across the eighteen synthetic datasets there is on average a 12.9% reduction in discriminator performance compared to the Independent Baseline and an 8.2% reduction compared to the CTGAN Baseline, suggesting that the synthetic test data is more challenging to differentiate from real data. Additionally, the difference in Cramér’s V correlations between synthetic and real datasets suggests that the attribute correlations in our synthetic data closely match those in the real data, showing greater similarity than baseline methods. See Appendix (C.1) for correlation figures and full fidelity results.

5.2 Protected Attribute and Outcome Relationship in Synthetic Data Compared to Real Data

As illustrated in Section 3, understanding the relationship between the protected attribute $A$ and the outcome label $Y$ is essential for assessing group disparities. When $A$ and $Y$ are located in separate datasets, such as the simple case in our loan example, it is crucial that the relationship between these variables $(A,Y)$ is accurately reconstructed in the synthetic datasets. We therefore measure the Kullback-Leibler (KL) divergence, $D_{\text{KL}}(p_{\textit{synth}}(A,Y)\parallel p_{\textit{real}}(A,Y))$ , between the joint distributions $p(A,Y)$ of synthetic and real data. KL divergence values close to zero indicate that the joint distribution of protected attribute and outcome label in the synthetic data is similar to the distribution in the real data

Table 3 presents the divergence for the Adult dataset, focusing on synthetic data generated from separate data which had ‘relationship’ as the overlapping variable. Across all separations and joint distribution estimation methods for Adult Data, the average KL divergence is 0.002 for Race and 0.001 for Sex. Despite generating synthetic data from separate datasets with only one overlapping variable, the joint distribution of protected attributes and outcome values is accurately reconstructed, as evidenced by the low KL divergence values. In comparison, CTGAN shows higher KL divergence values of 0.132 for Race and 0.048 for Sex. Similar patterns are observed in across the other datasets, with detailed results provided in Appendix (C.2).

5.3 Fairness Metrics from Synthetic Data Compared to Real Data

We next compare how fairness metrics computed on synthetic test datasets compare with those from real test datasets. Using the notation from Section 3, we focus on the Equal Opportunity Difference (EOD) (Equation 1) and two other common metrics: Disparate Impact (DI) and Average Odds Difference (AOD) [32]. The Disparate Impact (DI) metric compares the ratio of positive (favorable) outcomes between the unprivileged and the privileged groups and can be computed as:

\text{DI}=\frac{p(\hat{Y}=+|A=unprivileged)}{p(\hat{Y}=+|A=privileged)}

(6)

The Average Odds Difference (AOD) metric measures the disparity between the false positive rate and true positive rate for the unprivileged and privileged groups and can be written as follows:

	$\displaystyle\text{AOD}=$	$\displaystyle\frac{1}{2}\Bigg{[}p(\hat{Y}=+\mid Y=-,A=unprivileged)-p(\hat{Y}=% +\mid Y=-,A=privileged)$
		$\displaystyle+p(\hat{Y}=+\mid Y=+,A=unprivileged)-p(\hat{Y}=+\mid Y=+,A=% privileged)\Bigg{]}$		(7)

Figure 4 compares fairness metrics between synthetic and real test datasets for a Decision Tree classifier. For each dataset, we generate 1,000 bootstrap samples of the same size as the real test data to compute fairness metrics. Box-plots for DI and AOD illustrate the distribution of these metrics. EOD, which trends similarly to AOD, is omitted from the figure but included in Appendix (C.3) with detailed results on the absolute differences between bootstrap means of fairness metrics from synthetic and real data. The results show that the fairness metrics from our synthetic test data closely match those from real data, outperforming baseline methods on nearly all metrics and protected attributes, except for DI for race in the Adult dataset. Notably, the synthetic data for the COMPAS dataset performs best, with absolute differences of 0.000 in bootstrap means for AOD and DI values for race, achieved using the ‘Marginal’ joint estimation method on separate data with the ‘violent score’ variable overlapping. For the Adult dataset, we also see small absolute differences in bootstrap means, with values as low as 0.002 for DI related to sex, 0.003 for AOD related to race, and 0.010 for AOD related to sex. For the German dataset, we see similar results, showing small absolute differences of 0.005 for AOD and 0.015 for DI related to sex. Despite larger differences shown in fairness metrics for age, the synthetic data still outperforms baseline methods.

6 Conclusion and Future Work

In this study, we tackled the challenge of evaluating classifier fairness when complete datasets, including protected attributes, are inaccessible. We proposed an approach that utilises separate overlapping datasets to estimate a joint distribution and generate complete synthetic test data which includes demographic information and accurately captures the relationships between demographics and model features essential for fairness testing. Our empirical analysis demonstrated that the fairness metrics derived from this synthetic test data closely match those obtained from real data. Our results further show that even with the assumption of only a single overlapping variable between separate datasets, and simple joint distribution estimation methods, the synthetic data can closely mirror real data outcomes and exhibit high fidelity.

This work demonstrates a promising approach for fairness testing by leveraging marginally overlapping datasets to curate effective test datasets. However, we simulated separate datasets and data scenarios, future research could explore incorporating real public data and more complex data scenarios to validate the results obtained. We also employed three joint estimation methods using structural assumptions. Future research could instead explore all feasible joint distributions that meet the constraints of the available marginal distributions, and thus work towards defining bounds within which the true fairness metrics are likely to fall.

Acknowledgments and Disclosure of Funding

The authors declare no competing interests related to this paper. This research was supported by the UKRI Engineering and Physical Sciences Research Council (EPSRC) [grant numbers EP/S021566/1 and EP/P024289/1].

References

Andrus, McKane and Spitzer, Elena and Brown, Jeffrey and Xiang, Alice [2021] Andrus, McKane and Spitzer, Elena and Brown, Jeffrey and Xiang, Alice. What We Can’t Measure, We Can’t Understand: Challenges to Demographic Data Procurement in the Pursuit of Fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021.
Assefa et al. [2021] Samuel A. Assefa, Danial Dervovic, Mahmoud Mahfouz, Robert E. Tillman, Prashant Reddy, and Manuela Veloso. Generating Synthetic Data in Finance: Opportunities, Challenges and Pitfalls. In Proceedings of the 1st ACM International Conference on AI in Finance, 2021.
Bao et al. [2021] Michelle Bao, Angela Zhou, Samantha Zottola, Brian Brubach, Sarah Desmarais, Aaron Horowitz, Kristian Lum, and Suresh Venkatasubramanian. It’s COMPASlicated: The Messy Relationship between RAI datasets and Algorithmic Fairness Benchmarks. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.
Barber [2012] David Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, USA, 2012.
Baumann et al. [2023] Joachim Baumann, Alessandro Castelnovo, Riccardo Crupi, Nicole Inverardi, and Daniele Regoli. Bias on Demand: A Modelling Framework That Generates Synthetic Data With Bias. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, 2023.
Becker and Kohavi [1996] Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20.
Bergman et al. [2023] A Stevie Bergman, Lisa Anne Hendricks, Maribeth Rauh, Boxi Wu, William Agnew, Markus Kunesch, Isabella Duan, Iason Gabriel, and William Isaac. Representation in AI Evaluations. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, 2023.
Borisov et al. [2023] Vadim Borisov, Kathrin Sessler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. Language Models are Realistic Tabular Data Generators. In Proceedings of the 11th International Conference on Learning Representations, 2023.
Breugel et al. [2024] Boris van Breugel, Trent Kyono, Jeroen Berrevoets, and Mihaela van der Schaar. DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks. In Proceedings of the 38th Annual Conference on Neural Information Processing Systems, 2024.
Buolamwini and Gebru [2018] Joy Buolamwini and Timnit Gebru. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 2018 ACM Conference on Fairness, Accountability and Transparency, 2018.
Caliskan et al. [2017] Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186, 2017.
Castelnovo et al. [2022] Alessandro Castelnovo, Riccardo Crupi, Greta Greco, Daniele Regoli, Ilaria Giuseppina Penco, and Andrea Claudio Cosentini. A Clarification of the Nuances in the Fairness Metrics Landscape. Scientific Reports, 12(1):4209, 2022.
Chawla et al. [2002] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. Journal of artificial intelligence research, 16:321–357, 2002.
Chen et al. [2021] Richard J Chen, Ming Y Lu, Tiffany Y Chen, Drew FK Williamson, and Faisal Mahmood. Synthetic data in machine learning for medicine and healthcare. Nature Biomedical Engineering, 5(6):493–497, 2021.
Cramér, H. [1946] Cramér, H. Mathematical Methods of Statistics. Princeton University Press, USA, 1946.
Dempster et al. [1977] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum Likelihood from Incomplete Data Via the EM Algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
Department for Science, Innovation and Technology [2023] Department for Science, Innovation and Technology. A Pro-Innovation Approach to AI Regulation. Technical report, Government of the United Kingdom, 2023.
Dheeru and Karra Taniskidou [2017] D. Dheeru and E. Karra Taniskidou. UCI Machine Learning Repository, 2017. URL http://archive.ics.uci.edu/ml.
Eitan et al. [2022] Yam Eitan, Nathan Cavaglione, Michael Arbel, and Samuel Cohen. Fair Synthetic Data Does not Necessarily Lead to Fair Models. In Proceedings of the NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research, 2022.
European Commission [2023] European Commission. Regulation (EU) 2023/822 of the European Parliament and of the Council on Harmonized Rules on Artificial Intelligence (Artificial Intelligence Act), 2023.
Frogner and Poggio [2019] Charlie Frogner and Tomaso Poggio. Fast and Flexible Inference of Joint Distributions from their Marginals. In Proceedings of the 36th International Conference on Machine Learning, 2019.
Gebru et al. [2021] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.
Groves et al. [2024] Lara Groves, Jacob Metcalf, Alayna Kennedy, Briana Vecchione, and Andrew Strait. Auditing Work: Exploring the New York City algorithmic bias audit regime. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 2024.
Holstein et al. [2019] Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé, Miro Dudik, and Hanna Wallach. Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need? In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 2019.
Ji et al. [2020] Disi Ji, Padhraic Smyth, and Mark Steyvers. Can I Trust My Fairness Metric? Assessing Fairness with Unlabeled Data and Bayesian Inference. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, 2020.
Jo and Gebru [2020] Eun Seo Jo and Timnit Gebru. Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning. In Proceedings of the 2020 ACM Conference on Fairness, Accountability, and Transparency, 2020.
Johnson-Roberson et al. [2017] Matthew Johnson-Roberson, Charles Barto, Rounak Mehta, Sharath Nittur Sridhar, Karl Rosaen, and Ram Vasudevan. Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? In Proceedings of the 2017 IEEE International Conference on Robotics and Automation, 2017.
Le Quy et al. [2022] Tai Le Quy, Arjun Roy, Vasileios Iosifidis, Wenbin Zhang, and Eirini Ntoutsi. A survey on datasets for fairness-aware machine learning. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 12(3):e1452, 2022.
Madaio et al. [2022] Michael Madaio, Lisa Egede, Hariharan Subramonyam, Jennifer Wortman Vaughan, and Hanna Wallach. Assessing the Fairness of AI Systems: AI Practitioners’ Processes, Challenges, and Needs for Support. Proceedings of the ACM on Human-Computer Interaction, 6:1–26, 2022.
McKenna et al. [2021] Ryan McKenna, Gerome Miklau, and Daniel Sheldon. Winning the NIST Contest: A scalable and general approach to differentially private synthetic data. Journal of Privacy and Confidentiality, 11(3), 2021.
McKenna et al. [2022] Ryan McKenna, Brett Mullins, Daniel Sheldon, and Gerome Miklau. AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data. In Proceedings of the VLDB Endowment, 2022.
Mehrabi et al. [2021] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6):1–35, 2021.
Office for National Statistics [2021] Office for National Statistics. 2021 Census Data, 2021. URL https://www.ons.gov.uk/census/aboutcensus/censusproducts/multivariatedata.
Patki et al. [2016] Neha Patki, Roy Wedge, and Kalyan Veeramachaneni. The Synthetic Data Vault. In Proceedings of the 2016 IEEE International Conference on Data Science and Advanced Analytics, 2016.
Paullada et al. [2021] Amandalynne Paullada, Inioluwa Deborah Raji, Emily M Bender, Emily Denton, and Alex Hanna. Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns, 2(11), 2021.
Pereira et al. [2024] Mayana Pereira, Meghana Kshirsagar, Sumit Mukherjee, Rahul Dodhia, Juan Lavista Ferres, and Rafael de Sousa. Assessment of Differentially Private Synthetic Data for Utility and Fairness in End-to-End Machine Learning Pipelines for Tabular Data. Plos one, 19(2), 2024.
Platzer and Reutterer [2021] Michael Platzer and Thomas Reutterer. Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data. Frontiers in Big Data, 4:679939, 2021.
ProPublica [2016] ProPublica. COMPAS Recidivism Risk Score Data and Analysis, 2016. URL https://github.com/propublica/compas-analysis.
Richardson and Weiss [2018] Eitan Richardson and Yair Weiss. On GANs and GMMs. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018.
Shome et al. [2024] Arumoy Shome, Luis Cruz, and Arie Van Deursen. Data vs. Model Machine Learning Fairness Testing: An Empirical Study. In Proceedings of the 5th IEEE/ACM International Workshop on Deep Learning for Testing and Testing for Deep Learning, 2024.
Shwartz-Ziv and Armon [2022] Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, 2022.
Tao et al. [2021] Yuchao Tao, Ryan McKenna, Michael Hay, Ashwin Machanavajjhala, and Gerome Miklau. Benchmarking Differentially Private Synthetic Data Generation Algorithms. In Proceedings of the 3rd AAAI Workshop on Privacy-Preserving Artificial Intelligence, 2021.
van Breugel et al. [2023] Boris van Breugel, Nabeel Seedat, Fergus Imrie, and Mihaela van der Schaar. Can You Rely on Your Model Evaluation? Improving Model Evaluation with Synthetic Test Data. In Proceedings of 2023 International Conference on Neural Information Processing Systems, 2023.
Veale and Binns [2017] Michael Veale and Reuben Binns. Fairer machine learning in the real world: Mitigating discrimination without collecting sensitive data . Big Data & Society, 4(2), 2017.
Vero et al. [2024] Mark Vero, Mislav Balunovic, and Martin Vechev. CuTS: Customizable Tabular Synthetic Data Generation. In Proceedings of the 41st International Conference on Machine Learning, 2024.
Vora et al. [2021] Jian Vora, Karthik S Gurumoorthy, and Ajit Rajwade. Recovery of Joint Probability Distribution from One-Way Marginals: Low Rank Tensors and Random Projections. In Proceedings of the 2021 IEEE Statistical Signal Processing Workshop, 2021.
White House Office of Science and Technology Policy [2022] White House Office of Science and Technology Policy. Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People. https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf, 2022.
Xu et al. [2018] Depeng Xu, Shuhan Yuan, Lu Zhang, and Xintao Wu. FairGAN: Fairness-aware Generative Adversarial Networks. In Proceedings of the 2018 IEEE International Conference on Big Data, 2018.
Xu et al. [2019] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019.

Appendix

Appendix related to paper: Beyond Internal Data: Constructing Complete Datasets for Fairness Testing for the Algorithmic Fairness through the Lens of Metrics and Evaluation (AFME) at NeurIPS 2024

The appendix is structured as follows:

Appendix §A

provides technical details on the joint distribution estimation methods, Marginal Preservation and Latent Naïve Bayes, as outlined in the main text.

Appendix §B

describes each dataset used in the experiments, with tables specifying the variables and categories present after preprocessing.

Appendix §C

presents detailed results of the metrics used to assess the quality of the generated synthetic data.

Appendix A Technical Details for Joint Distribution Estimation Methods

A.1 Proof for Optimality for Independence Given Overlap Method

To find the optimal $p(x_{3})$ , we start by minimising the total Kullback-Leibler (KL) divergence:

\mathcal{L}(p)=D_{\mathrm{KL}}\big{(}\hat{p}(x_{1},x_{2},x_{3})\parallel p(x_{% 1},x_{2},x_{3})\big{)}+D_{\mathrm{KL}}\big{(}\hat{p}(x_{3},x_{4})\parallel p(x% _{3},x_{4})\big{)}.

(8)

Let $\hat{p}_{1}(x_{3})$ and $\hat{p}_{2}(x_{3})$ be the empirical marginals from the first and second datasets, respectively:

\hat{p}_{1}(x_{3})=\sum_{x_{1},x_{2}}\hat{p}(x_{1},x_{2},x_{3}),\quad\hat{p}_{% 2}(x_{3})=\sum_{x_{4}}\hat{p}(x_{3},x_{4}).

(9)

From our joint distribution assumption $p(x_{1},x_{2},x_{3},x_{4})=p(x_{3})\cdot\hat{p}(x_{1},x_{2}\mid x_{3})\cdot% \hat{p}(x_{4}\mid x_{3})$ , we obtain marginals $p(x_{1},x_{2},x_{3})=p(x_{3})\cdot\hat{p}(x_{1},x_{2}\mid x_{3}),$ and $p(x_{3},x_{4})=p(x_{3})\cdot\hat{p}(x_{4}\mid x_{3}).$

To minimise the KL divergence with respect to $p(x_{3})$ , we rewrite $\mathcal{L}(p)$ focusing on the marginal $p(x_{3})$ :

$\displaystyle\mathcal{L}(p)$	$\displaystyle=\sum_{x_{1},x_{2},x_{3}}\hat{p}(x_{1},x_{2},x_{3})\left[\log% \frac{\hat{p_{1}}(x_{3})\hat{p}(x_{1},x_{2}\|x_{3})}{p(x_{3})\hat{p}(x_{1},x_{2% }\|x_{3})})\right]+\sum_{x_{3},x_{4}}\hat{p}(x_{3},x_{4})\left[\log\frac{\hat{p% _{1}}(x_{3})\hat{p}(x_{4}\|x_{3})}{p(x_{3})\hat{p}(x_{4}\|x_{3})})\right]$
	$\displaystyle=-\sum_{x_{3}}\sum_{x_{1},x_{2}}\hat{p}(x_{1},x_{2},x_{3})\log p(% x_{3})-\sum_{x_{3}}\sum_{x_{4}}\hat{p}(x_{3},x_{4})\log p(x_{3})$
	$\displaystyle=-\sum_{x_{3}}\left(\hat{p}_{1}(x_{3})+\hat{p}_{2}(x_{3})\right)% \log p(x_{3})$	(10)

We find the optimal $p(x_{3})$ , which minimises $\mathcal{L}(p)$ subject to $\sum_{x_{3}}p(x_{3})=1$ to ensure that $p(x_{3})$ is a valid probability distribution.

\displaystyle p(x_{3})\propto\hat{p}_{1}(x_{3})+\hat{p}_{2}(x_{3}).

(11)

To normalise $p(x_{3})$ , we set:

	$\displaystyle p(x_{3})$	$\displaystyle=\frac{\hat{p}_{1}(x_{3})+\hat{p}_{2}(x_{3})}{\sum_{x_{3}^{\prime% }}\left(\hat{p}_{1}(x_{3}^{\prime})+\hat{p}_{2}(x_{3}^{\prime})\right)}$
		$\displaystyle=\frac{\hat{p}_{1}(x_{3})+\hat{p}_{2}(x_{3})}{2}$		(12)

ensuring that $p(x_{3})$ is a valid probability distribution.

Therefore the optimal $p(x_{3})$ is the average of the empirical marginals from both datasets.

A.2 Details for Expectation Maximisation Algorithm for Latent Naïve Bayes Method

We assume categorical variables $X_{1},X_{2},X_{3},X_{4}$ with dom $(X_{i})=\{1,2,...,M_{i}\}$ where $M_{i}\in\mathbb{N},M_{i}>1$ for $i=\{1,2,3,4\}$ . We want to sample from the full joint distribution $p(X_{1},X_{2},X_{3},X_{4})$ . However, our observations are of the form $\mathbf{D_{1}}$ , and $\mathbf{D_{2}}$ , where $x_{3}$ and $x^{\prime}_{3}$ are both observations of the same variable $X_{3}$ .

	$\displaystyle\mathbf{D_{1}}$	$\displaystyle=\{({x_{1}}^{n},{x_{2}}^{n},{x_{3}}^{n})\}_{n=1}^{N_{1}}$		(13)
	$\displaystyle\mathbf{D_{2}}$	$\displaystyle=\{({x^{\prime}_{3}}^{n},{x_{4}}^{n})\}_{n=1}^{N_{2}}$		(14)

To model the complex dependencies between the variables and to simplify the model, we intentionally introduce latent variable $Z$ , and the following probabilistic graphical model, where dom $(Z)=\{1,2,...,K\},K\in\mathbb{N},K>1$ .

By treating $Z$ as a missing variable, mixture models can be trained using the EM algorithm.

The model defines the generative process for each data item $n$ as follows:

1.

Sample $Z$ from $p(Z=k)=\pi_{k}$ , where $k=1,2,\ldots,K$ , $\pi_{k}\geq 0$ , and $\sum_{k=1}^{K}\pi_{k}=1$ .

Given $Z=k$ , the conditional distribution of $X_{i}$ for $i=1,2,3,4$ is:

\displaystyle p(X_{i}=m\mid Z=k)=p_{i}(m\mid k)

(15)

where $m=1,2,\ldots,M_{i}$ , $p_{i}(m\mid k)\geq 0$ , and $\sum_{m=1}^{M_{i}}p_{i}(m\mid k)=1$ .

We aim to learn the parameters $\boldsymbol{\theta}=(\theta_{1},\theta_{2},\theta_{3},\theta_{4},\theta_{Z})$ , where:

	$\displaystyle\theta_{i}$	$\displaystyle=\{p_{i}(m\mid k):m=1,\ldots,M_{i},\;k=1,\ldots,K\}\quad\text{for% }i=1,2,3,4$
	$\displaystyle\theta_{Z}$	$\displaystyle=(\pi_{1},\ldots,\pi_{K})$

By learning $\boldsymbol{\theta}$ , we can model the joint distribution:

\displaystyle p_{\boldsymbol{\theta}}(X_{1},X_{2},X_{3})=\sum_{Z=1}^{K}p_{% \theta_{Z}}(Z)\prod_{i=1}^{4}p_{\theta_{i}}(X_{i}\mid Z)

(16)

A.2.1 Model Distributions

For dataset $\mathbf{D_{1}}$ , the joint distribution is:

\displaystyle p_{\boldsymbol{\theta}}(\mathbf{D_{1}},\mathbf{z})=\prod_{n=1}^{% N_{1}}p_{1}(x_{1}^{n}\mid z^{n})\cdot p_{2}(x_{2}^{n}\mid z^{n})\cdot p_{3}(x_% {3}^{n}\mid z^{n})\cdot\pi_{z^{n}}

(17)

Marginalising over the latent variables gives the marginal log likelihood:

\displaystyle\log p_{\boldsymbol{\theta}}(\mathbf{D_{1}})=\sum_{n=1}^{N_{1}}% \log\left(\sum_{k=1}^{K}p_{1}(x_{1}^{n}\mid k)\cdot p_{2}(x_{2}^{n}\mid k)% \cdot p_{3}(x_{3}^{n}\mid k)\cdot\pi_{k}\right)

(18)

The posterior distribution is:

\displaystyle p_{\boldsymbol{\theta}}(\mathbf{z}\mid\mathbf{D_{1}})=\prod_{n=1% }^{N_{1}}\frac{p_{1}(x_{1}^{n}\mid z^{n})\cdot p_{2}(x_{2}^{n}\mid z^{n})\cdot p% _{3}(x_{3}^{n}\mid z^{n})\cdot\pi_{z^{n}}}{\sum_{k=1}^{K}p_{1}(x_{1}^{n}\mid k% )\cdot p_{2}(x_{2}^{n}\mid k)\cdot p_{3}(x_{3}^{n}\mid k)\cdot\pi_{k}}

(19)

Similarly, for dataset $\mathbf{D_{2}}$ :

$\displaystyle p_{\boldsymbol{\theta}}(\mathbf{D_{2}},\mathbf{z^{\prime}})$	$\displaystyle=\prod_{n=1}^{N_{2}}p_{3}(x_{3}^{\prime n}\mid z^{\prime n})\cdot p% _{4}(x_{4}^{n}\mid z^{\prime n})\cdot\pi_{z^{\prime n}}$	(20)
$\displaystyle\log p_{\boldsymbol{\theta}}(\mathbf{D_{2}})$	$\displaystyle=\sum_{n=1}^{N_{2}}\log\left(\sum_{k=1}^{K}p_{3}(x_{3}^{\prime n}% \mid k)\cdot p_{4}(x_{4}^{n}\mid k)\cdot\pi_{k}\right)$	(21)
$\displaystyle p_{\boldsymbol{\theta}}(\mathbf{z^{\prime}}\mid\mathbf{D_{2}})$	$\displaystyle=\prod_{n=1}^{N_{2}}\frac{p_{3}(x_{3}^{\prime n}\mid z^{\prime n}% )\cdot p_{4}(x_{4}^{n}\mid z^{\prime n})\cdot\pi_{z^{\prime n}}}{\sum_{k=1}^{K% }p_{3}(x_{3}^{\prime n}\mid k)\cdot p_{4}(x_{4}^{n}\mid k)\cdot\pi_{k}}$	(22)

A.2.2 Method Outline

For dataset $\mathbf{D_{1}}$ with latents $\mathbf{z}=\{z^{n}\}_{n=1}^{N_{1}}$ , a Latent Variable Model (LVM) is defined as $p_{\boldsymbol{\theta}}(\mathbf{D_{1}},\mathbf{z})$ . Similarly, for $\mathbf{D_{2}}$ with latents $\mathbf{z^{\prime}}=\{z^{\prime n}\}_{n=1}^{N_{2}}$ , the LVM is $p_{\boldsymbol{\theta}}(\mathbf{D_{2}},\mathbf{z^{\prime}})$ . Under independence assumptions, the distributions factorize:

$\displaystyle p_{\boldsymbol{\theta}}(\mathbf{D_{1}},\mathbf{D_{2}},\mathbf{z}% ,\mathbf{z^{\prime}})$	$\displaystyle=p_{\boldsymbol{\theta}}(\mathbf{D_{1}},\mathbf{z})\cdot p_{% \boldsymbol{\theta}}(\mathbf{D_{2}},\mathbf{z^{\prime}})$	(23)
$\displaystyle\log p_{\boldsymbol{\theta}}(\mathbf{D_{1}},\mathbf{D_{2}})$	$\displaystyle=\log p_{\boldsymbol{\theta}}(\mathbf{D_{1}})+\log p_{\boldsymbol% {\theta}}(\mathbf{D_{2}})$	(24)
$\displaystyle p_{\boldsymbol{\theta}}(\mathbf{z},\mathbf{z^{\prime}}\mid% \mathbf{D_{1}},\mathbf{D_{2}})$	$\displaystyle=p_{\boldsymbol{\theta}}(\mathbf{z}\mid\mathbf{D_{1}})\cdot p_{% \boldsymbol{\theta}}(\mathbf{z^{\prime}}\mid\mathbf{D_{2}})$	(25)

To estimate ${\boldsymbol{\theta}}$ , we apply the EM algorithm to maximize the marginal log-likelihoods $\log p_{\boldsymbol{\theta}}(\mathbf{D_{1}})$ and $\log p_{\boldsymbol{\theta}}(\mathbf{D_{2}})$ under latent variables. The lower bounds are given by:

\displaystyle\log p_{\boldsymbol{\theta}}(\mathbf{D_{1}})\geq L_{D_{1}}(\theta% ,q_{1}),\quad\log p_{\boldsymbol{\theta}}(\mathbf{D_{2}})\geq L_{D_{2}}(\theta% ,q_{2})

(26)

where $q_{1}(z)=q(z\mid D_{1})$ and $q_{2}(z)=q(z\mid D_{2})$ are distributions over $Z$ .

The EM algorithm steps are as follows, also detailed in Algorithm 1.

Algorithm 1 EM Algorithm

1:Initialize

t=0

and

\boldsymbol{\theta}^{(0)}=\{\theta_{1}^{(0)},\theta_{2}^{(0)},\theta_{3}^{(0)}% ,\theta_{4}^{(0)},\theta_{Z}^{(0)}\}

t\leftarrow 1

3:while

\boldsymbol{\theta}

not converged do

4: for

n=1,...,N_{1}

k=1,...,K

5: Set

q_{1}^{(t)}(z^{n}=k)

using (34)

6: end for

7: for

n=1,...,N_{2}

k=1,...,K

8: Set

q_{2}^{(t)}(z^{\prime n}=k)

using (35)

9: end for

10: Update

\boldsymbol{\theta}^{(t)}=\{\theta_{1}^{(t)},\theta_{2}^{(t)},\theta_{3}^{(t)}% ,\theta_{4}^{(t)},\theta_{Z}^{(t)}\}

using (43), (44), (49), (45), (39)

11:

t\leftarrow t+1

12:end while

•
M-step: Maximize the lower bounds with respect to $\theta_{1},\theta_{2},\theta_{3},\theta_{4},\theta_{Z}$ :
- –
  
  Maximize $L_{D_{1}}(\theta,q_{1})$ for $\theta_{1},\theta_{2}$
- –
  
  Maximize $L_{D_{2}}(\theta,q_{2})$ for $\theta_{4}$
- –
  
  Maximize the sum over terms containing $\theta_{3}$ and $\theta_{Z}$ across $L_{D_{1}}$ and $L_{D_{2}}$
•
E-step: Find $q$ to optimize $L_{D_{1}}(\theta,q_{1})+L_{D_{2}}(\theta,q_{2})$ :
- –
  
  Set $q_{1}$ to optimize $L_{D_{1}}$ given fixed $\theta$
- –
  
  Set $q_{2}$ to optimize $L_{D_{2}}$ given fixed $\theta$

A.3 Deriving Algorithm Steps

A.3.1 Lower bound on the Likelihood

We lower bound the log-likelihood of the observed variables:

\displaystyle\log\left(p_{\boldsymbol{\theta}}(\mathbf{D_{1}})\right)+\log% \left(p_{\boldsymbol{\theta}}(\mathbf{D_{2}})\right)

(27)

Using $q_{1}(z)=q(z\mid\mathbf{D_{1}})$ and $q_{2}(z)=q(z\mid\mathbf{D_{2}})$ , the KL divergence for $\mathbf{D_{1}}$ is:

\displaystyle D_{\mathrm{KL}}(q_{1}(Z)\,\|\,p_{\boldsymbol{\theta}}(Z\mid% \mathbf{D_{1}}))=\mathbb{E}_{Z\sim q_{1}}\left[\log\frac{q_{1}(Z)}{p_{% \boldsymbol{\theta}}(Z\mid\mathbf{D_{1}})}\right]\geq 0

(28)

Thus, we have:

\displaystyle\log\left(p_{\boldsymbol{\theta}}(\mathbf{D_{1}})\right)\geq% \mathbb{E}_{Z\sim q_{1}}\left[\log\frac{p_{\boldsymbol{\theta}}(\mathbf{D_{1}}% ,Z)}{q_{1}(Z)}\right]=L_{D_{1}}(\boldsymbol{\theta},q_{1})

(29)

where

\displaystyle L_{D_{1}}(\boldsymbol{\theta},q_{1})=\sum_{n=1}^{N_{1}}\sum_{k=1% }^{K}q_{1}(z^{n}=k)\left[\sum_{i=1}^{3}\log p_{i}(x_{i}^{n}\mid k)+\log\pi_{k}% \right]-H(q_{1})

(30)

Similarly, for $\mathbf{D_{2}}$ :

\displaystyle\log\left(p_{\boldsymbol{\theta}}(\mathbf{D_{2}})\right)\geq L_{D% _{2}}(\boldsymbol{\theta},q_{2})

(31)

with

\displaystyle L_{D_{2}}(\boldsymbol{\theta},q_{2})=\sum_{n=1}^{N_{2}}\sum_{k=1% }^{K}q_{2}({z^{\prime}}^{n}=k)\left[\log p_{3}({x^{\prime}}_{3}^{n}\mid k)+% \log p_{4}(x_{4}^{n}\mid k)+\log\pi_{k}\right]-H(q_{2})

(32)

Overall, the lower bound is:

\displaystyle\log\left(p_{\boldsymbol{\theta}}(\mathbf{D_{1}})\right)+\log% \left(p_{\boldsymbol{\theta}}(\mathbf{D_{2}})\right)\geq L_{D_{1}}(\boldsymbol% {\theta},q_{1})+L_{D_{2}}(\boldsymbol{\theta},q_{2})

(33)

A.3.2 E step

The E-step 1 updates $q_{1}^{(t)}(z^{n}=k)$ by maximizing the lower bound $L_{D_{1}}(\theta,q_{1})$ with respect to $q_{1}(\cdot)$ , while keeping $\theta$ fixed:

	$\displaystyle q_{1}^{(t)}(z^{n}=k)$	$\displaystyle=p_{\boldsymbol{\theta}^{(t-1)}}(z^{n}=k\mid x_{1}^{n},x_{2}^{n},% x_{3}^{n})$
		$\displaystyle=\frac{p_{1}^{(t-1)}(x_{1}^{n}\mid k)\cdot p_{2}^{(t-1)}(x_{2}^{n% }\mid k)\cdot p_{3}^{(t-1)}(x_{3}^{n}\mid k)\cdot\pi_{k}^{(t-1)}}{\sum_{j=1}^{% K}p_{1}^{(t-1)}(x_{1}^{n}\mid j)\cdot p_{2}^{(t-1)}(x_{2}^{n}\mid j)\cdot p_{3% }^{(t-1)}(x_{3}^{n}\mid j)\cdot\pi_{j}^{(t-1)}}$		(34)

The E-step 2 updates $q_{2}^{(t)}({z^{\prime}}^{n}=k)$ by maximizing $L_{D_{2}}(\theta,q_{2})$ with respect to $q_{2}(\cdot)$ , while keeping $\theta$ fixed:

	$\displaystyle q_{2}^{(t)}({z^{\prime}}^{n}=k)$	$\displaystyle=p_{\boldsymbol{\theta}^{(t-1)}}({z^{\prime}}^{n}=k\mid{x^{\prime% }_{3}}^{n},x_{4}^{n})$
		$\displaystyle=\frac{p_{3}^{(t-1)}({x^{\prime}_{3}}^{n}\mid k)\cdot p_{4}^{(t-1% )}(x_{4}^{n}\mid k)\cdot\pi_{k}^{(t-1)}}{\sum_{j=1}^{K}p_{3}^{(t-1)}({x^{% \prime}_{3}}^{n}\mid j)\cdot p_{4}^{(t-1)}(x_{4}^{n}\mid j)\cdot\pi_{j}^{(t-1)}}$		(35)

A.3.3 M step: Optimal $\theta_{Z}$

For the M-step, we maximize $L_{D_{1}}(\theta,q_{1})+L_{D_{2}}(\theta,q_{2})$ with respect to $\theta$ , while keeping $q(\cdot)$ fixed.

To account for the constraint $\sum_{k=1}^{K}\pi_{k}=1$ , we use a Lagrange multiplier $\lambda$ . For any $c\in\{1,\ldots,K\}$ , we have:

	$\displaystyle\triangledown_{\pi_{c}}\left(L_{D_{1}}(\boldsymbol{\theta},q_{1})% +L_{D_{2}}(\boldsymbol{\theta},q_{2})-\lambda\left(\sum_{k=1}^{K}\pi_{k}-1% \right)\right)=0$		(36)
	$\displaystyle\implies\frac{\sum_{n=1}^{N_{1}}q_{1}({z}^{n}=c)+\sum_{n=1}^{N_{2% }}q_{2}({z^{\prime}}^{n}=c)}{\pi_{c}}-\lambda=0$		(37)
	$\displaystyle\implies\pi_{c}\propto\sum_{n=1}^{N_{1}}q_{1}({z}^{n}=c)+\sum_{n=% 1}^{N_{2}}q_{2}({z^{\prime}}^{n}=c)$		(38)

Since $\sum_{k=1}^{K}\left(\sum_{n=1}^{N_{1}}q_{1}({z}^{n}=k)+\sum_{n=1}^{N_{2}}q_{2}% ({z^{\prime}}^{n}=k)\right)=N_{1}+N_{2}$ , we obtain:

\displaystyle\pi_{c}^{(t)}=\frac{\sum_{n=1}^{N_{1}}q_{1}^{(t)}({z}^{n}=c)+\sum% _{n=1}^{N_{2}}q_{2}^{(t)}({z^{\prime}}^{n}=c)}{N_{1}+N_{2}}

(39)

A.3.4 M step: Optimal $\theta_{1}$ , $\theta_{2}$ and $\theta_{4}$

In the M-step, we use Lagrange multipliers $\lambda(c)$ to maximize $L_{D_{1}}(\boldsymbol{\theta},q_{1})$ with respect to $p_{1}(m|c)$ . For $c\in\{1,\ldots,K\}$ and $m\in\{1,\ldots,M_{1}\}$ , we have:

	$\displaystyle\triangledown_{p_{1}(m\|c)}\left(L_{D_{1}}(\boldsymbol{\theta},q_{% 1})-\sum_{k=1}^{K}\lambda(k)\left(\sum_{j=1}^{M_{1}}p_{1}(j\|k)-1\right)\right)=0$		(40)
	$\displaystyle\implies\sum_{n=1}^{N_{1}}\frac{\mathbbm{1}(x_{1}^{n}=m)q_{1}(z^{% n}=c)}{p_{1}(m\|c)}-\lambda(c)=0$		(41)
	$\displaystyle\implies p_{1}(m\|c)\propto\sum_{n=1}^{N_{1}}\mathbbm{1}(x_{1}^{n}% =m)q_{1}(z^{n}=c)$		(42)

Normalizing gives:

\displaystyle p_{1}^{(t)}(m|c)=\frac{\sum_{n=1}^{N_{1}}\mathbbm{1}(x_{1}^{n}=m% )q_{1}^{(t)}(z^{n}=c)}{\sum_{j=1}^{M_{1}}\sum_{n=1}^{N_{1}}\mathbbm{1}(x_{1}^{% n}=j)q_{1}^{(t)}(z^{n}=c)}

(43)

For $p_{2}(m|c)$ :

\displaystyle p_{2}^{(t)}(m|c)=\frac{\sum_{n=1}^{N_{2}}\mathbbm{1}(x_{2}^{n}=m% )q_{1}^{(t)}(z^{n}=c)}{\sum_{j=1}^{M_{2}}\sum_{n=1}^{N_{2}}\mathbbm{1}(x_{2}^{% n}=j)q_{1}^{(t)}(z^{n}=c)}

(44)

Similarly, for $p_{4}(m|c)$ , we maximise $L_{D_{2}}(\boldsymbol{\theta},q_{2})$ :

\displaystyle p_{4}^{(t)}(m|c)=\frac{\sum_{n=1}^{N_{2}}\mathbbm{1}(x_{4}^{n}=m% )q_{2}^{(t)}(z^{\prime n}=c)}{\sum_{j=1}^{M_{4}}\sum_{n=1}^{N_{2}}\mathbbm{1}(% x_{4}^{n}=j)q_{2}^{(t)}(z^{\prime n}=c)}

(45)

A.3.5 M step: Optimal $\theta_{3}$

In the M-step, we use Lagrange multipliers $\lambda(c)$ to maximize $L_{D_{1}}(\boldsymbol{\theta},q_{1})$ + $L_{D_{2}}(\boldsymbol{\theta},q_{2})$ with respect to $p_{3}(m|c)$ . For $c\in\{1,\ldots,K\}$ and $m\in\{1,\ldots,M_{3}\}$ , we have:

	$\displaystyle\triangledown_{{p_{3}}(m\|c)}\left(L_{D_{1}}(\boldsymbol{\theta},q% _{1})+L_{D_{2}}(\boldsymbol{\theta},q_{2})-\sum_{k=1}^{K}\lambda(k)(\sum_{j=1}% ^{M_{2}}{p_{3}}(j\|k)-1)\right)=0$		(46)
	$\displaystyle\implies\frac{\sum_{n=1}^{N_{1}}\mathbbm{1}({x_{3}}^{n}=m)q_{1}({% z}^{n}=c)+\sum_{n=1}^{N_{2}}\mathbbm{1}({x^{\prime}_{2}}^{n}=m)q_{2}({z^{% \prime}}^{n}=c)}{{p_{3}}(m\|c)}-\lambda(k=c)=0$		(47)
	$\displaystyle\implies{p_{3}}(m\|c)\propto\sum_{n=1}^{N_{1}}\mathbbm{1}({x_{3}}^% {n}=m)q_{1}({z}^{n}=c)+\sum_{n=1}^{N_{2}}\mathbbm{1}({x^{\prime}_{2}}^{n}=m)q_% {2}({z^{\prime}}^{n}=c)$		(48)

We therefore obtain M step update

\displaystyle{p_{3}^{(t)}}(m|c)=\frac{\sum_{n=1}^{N_{1}}\mathbbm{1}({x_{3}}^{n% }=m)q_{1}^{(t)}({z}^{n}=c)+\sum_{n=1}^{N_{2}}\mathbbm{1}({x^{\prime}_{3}}^{n}=% m)q_{2}^{(t)}({z^{\prime}}^{n}=c)}{\sum_{j=1}^{M_{2}}\left(\sum_{n=1}^{N_{1}}% \mathbbm{1}({x_{3}}^{n}=j)q_{1}^{(t)}({z}^{n}=c)+\sum_{n=1}^{N_{2}}\mathbbm{1}% ({x^{\prime}_{3}}^{n}=j)q_{2}^{(t)}({z^{\prime}}^{n}=c)\right)}

(49)

Appendix B Dataset Details

For the Adult Data, the ‘fnlwgt’ attribute is dropped as it is not relevant to the task and the ‘education-num’ attribute as it duplicates the information available in the ‘education’ attribute. COMPAS Data is filtered to only include ‘race‘ column is either ‘African-American’ or ‘Caucasian’ and coding as $\{black,white\}$ . We further combine three columns containing juvenile crime counts to get the total number of juvenile crimes. Details of the attributes and their values can be found in Tables 4, 5, and 6.

Table 4: Adult Data: Attributes and Their Values

Attribute	Values
Age	{25–60, <25, >60}
Capital Gain	{<=5000, >5000}
Capital Loss	{<=40, >40}
Education	{assoc-acdm, assoc-voc, bachelors, doctorate, HS-grad, masters, prof-school, some-college, high-school, primary/middle school}
Hours Per Week	{<40, 40–60, >60}
Income	{<=50K, >50K}
Marital Status	{married, other}
Native Country	{US, non-US}
Occupation	{adm-clerical, armed-forces, craft-repair, exec-managerial, farming-fishing, handlers-cleaners, machine-op-inspct, other-service, priv-house-serv, prof-specialty, protective-serv, sales, tech-support, transport-moving}
Race	{non-white, white}
Relationship	{non-spouse, spouse}
Sex	{male, female}
Workclass	{private, non-private}

Table 5: COMPAS Data: Attributes and Their Values

Attribute	Values
Age Category	{25 - 45, >45, <25}
Charge Degree	{F, M}
Juvenile Crime	{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14}
Priors Count	{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 36, 37, 38}
Race	{Black, White}
Score Text	{High, Low, Medium}
Sex	{Female, Male}
Two-Year Recidivism	{0, 1}
Violent Score Text	{High, Low, Medium}

Table 6: German Credit Data: Attributes and Their Values

Attribute	Values
Age	{<= 25, >25}
Checking Account	{0 <= <200 DM, <0 DM, >= 200 DM, no account}
Class Label	{0, 1}
Credit Amount	{<=2000, 2001-5000, >5000}
Credit History	{all credits at this bank paid back duly, critical account, delay in paying off, existing credits paid back duly till now, no credits taken}
Duration	{<=6, 7-12, >12}
Employment Since	{1 <= < 4 years, 4 <= <7 years, <1 years, >=7 years, unemployed}
Existing Credits	{1, 2, 3, 4}
Foreign Worker	{no, yes}
Housing	{for free, own, rent}
Installment Rate	{1, 2, 3, 4}
Job	{management/ highly qualified employee, skilled employee / official, unemployed/ unskilled - non-resident, unskilled - resident}
Marital Status	{divorced/separated, married/widowed}
Number of People Provide Maintenance For	{1, 2}
Other Debtors	{co-applicant, guarantor, none}
Other Installment Plans	{bank, none, store}
Property	{car or other, real estate, savings agreement/life insurance, unknown / no property}
Purpose	{business, car (new), car (used), domestic appliances, education, furniture/equipment, others, radio/television, repairs, retraining}
Residence Since	{1, 2, 3, 4}
Savings Account	{100 <= <500 DM, 500 <= < 1000 DM, <100 DM, >= 1000 DM, no savings account}
Sex	{female, male}
Telephone	{none, yes}

Appendix C Evaluating Quality of Synthetic Data

C.1 Overall Fidelity Metrics: Synthetic vs. Real Data

Full results for overall fidelity metrics, including Total Variation Distance Complement (1-TVD), Contingency Similarity (CS), and Discriminator Measure (DM) across various synthetic datasets, are presented in Table 7. This table provides a comprehensive comparison of the fidelity of different synthetic data generation methods to real-world data.

Figure 5 shows the difference in Cramér’s V correlation (DCC) between synthetic and real test data for COMPAS. Similar patterns are observed across other synthetic datasets.

Table 7: Fidelity metrics for synthetic datasets generated from separate data (overlapping variable in brackets next to joint distribution estimation method). Metrics include total variation distance complement (1-TVD), contingency similarity (CS), discriminator measure (DM). Baseline methods include CTGAN and Indep.

Dataset	Method (Overlapping)	1-TVD $\uparrow$	CS $\uparrow$	DM $\downarrow$
Adult
	Indep-Overlap (Relationship)	0.993	0.983	0.588
	Marginal (Relationship)	0.993	0.983	0.588
	Latent (Relationship)	0.986	0.968	0.658
	Indep-Overlap (Marital Status)	0.994	0.983	0.587
	Marginal (Marital Status)	0.993	0.982	0.594
	Latent (Marital Status)	0.987	0.970	0.655
	CTGAN	0.935	0.938	0.656
	Indep	0.935	0.895	0.808
COMPAS
	Indep-Overlap (Score)	0.978	0.952	0.596
	Marginal (Score)	0.979	0.953	0.598
	Latent (Score)	0.978	0.951	0.592
	Indep-Overlap (Violent Score)	0.978	0.955	0.577
	Marginal (Violent Score)	0.978	0.955	0.573
	Latent (Violent Score)	0.976	0.950	0.598
	CTGAN	0.910	0.839	0.699
	Indep	0.979	0.913	0.689
German
	Indep-Overlap (Property)	0.965	0.926	0.613
	Marginal (Property)	0.966	0.926	0.628
	Latent (Property)	0.965	0.924	0.586
	Indep-Overlap (Housing)	0.966	0.926	0.618
	Marginal (Housing)	0.966	0.927	0.621
	Latent (Housing)	0.966	0.925	0.575
	CTGAN	0.946	0.894	0.697
	Indep	0.965	0.920	0.696

C.2 Joint Distribution of Protected Attributes and Outcomes: Synthetic vs. Real Data

KL Divergence Values for the joint distribution of protected attributes and outcome labels between synthetic and real data, evaluated across various methods and data separations, are detailed in Table 8.

Table 8: KL divergence of

p(A,Y)

in synthetic vs real data (where Y is the outcome label and A is a protected attribute such as race, sex, and age). Synthetic datasets generated from separate data (with the overlapping variable indicated in brackets next to the joint distribution estimation method). Baseline methods include CTGAN and Indep.

Dataset	Method (Overlapping)	KL for Race $\downarrow$	KL for Sex $\downarrow$	KL for Age $\downarrow$
Adult
	Indep-Overlap (Relationship)	0.002	0.001	–
	Marginal (Relationship)	0.002	0.001	–
	Latent (Relationship)	0.002	0.002	–
	Indep-Overlap (Marital Status)	0.002	0.001	–
	Marginal (Marital Status)	0.002	0.002	–
	Latent (Marital Status)	0.002	0.001	–
	CTGAN	0.132	0.048	–
	Indep	0.005	0.026	–
COMPAS
	Indep-Overlap (Score)	0.006	0.044	–
	Marginal (Score)	0.005	0.039	–
	Latent (Score)	0.005	0.034	–
	Indep-Overlap (Violent Score)	0.015	0.038	–
	Marginal (Violent Score)	0.015	0.038	–
	Latent (Violent Score)	0.005	0.026	–
	CTGAN	0.498	0.506	–
	Indep	0.058	0.062	–
German
	Indep-Overlap (Property)	–	0.015	0.052
	Marginal (Property)	–	0.013	0.055
	Latent (Property)	–	0.003	0.023
	Indep-Overlap (Housing)	–	0.003	0.034
	Marginal (Housing)	–	0.005	0.035
	Latent (Housing)	–	0.002	0.022
	CTGAN	–	0.282	0.215
	Indep	–	0.007	0.038

C.3 Detailed Fairness Metrics Comparison: Synthetic vs. Real Data

Table 9 provides a detailed comparison of absolute differences in fairness metrics for a Decision Tree classifier, as evaluated on various synthetic datasets compared to real test data. The metrics include Average Odds Difference (AOD), Disparate Impact (DI), and Equal Opportunity Difference (EOD). The analysis is based on 1000 bootstrapped samples. The table summarises these metrics across different synthetic datasets and baselines.

Table 9: Absolute differences between bootstrap means of fairness metrics from synthetic and real data. Metrics calculated for Decision Tree Classifier, using 1000 bootstrapped aamples. Metrics include AOD (Average Odds Difference), DI (Disparate Impact), and EOD (Equal Opportunity Difference). Synthetic datasets generated from separate data (with the overlapping variable indicated in brackets next to the joint distribution estimation method). Baseline methods include CTGAN and Indep.

Dataset	Method (Overlapping)	Race			Sex
		AOD $\downarrow$	DI $\downarrow$	EOD $\downarrow$	AOD $\downarrow$	DI $\downarrow$	EOD $\downarrow$
Adult
	Indep-Overlap (Marital)	0.008	0.137	0.015	0.015	0.030	0.038
	Indep-Overlap (Relationship)	0.013	0.112	0.022	0.037	0.006	0.070
	Latent (Marital)	0.013	0.063	0.016	0.069	0.162	0.134
	Latent (Relationship)	0.003	0.079	0.010	0.068	0.154	0.148
	Marginal (Marital)	0.003	0.144	0.005	0.010	0.047	0.030
	Marginal (Relationship)	0.009	0.112	0.014	0.036	0.002	0.071
	CTGAN	0.027	0.003	0.053	0.013	0.048	0.055
	Indep	0.021	0.390	0.023	0.070	0.669	0.071
COMPAS
	Indep-Overlap (Score)	0.002	0.003	0.034	0.040	0.079	0.014
	Indep-Overlap (Violent Score)	0.031	0.046	0.057	0.013	0.032	0.027
	Latent (Score)	0.001	0.001	0.035	0.013	0.032	0.030
	Latent (Violent Score)	0.005	0.010	0.016	0.009	0.016	0.054
	Marginal (Score)	0.000	0.000	0.037	0.029	0.061	0.010
	Marginal (Violent Score)	0.039	0.057	0.063	0.015	0.037	0.034
	CTGAN	0.065	0.212	0.097	0.083	0.351	0.135
	Indep	0.134	0.211	0.146	0.072	0.138	0.012
		Age			Sex
		AOD $\downarrow$	DI $\downarrow$	EOD $\downarrow$	AOD $\downarrow$	DI $\downarrow$	EOD $\downarrow$
German
	Indep-Overlap (Housing)	0.106	0.178	0.084	0.048	0.068	0.071
	Indep-Overlap (Property)	0.144	0.216	0.085	0.010	0.020	0.048
	Latent (Housing)	0.141	0.190	0.049	0.025	0.049	0.069
	Latent (Property)	0.119	0.170	0.048	0.014	0.034	0.062
	Marginal (Housing)	0.111	0.178	0.076	0.044	0.058	0.055
	Marginal (Property)	0.150	0.218	0.078	0.005	0.015	0.045
	CTGAN	0.140	0.199	0.066	0.009	0.019	0.048
	Indep	0.139	0.195	0.062	0.007	0.024	0.056

Beyond Internal Data: Constructing Complete Datasets for Fairness Testing