这是indexloc提供的服务,不要输入任何密码

Beyond Internal Data:
Constructing Complete Datasets for Fairness Testing

Varsha Ramineni, Hossein A. Rahmani, Emine Yilmaz, David Barber
Centre of Artificial Intelligence
University College London
{varsha.ramineni.23, hossein.rahmani.22, emine.yilmaz, david.barber}
@ucl.ac.uk
Abstract

As AI becomes prevalent in high-risk domains and decision-making, it is essential to test for potential harms and biases. This urgency is reflected by the global emergence of AI regulations that emphasise fairness and adequate testing, with some mandating independent bias audits. However, procuring the necessary data for fairness testing remains a significant challenge. Particularly in industry settings, legal and privacy concerns restrict the collection of demographic data required to assess group disparities, and auditors face practical and cultural challenges in gaining access to data. Further, internal historical datasets are often insufficiently representative to identify real-world biases. This work focuses on evaluating classifier fairness when complete datasets including demographics are inaccessible. We propose leveraging separate overlapping datasets to construct complete synthetic data that includes demographic information and accurately reflects the underlying relationships between protected attributes and model features. We validate the fidelity of the synthetic data by comparing it to real data, and empirically demonstrate that fairness metrics derived from testing on such synthetic data are consistent with those obtained from real data. This work, therefore, offers a path to overcome real-world data scarcity for fairness testing, enabling independent, model-agnostic evaluation of fairness, and serving as a viable substitute where real data is limited.

1 Introduction

It is well established that Artificial Intelligence (AI) systems have the potential to perpetuate, amplify, and systemise harmful biases [10, 11]. Therefore, rigorous testing for bias is imperative to mitigate harms, especially given the increasing influence of AI in high-stakes domains such as lending, hiring, and healthcare. Such concerns have fuelled active research in bias detection and mitigation [32], and ensuring the fairness of AI systems has become an urgent policy priority for governments around the world [17, 47]. For instance, the EU AI Act imposes strict safety testing on high-risk systems [20], while New York City Local Law 144 mandates independent bias audits for AI used in employment decisions [23].

However, procuring the necessary data for fairness testing remains a significant challenge. Influential works in ethics and fairness of machine learning have highlighted the centrality of datasets [26, 3], emphasising how representative model testing and evaluation data is crucial [7, 40]. To effectively uncover biases, complete datasets that include demographic information and their relationship with model features are essential for controlling the impact of proxy variables. However, having access to such datasets that can reliably be used for evaluating fairness may not always be possible in practice.

As a motivating example, consider a bank that uses an AI system to assess loan applicants based on non-protected variables such as occupation and savings. The bank wants to perform an internal audit as to whether its AI system inadvertently discriminates against certain racial groups. For this, the bank requires data concerning protected attributes such as the race of the applicants alongside data of non-protected attributes required by the model to make a loan decision.

Whilst protected attributes such as race, sex, age etc. are crucial to assess bias, their collection and use in modelling are heavily restricted under regulations such as GDPR [1, 44]. Hence, most internal datasets collected by organisations that use AI systems for decision making (such as the bank in our example) do not contain such protected attributes [29]. Similarly, procuring the necessary data is also a huge complexity for auditors, hindering the effective implementation of algorithmic auditing laws [23]. In an external audit of fairness, the auditing agency often has access only to the black box loan predictions and is not provided any data by the bank since existing regulations often do not allow data holders to release datasets that pause privacy concerns. For this external audit the agency needs a joint distribution of both the attributes needed by the black box loan classifier and the protected attributes. Therefore, the development of curated test sets capable of effectively uncovering biases is essential [29].

Recently, there has been shift away from using limited real test data towards leveraging synthetic data, which has shown promise in a variety of applications ranging from privacy preservation [2] to emulating scenarios for which collecting data is challenging [27].

Refer to caption
Figure 1: Creation of a synthetic dataset by using two separate datasets and learning their joint distribution. This produces a complete and representative synthetic dataset with essential demographic information necessary for fairness testing.

Our work focuses on the challenge of evaluating classifier fairness in scenarios where complete data including protected attributes is inaccessible. To overcome this challenge, we propose leveraging separate datasets containing overlapping variables, which are more accessible in real-world scenarios than complete datasets containing all variables [21]. Specifically, in addition to using an internal dataset that lacks protected attribute information, we propose utilising external data, such as census datasets which provide representative demographic information. For example, the UK Office for National Statistics [33] offers multivariate data from the 2021 Census, providing access to customisable combinations of census variables. Such external data could be utilised when the essential demographic information needed for fairness testing is not directly available.

In our motivating example above, even if the protected attribute ‘race’ is not directly available in the internal dataset, its connection to the features used by the model such as occupation, savings, etc. can be used to evaluate fairness with respect to race. For instance, the internal dataset used by the bank might include information about {loanoutcome,savings,occupation}𝑙𝑜𝑎𝑛𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠𝑎𝑣𝑖𝑛𝑔𝑠𝑜𝑐𝑐𝑢𝑝𝑎𝑡𝑖𝑜𝑛\{loan\,outcome,\;savings,\;occupation\}{ italic_l italic_o italic_a italic_n italic_o italic_u italic_t italic_c italic_o italic_m italic_e , italic_s italic_a italic_v italic_i italic_n italic_g italic_s , italic_o italic_c italic_c italic_u italic_p italic_a italic_t italic_i italic_o italic_n }. By utilising an external dataset which contains an overlapping variable such as {occupation,race}𝑜𝑐𝑐𝑢𝑝𝑎𝑡𝑖𝑜𝑛𝑟𝑎𝑐𝑒\{occupation,\;race\}{ italic_o italic_c italic_c italic_u italic_p italic_a italic_t italic_i italic_o italic_n , italic_r italic_a italic_c italic_e } that is representative of the population, we can learn the joint distribution of variables from these two datasets, which can then be used to generate synthetic joint test data that contains all the variables, e.g. {loanoutcome,savings,occupation,race}𝑙𝑜𝑎𝑛𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠𝑎𝑣𝑖𝑛𝑔𝑠𝑜𝑐𝑐𝑢𝑝𝑎𝑡𝑖𝑜𝑛𝑟𝑎𝑐𝑒\{loan\,outcome,\;savings,\;occupation,\;race\}{ italic_l italic_o italic_a italic_n italic_o italic_u italic_t italic_c italic_o italic_m italic_e , italic_s italic_a italic_v italic_i italic_n italic_g italic_s , italic_o italic_c italic_c italic_u italic_p italic_a italic_t italic_i italic_o italic_n , italic_r italic_a italic_c italic_e } as shown in Figure 1. This dataset can then be reliably used for evaluating the fairness of the model, as shown in Figure 2.

In this work, we conduct experiments on multiple real-world datasets commonly used in fairness research, simulating realistic scenarios involving separated datasets, such as isolated protected attributes and only a single overlapping variable. Our results show that the synthetic test data generated using our proposed approach exhibits high fidelity when compared to real test data. Crucially, we find that fairness metrics derived from testing classifier models on synthetic data closely align with those obtained from real data. These findings suggest that our approach provides a reliable method for fairness evaluation in scenarios where complete datasets are inaccessible, offering a viable alternative for testing in such contexts.

2 Related Work

Fairness Testing

Significant work on fairness evaluation has centered on formalising definitions of fairness [32] and emphasising the critical role of data [3, 26, 22, 35]. Recent work has also explored fairness testing in response to regulatory requirements [23, 44] and in the context of industry [24, 29] and software development [14]. Additionally, there is growing interest in sample-efficient approaches to fairness testing [25, 43].

Synthetic Data Generation.

Generative models aim to learn the underlying distribution from real data and produce realistic synthetic data. In our work, we focus on tabular data, as it is the most common data type for real-world applications [41]. Various models have been developed for tabular data generation, from simple methods like SMOTE [13] to deep learning approaches such as CTGAN [49] and TVAE [49]. Significant previous work has focused on privacy-preserving synthetic data generation, employing marginal-based methods like the MST algorithm [30], with work showing that marginal-based algorithms and traditional methods such as mixture models, are more effective at preserving attribute correlations compared to deep learning approaches [39, 36]. Recent innovative advancements also include using large language models [8] and offering customisable tabular data generation [45]. However, these methods typically assume access to full datasets to learn from, limiting their effectiveness in scenarios with restricted data access.

Synthetic Data for Bias.

Synthetic data for bias has predominately focused on creating fair data for training [48, 9]; however, this offers no guarantee of unbiased models [19] and reliable testing methods are therefore crucial. Another approach is to simulate different scenarios to explore the interconnection between biases and their effect on performance and fairness evaluations [5, 12]. Recent work highlights the potential of synthetic data for evaluation, showing that, whilst testing on limited real data is unreliable, utilising synthetic test data allows for granular evaluation and testing on distributional shifts [43]. Emerging work also looks at most effective synthetic data generation techniques for training and evaluating machine learning models and the implications of model fairness [36].

Refer to caption
Figure 2: Evaluation of a pre-trained black-box classifier (e.g. a classifier used by a bank for loan/no-loan decision) on the synthetic data which includes demographics not available during training, enabling the calculation of fairness metrics.

3 Methodology

Returning to our motivating example of a loan classifier, our assumption is that the classifier uses only non-protected attributes, such as savings X𝑋Xitalic_X, and occupation O𝑂Oitalic_O in order to form a loan prediction Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG; in this case, the loan decision is some function of the non-protected attributes, e.g. Y^=f(X,O)^𝑌𝑓𝑋𝑂\hat{Y}=f(X,O)over^ start_ARG italic_Y end_ARG = italic_f ( italic_X , italic_O ). However, we would like to assess whether this prediction is fair against a protected attribute A𝐴Aitalic_A such as race. There are various statistical definitions of group fairness in classification, typically conditioned on protected attributes along which fairness should be ensured. We use the following notation: Let Y{+,}𝑌Y\in\{+,-\}italic_Y ∈ { + , - } represent the true outcome, Y^{+,}^𝑌\hat{Y}\in\{+,-\}over^ start_ARG italic_Y end_ARG ∈ { + , - } the predicted outcome, and A{privileged,unprivileged}𝐴𝑝𝑟𝑖𝑣𝑖𝑙𝑒𝑔𝑒𝑑𝑢𝑛𝑝𝑟𝑖𝑣𝑖𝑙𝑒𝑔𝑒𝑑A\in\{privileged,unprivileged\}italic_A ∈ { italic_p italic_r italic_i italic_v italic_i italic_l italic_e italic_g italic_e italic_d , italic_u italic_n italic_p italic_r italic_i italic_v italic_i italic_l italic_e italic_g italic_e italic_d } the protected attribute. Here, ‘+++’ denotes a positive classification outcome (e.g., loan approval), while ‘--’ denotes a negative outcome (e.g., loan rejection). For instance, the fairness metric Equal Opportunity Difference (EOD) is given by:

EOD=P(Y^=+Y=+,A=unprivileged)P(Y^=+Y=+,A=privileged)EOD=P(\hat{Y}=+\mid Y=+,A=unprivileged)-P(\hat{Y}=+\mid Y=+,A=privileged)italic_E italic_O italic_D = italic_P ( over^ start_ARG italic_Y end_ARG = + ∣ italic_Y = + , italic_A = italic_u italic_n italic_p italic_r italic_i italic_v italic_i italic_l italic_e italic_g italic_e italic_d ) - italic_P ( over^ start_ARG italic_Y end_ARG = + ∣ italic_Y = + , italic_A = italic_p italic_r italic_i italic_v italic_i italic_l italic_e italic_g italic_e italic_d ) (1)

To calculate this, one necessary term is P(Y^=+Y=+,A),P(\hat{Y}=+\mid Y=+,A),italic_P ( over^ start_ARG italic_Y end_ARG = + ∣ italic_Y = + , italic_A ) , where

P(Y=+,A)=O,XP(Y=+,O,X,A).𝑃𝑌𝐴subscript𝑂𝑋𝑃𝑌𝑂𝑋𝐴P(Y=+,A)=\sum_{O,X}P(Y=+,O,X,A).italic_P ( italic_Y = + , italic_A ) = ∑ start_POSTSUBSCRIPT italic_O , italic_X end_POSTSUBSCRIPT italic_P ( italic_Y = + , italic_O , italic_X , italic_A ) . (2)

This requires a model of the joint distribution (as shown in Figure 1), which can then be used to test the fairness of a pre-trained black-box classifier, as illustrated in Figure 2. In the following section, we explain how to construct joint distributions from a collection of overlapping marginal distributions.

3.1 Learning a Joint Distribution

Consider a fairness testing scenario that requires access to the distribution p(loanoutcome,savings,occupation,race)𝑝𝑙𝑜𝑎𝑛𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠𝑎𝑣𝑖𝑛𝑔𝑠𝑜𝑐𝑐𝑢𝑝𝑎𝑡𝑖𝑜𝑛𝑟𝑎𝑐𝑒p(loan\,outcome,savings,occupation,race)italic_p ( italic_l italic_o italic_a italic_n italic_o italic_u italic_t italic_c italic_o italic_m italic_e , italic_s italic_a italic_v italic_i italic_n italic_g italic_s , italic_o italic_c italic_c italic_u italic_p italic_a italic_t italic_i italic_o italic_n , italic_r italic_a italic_c italic_e ). Most real-world datasets, such as provided by publicly available census data, often only provide sets of marginal distributions [21]. Suppose we have two separate datasets with empirical distributions p^(loanoutcome,savings,occupation)^𝑝𝑙𝑜𝑎𝑛𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠𝑎𝑣𝑖𝑛𝑔𝑠𝑜𝑐𝑐𝑢𝑝𝑎𝑡𝑖𝑜𝑛\hat{p}(loan\,outcome,savings,occupation)over^ start_ARG italic_p end_ARG ( italic_l italic_o italic_a italic_n italic_o italic_u italic_t italic_c italic_o italic_m italic_e , italic_s italic_a italic_v italic_i italic_n italic_g italic_s , italic_o italic_c italic_c italic_u italic_p italic_a italic_t italic_i italic_o italic_n ) and p^(occupation,race)^𝑝𝑜𝑐𝑐𝑢𝑝𝑎𝑡𝑖𝑜𝑛𝑟𝑎𝑐𝑒\hat{p}(occupation,race)over^ start_ARG italic_p end_ARG ( italic_o italic_c italic_c italic_u italic_p italic_a italic_t italic_i italic_o italic_n , italic_r italic_a italic_c italic_e ), where occupation𝑜𝑐𝑐𝑢𝑝𝑎𝑡𝑖𝑜𝑛occupationitalic_o italic_c italic_c italic_u italic_p italic_a italic_t italic_i italic_o italic_n is the overlapping variable. Our goal is to estimate the joint distribution p(loanoutcome,savings,occupation,race)𝑝𝑙𝑜𝑎𝑛𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠𝑎𝑣𝑖𝑛𝑔𝑠𝑜𝑐𝑐𝑢𝑝𝑎𝑡𝑖𝑜𝑛𝑟𝑎𝑐𝑒p(loan\,outcome,savings,occupation,race)italic_p ( italic_l italic_o italic_a italic_n italic_o italic_u italic_t italic_c italic_o italic_m italic_e , italic_s italic_a italic_v italic_i italic_n italic_g italic_s , italic_o italic_c italic_c italic_u italic_p italic_a italic_t italic_i italic_o italic_n , italic_r italic_a italic_c italic_e ). Theoretically, this problem is ill-posed and therefore requires additional assumptions.

Using marginal data observations and a structural independence assumption, the joint distribution can be estimated using maximum likelihood estimation. We consider below three simple structural independence assumptions, illustrated by graphical models, to fit a joint distribution on four variables p(x1,x2,x3,x4)𝑝subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4p(x_{1},x_{2},x_{3},x_{4})italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ), given two empirical marginal distributions p^(x1,x2,x3)^𝑝subscript𝑥1subscript𝑥2subscript𝑥3\hat{p}(x_{1},x_{2},x_{3})over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) and p^(x3,x4)^𝑝subscript𝑥3subscript𝑥4\hat{p}(x_{3},x_{4})over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ). The estimated joint distribution is then used as a generative model to create synthetic data points through sampling [46]. Note that we assume marginal consistency i.e. that all marginal distributions considered originate from a common underlying joint distribution.

3.1.1 Independence Given Overlap

X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTX1,X2subscript𝑋1subscript𝑋2X_{1},X_{2}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTX4subscript𝑋4X_{4}italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT
p(x1,x2,x3,x4)=p(x3)p^(x1,x2|x3)p^(x4|x3)𝑝subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4𝑝subscript𝑥3^𝑝subscript𝑥1conditionalsubscript𝑥2subscript𝑥3^𝑝conditionalsubscript𝑥4subscript𝑥3p(x_{1},x_{2},x_{3},x_{4})=p(x_{3})\cdot\hat{p}(x_{1},x_{2}|x_{3})\cdot\hat{p}% (x_{4}|x_{3})italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) = italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ⋅ over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ⋅ over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) (3)

We model the joint distribution of x1,x2,x3subscript𝑥1subscript𝑥2subscript𝑥3x_{1},x_{2},x_{3}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and x4subscript𝑥4x_{4}italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT by treating the association between (x1,x2)subscript𝑥1subscript𝑥2(x_{1},x_{2})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and x4subscript𝑥4x_{4}italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT as the product of their conditional distributions given x3subscript𝑥3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. To estimate p(x3)𝑝subscript𝑥3p(x_{3})italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ), we take the average of the proportions from both marginal datasets and use this to sample x3subscript𝑥3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT (see Appendix (A.1) for proof of optimality). To sample from this model, we first sample from p(x3)𝑝subscript𝑥3p(x_{3})italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) and then draw conditional samples for (x1,x2)subscript𝑥1subscript𝑥2(x_{1},x_{2})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and x3subscript𝑥3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT from the marginal datasets. Note that if the marginals are consistent, namely x1,x2p^(x1,x2,x3)=x4p^(x3,x4)p^(x3)subscriptsubscript𝑥1subscript𝑥2^𝑝subscript𝑥1subscript𝑥2subscript𝑥3subscriptsubscript𝑥4^𝑝subscript𝑥3subscript𝑥4^𝑝subscript𝑥3\sum_{x_{1},x_{2}}\hat{p}(x_{1},x_{2},x_{3})=\sum_{x_{4}}\hat{p}(x_{3},x_{4})% \equiv\hat{p}(x_{3})∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) ≡ over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ), then we simply set p(x3)=p^(x3)𝑝subscript𝑥3^𝑝subscript𝑥3p(x_{3})=\hat{p}(x_{3})italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ).

3.1.2 Marginal Preservation

X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTX2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTX3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTX4subscript𝑋4X_{4}italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT
p(x1,x2,x3,x4)=p^(x1,x2,x3)p^(x4|x3)𝑝subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4^𝑝subscript𝑥1subscript𝑥2subscript𝑥3^𝑝conditionalsubscript𝑥4subscript𝑥3p(x_{1},x_{2},x_{3},x_{4})=\hat{p}(x_{1},x_{2},x_{3})\cdot\hat{p}(x_{4}|x_{3})italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) = over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ⋅ over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) (4)

We directly use the proportions from the first marginal dataset to model the joint distribution of x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and x3subscript𝑥3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. A sample is then obtained by sampling from the marginal p^(x1,x2,x3)^𝑝subscript𝑥1subscript𝑥2subscript𝑥3\hat{p}(x_{1},x_{2},x_{3})over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) and then from the conditional marginal p^(x4|x3)^𝑝conditionalsubscript𝑥4subscript𝑥3\hat{p}(x_{4}|x_{3})over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ). Alternatively, we could preserve the second marginal by modelling the distribution as p(x1,x2,x3,x4)=p^(x1,x2|x3)p^(x3,x4)𝑝subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4^𝑝subscript𝑥1conditionalsubscript𝑥2subscript𝑥3^𝑝subscript𝑥3subscript𝑥4p(x_{1},x_{2},x_{3},x_{4})=\hat{p}(x_{1},x_{2}|x_{3})\cdot\hat{p}(x_{3},x_{4})italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) = over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ⋅ over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ).

3.1.3 Latent Naïve Bayes

Z𝑍Zitalic_ZX1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTX2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTX3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTX4subscript𝑋4X_{4}italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT
p(x1,x2,x3,x4)=zp(z)i=14p(xi|z)𝑝subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4subscript𝑧𝑝𝑧superscriptsubscriptproduct𝑖14𝑝conditionalsubscript𝑥𝑖𝑧p(x_{1},x_{2},x_{3},x_{4})=\sum_{z}p(z)\prod_{i=1}^{4}p(x_{i}|z)italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_p ( italic_z ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_z ) (5)

We employ a latent variable model based on the Naïve Bayes assumption by introducing a latent variable z𝑧zitalic_z, which assumes that x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, x3subscript𝑥3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and x4subscript𝑥4x_{4}italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT are conditionally independent given z𝑧zitalic_z. We use the Expectation-Maximization (EM) algorithm [16] to train the model (see Appendix (A.2) for details).

3.1.4 Extension to More Complex Scenarios

We can extend the Latent Naïve Bayes method to include more variables by adding the term p(xkz)𝑝conditionalsubscript𝑥𝑘𝑧p(x_{k}\mid z)italic_p ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_z ) for any new variable xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Similarly, other methods can be adapted to handle additional variables. For instance, if the second marginal distribution is p^(x3,x4,x5,x6)^𝑝subscript𝑥3subscript𝑥4subscript𝑥5subscript𝑥6\hat{p}(x_{3},x_{4},x_{5},x_{6})over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ), we adjust the conditional distribution from p^(x4x3)^𝑝conditionalsubscript𝑥4subscript𝑥3\hat{p}(x_{4}\mid x_{3})over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) to p^(x4,x5,x6x3)^𝑝subscript𝑥4subscript𝑥5conditionalsubscript𝑥6subscript𝑥3\hat{p}(x_{4},x_{5},x_{6}\mid x_{3})over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ). When multiple variables overlap between datasets, such as in the empirical distributions p^(x1,x2,x3,x4)^𝑝subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4\hat{p}(x_{1},x_{2},x_{3},x_{4})over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) and p^(x3,x4,x5)^𝑝subscript𝑥3subscript𝑥4subscript𝑥5\hat{p}(x_{3},x_{4},x_{5})over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) where (x3,x4)subscript𝑥3subscript𝑥4(x_{3},x_{4})( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) are overlapping, we extend the methods to preserve the joint structure. For the Independence Given Overlap method, we use: p(x1,x2,x3,x4,x5)=p(x3,x4)p^(x1,x2x3,x4)p^(x5x3,x4).𝑝subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4subscript𝑥5𝑝subscript𝑥3subscript𝑥4^𝑝subscript𝑥1conditionalsubscript𝑥2subscript𝑥3subscript𝑥4^𝑝conditionalsubscript𝑥5subscript𝑥3subscript𝑥4p(x_{1},x_{2},x_{3},x_{4},x_{5})=p(x_{3},x_{4})\cdot\hat{p}(x_{1},x_{2}\mid x_% {3},x_{4})\cdot\hat{p}(x_{5}\mid x_{3},x_{4}).italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) = italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) ⋅ over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) ⋅ over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) . For the Marginal Preservation method, we use: p(x1,x2,x3,x4,x5)=p^(x1,x2,x3,x4)p^(x5x3,x4).𝑝subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4subscript𝑥5^𝑝subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4^𝑝conditionalsubscript𝑥5subscript𝑥3subscript𝑥4p(x_{1},x_{2},x_{3},x_{4},x_{5})=\hat{p}(x_{1},x_{2},x_{3},x_{4})\cdot\hat{p}(% x_{5}\mid x_{3},x_{4}).italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) = over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) ⋅ over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) .

In this work, we focus on estimating the joint distribution from two datasets that overlap in a single variable. Real-world datasets may exhibit more complex structures involving multiple datasets. While Latent Naïve Bayes offers a straightforward extension to multiple datasets, there could be alternative approaches such as using Junction Trees [4]. Such work is left for future research, with this study serving as a preliminary exploration of our proposed approach.

4 Experimental Setup

Refer to caption
Figure 3: Experimental Setup

We aim to generate synthetic datasets and evaluate their quality based on two criteria: 1) how well they can approximate a real ground truth dataset, and 2) how accurately they can estimate the fairness of a black-box classifier in situations where complete data, including protected attributes is inaccessible. We assume that, as in our example, we have access to two separate datasets, for example one containing {loanoutcome,savings,occupation}𝑙𝑜𝑎𝑛𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠𝑎𝑣𝑖𝑛𝑔𝑠𝑜𝑐𝑐𝑢𝑝𝑎𝑡𝑖𝑜𝑛\{loan\,outcome,savings,occupation\}{ italic_l italic_o italic_a italic_n italic_o italic_u italic_t italic_c italic_o italic_m italic_e , italic_s italic_a italic_v italic_i italic_n italic_g italic_s , italic_o italic_c italic_c italic_u italic_p italic_a italic_t italic_i italic_o italic_n } and another containing {occupation,race}𝑜𝑐𝑐𝑢𝑝𝑎𝑡𝑖𝑜𝑛𝑟𝑎𝑐𝑒\{occupation,race\}{ italic_o italic_c italic_c italic_u italic_p italic_a italic_t italic_i italic_o italic_n , italic_r italic_a italic_c italic_e }, used to estimate a joint distribution and generate a synthetic test dataset including all attributes. In this setup, one dataset includes the protected attribute, while the other contains model input features, with an overlapping variable between the two datasets.

4.1 Datasets

We conduct our experiments using three real-world datasets: Adult [6], COMPAS [38], and German Credit [18], detailed in Table 1, which are commonly used in the fairness literature. For all three datasets we follow the literature by removing instances with null values, and map all continuous variables into categorical variables (see Appendix (B) for details) [28]. These datasets represent complete real data with protected attributes. Our goal is to approximate such data using our synthetic data generation approach.

Table 1: Overview of real world datasets used in experiments
Name # Instances # Attributes Label Protected Attributes
Adult [6] 45,222 13 Income Sex (67.5% male, 32.5% female)
Race (86% white,14% non-white)
COMPAS [38] 5278 9 Recidivism Sex (80.5% male, 19.5% female)
Race (60.2% white, 39.8% black)
German [18] 1000 22 Credit Risk Age (81% > 25, 19% \leq 25)
Sex (69% male, 31% female)
Table 2: Separation of complete real datasets, with each row illustrating how attributes are categorised into ‘external’ and ‘internal’ datasets. The ‘external’ dataset shown includes protected attributes, while the ‘internal’ dataset comprises the remaining attributes. Protected attributes are shown in bold, and overlapping variables shared between the two datasets are shown in italics.
Dataset Attributes in ‘External’ Dataset (overlapping variable in italics)
Adult relationship, age, sex, race, marital-status, native-country
marital-status, age, sex, race, marital-status, native-country
COMPAS score, sex, age, race
violent score, sex, age, race
German Credit property, sex, marital-status, age, foreign-worker
housing, sex, marital-status, age, foreign-worker

4.2 Simulating Data Scenarios

Our experimental setup is visualised in Figure 3. To assess our approach, we simulate having a known ground truth dataset to compare our generated synthetic data against.

Real Test Data.

Starting with a complete real dataset, we reserve a hold-out real test set Dtestsubscript𝐷testD_{\textit{test}}italic_D start_POSTSUBSCRIPT test end_POSTSUBSCRIPT (30% of the complete real dataset) that includes all relevant attributes. This is the dataset that we would like to approximate using the synthetic data we generate and we use this to assess our approach.

Separated Data.

We wish to simulate the scenario where we don’t have access to complete data but only have two separate datasets as illustrated in Figure 1. We therefore separate the remaining complete real data by column into two overlapping datasets. We consider separations where protected attributes are isolated from other variables, and where there is one variable overlapping between datasets. We refer to these separate datasets as ‘internal’ and the ‘external’, where the ‘external’ data includes protected attributes not available in the ‘internal’ data. Such separation simulates only having access to protected attributes separately, such as in publicly available census data, and assumes limited overlap of attributes.

Table 2 demonstrates the separation of our three complete real-world datasets. Notably, the ‘external’ datasets includes data commonly found as census variables. As illustrated in Figure 1, we use the two separate datasets to estimate the joint distribution of all attributes and generate synthetic test data Dsynthsubscript𝐷synthD_{\textit{synth}}italic_D start_POSTSUBSCRIPT synth end_POSTSUBSCRIPT. We also wish to simulate having a trained classifier that we wish to test for fairness, as shown in Figure 2. This is done by training classifier models on one of the real separate datasets, the ‘internal’ dataset, which does not include protected attributes. The classifier models will then be tested on both synthetic and real test data.

4.3 Baselines

To our knowledge, no prior work for fairness testing has tackled the challenge of creating synthetic data from separate datasets that accurately capture the relationship between demographic and model features. We compare our approach with common methods for tabular synthetic data generation. The Independent Model assumes independence between any two variables to estimate the joint distributions [31]. Conditional Tabular GAN (CTGAN) [49] is a state-of-the-art method that learns from the full dataset, unlike our method, which works with separate datasets. Although CTGAN has an advantage due to its access to complete data, we include its performance using default hyperparameters for comparison.

5 Evaluating the Quality of Synthetic Datasets

We use two criteria to evaluate the quality of our synthetic datasets: 1) How does the synthetic data compare with real data? and 2) How does the fairness metrics computed on the synthetic data compare with real data?

We present results for eighteen synthetic test datasets which were generated using the three joint distribution estimation methods, applied to six different pairs of separated data across real world datasets: Adult, COMPAS and German Credit. Table 2 shows an overview of how our datasets have been separated and which overlapping attributes have been used. In addition to the synthetic datasets generated using our proposed approach, we also generate synthetic datasets using the two baseline approaches and compare the quality of our synthetic datasets with the quality of synthetic datasets generated using the baseline methods.

Table 3: Fidelity metrics for synthetic datasets of the Adult dataset, generated from separate data ( ‘relationship’ overlapping) with different joint estimation methods. Metrics include total variation distance complement (1-TVD), contingency similarity (CS), discriminator measure (DM), and KL divergence of p(A,Y)𝑝𝐴𝑌p(A,Y)italic_p ( italic_A , italic_Y ) in synthetic vs real data (where Y is the outcome label and A is a protected attribute such as race and sex). Baseline methods include CTGAN and Indep.
Method Overall Fidelity Joint Distribution for (A,Y)
1-TVD \uparrow CS \uparrow DM \downarrow KL (Race) \downarrow KL (Sex) \downarrow
Indep-Overlap (Relationship) 0.993 0.983 0.588 0.002 0.001
Marginal (Relationship) 0.993 0.983 0.588 0.002 0.001
Latent (Relationship) 0.986 0.968 0.658 0.002 0.002
CTGAN 0.935 0.938 0.656 0.132 0.048
Independent 0.935 0.895 0.808 0.005 0.026
Refer to caption

(a) Adult Data

Refer to caption

(b) COMPAS Data

Refer to caption

(c) German Credit Data

Figure 4: Box-plots of fairness metrics for a Decision Tree Classifier across synthetic datasets. Each subplot represents a specific fairness metric for a protected attribute, showing the distribution of metrics from bootstrap samples. The top box-plot in each subplot displays the distribution of the metric from testing the classifier on real test data (blue), the middle boxplots (green) are for synthetic data generated using our approach differentiated by data separation methods (overlapping variables in brackets) and joint estimation methods, and the bottom box plots are for baseline methods (white).

5.1 Overall Fidelity of Synthetic Data Compared to Real Data

Fidelity evaluates how close the distribution of the synthetic data is to that of the real data with metrics often estimating the difference between marginal distributions [34, 37, 42]. To evaluate the fidelity of our synthetic datasets, we focus on the following metrics:

  • Total Variation Distance (TVD): Measures the difference between the empirical distributions of a variable in the synthetic data and the real data, defined as half the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT distance. We use the TVD Complement score, 1TVD1TVD1-\text{TVD}1 - TVD, where higher scores close to one indicate better quality synthetic data (averaged across variables) [34].

  • Contingency Similarity (CS): Assesses the similarity between normalised contingency tables of two variables, one from the real data and one from the synthetic data. This metric is calculated by first normalising the contingency tables to show the proportion of each category combination, then computing the TVD between these tables. The complement, 1TVD1TVD1-\text{TVD}1 - TVD, is used so that higher values close to one reflect greater similarity (averaged across varibales) [34].

  • Cramér’s V Correlation: Quantifies the strength of association between two categorical variables based on the Chi-square statistic [15]. We calculate the difference in Cramér’s V correlation between the synthetic and real data for each pair of variables.

  • Discriminator Measure (DM): Evaluates whether the synthetic data can be distinguished from the real data. We train a Random Forest Classifier on a balanced dataset, with synthetic data labeled as 1111 and real data labeled as 00. The classifier’s average accuracy on a test set is reported across five trials with different random seeds [8].

The eighteen synthetic datasets generated using our approach demonstrate high fidelity to the real test data, with an example shown in Table 3 for the Adult dataset. The average 1TVD1TVD1-\text{TVD}1 - TVD values across synthetic data for the Adult, COMPAS, and German datasets are 0.991, 0.978, and 0.966 respectively, while the average CS values are 0.978, 0.953, and 0.926. These results demonstrate the effectiveness of our approach in generating data that closely mirrors the proportions of the real test dataset. The results also show competitive or superior performance compared to the CTGAN baseline method, which generates synthetic data from complete data rather than separate data. The DM scores reveal moderate accuracy in distinguishing synthetic from real data. Across the eighteen synthetic datasets there is on average a 12.9% reduction in discriminator performance compared to the Independent Baseline and an 8.2% reduction compared to the CTGAN Baseline, suggesting that the synthetic test data is more challenging to differentiate from real data. Additionally, the difference in Cramér’s V correlations between synthetic and real datasets suggests that the attribute correlations in our synthetic data closely match those in the real data, showing greater similarity than baseline methods. See Appendix (C.1) for correlation figures and full fidelity results.

5.2 Protected Attribute and Outcome Relationship in Synthetic Data Compared to Real Data

As illustrated in Section 3, understanding the relationship between the protected attribute A𝐴Aitalic_A and the outcome label Y𝑌Yitalic_Y is essential for assessing group disparities. When A𝐴Aitalic_A and Y𝑌Yitalic_Y are located in separate datasets, such as the simple case in our loan example, it is crucial that the relationship between these variables (A,Y)𝐴𝑌(A,Y)( italic_A , italic_Y ) is accurately reconstructed in the synthetic datasets. We therefore measure the Kullback-Leibler (KL) divergence, DKL(psynth(A,Y)preal(A,Y))subscript𝐷KLconditionalsubscript𝑝synth𝐴𝑌subscript𝑝real𝐴𝑌D_{\text{KL}}(p_{\textit{synth}}(A,Y)\parallel p_{\textit{real}}(A,Y))italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT synth end_POSTSUBSCRIPT ( italic_A , italic_Y ) ∥ italic_p start_POSTSUBSCRIPT real end_POSTSUBSCRIPT ( italic_A , italic_Y ) ), between the joint distributions p(A,Y)𝑝𝐴𝑌p(A,Y)italic_p ( italic_A , italic_Y ) of synthetic and real data. KL divergence values close to zero indicate that the joint distribution of protected attribute and outcome label in the synthetic data is similar to the distribution in the real data

Table 3 presents the divergence for the Adult dataset, focusing on synthetic data generated from separate data which had ‘relationship’ as the overlapping variable. Across all separations and joint distribution estimation methods for Adult Data, the average KL divergence is 0.002 for Race and 0.001 for Sex. Despite generating synthetic data from separate datasets with only one overlapping variable, the joint distribution of protected attributes and outcome values is accurately reconstructed, as evidenced by the low KL divergence values. In comparison, CTGAN shows higher KL divergence values of 0.132 for Race and 0.048 for Sex. Similar patterns are observed in across the other datasets, with detailed results provided in Appendix (C.2).

5.3 Fairness Metrics from Synthetic Data Compared to Real Data

We next compare how fairness metrics computed on synthetic test datasets compare with those from real test datasets. Using the notation from Section 3, we focus on the Equal Opportunity Difference (EOD) (Equation 1) and two other common metrics: Disparate Impact (DI) and Average Odds Difference (AOD) [32]. The Disparate Impact (DI) metric compares the ratio of positive (favorable) outcomes between the unprivileged and the privileged groups and can be computed as:

DI=p(Y^=+|A=unprivileged)p(Y^=+|A=privileged)\text{DI}=\frac{p(\hat{Y}=+|A=unprivileged)}{p(\hat{Y}=+|A=privileged)}DI = divide start_ARG italic_p ( over^ start_ARG italic_Y end_ARG = + | italic_A = italic_u italic_n italic_p italic_r italic_i italic_v italic_i italic_l italic_e italic_g italic_e italic_d ) end_ARG start_ARG italic_p ( over^ start_ARG italic_Y end_ARG = + | italic_A = italic_p italic_r italic_i italic_v italic_i italic_l italic_e italic_g italic_e italic_d ) end_ARG (6)

The Average Odds Difference (AOD) metric measures the disparity between the false positive rate and true positive rate for the unprivileged and privileged groups and can be written as follows:

AOD=AODabsent\displaystyle\text{AOD}=AOD = 12[p(Y^=+Y=,A=unprivileged)p(Y^=+Y=,A=privileged)\displaystyle\frac{1}{2}\Bigg{[}p(\hat{Y}=+\mid Y=-,A=unprivileged)-p(\hat{Y}=% +\mid Y=-,A=privileged)divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ italic_p ( over^ start_ARG italic_Y end_ARG = + ∣ italic_Y = - , italic_A = italic_u italic_n italic_p italic_r italic_i italic_v italic_i italic_l italic_e italic_g italic_e italic_d ) - italic_p ( over^ start_ARG italic_Y end_ARG = + ∣ italic_Y = - , italic_A = italic_p italic_r italic_i italic_v italic_i italic_l italic_e italic_g italic_e italic_d )
+p(Y^=+Y=+,A=unprivileged)p(Y^=+Y=+,A=privileged)]\displaystyle+p(\hat{Y}=+\mid Y=+,A=unprivileged)-p(\hat{Y}=+\mid Y=+,A=% privileged)\Bigg{]}+ italic_p ( over^ start_ARG italic_Y end_ARG = + ∣ italic_Y = + , italic_A = italic_u italic_n italic_p italic_r italic_i italic_v italic_i italic_l italic_e italic_g italic_e italic_d ) - italic_p ( over^ start_ARG italic_Y end_ARG = + ∣ italic_Y = + , italic_A = italic_p italic_r italic_i italic_v italic_i italic_l italic_e italic_g italic_e italic_d ) ] (7)

Figure 4 compares fairness metrics between synthetic and real test datasets for a Decision Tree classifier. For each dataset, we generate 1,000 bootstrap samples of the same size as the real test data to compute fairness metrics. Box-plots for DI and AOD illustrate the distribution of these metrics. EOD, which trends similarly to AOD, is omitted from the figure but included in Appendix (C.3) with detailed results on the absolute differences between bootstrap means of fairness metrics from synthetic and real data. The results show that the fairness metrics from our synthetic test data closely match those from real data, outperforming baseline methods on nearly all metrics and protected attributes, except for DI for race in the Adult dataset. Notably, the synthetic data for the COMPAS dataset performs best, with absolute differences of 0.000 in bootstrap means for AOD and DI values for race, achieved using the ‘Marginal’ joint estimation method on separate data with the ‘violent score’ variable overlapping. For the Adult dataset, we also see small absolute differences in bootstrap means, with values as low as 0.002 for DI related to sex, 0.003 for AOD related to race, and 0.010 for AOD related to sex. For the German dataset, we see similar results, showing small absolute differences of 0.005 for AOD and 0.015 for DI related to sex. Despite larger differences shown in fairness metrics for age, the synthetic data still outperforms baseline methods.

6 Conclusion and Future Work

In this study, we tackled the challenge of evaluating classifier fairness when complete datasets, including protected attributes, are inaccessible. We proposed an approach that utilises separate overlapping datasets to estimate a joint distribution and generate complete synthetic test data which includes demographic information and accurately captures the relationships between demographics and model features essential for fairness testing. Our empirical analysis demonstrated that the fairness metrics derived from this synthetic test data closely match those obtained from real data. Our results further show that even with the assumption of only a single overlapping variable between separate datasets, and simple joint distribution estimation methods, the synthetic data can closely mirror real data outcomes and exhibit high fidelity.

This work demonstrates a promising approach for fairness testing by leveraging marginally overlapping datasets to curate effective test datasets. However, we simulated separate datasets and data scenarios, future research could explore incorporating real public data and more complex data scenarios to validate the results obtained. We also employed three joint estimation methods using structural assumptions. Future research could instead explore all feasible joint distributions that meet the constraints of the available marginal distributions, and thus work towards defining bounds within which the true fairness metrics are likely to fall.

Acknowledgments and Disclosure of Funding

The authors declare no competing interests related to this paper. This research was supported by the UKRI Engineering and Physical Sciences Research Council (EPSRC) [grant numbers EP/S021566/1 and EP/P024289/1].

References

  • Andrus, McKane and Spitzer, Elena and Brown, Jeffrey and Xiang, Alice [2021] Andrus, McKane and Spitzer, Elena and Brown, Jeffrey and Xiang, Alice. What We Can’t Measure, We Can’t Understand: Challenges to Demographic Data Procurement in the Pursuit of Fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021.
  • Assefa et al. [2021] Samuel A. Assefa, Danial Dervovic, Mahmoud Mahfouz, Robert E. Tillman, Prashant Reddy, and Manuela Veloso. Generating Synthetic Data in Finance: Opportunities, Challenges and Pitfalls. In Proceedings of the 1st ACM International Conference on AI in Finance, 2021.
  • Bao et al. [2021] Michelle Bao, Angela Zhou, Samantha Zottola, Brian Brubach, Sarah Desmarais, Aaron Horowitz, Kristian Lum, and Suresh Venkatasubramanian. It’s COMPASlicated: The Messy Relationship between RAI datasets and Algorithmic Fairness Benchmarks. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.
  • Barber [2012] David Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, USA, 2012.
  • Baumann et al. [2023] Joachim Baumann, Alessandro Castelnovo, Riccardo Crupi, Nicole Inverardi, and Daniele Regoli. Bias on Demand: A Modelling Framework That Generates Synthetic Data With Bias. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, 2023.
  • Becker and Kohavi [1996] Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20.
  • Bergman et al. [2023] A Stevie Bergman, Lisa Anne Hendricks, Maribeth Rauh, Boxi Wu, William Agnew, Markus Kunesch, Isabella Duan, Iason Gabriel, and William Isaac. Representation in AI Evaluations. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, 2023.
  • Borisov et al. [2023] Vadim Borisov, Kathrin Sessler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. Language Models are Realistic Tabular Data Generators. In Proceedings of the 11th International Conference on Learning Representations, 2023.
  • Breugel et al. [2024] Boris van Breugel, Trent Kyono, Jeroen Berrevoets, and Mihaela van der Schaar. DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks. In Proceedings of the 38th Annual Conference on Neural Information Processing Systems, 2024.
  • Buolamwini and Gebru [2018] Joy Buolamwini and Timnit Gebru. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 2018 ACM Conference on Fairness, Accountability and Transparency, 2018.
  • Caliskan et al. [2017] Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186, 2017.
  • Castelnovo et al. [2022] Alessandro Castelnovo, Riccardo Crupi, Greta Greco, Daniele Regoli, Ilaria Giuseppina Penco, and Andrea Claudio Cosentini. A Clarification of the Nuances in the Fairness Metrics Landscape. Scientific Reports, 12(1):4209, 2022.
  • Chawla et al. [2002] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. Journal of artificial intelligence research, 16:321–357, 2002.
  • Chen et al. [2021] Richard J Chen, Ming Y Lu, Tiffany Y Chen, Drew FK Williamson, and Faisal Mahmood. Synthetic data in machine learning for medicine and healthcare. Nature Biomedical Engineering, 5(6):493–497, 2021.
  • Cramér, H. [1946] Cramér, H. Mathematical Methods of Statistics. Princeton University Press, USA, 1946.
  • Dempster et al. [1977] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum Likelihood from Incomplete Data Via the EM Algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
  • Department for Science, Innovation and Technology [2023] Department for Science, Innovation and Technology. A Pro-Innovation Approach to AI Regulation. Technical report, Government of the United Kingdom, 2023.
  • Dheeru and Karra Taniskidou [2017] D. Dheeru and E. Karra Taniskidou. UCI Machine Learning Repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Eitan et al. [2022] Yam Eitan, Nathan Cavaglione, Michael Arbel, and Samuel Cohen. Fair Synthetic Data Does not Necessarily Lead to Fair Models. In Proceedings of the NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research, 2022.
  • European Commission [2023] European Commission. Regulation (EU) 2023/822 of the European Parliament and of the Council on Harmonized Rules on Artificial Intelligence (Artificial Intelligence Act), 2023.
  • Frogner and Poggio [2019] Charlie Frogner and Tomaso Poggio. Fast and Flexible Inference of Joint Distributions from their Marginals. In Proceedings of the 36th International Conference on Machine Learning, 2019.
  • Gebru et al. [2021] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.
  • Groves et al. [2024] Lara Groves, Jacob Metcalf, Alayna Kennedy, Briana Vecchione, and Andrew Strait. Auditing Work: Exploring the New York City algorithmic bias audit regime. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 2024.
  • Holstein et al. [2019] Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé, Miro Dudik, and Hanna Wallach. Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need? In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 2019.
  • Ji et al. [2020] Disi Ji, Padhraic Smyth, and Mark Steyvers. Can I Trust My Fairness Metric? Assessing Fairness with Unlabeled Data and Bayesian Inference. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, 2020.
  • Jo and Gebru [2020] Eun Seo Jo and Timnit Gebru. Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning. In Proceedings of the 2020 ACM Conference on Fairness, Accountability, and Transparency, 2020.
  • Johnson-Roberson et al. [2017] Matthew Johnson-Roberson, Charles Barto, Rounak Mehta, Sharath Nittur Sridhar, Karl Rosaen, and Ram Vasudevan. Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? In Proceedings of the 2017 IEEE International Conference on Robotics and Automation, 2017.
  • Le Quy et al. [2022] Tai Le Quy, Arjun Roy, Vasileios Iosifidis, Wenbin Zhang, and Eirini Ntoutsi. A survey on datasets for fairness-aware machine learning. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 12(3):e1452, 2022.
  • Madaio et al. [2022] Michael Madaio, Lisa Egede, Hariharan Subramonyam, Jennifer Wortman Vaughan, and Hanna Wallach. Assessing the Fairness of AI Systems: AI Practitioners’ Processes, Challenges, and Needs for Support. Proceedings of the ACM on Human-Computer Interaction, 6:1–26, 2022.
  • McKenna et al. [2021] Ryan McKenna, Gerome Miklau, and Daniel Sheldon. Winning the NIST Contest: A scalable and general approach to differentially private synthetic data. Journal of Privacy and Confidentiality, 11(3), 2021.
  • McKenna et al. [2022] Ryan McKenna, Brett Mullins, Daniel Sheldon, and Gerome Miklau. AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data. In Proceedings of the VLDB Endowment, 2022.
  • Mehrabi et al. [2021] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6):1–35, 2021.
  • Office for National Statistics [2021] Office for National Statistics. 2021 Census Data, 2021. URL https://www.ons.gov.uk/census/aboutcensus/censusproducts/multivariatedata.
  • Patki et al. [2016] Neha Patki, Roy Wedge, and Kalyan Veeramachaneni. The Synthetic Data Vault. In Proceedings of the 2016 IEEE International Conference on Data Science and Advanced Analytics, 2016.
  • Paullada et al. [2021] Amandalynne Paullada, Inioluwa Deborah Raji, Emily M Bender, Emily Denton, and Alex Hanna. Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns, 2(11), 2021.
  • Pereira et al. [2024] Mayana Pereira, Meghana Kshirsagar, Sumit Mukherjee, Rahul Dodhia, Juan Lavista Ferres, and Rafael de Sousa. Assessment of Differentially Private Synthetic Data for Utility and Fairness in End-to-End Machine Learning Pipelines for Tabular Data. Plos one, 19(2), 2024.
  • Platzer and Reutterer [2021] Michael Platzer and Thomas Reutterer. Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data. Frontiers in Big Data, 4:679939, 2021.
  • ProPublica [2016] ProPublica. COMPAS Recidivism Risk Score Data and Analysis, 2016. URL https://github.com/propublica/compas-analysis.
  • Richardson and Weiss [2018] Eitan Richardson and Yair Weiss. On GANs and GMMs. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018.
  • Shome et al. [2024] Arumoy Shome, Luis Cruz, and Arie Van Deursen. Data vs. Model Machine Learning Fairness Testing: An Empirical Study. In Proceedings of the 5th IEEE/ACM International Workshop on Deep Learning for Testing and Testing for Deep Learning, 2024.
  • Shwartz-Ziv and Armon [2022] Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, 2022.
  • Tao et al. [2021] Yuchao Tao, Ryan McKenna, Michael Hay, Ashwin Machanavajjhala, and Gerome Miklau. Benchmarking Differentially Private Synthetic Data Generation Algorithms. In Proceedings of the 3rd AAAI Workshop on Privacy-Preserving Artificial Intelligence, 2021.
  • van Breugel et al. [2023] Boris van Breugel, Nabeel Seedat, Fergus Imrie, and Mihaela van der Schaar. Can You Rely on Your Model Evaluation? Improving Model Evaluation with Synthetic Test Data. In Proceedings of 2023 International Conference on Neural Information Processing Systems, 2023.
  • Veale and Binns [2017] Michael Veale and Reuben Binns. Fairer machine learning in the real world: Mitigating discrimination without collecting sensitive data . Big Data & Society, 4(2), 2017.
  • Vero et al. [2024] Mark Vero, Mislav Balunovic, and Martin Vechev. CuTS: Customizable Tabular Synthetic Data Generation. In Proceedings of the 41st International Conference on Machine Learning, 2024.
  • Vora et al. [2021] Jian Vora, Karthik S Gurumoorthy, and Ajit Rajwade. Recovery of Joint Probability Distribution from One-Way Marginals: Low Rank Tensors and Random Projections. In Proceedings of the 2021 IEEE Statistical Signal Processing Workshop, 2021.
  • White House Office of Science and Technology Policy [2022] White House Office of Science and Technology Policy. Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People. https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf, 2022.
  • Xu et al. [2018] Depeng Xu, Shuhan Yuan, Lu Zhang, and Xintao Wu. FairGAN: Fairness-aware Generative Adversarial Networks. In Proceedings of the 2018 IEEE International Conference on Big Data, 2018.
  • Xu et al. [2019] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019.

Appendix

Appendix related to paper: Beyond Internal Data: Constructing Complete Datasets for Fairness Testing for the Algorithmic Fairness through the Lens of Metrics and Evaluation (AFME) at NeurIPS 2024

The appendix is structured as follows:

Appendix §A

provides technical details on the joint distribution estimation methods, Marginal Preservation and Latent Naïve Bayes, as outlined in the main text.

Appendix §B

describes each dataset used in the experiments, with tables specifying the variables and categories present after preprocessing.

Appendix §C

presents detailed results of the metrics used to assess the quality of the generated synthetic data.

Appendix A Technical Details for Joint Distribution Estimation Methods

A.1 Proof for Optimality for Independence Given Overlap Method

To find the optimal p(x3)𝑝subscript𝑥3p(x_{3})italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ), we start by minimising the total Kullback-Leibler (KL) divergence:

(p)=DKL(p^(x1,x2,x3)p(x1,x2,x3))+DKL(p^(x3,x4)p(x3,x4)).𝑝subscript𝐷KLconditional^𝑝subscript𝑥1subscript𝑥2subscript𝑥3𝑝subscript𝑥1subscript𝑥2subscript𝑥3subscript𝐷KLconditional^𝑝subscript𝑥3subscript𝑥4𝑝subscript𝑥3subscript𝑥4\mathcal{L}(p)=D_{\mathrm{KL}}\big{(}\hat{p}(x_{1},x_{2},x_{3})\parallel p(x_{% 1},x_{2},x_{3})\big{)}+D_{\mathrm{KL}}\big{(}\hat{p}(x_{3},x_{4})\parallel p(x% _{3},x_{4})\big{)}.caligraphic_L ( italic_p ) = italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ∥ italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ) + italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) ∥ italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) ) . (8)

Let p^1(x3)subscript^𝑝1subscript𝑥3\hat{p}_{1}(x_{3})over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) and p^2(x3)subscript^𝑝2subscript𝑥3\hat{p}_{2}(x_{3})over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) be the empirical marginals from the first and second datasets, respectively:

p^1(x3)=x1,x2p^(x1,x2,x3),p^2(x3)=x4p^(x3,x4).formulae-sequencesubscript^𝑝1subscript𝑥3subscriptsubscript𝑥1subscript𝑥2^𝑝subscript𝑥1subscript𝑥2subscript𝑥3subscript^𝑝2subscript𝑥3subscriptsubscript𝑥4^𝑝subscript𝑥3subscript𝑥4\hat{p}_{1}(x_{3})=\sum_{x_{1},x_{2}}\hat{p}(x_{1},x_{2},x_{3}),\quad\hat{p}_{% 2}(x_{3})=\sum_{x_{4}}\hat{p}(x_{3},x_{4}).over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) . (9)

From our joint distribution assumption p(x1,x2,x3,x4)=p(x3)p^(x1,x2x3)p^(x4x3)𝑝subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4𝑝subscript𝑥3^𝑝subscript𝑥1conditionalsubscript𝑥2subscript𝑥3^𝑝conditionalsubscript𝑥4subscript𝑥3p(x_{1},x_{2},x_{3},x_{4})=p(x_{3})\cdot\hat{p}(x_{1},x_{2}\mid x_{3})\cdot% \hat{p}(x_{4}\mid x_{3})italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) = italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ⋅ over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ⋅ over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ), we obtain marginals p(x1,x2,x3)=p(x3)p^(x1,x2x3),𝑝subscript𝑥1subscript𝑥2subscript𝑥3𝑝subscript𝑥3^𝑝subscript𝑥1conditionalsubscript𝑥2subscript𝑥3p(x_{1},x_{2},x_{3})=p(x_{3})\cdot\hat{p}(x_{1},x_{2}\mid x_{3}),italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ⋅ over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) , and p(x3,x4)=p(x3)p^(x4x3).𝑝subscript𝑥3subscript𝑥4𝑝subscript𝑥3^𝑝conditionalsubscript𝑥4subscript𝑥3p(x_{3},x_{4})=p(x_{3})\cdot\hat{p}(x_{4}\mid x_{3}).italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) = italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ⋅ over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) .

To minimise the KL divergence with respect to p(x3)𝑝subscript𝑥3p(x_{3})italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ), we rewrite (p)𝑝\mathcal{L}(p)caligraphic_L ( italic_p ) focusing on the marginal p(x3)𝑝subscript𝑥3p(x_{3})italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ):

(p)𝑝\displaystyle\mathcal{L}(p)caligraphic_L ( italic_p ) =x1,x2,x3p^(x1,x2,x3)[logp1^(x3)p^(x1,x2|x3)p(x3)p^(x1,x2|x3))]+x3,x4p^(x3,x4)[logp1^(x3)p^(x4|x3)p(x3)p^(x4|x3))]\displaystyle=\sum_{x_{1},x_{2},x_{3}}\hat{p}(x_{1},x_{2},x_{3})\left[\log% \frac{\hat{p_{1}}(x_{3})\hat{p}(x_{1},x_{2}|x_{3})}{p(x_{3})\hat{p}(x_{1},x_{2% }|x_{3})})\right]+\sum_{x_{3},x_{4}}\hat{p}(x_{3},x_{4})\left[\log\frac{\hat{p% _{1}}(x_{3})\hat{p}(x_{4}|x_{3})}{p(x_{3})\hat{p}(x_{4}|x_{3})})\right]= ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) [ roman_log divide start_ARG over^ start_ARG italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) end_ARG ) ] + ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) [ roman_log divide start_ARG over^ start_ARG italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) end_ARG ) ]
=x3x1,x2p^(x1,x2,x3)logp(x3)x3x4p^(x3,x4)logp(x3)absentsubscriptsubscript𝑥3subscriptsubscript𝑥1subscript𝑥2^𝑝subscript𝑥1subscript𝑥2subscript𝑥3𝑝subscript𝑥3subscriptsubscript𝑥3subscriptsubscript𝑥4^𝑝subscript𝑥3subscript𝑥4𝑝subscript𝑥3\displaystyle=-\sum_{x_{3}}\sum_{x_{1},x_{2}}\hat{p}(x_{1},x_{2},x_{3})\log p(% x_{3})-\sum_{x_{3}}\sum_{x_{4}}\hat{p}(x_{3},x_{4})\log p(x_{3})= - ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) roman_log italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) roman_log italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT )
=x3(p^1(x3)+p^2(x3))logp(x3)absentsubscriptsubscript𝑥3subscript^𝑝1subscript𝑥3subscript^𝑝2subscript𝑥3𝑝subscript𝑥3\displaystyle=-\sum_{x_{3}}\left(\hat{p}_{1}(x_{3})+\hat{p}_{2}(x_{3})\right)% \log p(x_{3})= - ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) + over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ) roman_log italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) (10)

We find the optimal p(x3)𝑝subscript𝑥3p(x_{3})italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ), which minimises (p)𝑝\mathcal{L}(p)caligraphic_L ( italic_p ) subject to x3p(x3)=1subscriptsubscript𝑥3𝑝subscript𝑥31\sum_{x_{3}}p(x_{3})=1∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = 1 to ensure that p(x3)𝑝subscript𝑥3p(x_{3})italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) is a valid probability distribution.

p(x3)p^1(x3)+p^2(x3).proportional-to𝑝subscript𝑥3subscript^𝑝1subscript𝑥3subscript^𝑝2subscript𝑥3\displaystyle p(x_{3})\propto\hat{p}_{1}(x_{3})+\hat{p}_{2}(x_{3}).italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ∝ over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) + over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) . (11)

To normalise p(x3)𝑝subscript𝑥3p(x_{3})italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ), we set:

p(x3)𝑝subscript𝑥3\displaystyle p(x_{3})italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) =p^1(x3)+p^2(x3)x3(p^1(x3)+p^2(x3))absentsubscript^𝑝1subscript𝑥3subscript^𝑝2subscript𝑥3subscriptsuperscriptsubscript𝑥3subscript^𝑝1superscriptsubscript𝑥3subscript^𝑝2superscriptsubscript𝑥3\displaystyle=\frac{\hat{p}_{1}(x_{3})+\hat{p}_{2}(x_{3})}{\sum_{x_{3}^{\prime% }}\left(\hat{p}_{1}(x_{3}^{\prime})+\hat{p}_{2}(x_{3}^{\prime})\right)}= divide start_ARG over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) + over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) end_ARG
=p^1(x3)+p^2(x3)2absentsubscript^𝑝1subscript𝑥3subscript^𝑝2subscript𝑥32\displaystyle=\frac{\hat{p}_{1}(x_{3})+\hat{p}_{2}(x_{3})}{2}= divide start_ARG over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) + over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) end_ARG start_ARG 2 end_ARG (12)

ensuring that p(x3)𝑝subscript𝑥3p(x_{3})italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) is a valid probability distribution.

Therefore the optimal p(x3)𝑝subscript𝑥3p(x_{3})italic_p ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) is the average of the empirical marginals from both datasets.

A.2 Details for Expectation Maximisation Algorithm for Latent Naïve Bayes Method

We assume categorical variables X1,X2,X3,X4subscript𝑋1subscript𝑋2subscript𝑋3subscript𝑋4X_{1},X_{2},X_{3},X_{4}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT with dom(Xi)={1,2,,Mi}subscript𝑋𝑖12subscript𝑀𝑖(X_{i})=\{1,2,...,M_{i}\}( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { 1 , 2 , … , italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } where Mi,Mi>1formulae-sequencesubscript𝑀𝑖subscript𝑀𝑖1M_{i}\in\mathbb{N},M_{i}>1italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_N , italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 1 for i={1,2,3,4}𝑖1234i=\{1,2,3,4\}italic_i = { 1 , 2 , 3 , 4 }. We want to sample from the full joint distribution p(X1,X2,X3,X4)𝑝subscript𝑋1subscript𝑋2subscript𝑋3subscript𝑋4p(X_{1},X_{2},X_{3},X_{4})italic_p ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ). However, our observations are of the form 𝐃𝟏subscript𝐃1\mathbf{D_{1}}bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT, and 𝐃𝟐subscript𝐃2\mathbf{D_{2}}bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT, where x3subscript𝑥3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and x3subscriptsuperscript𝑥3x^{\prime}_{3}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are both observations of the same variable X3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT.

𝐃𝟏subscript𝐃1\displaystyle\mathbf{D_{1}}bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ={(x1n,x2n,x3n)}n=1N1absentsuperscriptsubscriptsuperscriptsubscript𝑥1𝑛superscriptsubscript𝑥2𝑛superscriptsubscript𝑥3𝑛𝑛1subscript𝑁1\displaystyle=\{({x_{1}}^{n},{x_{2}}^{n},{x_{3}}^{n})\}_{n=1}^{N_{1}}= { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (13)
𝐃𝟐subscript𝐃2\displaystyle\mathbf{D_{2}}bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ={(x3n,x4n)}n=1N2absentsuperscriptsubscriptsuperscriptsubscriptsuperscript𝑥3𝑛superscriptsubscript𝑥4𝑛𝑛1subscript𝑁2\displaystyle=\{({x^{\prime}_{3}}^{n},{x_{4}}^{n})\}_{n=1}^{N_{2}}= { ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (14)

To model the complex dependencies between the variables and to simplify the model, we intentionally introduce latent variable Z𝑍Zitalic_Z, and the following probabilistic graphical model, where dom(Z)={1,2,,K},K,K>1formulae-sequence𝑍12𝐾formulae-sequence𝐾𝐾1(Z)=\{1,2,...,K\},K\in\mathbb{N},K>1( italic_Z ) = { 1 , 2 , … , italic_K } , italic_K ∈ blackboard_N , italic_K > 1.

Z𝑍Zitalic_ZX1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTX2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTX3subscript𝑋3X_{3}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTX4subscript𝑋4X_{4}italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT

By treating Z𝑍Zitalic_Z as a missing variable, mixture models can be trained using the EM algorithm.

The model defines the generative process for each data item n𝑛nitalic_n as follows:

  1. 1.

    Sample Z𝑍Zitalic_Z from p(Z=k)=πk𝑝𝑍𝑘subscript𝜋𝑘p(Z=k)=\pi_{k}italic_p ( italic_Z = italic_k ) = italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, where k=1,2,,K𝑘12𝐾k=1,2,\ldots,Kitalic_k = 1 , 2 , … , italic_K, πk0subscript𝜋𝑘0\pi_{k}\geq 0italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ 0, and k=1Kπk=1superscriptsubscript𝑘1𝐾subscript𝜋𝑘1\sum_{k=1}^{K}\pi_{k}=1∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1.

  2. 2.

    Given Z=k𝑍𝑘Z=kitalic_Z = italic_k, the conditional distribution of Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for i=1,2,3,4𝑖1234i=1,2,3,4italic_i = 1 , 2 , 3 , 4 is:

    p(Xi=mZ=k)=pi(mk)𝑝subscript𝑋𝑖conditional𝑚𝑍𝑘subscript𝑝𝑖conditional𝑚𝑘\displaystyle p(X_{i}=m\mid Z=k)=p_{i}(m\mid k)italic_p ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_m ∣ italic_Z = italic_k ) = italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_m ∣ italic_k ) (15)

    where m=1,2,,Mi𝑚12subscript𝑀𝑖m=1,2,\ldots,M_{i}italic_m = 1 , 2 , … , italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, pi(mk)0subscript𝑝𝑖conditional𝑚𝑘0p_{i}(m\mid k)\geq 0italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_m ∣ italic_k ) ≥ 0, and m=1Mipi(mk)=1superscriptsubscript𝑚1subscript𝑀𝑖subscript𝑝𝑖conditional𝑚𝑘1\sum_{m=1}^{M_{i}}p_{i}(m\mid k)=1∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_m ∣ italic_k ) = 1.

We aim to learn the parameters 𝜽=(θ1,θ2,θ3,θ4,θZ)𝜽subscript𝜃1subscript𝜃2subscript𝜃3subscript𝜃4subscript𝜃𝑍\boldsymbol{\theta}=(\theta_{1},\theta_{2},\theta_{3},\theta_{4},\theta_{Z})bold_italic_θ = ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ), where:

θisubscript𝜃𝑖\displaystyle\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ={pi(mk):m=1,,Mi,k=1,,K}for i=1,2,3,4\displaystyle=\{p_{i}(m\mid k):m=1,\ldots,M_{i},\;k=1,\ldots,K\}\quad\text{for% }i=1,2,3,4= { italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_m ∣ italic_k ) : italic_m = 1 , … , italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k = 1 , … , italic_K } for italic_i = 1 , 2 , 3 , 4
θZsubscript𝜃𝑍\displaystyle\theta_{Z}italic_θ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT =(π1,,πK)absentsubscript𝜋1subscript𝜋𝐾\displaystyle=(\pi_{1},\ldots,\pi_{K})= ( italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT )

By learning 𝜽𝜽\boldsymbol{\theta}bold_italic_θ, we can model the joint distribution:

p𝜽(X1,X2,X3)=Z=1KpθZ(Z)i=14pθi(XiZ)subscript𝑝𝜽subscript𝑋1subscript𝑋2subscript𝑋3superscriptsubscript𝑍1𝐾subscript𝑝subscript𝜃𝑍𝑍superscriptsubscriptproduct𝑖14subscript𝑝subscript𝜃𝑖conditionalsubscript𝑋𝑖𝑍\displaystyle p_{\boldsymbol{\theta}}(X_{1},X_{2},X_{3})=\sum_{Z=1}^{K}p_{% \theta_{Z}}(Z)\prod_{i=1}^{4}p_{\theta_{i}}(X_{i}\mid Z)italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_Z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_Z ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_Z ) (16)

A.2.1 Model Distributions

For dataset 𝐃𝟏subscript𝐃1\mathbf{D_{1}}bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT, the joint distribution is:

p𝜽(𝐃𝟏,𝐳)=n=1N1p1(x1nzn)p2(x2nzn)p3(x3nzn)πznsubscript𝑝𝜽subscript𝐃1𝐳superscriptsubscriptproduct𝑛1subscript𝑁1subscript𝑝1conditionalsuperscriptsubscript𝑥1𝑛superscript𝑧𝑛subscript𝑝2conditionalsuperscriptsubscript𝑥2𝑛superscript𝑧𝑛subscript𝑝3conditionalsuperscriptsubscript𝑥3𝑛superscript𝑧𝑛subscript𝜋superscript𝑧𝑛\displaystyle p_{\boldsymbol{\theta}}(\mathbf{D_{1}},\mathbf{z})=\prod_{n=1}^{% N_{1}}p_{1}(x_{1}^{n}\mid z^{n})\cdot p_{2}(x_{2}^{n}\mid z^{n})\cdot p_{3}(x_% {3}^{n}\mid z^{n})\cdot\pi_{z^{n}}italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_z ) = ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⋅ italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⋅ italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⋅ italic_π start_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (17)

Marginalising over the latent variables gives the marginal log likelihood:

logp𝜽(𝐃𝟏)=n=1N1log(k=1Kp1(x1nk)p2(x2nk)p3(x3nk)πk)subscript𝑝𝜽subscript𝐃1superscriptsubscript𝑛1subscript𝑁1superscriptsubscript𝑘1𝐾subscript𝑝1conditionalsuperscriptsubscript𝑥1𝑛𝑘subscript𝑝2conditionalsuperscriptsubscript𝑥2𝑛𝑘subscript𝑝3conditionalsuperscriptsubscript𝑥3𝑛𝑘subscript𝜋𝑘\displaystyle\log p_{\boldsymbol{\theta}}(\mathbf{D_{1}})=\sum_{n=1}^{N_{1}}% \log\left(\sum_{k=1}^{K}p_{1}(x_{1}^{n}\mid k)\cdot p_{2}(x_{2}^{n}\mid k)% \cdot p_{3}(x_{3}^{n}\mid k)\cdot\pi_{k}\right)roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_log ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_k ) ⋅ italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_k ) ⋅ italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_k ) ⋅ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (18)

The posterior distribution is:

p𝜽(𝐳𝐃𝟏)=n=1N1p1(x1nzn)p2(x2nzn)p3(x3nzn)πznk=1Kp1(x1nk)p2(x2nk)p3(x3nk)πksubscript𝑝𝜽conditional𝐳subscript𝐃1superscriptsubscriptproduct𝑛1subscript𝑁1subscript𝑝1conditionalsuperscriptsubscript𝑥1𝑛superscript𝑧𝑛subscript𝑝2conditionalsuperscriptsubscript𝑥2𝑛superscript𝑧𝑛subscript𝑝3conditionalsuperscriptsubscript𝑥3𝑛superscript𝑧𝑛subscript𝜋superscript𝑧𝑛superscriptsubscript𝑘1𝐾subscript𝑝1conditionalsuperscriptsubscript𝑥1𝑛𝑘subscript𝑝2conditionalsuperscriptsubscript𝑥2𝑛𝑘subscript𝑝3conditionalsuperscriptsubscript𝑥3𝑛𝑘subscript𝜋𝑘\displaystyle p_{\boldsymbol{\theta}}(\mathbf{z}\mid\mathbf{D_{1}})=\prod_{n=1% }^{N_{1}}\frac{p_{1}(x_{1}^{n}\mid z^{n})\cdot p_{2}(x_{2}^{n}\mid z^{n})\cdot p% _{3}(x_{3}^{n}\mid z^{n})\cdot\pi_{z^{n}}}{\sum_{k=1}^{K}p_{1}(x_{1}^{n}\mid k% )\cdot p_{2}(x_{2}^{n}\mid k)\cdot p_{3}(x_{3}^{n}\mid k)\cdot\pi_{k}}italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_z ∣ bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⋅ italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⋅ italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⋅ italic_π start_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_k ) ⋅ italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_k ) ⋅ italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_k ) ⋅ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG (19)

Similarly, for dataset 𝐃𝟐subscript𝐃2\mathbf{D_{2}}bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT:

p𝜽(𝐃𝟐,𝐳)subscript𝑝𝜽subscript𝐃2superscript𝐳\displaystyle p_{\boldsymbol{\theta}}(\mathbf{D_{2}},\mathbf{z^{\prime}})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , bold_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =n=1N2p3(x3nzn)p4(x4nzn)πznabsentsuperscriptsubscriptproduct𝑛1subscript𝑁2subscript𝑝3conditionalsuperscriptsubscript𝑥3𝑛superscript𝑧𝑛subscript𝑝4conditionalsuperscriptsubscript𝑥4𝑛superscript𝑧𝑛subscript𝜋superscript𝑧𝑛\displaystyle=\prod_{n=1}^{N_{2}}p_{3}(x_{3}^{\prime n}\mid z^{\prime n})\cdot p% _{4}(x_{4}^{n}\mid z^{\prime n})\cdot\pi_{z^{\prime n}}= ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ italic_n end_POSTSUPERSCRIPT ∣ italic_z start_POSTSUPERSCRIPT ′ italic_n end_POSTSUPERSCRIPT ) ⋅ italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_z start_POSTSUPERSCRIPT ′ italic_n end_POSTSUPERSCRIPT ) ⋅ italic_π start_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT ′ italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (20)
logp𝜽(𝐃𝟐)subscript𝑝𝜽subscript𝐃2\displaystyle\log p_{\boldsymbol{\theta}}(\mathbf{D_{2}})roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) =n=1N2log(k=1Kp3(x3nk)p4(x4nk)πk)absentsuperscriptsubscript𝑛1subscript𝑁2superscriptsubscript𝑘1𝐾subscript𝑝3conditionalsuperscriptsubscript𝑥3𝑛𝑘subscript𝑝4conditionalsuperscriptsubscript𝑥4𝑛𝑘subscript𝜋𝑘\displaystyle=\sum_{n=1}^{N_{2}}\log\left(\sum_{k=1}^{K}p_{3}(x_{3}^{\prime n}% \mid k)\cdot p_{4}(x_{4}^{n}\mid k)\cdot\pi_{k}\right)= ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_log ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ italic_n end_POSTSUPERSCRIPT ∣ italic_k ) ⋅ italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_k ) ⋅ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (21)
p𝜽(𝐳𝐃𝟐)subscript𝑝𝜽conditionalsuperscript𝐳subscript𝐃2\displaystyle p_{\boldsymbol{\theta}}(\mathbf{z^{\prime}}\mid\mathbf{D_{2}})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) =n=1N2p3(x3nzn)p4(x4nzn)πznk=1Kp3(x3nk)p4(x4nk)πkabsentsuperscriptsubscriptproduct𝑛1subscript𝑁2subscript𝑝3conditionalsuperscriptsubscript𝑥3𝑛superscript𝑧𝑛subscript𝑝4conditionalsuperscriptsubscript𝑥4𝑛superscript𝑧𝑛subscript𝜋superscript𝑧𝑛superscriptsubscript𝑘1𝐾subscript𝑝3conditionalsuperscriptsubscript𝑥3𝑛𝑘subscript𝑝4conditionalsuperscriptsubscript𝑥4𝑛𝑘subscript𝜋𝑘\displaystyle=\prod_{n=1}^{N_{2}}\frac{p_{3}(x_{3}^{\prime n}\mid z^{\prime n}% )\cdot p_{4}(x_{4}^{n}\mid z^{\prime n})\cdot\pi_{z^{\prime n}}}{\sum_{k=1}^{K% }p_{3}(x_{3}^{\prime n}\mid k)\cdot p_{4}(x_{4}^{n}\mid k)\cdot\pi_{k}}= ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ italic_n end_POSTSUPERSCRIPT ∣ italic_z start_POSTSUPERSCRIPT ′ italic_n end_POSTSUPERSCRIPT ) ⋅ italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_z start_POSTSUPERSCRIPT ′ italic_n end_POSTSUPERSCRIPT ) ⋅ italic_π start_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT ′ italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ italic_n end_POSTSUPERSCRIPT ∣ italic_k ) ⋅ italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_k ) ⋅ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG (22)

A.2.2 Method Outline

For dataset 𝐃𝟏subscript𝐃1\mathbf{D_{1}}bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT with latents 𝐳={zn}n=1N1𝐳superscriptsubscriptsuperscript𝑧𝑛𝑛1subscript𝑁1\mathbf{z}=\{z^{n}\}_{n=1}^{N_{1}}bold_z = { italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, a Latent Variable Model (LVM) is defined as p𝜽(𝐃𝟏,𝐳)subscript𝑝𝜽subscript𝐃1𝐳p_{\boldsymbol{\theta}}(\mathbf{D_{1}},\mathbf{z})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_z ). Similarly, for 𝐃𝟐subscript𝐃2\mathbf{D_{2}}bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT with latents 𝐳={zn}n=1N2superscript𝐳superscriptsubscriptsuperscript𝑧𝑛𝑛1subscript𝑁2\mathbf{z^{\prime}}=\{z^{\prime n}\}_{n=1}^{N_{2}}bold_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_z start_POSTSUPERSCRIPT ′ italic_n end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, the LVM is p𝜽(𝐃𝟐,𝐳)subscript𝑝𝜽subscript𝐃2superscript𝐳p_{\boldsymbol{\theta}}(\mathbf{D_{2}},\mathbf{z^{\prime}})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , bold_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Under independence assumptions, the distributions factorize:

p𝜽(𝐃𝟏,𝐃𝟐,𝐳,𝐳)subscript𝑝𝜽subscript𝐃1subscript𝐃2𝐳superscript𝐳\displaystyle p_{\boldsymbol{\theta}}(\mathbf{D_{1}},\mathbf{D_{2}},\mathbf{z}% ,\mathbf{z^{\prime}})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , bold_z , bold_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =p𝜽(𝐃𝟏,𝐳)p𝜽(𝐃𝟐,𝐳)absentsubscript𝑝𝜽subscript𝐃1𝐳subscript𝑝𝜽subscript𝐃2superscript𝐳\displaystyle=p_{\boldsymbol{\theta}}(\mathbf{D_{1}},\mathbf{z})\cdot p_{% \boldsymbol{\theta}}(\mathbf{D_{2}},\mathbf{z^{\prime}})= italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_z ) ⋅ italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , bold_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (23)
logp𝜽(𝐃𝟏,𝐃𝟐)subscript𝑝𝜽subscript𝐃1subscript𝐃2\displaystyle\log p_{\boldsymbol{\theta}}(\mathbf{D_{1}},\mathbf{D_{2}})roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) =logp𝜽(𝐃𝟏)+logp𝜽(𝐃𝟐)absentsubscript𝑝𝜽subscript𝐃1subscript𝑝𝜽subscript𝐃2\displaystyle=\log p_{\boldsymbol{\theta}}(\mathbf{D_{1}})+\log p_{\boldsymbol% {\theta}}(\mathbf{D_{2}})= roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ) + roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) (24)
p𝜽(𝐳,𝐳𝐃𝟏,𝐃𝟐)subscript𝑝𝜽𝐳conditionalsuperscript𝐳subscript𝐃1subscript𝐃2\displaystyle p_{\boldsymbol{\theta}}(\mathbf{z},\mathbf{z^{\prime}}\mid% \mathbf{D_{1}},\mathbf{D_{2}})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_z , bold_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) =p𝜽(𝐳𝐃𝟏)p𝜽(𝐳𝐃𝟐)absentsubscript𝑝𝜽conditional𝐳subscript𝐃1subscript𝑝𝜽conditionalsuperscript𝐳subscript𝐃2\displaystyle=p_{\boldsymbol{\theta}}(\mathbf{z}\mid\mathbf{D_{1}})\cdot p_{% \boldsymbol{\theta}}(\mathbf{z^{\prime}}\mid\mathbf{D_{2}})= italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_z ∣ bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ) ⋅ italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) (25)

To estimate 𝜽𝜽{\boldsymbol{\theta}}bold_italic_θ, we apply the EM algorithm to maximize the marginal log-likelihoods logp𝜽(𝐃𝟏)subscript𝑝𝜽subscript𝐃1\log p_{\boldsymbol{\theta}}(\mathbf{D_{1}})roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ) and logp𝜽(𝐃𝟐)subscript𝑝𝜽subscript𝐃2\log p_{\boldsymbol{\theta}}(\mathbf{D_{2}})roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) under latent variables. The lower bounds are given by:

logp𝜽(𝐃𝟏)LD1(θ,q1),logp𝜽(𝐃𝟐)LD2(θ,q2)formulae-sequencesubscript𝑝𝜽subscript𝐃1subscript𝐿subscript𝐷1𝜃subscript𝑞1subscript𝑝𝜽subscript𝐃2subscript𝐿subscript𝐷2𝜃subscript𝑞2\displaystyle\log p_{\boldsymbol{\theta}}(\mathbf{D_{1}})\geq L_{D_{1}}(\theta% ,q_{1}),\quad\log p_{\boldsymbol{\theta}}(\mathbf{D_{2}})\geq L_{D_{2}}(\theta% ,q_{2})roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ) ≥ italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) ≥ italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (26)

where q1(z)=q(zD1)subscript𝑞1𝑧𝑞conditional𝑧subscript𝐷1q_{1}(z)=q(z\mid D_{1})italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z ) = italic_q ( italic_z ∣ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and q2(z)=q(zD2)subscript𝑞2𝑧𝑞conditional𝑧subscript𝐷2q_{2}(z)=q(z\mid D_{2})italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_z ) = italic_q ( italic_z ∣ italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) are distributions over Z𝑍Zitalic_Z.

The EM algorithm steps are as follows, also detailed in Algorithm 1.

Algorithm 1 EM Algorithm
1:Initialize t=0𝑡0t=0italic_t = 0 and 𝜽(0)={θ1(0),θ2(0),θ3(0),θ4(0),θZ(0)}superscript𝜽0superscriptsubscript𝜃10superscriptsubscript𝜃20superscriptsubscript𝜃30superscriptsubscript𝜃40superscriptsubscript𝜃𝑍0\boldsymbol{\theta}^{(0)}=\{\theta_{1}^{(0)},\theta_{2}^{(0)},\theta_{3}^{(0)}% ,\theta_{4}^{(0)},\theta_{Z}^{(0)}\}bold_italic_θ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = { italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT }
2:t1𝑡1t\leftarrow 1italic_t ← 1
3:while 𝜽𝜽\boldsymbol{\theta}bold_italic_θ not converged do
4:     for n=1,,N1𝑛1subscript𝑁1n=1,...,N_{1}italic_n = 1 , … , italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, k=1,,K𝑘1𝐾k=1,...,Kitalic_k = 1 , … , italic_K do
5:         Set q1(t)(zn=k)superscriptsubscript𝑞1𝑡superscript𝑧𝑛𝑘q_{1}^{(t)}(z^{n}=k)italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_k ) using (34)
6:     end for
7:     for n=1,,N2𝑛1subscript𝑁2n=1,...,N_{2}italic_n = 1 , … , italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, k=1,,K𝑘1𝐾k=1,...,Kitalic_k = 1 , … , italic_K do
8:         Set q2(t)(zn=k)superscriptsubscript𝑞2𝑡superscript𝑧𝑛𝑘q_{2}^{(t)}(z^{\prime n}=k)italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ italic_n end_POSTSUPERSCRIPT = italic_k ) using (35)
9:     end for
10:     Update 𝜽(t)={θ1(t),θ2(t),θ3(t),θ4(t),θZ(t)}superscript𝜽𝑡superscriptsubscript𝜃1𝑡superscriptsubscript𝜃2𝑡superscriptsubscript𝜃3𝑡superscriptsubscript𝜃4𝑡superscriptsubscript𝜃𝑍𝑡\boldsymbol{\theta}^{(t)}=\{\theta_{1}^{(t)},\theta_{2}^{(t)},\theta_{3}^{(t)}% ,\theta_{4}^{(t)},\theta_{Z}^{(t)}\}bold_italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = { italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } using (43), (44), (49), (45), (39)
11:     tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1
12:end while
  • M-step: Maximize the lower bounds with respect to θ1,θ2,θ3,θ4,θZsubscript𝜃1subscript𝜃2subscript𝜃3subscript𝜃4subscript𝜃𝑍\theta_{1},\theta_{2},\theta_{3},\theta_{4},\theta_{Z}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT:

    • Maximize LD1(θ,q1)subscript𝐿subscript𝐷1𝜃subscript𝑞1L_{D_{1}}(\theta,q_{1})italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) for θ1,θ2subscript𝜃1subscript𝜃2\theta_{1},\theta_{2}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

    • Maximize LD2(θ,q2)subscript𝐿subscript𝐷2𝜃subscript𝑞2L_{D_{2}}(\theta,q_{2})italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) for θ4subscript𝜃4\theta_{4}italic_θ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT

    • Maximize the sum over terms containing θ3subscript𝜃3\theta_{3}italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and θZsubscript𝜃𝑍\theta_{Z}italic_θ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT across LD1subscript𝐿subscript𝐷1L_{D_{1}}italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and LD2subscript𝐿subscript𝐷2L_{D_{2}}italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT

  • E-step: Find q𝑞qitalic_q to optimize LD1(θ,q1)+LD2(θ,q2)subscript𝐿subscript𝐷1𝜃subscript𝑞1subscript𝐿subscript𝐷2𝜃subscript𝑞2L_{D_{1}}(\theta,q_{1})+L_{D_{2}}(\theta,q_{2})italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ):

    • Set q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to optimize LD1subscript𝐿subscript𝐷1L_{D_{1}}italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT given fixed θ𝜃\thetaitalic_θ

    • Set q2subscript𝑞2q_{2}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to optimize LD2subscript𝐿subscript𝐷2L_{D_{2}}italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT given fixed θ𝜃\thetaitalic_θ

A.3 Deriving Algorithm Steps

A.3.1 Lower bound on the Likelihood

We lower bound the log-likelihood of the observed variables:

log(p𝜽(𝐃𝟏))+log(p𝜽(𝐃𝟐))subscript𝑝𝜽subscript𝐃1subscript𝑝𝜽subscript𝐃2\displaystyle\log\left(p_{\boldsymbol{\theta}}(\mathbf{D_{1}})\right)+\log% \left(p_{\boldsymbol{\theta}}(\mathbf{D_{2}})\right)roman_log ( italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ) ) + roman_log ( italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) ) (27)

Using q1(z)=q(z𝐃𝟏)subscript𝑞1𝑧𝑞conditional𝑧subscript𝐃1q_{1}(z)=q(z\mid\mathbf{D_{1}})italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z ) = italic_q ( italic_z ∣ bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ) and q2(z)=q(z𝐃𝟐)subscript𝑞2𝑧𝑞conditional𝑧subscript𝐃2q_{2}(z)=q(z\mid\mathbf{D_{2}})italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_z ) = italic_q ( italic_z ∣ bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ), the KL divergence for 𝐃𝟏subscript𝐃1\mathbf{D_{1}}bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT is:

DKL(q1(Z)p𝜽(Z𝐃𝟏))=𝔼Zq1[logq1(Z)p𝜽(Z𝐃𝟏)]0\displaystyle D_{\mathrm{KL}}(q_{1}(Z)\,\|\,p_{\boldsymbol{\theta}}(Z\mid% \mathbf{D_{1}}))=\mathbb{E}_{Z\sim q_{1}}\left[\log\frac{q_{1}(Z)}{p_{% \boldsymbol{\theta}}(Z\mid\mathbf{D_{1}})}\right]\geq 0italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ) ∥ italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_Z ∣ bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ) ) = blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_Z ∣ bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ) end_ARG ] ≥ 0 (28)

Thus, we have:

log(p𝜽(𝐃𝟏))𝔼Zq1[logp𝜽(𝐃𝟏,Z)q1(Z)]=LD1(𝜽,q1)subscript𝑝𝜽subscript𝐃1subscript𝔼similar-to𝑍subscript𝑞1delimited-[]subscript𝑝𝜽subscript𝐃1𝑍subscript𝑞1𝑍subscript𝐿subscript𝐷1𝜽subscript𝑞1\displaystyle\log\left(p_{\boldsymbol{\theta}}(\mathbf{D_{1}})\right)\geq% \mathbb{E}_{Z\sim q_{1}}\left[\log\frac{p_{\boldsymbol{\theta}}(\mathbf{D_{1}}% ,Z)}{q_{1}(Z)}\right]=L_{D_{1}}(\boldsymbol{\theta},q_{1})roman_log ( italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ) ) ≥ blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , italic_Z ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ) end_ARG ] = italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (29)

where

LD1(𝜽,q1)=n=1N1k=1Kq1(zn=k)[i=13logpi(xink)+logπk]H(q1)subscript𝐿subscript𝐷1𝜽subscript𝑞1superscriptsubscript𝑛1subscript𝑁1superscriptsubscript𝑘1𝐾subscript𝑞1superscript𝑧𝑛𝑘delimited-[]superscriptsubscript𝑖13subscript𝑝𝑖conditionalsuperscriptsubscript𝑥𝑖𝑛𝑘subscript𝜋𝑘𝐻subscript𝑞1\displaystyle L_{D_{1}}(\boldsymbol{\theta},q_{1})=\sum_{n=1}^{N_{1}}\sum_{k=1% }^{K}q_{1}(z^{n}=k)\left[\sum_{i=1}^{3}\log p_{i}(x_{i}^{n}\mid k)+\log\pi_{k}% \right]-H(q_{1})italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_k ) [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_k ) + roman_log italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] - italic_H ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (30)

Similarly, for 𝐃𝟐subscript𝐃2\mathbf{D_{2}}bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT:

log(p𝜽(𝐃𝟐))LD2(𝜽,q2)subscript𝑝𝜽subscript𝐃2subscript𝐿subscript𝐷2𝜽subscript𝑞2\displaystyle\log\left(p_{\boldsymbol{\theta}}(\mathbf{D_{2}})\right)\geq L_{D% _{2}}(\boldsymbol{\theta},q_{2})roman_log ( italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) ) ≥ italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (31)

with

LD2(𝜽,q2)=n=1N2k=1Kq2(zn=k)[logp3(x3nk)+logp4(x4nk)+logπk]H(q2)subscript𝐿subscript𝐷2𝜽subscript𝑞2superscriptsubscript𝑛1subscript𝑁2superscriptsubscript𝑘1𝐾subscript𝑞2superscriptsuperscript𝑧𝑛𝑘delimited-[]subscript𝑝3conditionalsuperscriptsubscriptsuperscript𝑥3𝑛𝑘subscript𝑝4conditionalsuperscriptsubscript𝑥4𝑛𝑘subscript𝜋𝑘𝐻subscript𝑞2\displaystyle L_{D_{2}}(\boldsymbol{\theta},q_{2})=\sum_{n=1}^{N_{2}}\sum_{k=1% }^{K}q_{2}({z^{\prime}}^{n}=k)\left[\log p_{3}({x^{\prime}}_{3}^{n}\mid k)+% \log p_{4}(x_{4}^{n}\mid k)+\log\pi_{k}\right]-H(q_{2})italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_k ) [ roman_log italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_k ) + roman_log italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_k ) + roman_log italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] - italic_H ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (32)

Overall, the lower bound is:

log(p𝜽(𝐃𝟏))+log(p𝜽(𝐃𝟐))LD1(𝜽,q1)+LD2(𝜽,q2)subscript𝑝𝜽subscript𝐃1subscript𝑝𝜽subscript𝐃2subscript𝐿subscript𝐷1𝜽subscript𝑞1subscript𝐿subscript𝐷2𝜽subscript𝑞2\displaystyle\log\left(p_{\boldsymbol{\theta}}(\mathbf{D_{1}})\right)+\log% \left(p_{\boldsymbol{\theta}}(\mathbf{D_{2}})\right)\geq L_{D_{1}}(\boldsymbol% {\theta},q_{1})+L_{D_{2}}(\boldsymbol{\theta},q_{2})roman_log ( italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ) ) + roman_log ( italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_D start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) ) ≥ italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (33)

A.3.2 E step

The E-step 1 updates q1(t)(zn=k)superscriptsubscript𝑞1𝑡superscript𝑧𝑛𝑘q_{1}^{(t)}(z^{n}=k)italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_k ) by maximizing the lower bound LD1(θ,q1)subscript𝐿subscript𝐷1𝜃subscript𝑞1L_{D_{1}}(\theta,q_{1})italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) with respect to q1()subscript𝑞1q_{1}(\cdot)italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ ), while keeping θ𝜃\thetaitalic_θ fixed:

q1(t)(zn=k)superscriptsubscript𝑞1𝑡superscript𝑧𝑛𝑘\displaystyle q_{1}^{(t)}(z^{n}=k)italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_k ) =p𝜽(t1)(zn=kx1n,x2n,x3n)absentsubscript𝑝superscript𝜽𝑡1superscript𝑧𝑛conditional𝑘superscriptsubscript𝑥1𝑛superscriptsubscript𝑥2𝑛superscriptsubscript𝑥3𝑛\displaystyle=p_{\boldsymbol{\theta}^{(t-1)}}(z^{n}=k\mid x_{1}^{n},x_{2}^{n},% x_{3}^{n})= italic_p start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_k ∣ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT )
=p1(t1)(x1nk)p2(t1)(x2nk)p3(t1)(x3nk)πk(t1)j=1Kp1(t1)(x1nj)p2(t1)(x2nj)p3(t1)(x3nj)πj(t1)absentsuperscriptsubscript𝑝1𝑡1conditionalsuperscriptsubscript𝑥1𝑛𝑘superscriptsubscript𝑝2𝑡1conditionalsuperscriptsubscript𝑥2𝑛𝑘superscriptsubscript𝑝3𝑡1conditionalsuperscriptsubscript𝑥3𝑛𝑘superscriptsubscript𝜋𝑘𝑡1superscriptsubscript𝑗1𝐾superscriptsubscript𝑝1𝑡1conditionalsuperscriptsubscript𝑥1𝑛𝑗superscriptsubscript𝑝2𝑡1conditionalsuperscriptsubscript𝑥2𝑛𝑗superscriptsubscript𝑝3𝑡1conditionalsuperscriptsubscript𝑥3𝑛𝑗superscriptsubscript𝜋𝑗𝑡1\displaystyle=\frac{p_{1}^{(t-1)}(x_{1}^{n}\mid k)\cdot p_{2}^{(t-1)}(x_{2}^{n% }\mid k)\cdot p_{3}^{(t-1)}(x_{3}^{n}\mid k)\cdot\pi_{k}^{(t-1)}}{\sum_{j=1}^{% K}p_{1}^{(t-1)}(x_{1}^{n}\mid j)\cdot p_{2}^{(t-1)}(x_{2}^{n}\mid j)\cdot p_{3% }^{(t-1)}(x_{3}^{n}\mid j)\cdot\pi_{j}^{(t-1)}}= divide start_ARG italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_k ) ⋅ italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_k ) ⋅ italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_k ) ⋅ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_j ) ⋅ italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_j ) ⋅ italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_j ) ⋅ italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_ARG (34)

The E-step 2 updates q2(t)(zn=k)superscriptsubscript𝑞2𝑡superscriptsuperscript𝑧𝑛𝑘q_{2}^{(t)}({z^{\prime}}^{n}=k)italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_k ) by maximizing LD2(θ,q2)subscript𝐿subscript𝐷2𝜃subscript𝑞2L_{D_{2}}(\theta,q_{2})italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with respect to q2()subscript𝑞2q_{2}(\cdot)italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ⋅ ), while keeping θ𝜃\thetaitalic_θ fixed:

q2(t)(zn=k)superscriptsubscript𝑞2𝑡superscriptsuperscript𝑧𝑛𝑘\displaystyle q_{2}^{(t)}({z^{\prime}}^{n}=k)italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_k ) =p𝜽(t1)(zn=kx3n,x4n)absentsubscript𝑝superscript𝜽𝑡1superscriptsuperscript𝑧𝑛conditional𝑘superscriptsubscriptsuperscript𝑥3𝑛superscriptsubscript𝑥4𝑛\displaystyle=p_{\boldsymbol{\theta}^{(t-1)}}({z^{\prime}}^{n}=k\mid{x^{\prime% }_{3}}^{n},x_{4}^{n})= italic_p start_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_k ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT )
=p3(t1)(x3nk)p4(t1)(x4nk)πk(t1)j=1Kp3(t1)(x3nj)p4(t1)(x4nj)πj(t1)absentsuperscriptsubscript𝑝3𝑡1conditionalsuperscriptsubscriptsuperscript𝑥3𝑛𝑘superscriptsubscript𝑝4𝑡1conditionalsuperscriptsubscript𝑥4𝑛𝑘superscriptsubscript𝜋𝑘𝑡1superscriptsubscript𝑗1𝐾superscriptsubscript𝑝3𝑡1conditionalsuperscriptsubscriptsuperscript𝑥3𝑛𝑗superscriptsubscript𝑝4𝑡1conditionalsuperscriptsubscript𝑥4𝑛𝑗superscriptsubscript𝜋𝑗𝑡1\displaystyle=\frac{p_{3}^{(t-1)}({x^{\prime}_{3}}^{n}\mid k)\cdot p_{4}^{(t-1% )}(x_{4}^{n}\mid k)\cdot\pi_{k}^{(t-1)}}{\sum_{j=1}^{K}p_{3}^{(t-1)}({x^{% \prime}_{3}}^{n}\mid j)\cdot p_{4}^{(t-1)}(x_{4}^{n}\mid j)\cdot\pi_{j}^{(t-1)}}= divide start_ARG italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_k ) ⋅ italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_k ) ⋅ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_j ) ⋅ italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_j ) ⋅ italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_ARG (35)

A.3.3 M step: Optimal θZsubscript𝜃𝑍\theta_{Z}italic_θ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT

For the M-step, we maximize LD1(θ,q1)+LD2(θ,q2)subscript𝐿subscript𝐷1𝜃subscript𝑞1subscript𝐿subscript𝐷2𝜃subscript𝑞2L_{D_{1}}(\theta,q_{1})+L_{D_{2}}(\theta,q_{2})italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with respect to θ𝜃\thetaitalic_θ, while keeping q()𝑞q(\cdot)italic_q ( ⋅ ) fixed.

To account for the constraint k=1Kπk=1superscriptsubscript𝑘1𝐾subscript𝜋𝑘1\sum_{k=1}^{K}\pi_{k}=1∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1, we use a Lagrange multiplier λ𝜆\lambdaitalic_λ. For any c{1,,K}𝑐1𝐾c\in\{1,\ldots,K\}italic_c ∈ { 1 , … , italic_K }, we have:

πc(LD1(𝜽,q1)+LD2(𝜽,q2)λ(k=1Kπk1))=0subscriptsubscript𝜋𝑐subscript𝐿subscript𝐷1𝜽subscript𝑞1subscript𝐿subscript𝐷2𝜽subscript𝑞2𝜆superscriptsubscript𝑘1𝐾subscript𝜋𝑘10\displaystyle\triangledown_{\pi_{c}}\left(L_{D_{1}}(\boldsymbol{\theta},q_{1})% +L_{D_{2}}(\boldsymbol{\theta},q_{2})-\lambda\left(\sum_{k=1}^{K}\pi_{k}-1% \right)\right)=0▽ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_λ ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 ) ) = 0 (36)
n=1N1q1(zn=c)+n=1N2q2(zn=c)πcλ=0absentsuperscriptsubscript𝑛1subscript𝑁1subscript𝑞1superscript𝑧𝑛𝑐superscriptsubscript𝑛1subscript𝑁2subscript𝑞2superscriptsuperscript𝑧𝑛𝑐subscript𝜋𝑐𝜆0\displaystyle\implies\frac{\sum_{n=1}^{N_{1}}q_{1}({z}^{n}=c)+\sum_{n=1}^{N_{2% }}q_{2}({z^{\prime}}^{n}=c)}{\pi_{c}}-\lambda=0⟹ divide start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) + ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG - italic_λ = 0 (37)
πcn=1N1q1(zn=c)+n=1N2q2(zn=c)absentsubscript𝜋𝑐proportional-tosuperscriptsubscript𝑛1subscript𝑁1subscript𝑞1superscript𝑧𝑛𝑐superscriptsubscript𝑛1subscript𝑁2subscript𝑞2superscriptsuperscript𝑧𝑛𝑐\displaystyle\implies\pi_{c}\propto\sum_{n=1}^{N_{1}}q_{1}({z}^{n}=c)+\sum_{n=% 1}^{N_{2}}q_{2}({z^{\prime}}^{n}=c)⟹ italic_π start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∝ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) + ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) (38)

Since k=1K(n=1N1q1(zn=k)+n=1N2q2(zn=k))=N1+N2superscriptsubscript𝑘1𝐾superscriptsubscript𝑛1subscript𝑁1subscript𝑞1superscript𝑧𝑛𝑘superscriptsubscript𝑛1subscript𝑁2subscript𝑞2superscriptsuperscript𝑧𝑛𝑘subscript𝑁1subscript𝑁2\sum_{k=1}^{K}\left(\sum_{n=1}^{N_{1}}q_{1}({z}^{n}=k)+\sum_{n=1}^{N_{2}}q_{2}% ({z^{\prime}}^{n}=k)\right)=N_{1}+N_{2}∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_k ) + ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_k ) ) = italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we obtain:

πc(t)=n=1N1q1(t)(zn=c)+n=1N2q2(t)(zn=c)N1+N2superscriptsubscript𝜋𝑐𝑡superscriptsubscript𝑛1subscript𝑁1superscriptsubscript𝑞1𝑡superscript𝑧𝑛𝑐superscriptsubscript𝑛1subscript𝑁2superscriptsubscript𝑞2𝑡superscriptsuperscript𝑧𝑛𝑐subscript𝑁1subscript𝑁2\displaystyle\pi_{c}^{(t)}=\frac{\sum_{n=1}^{N_{1}}q_{1}^{(t)}({z}^{n}=c)+\sum% _{n=1}^{N_{2}}q_{2}^{(t)}({z^{\prime}}^{n}=c)}{N_{1}+N_{2}}italic_π start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) + ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG (39)

A.3.4 M step: Optimal θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, θ2subscript𝜃2\theta_{2}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and θ4subscript𝜃4\theta_{4}italic_θ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT

In the M-step, we use Lagrange multipliers λ(c)𝜆𝑐\lambda(c)italic_λ ( italic_c ) to maximize LD1(𝜽,q1)subscript𝐿subscript𝐷1𝜽subscript𝑞1L_{D_{1}}(\boldsymbol{\theta},q_{1})italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) with respect to p1(m|c)subscript𝑝1conditional𝑚𝑐p_{1}(m|c)italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_m | italic_c ). For c{1,,K}𝑐1𝐾c\in\{1,\ldots,K\}italic_c ∈ { 1 , … , italic_K } and m{1,,M1}𝑚1subscript𝑀1m\in\{1,\ldots,M_{1}\}italic_m ∈ { 1 , … , italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT }, we have:

p1(m|c)(LD1(𝜽,q1)k=1Kλ(k)(j=1M1p1(j|k)1))=0subscriptsubscript𝑝1conditional𝑚𝑐subscript𝐿subscript𝐷1𝜽subscript𝑞1superscriptsubscript𝑘1𝐾𝜆𝑘superscriptsubscript𝑗1subscript𝑀1subscript𝑝1conditional𝑗𝑘10\displaystyle\triangledown_{p_{1}(m|c)}\left(L_{D_{1}}(\boldsymbol{\theta},q_{% 1})-\sum_{k=1}^{K}\lambda(k)\left(\sum_{j=1}^{M_{1}}p_{1}(j|k)-1\right)\right)=0▽ start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_m | italic_c ) end_POSTSUBSCRIPT ( italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ ( italic_k ) ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_j | italic_k ) - 1 ) ) = 0 (40)
n=1N1𝟙(x1n=m)q1(zn=c)p1(m|c)λ(c)=0absentsuperscriptsubscript𝑛1subscript𝑁11superscriptsubscript𝑥1𝑛𝑚subscript𝑞1superscript𝑧𝑛𝑐subscript𝑝1conditional𝑚𝑐𝜆𝑐0\displaystyle\implies\sum_{n=1}^{N_{1}}\frac{\mathbbm{1}(x_{1}^{n}=m)q_{1}(z^{% n}=c)}{p_{1}(m|c)}-\lambda(c)=0⟹ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG blackboard_1 ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_m ) italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_m | italic_c ) end_ARG - italic_λ ( italic_c ) = 0 (41)
p1(m|c)n=1N1𝟙(x1n=m)q1(zn=c)absentsubscript𝑝1conditional𝑚𝑐proportional-tosuperscriptsubscript𝑛1subscript𝑁11superscriptsubscript𝑥1𝑛𝑚subscript𝑞1superscript𝑧𝑛𝑐\displaystyle\implies p_{1}(m|c)\propto\sum_{n=1}^{N_{1}}\mathbbm{1}(x_{1}^{n}% =m)q_{1}(z^{n}=c)⟹ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_m | italic_c ) ∝ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_m ) italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) (42)

Normalizing gives:

p1(t)(m|c)=n=1N1𝟙(x1n=m)q1(t)(zn=c)j=1M1n=1N1𝟙(x1n=j)q1(t)(zn=c)superscriptsubscript𝑝1𝑡conditional𝑚𝑐superscriptsubscript𝑛1subscript𝑁11superscriptsubscript𝑥1𝑛𝑚superscriptsubscript𝑞1𝑡superscript𝑧𝑛𝑐superscriptsubscript𝑗1subscript𝑀1superscriptsubscript𝑛1subscript𝑁11superscriptsubscript𝑥1𝑛𝑗superscriptsubscript𝑞1𝑡superscript𝑧𝑛𝑐\displaystyle p_{1}^{(t)}(m|c)=\frac{\sum_{n=1}^{N_{1}}\mathbbm{1}(x_{1}^{n}=m% )q_{1}^{(t)}(z^{n}=c)}{\sum_{j=1}^{M_{1}}\sum_{n=1}^{N_{1}}\mathbbm{1}(x_{1}^{% n}=j)q_{1}^{(t)}(z^{n}=c)}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_m | italic_c ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_m ) italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_j ) italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) end_ARG (43)

For p2(m|c)subscript𝑝2conditional𝑚𝑐p_{2}(m|c)italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_m | italic_c ):

p2(t)(m|c)=n=1N2𝟙(x2n=m)q1(t)(zn=c)j=1M2n=1N2𝟙(x2n=j)q1(t)(zn=c)superscriptsubscript𝑝2𝑡conditional𝑚𝑐superscriptsubscript𝑛1subscript𝑁21superscriptsubscript𝑥2𝑛𝑚superscriptsubscript𝑞1𝑡superscript𝑧𝑛𝑐superscriptsubscript𝑗1subscript𝑀2superscriptsubscript𝑛1subscript𝑁21superscriptsubscript𝑥2𝑛𝑗superscriptsubscript𝑞1𝑡superscript𝑧𝑛𝑐\displaystyle p_{2}^{(t)}(m|c)=\frac{\sum_{n=1}^{N_{2}}\mathbbm{1}(x_{2}^{n}=m% )q_{1}^{(t)}(z^{n}=c)}{\sum_{j=1}^{M_{2}}\sum_{n=1}^{N_{2}}\mathbbm{1}(x_{2}^{% n}=j)q_{1}^{(t)}(z^{n}=c)}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_m | italic_c ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_m ) italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_j ) italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) end_ARG (44)

Similarly, for p4(m|c)subscript𝑝4conditional𝑚𝑐p_{4}(m|c)italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( italic_m | italic_c ), we maximise LD2(𝜽,q2)subscript𝐿subscript𝐷2𝜽subscript𝑞2L_{D_{2}}(\boldsymbol{\theta},q_{2})italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ):

p4(t)(m|c)=n=1N2𝟙(x4n=m)q2(t)(zn=c)j=1M4n=1N2𝟙(x4n=j)q2(t)(zn=c)superscriptsubscript𝑝4𝑡conditional𝑚𝑐superscriptsubscript𝑛1subscript𝑁21superscriptsubscript𝑥4𝑛𝑚superscriptsubscript𝑞2𝑡superscript𝑧𝑛𝑐superscriptsubscript𝑗1subscript𝑀4superscriptsubscript𝑛1subscript𝑁21superscriptsubscript𝑥4𝑛𝑗superscriptsubscript𝑞2𝑡superscript𝑧𝑛𝑐\displaystyle p_{4}^{(t)}(m|c)=\frac{\sum_{n=1}^{N_{2}}\mathbbm{1}(x_{4}^{n}=m% )q_{2}^{(t)}(z^{\prime n}=c)}{\sum_{j=1}^{M_{4}}\sum_{n=1}^{N_{2}}\mathbbm{1}(% x_{4}^{n}=j)q_{2}^{(t)}(z^{\prime n}=c)}italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_m | italic_c ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_m ) italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ italic_n end_POSTSUPERSCRIPT = italic_c ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_j ) italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ italic_n end_POSTSUPERSCRIPT = italic_c ) end_ARG (45)

A.3.5 M step: Optimal θ3subscript𝜃3\theta_{3}italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT

In the M-step, we use Lagrange multipliers λ(c)𝜆𝑐\lambda(c)italic_λ ( italic_c ) to maximize LD1(𝜽,q1)subscript𝐿subscript𝐷1𝜽subscript𝑞1L_{D_{1}}(\boldsymbol{\theta},q_{1})italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + LD2(𝜽,q2)subscript𝐿subscript𝐷2𝜽subscript𝑞2L_{D_{2}}(\boldsymbol{\theta},q_{2})italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with respect to p3(m|c)subscript𝑝3conditional𝑚𝑐p_{3}(m|c)italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_m | italic_c ). For c{1,,K}𝑐1𝐾c\in\{1,\ldots,K\}italic_c ∈ { 1 , … , italic_K } and m{1,,M3}𝑚1subscript𝑀3m\in\{1,\ldots,M_{3}\}italic_m ∈ { 1 , … , italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }, we have:

p3(m|c)(LD1(𝜽,q1)+LD2(𝜽,q2)k=1Kλ(k)(j=1M2p3(j|k)1))=0subscriptsubscript𝑝3conditional𝑚𝑐subscript𝐿subscript𝐷1𝜽subscript𝑞1subscript𝐿subscript𝐷2𝜽subscript𝑞2superscriptsubscript𝑘1𝐾𝜆𝑘superscriptsubscript𝑗1subscript𝑀2subscript𝑝3conditional𝑗𝑘10\displaystyle\triangledown_{{p_{3}}(m|c)}\left(L_{D_{1}}(\boldsymbol{\theta},q% _{1})+L_{D_{2}}(\boldsymbol{\theta},q_{2})-\sum_{k=1}^{K}\lambda(k)(\sum_{j=1}% ^{M_{2}}{p_{3}}(j|k)-1)\right)=0▽ start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_m | italic_c ) end_POSTSUBSCRIPT ( italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ ( italic_k ) ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_j | italic_k ) - 1 ) ) = 0 (46)
n=1N1𝟙(x3n=m)q1(zn=c)+n=1N2𝟙(x2n=m)q2(zn=c)p3(m|c)λ(k=c)=0absentsuperscriptsubscript𝑛1subscript𝑁11superscriptsubscript𝑥3𝑛𝑚subscript𝑞1superscript𝑧𝑛𝑐superscriptsubscript𝑛1subscript𝑁21superscriptsubscriptsuperscript𝑥2𝑛𝑚subscript𝑞2superscriptsuperscript𝑧𝑛𝑐subscript𝑝3conditional𝑚𝑐𝜆𝑘𝑐0\displaystyle\implies\frac{\sum_{n=1}^{N_{1}}\mathbbm{1}({x_{3}}^{n}=m)q_{1}({% z}^{n}=c)+\sum_{n=1}^{N_{2}}\mathbbm{1}({x^{\prime}_{2}}^{n}=m)q_{2}({z^{% \prime}}^{n}=c)}{{p_{3}}(m|c)}-\lambda(k=c)=0⟹ divide start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_m ) italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) + ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_m ) italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_m | italic_c ) end_ARG - italic_λ ( italic_k = italic_c ) = 0 (47)
p3(m|c)n=1N1𝟙(x3n=m)q1(zn=c)+n=1N2𝟙(x2n=m)q2(zn=c)absentsubscript𝑝3conditional𝑚𝑐proportional-tosuperscriptsubscript𝑛1subscript𝑁11superscriptsubscript𝑥3𝑛𝑚subscript𝑞1superscript𝑧𝑛𝑐superscriptsubscript𝑛1subscript𝑁21superscriptsubscriptsuperscript𝑥2𝑛𝑚subscript𝑞2superscriptsuperscript𝑧𝑛𝑐\displaystyle\implies{p_{3}}(m|c)\propto\sum_{n=1}^{N_{1}}\mathbbm{1}({x_{3}}^% {n}=m)q_{1}({z}^{n}=c)+\sum_{n=1}^{N_{2}}\mathbbm{1}({x^{\prime}_{2}}^{n}=m)q_% {2}({z^{\prime}}^{n}=c)⟹ italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_m | italic_c ) ∝ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_m ) italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) + ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_m ) italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) (48)

We therefore obtain M step update

p3(t)(m|c)=n=1N1𝟙(x3n=m)q1(t)(zn=c)+n=1N2𝟙(x3n=m)q2(t)(zn=c)j=1M2(n=1N1𝟙(x3n=j)q1(t)(zn=c)+n=1N2𝟙(x3n=j)q2(t)(zn=c))superscriptsubscript𝑝3𝑡conditional𝑚𝑐superscriptsubscript𝑛1subscript𝑁11superscriptsubscript𝑥3𝑛𝑚superscriptsubscript𝑞1𝑡superscript𝑧𝑛𝑐superscriptsubscript𝑛1subscript𝑁21superscriptsubscriptsuperscript𝑥3𝑛𝑚superscriptsubscript𝑞2𝑡superscriptsuperscript𝑧𝑛𝑐superscriptsubscript𝑗1subscript𝑀2superscriptsubscript𝑛1subscript𝑁11superscriptsubscript𝑥3𝑛𝑗superscriptsubscript𝑞1𝑡superscript𝑧𝑛𝑐superscriptsubscript𝑛1subscript𝑁21superscriptsubscriptsuperscript𝑥3𝑛𝑗superscriptsubscript𝑞2𝑡superscriptsuperscript𝑧𝑛𝑐\displaystyle{p_{3}^{(t)}}(m|c)=\frac{\sum_{n=1}^{N_{1}}\mathbbm{1}({x_{3}}^{n% }=m)q_{1}^{(t)}({z}^{n}=c)+\sum_{n=1}^{N_{2}}\mathbbm{1}({x^{\prime}_{3}}^{n}=% m)q_{2}^{(t)}({z^{\prime}}^{n}=c)}{\sum_{j=1}^{M_{2}}\left(\sum_{n=1}^{N_{1}}% \mathbbm{1}({x_{3}}^{n}=j)q_{1}^{(t)}({z}^{n}=c)+\sum_{n=1}^{N_{2}}\mathbbm{1}% ({x^{\prime}_{3}}^{n}=j)q_{2}^{(t)}({z^{\prime}}^{n}=c)\right)}italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_m | italic_c ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_m ) italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) + ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_m ) italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_j ) italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) + ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_j ) italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_c ) ) end_ARG (49)

Appendix B Dataset Details

For the Adult Data, the ‘fnlwgt’ attribute is dropped as it is not relevant to the task and the ‘education-num’ attribute as it duplicates the information available in the ‘education’ attribute. COMPAS Data is filtered to only include ‘race‘ column is either ‘African-American’ or ‘Caucasian’ and coding as {black,white}𝑏𝑙𝑎𝑐𝑘𝑤𝑖𝑡𝑒\{black,white\}{ italic_b italic_l italic_a italic_c italic_k , italic_w italic_h italic_i italic_t italic_e }. We further combine three columns containing juvenile crime counts to get the total number of juvenile crimes. Details of the attributes and their values can be found in Tables 4, 5, and 6.

Table 4: Adult Data: Attributes and Their Values
Attribute Values
Age {25–60, <25, >60}
Capital Gain {<=5000, >5000}
Capital Loss {<=40, >40}
Education {assoc-acdm, assoc-voc, bachelors, doctorate, HS-grad, masters, prof-school, some-college, high-school, primary/middle school}
Hours Per Week {<40, 40–60, >60}
Income {<=50K, >50K}
Marital Status {married, other}
Native Country {US, non-US}
Occupation {adm-clerical, armed-forces, craft-repair, exec-managerial, farming-fishing, handlers-cleaners, machine-op-inspct, other-service, priv-house-serv, prof-specialty, protective-serv, sales, tech-support, transport-moving}
Race {non-white, white}
Relationship {non-spouse, spouse}
Sex {male, female}
Workclass {private, non-private}
Table 5: COMPAS Data: Attributes and Their Values
Attribute Values
Age Category {25 - 45, >45, <25}
Charge Degree {F, M}
Juvenile Crime {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14}
Priors Count {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 36, 37, 38}
Race {Black, White}
Score Text {High, Low, Medium}
Sex {Female, Male}
Two-Year Recidivism {0, 1}
Violent Score Text {High, Low, Medium}
Table 6: German Credit Data: Attributes and Their Values
Attribute Values
Age {<= 25, >25}
Checking Account {0 <= <200 DM, <0 DM, >= 200 DM, no account}
Class Label {0, 1}
Credit Amount {<=2000, 2001-5000, >5000}
Credit History {all credits at this bank paid back duly, critical account, delay in paying off, existing credits paid back duly till now, no credits taken}
Duration {<=6, 7-12, >12}
Employment Since {1 <= < 4 years, 4 <= <7 years, <1 years, >=7 years, unemployed}
Existing Credits {1, 2, 3, 4}
Foreign Worker {no, yes}
Housing {for free, own, rent}
Installment Rate {1, 2, 3, 4}
Job {management/ highly qualified employee, skilled employee / official, unemployed/ unskilled - non-resident, unskilled - resident}
Marital Status {divorced/separated, married/widowed}
Number of People Provide Maintenance For {1, 2}
Other Debtors {co-applicant, guarantor, none}
Other Installment Plans {bank, none, store}
Property {car or other, real estate, savings agreement/life insurance, unknown / no property}
Purpose {business, car (new), car (used), domestic appliances, education, furniture/equipment, others, radio/television, repairs, retraining}
Residence Since {1, 2, 3, 4}
Savings Account {100 <= <500 DM, 500 <= < 1000 DM, <100 DM, >= 1000 DM, no savings account}
Sex {female, male}
Telephone {none, yes}

Appendix C Evaluating Quality of Synthetic Data

C.1 Overall Fidelity Metrics: Synthetic vs. Real Data

Full results for overall fidelity metrics, including Total Variation Distance Complement (1-TVD), Contingency Similarity (CS), and Discriminator Measure (DM) across various synthetic datasets, are presented in Table 7. This table provides a comprehensive comparison of the fidelity of different synthetic data generation methods to real-world data.

Figure 5 shows the difference in Cramér’s V correlation (DCC) between synthetic and real test data for COMPAS. Similar patterns are observed across other synthetic datasets.

Table 7: Fidelity metrics for synthetic datasets generated from separate data (overlapping variable in brackets next to joint distribution estimation method). Metrics include total variation distance complement (1-TVD), contingency similarity (CS), discriminator measure (DM). Baseline methods include CTGAN and Indep.
Dataset Method (Overlapping) 1-TVD \uparrow CS \uparrow DM \downarrow
Adult
Indep-Overlap (Relationship) 0.993 0.983 0.588
Marginal (Relationship) 0.993 0.983 0.588
Latent (Relationship) 0.986 0.968 0.658
Indep-Overlap (Marital Status) 0.994 0.983 0.587
Marginal (Marital Status) 0.993 0.982 0.594
Latent (Marital Status) 0.987 0.970 0.655
CTGAN 0.935 0.938 0.656
Indep 0.935 0.895 0.808
COMPAS
Indep-Overlap (Score) 0.978 0.952 0.596
Marginal (Score) 0.979 0.953 0.598
Latent (Score) 0.978 0.951 0.592
Indep-Overlap (Violent Score) 0.978 0.955 0.577
Marginal (Violent Score) 0.978 0.955 0.573
Latent (Violent Score) 0.976 0.950 0.598
CTGAN 0.910 0.839 0.699
Indep 0.979 0.913 0.689
German
Indep-Overlap (Property) 0.965 0.926 0.613
Marginal (Property) 0.966 0.926 0.628
Latent (Property) 0.965 0.924 0.586
Indep-Overlap (Housing) 0.966 0.926 0.618
Marginal (Housing) 0.966 0.927 0.621
Latent (Housing) 0.966 0.925 0.575
CTGAN 0.946 0.894 0.697
Indep 0.965 0.920 0.696
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure 5: Difference in Cramér’s V Correlation (DCC) for pairs of attributes in synthetic test data and in real test data. Values close to zero (dark blue colour) indicate synthetic data is more similar to real data. Results shown for COMPAS Data, with synthetic data generated from separate data with overlapping variable ‘Score’. Subplots correspond to different joint estimation methods (a) Independence given Overlap, (b) Marginal Preservation (c) Latent Naïve Bayes. (d) CTGAN Baseline (e) Independent Baseline.

C.2 Joint Distribution of Protected Attributes and Outcomes: Synthetic vs. Real Data

KL Divergence Values for the joint distribution of protected attributes and outcome labels between synthetic and real data, evaluated across various methods and data separations, are detailed in Table 8.

Table 8: KL divergence of p(A,Y)𝑝𝐴𝑌p(A,Y)italic_p ( italic_A , italic_Y ) in synthetic vs real data (where Y is the outcome label and A is a protected attribute such as race, sex, and age). Synthetic datasets generated from separate data (with the overlapping variable indicated in brackets next to the joint distribution estimation method). Baseline methods include CTGAN and Indep.
Dataset Method (Overlapping) KL for Race \downarrow KL for Sex \downarrow KL for Age \downarrow
Adult
Indep-Overlap (Relationship) 0.002 0.001
Marginal (Relationship) 0.002 0.001
Latent (Relationship) 0.002 0.002
Indep-Overlap (Marital Status) 0.002 0.001
Marginal (Marital Status) 0.002 0.002
Latent (Marital Status) 0.002 0.001
CTGAN 0.132 0.048
Indep 0.005 0.026
COMPAS
Indep-Overlap (Score) 0.006 0.044
Marginal (Score) 0.005 0.039
Latent (Score) 0.005 0.034
Indep-Overlap (Violent Score) 0.015 0.038
Marginal (Violent Score) 0.015 0.038
Latent (Violent Score) 0.005 0.026
CTGAN 0.498 0.506
Indep 0.058 0.062
German
Indep-Overlap (Property) 0.015 0.052
Marginal (Property) 0.013 0.055
Latent (Property) 0.003 0.023
Indep-Overlap (Housing) 0.003 0.034
Marginal (Housing) 0.005 0.035
Latent (Housing) 0.002 0.022
CTGAN 0.282 0.215
Indep 0.007 0.038

C.3 Detailed Fairness Metrics Comparison: Synthetic vs. Real Data

Table 9 provides a detailed comparison of absolute differences in fairness metrics for a Decision Tree classifier, as evaluated on various synthetic datasets compared to real test data. The metrics include Average Odds Difference (AOD), Disparate Impact (DI), and Equal Opportunity Difference (EOD). The analysis is based on 1000 bootstrapped samples. The table summarises these metrics across different synthetic datasets and baselines.

Table 9: Absolute differences between bootstrap means of fairness metrics from synthetic and real data. Metrics calculated for Decision Tree Classifier, using 1000 bootstrapped aamples. Metrics include AOD (Average Odds Difference), DI (Disparate Impact), and EOD (Equal Opportunity Difference). Synthetic datasets generated from separate data (with the overlapping variable indicated in brackets next to the joint distribution estimation method). Baseline methods include CTGAN and Indep.
Dataset Method (Overlapping) Race Sex
AOD \downarrow DI \downarrow EOD \downarrow AOD \downarrow DI \downarrow EOD \downarrow
Adult
Indep-Overlap (Marital) 0.008 0.137 0.015 0.015 0.030 0.038
Indep-Overlap (Relationship) 0.013 0.112 0.022 0.037 0.006 0.070
Latent (Marital) 0.013 0.063 0.016 0.069 0.162 0.134
Latent (Relationship) 0.003 0.079 0.010 0.068 0.154 0.148
Marginal (Marital) 0.003 0.144 0.005 0.010 0.047 0.030
Marginal (Relationship) 0.009 0.112 0.014 0.036 0.002 0.071
CTGAN 0.027 0.003 0.053 0.013 0.048 0.055
Indep 0.021 0.390 0.023 0.070 0.669 0.071
COMPAS
Indep-Overlap (Score) 0.002 0.003 0.034 0.040 0.079 0.014
Indep-Overlap (Violent Score) 0.031 0.046 0.057 0.013 0.032 0.027
Latent (Score) 0.001 0.001 0.035 0.013 0.032 0.030
Latent (Violent Score) 0.005 0.010 0.016 0.009 0.016 0.054
Marginal (Score) 0.000 0.000 0.037 0.029 0.061 0.010
Marginal (Violent Score) 0.039 0.057 0.063 0.015 0.037 0.034
CTGAN 0.065 0.212 0.097 0.083 0.351 0.135
Indep 0.134 0.211 0.146 0.072 0.138 0.012
Age Sex
AOD \downarrow DI \downarrow EOD \downarrow AOD \downarrow DI \downarrow EOD \downarrow
German
Indep-Overlap (Housing) 0.106 0.178 0.084 0.048 0.068 0.071
Indep-Overlap (Property) 0.144 0.216 0.085 0.010 0.020 0.048
Latent (Housing) 0.141 0.190 0.049 0.025 0.049 0.069
Latent (Property) 0.119 0.170 0.048 0.014 0.034 0.062
Marginal (Housing) 0.111 0.178 0.076 0.044 0.058 0.055
Marginal (Property) 0.150 0.218 0.078 0.005 0.015 0.045
CTGAN 0.140 0.199 0.066 0.009 0.019 0.048
Indep 0.139 0.195 0.062 0.007 0.024 0.056