Abstract
Inference of directed biological networks is an important but notoriously challenging problem. The recent proliferation of large-scale CRISPR perturbation data provides a new opportunity to tackle this problem by leveraging the transcriptional response to the presence of a gene-targeting guide. Here, we introduce inverse sparse regression (inspre), an approach to learning causal networks that leverages large-scale intervention-response data. Applied to 788 genes from the genome-wide perturb-seq dataset, inspre discovers a network with small-world and scale-free properties. We integrate our network estimate with external data, finding relationships between gene eigencentrality and both measures of gene essentiality and gene expression heritability. Our analysis helps to elucidate the structure of networks that may underlie complex traits.
Introduction
Recent developments in the understanding of complex-trait genetics have led to a call for increased study of directed biological networks because they are crucial for dissecting the regulatory architecture of complex traits and finding pathways that can be targeted for treatment1,2,3. However, interrogating causal structure (i.e., causal discovery) is notoriously difficult owing to factors such as unmeasured confounding, reverse causation, and the presence of cycles4. Even assuming that all relevant variables are measured and that the underlying graph is acyclic, the exact network is still not identifiable using observational data alone5, and identifying the set of observationally equivalent graphs is computationally intractable6.
Interventional data improve the identifiability of causal models7 and can eliminate biases due to unobserved confounding8. Therefore, advances in the scope and scale of gene-targeting CRISPR experiments9 have created an ideal setting for the development of methods for large-scale causal graph inference. Recent advances include dotears10, which extends the convex optimization-based approach notears11 to interventional data, and igsp12, which learns an equivalence class of graphs using a permutation-based approach. However, existing methods can assume a strong intervention model10, return an unweighted graph12 or be intractable for large graphs10,11. They also generally assume that the graph is acyclic and unconfounded10,11,12.
Here, we introduce an approach to causal discovery from interventional data using a two-stage procedure (Fig. 1a). We treat guide RNA as instrumental variables and thus leverage standard procedures for estimating the marginal average causal effect (ACE) of every feature on every other, the matrix \(\hat{R}\). We show that given this, the causal graph G can be obtained via the relationship G = I − R−1D[1/R−1], where/indicates element-wise division, and the operator D[A] sets off-diagonal entries of the matrix to 013. Importantly, while G can be determined exactly given the true matrix R, we only have access to a noisy estimate \(\hat{R}\), which may not be well-conditioned or even invertible. Our primary contribution is a procedure for estimating a sparse approximate inverse of the ACE matrix via solving the constrained optimization problem
This approximate inverse is then used to estimate G via \(\hat{G}=I-VD[1/V]\). We call this approach inverse sparse regression (inspre). Here, U approximates \(\hat{R}\) while its left inverse V has sparsity controlled via the L1 optimization parameter λ. The weight matrix W allows us to place less emphasis on entries of \(\hat{R}\) with high standard error. For complete details, see “Methods”.
a Interventions (green squares) are used to estimate marginal average causal effects between all pairs of features (orange rectangles and blue diamonds). These pairwise estimates are used to infer a sparse network on the features. b Structural hamming distance (SHD) versus time for inspre compared to commonly-used methods in the setting of high density graphs with confounding, cycles, and large edge weights (lower is better). inspre_DAG represents inspre with an approximate acyclicity constriant, see “Methods” for details. c Averaged over density, intervention strength, graph type, edge effect size inspre is the most performant approach by several metrics, even in acyclic graphs without confounding. MAE: Mean absolute error. S: Seconds. Source data are provided in Supplementary Data 1.
Working with the bi-directional ACE matrix R rather than the full data matrix provides several advantages. First, interventional data can be used to estimate effects that are robust to unobserved confounding. Second, leveraging bi-directed ACE estimates that include both the effect of feature i on j and j on i allows us to accommodate graphs with cycles. Finally, the feature-by-feature ACE matrix is typically much smaller than the original samples by features data matrix, providing a dramatic speedup that enables inference in settings with hundreds or even thousands of features.
Results
Simulation studies
We evaluated the performance of inspre under a host of different simulation settings while comparing against other commonly-used methods for causal discovery from observational (LinGAM14, notears11, golem15) and interventional (GIES16, igsp12, dotears10) data. We tested 50-node cyclic and acyclic graphs with 100 interventional samples per node and 5000 total control samples. We simulated graphs with and without confounding while varying the graph type (Erdős-Réyni random17 vs scale-free18), density (high vs low), edge weights (large vs small), and intervention strength (strong vs weak). We conducted each of these 64 experiments 10 times and compared methods by structural Hamming distance (SHD), precision, recall, F1-score, mean absolute error and runtime (Fig. 1b, c and Supplementary Data 1).
inspre out-performs other tested methods in cyclic graphs with confounding by a large margin, even when interventions are weak (Fig. 1b). Perhaps more interestingly, inspre still obtains the highest precision, lowest SHD, and lowest MAE in acyclic graphs without confounding when averaged over graph type, density, edge weight and intervention strength (Fig. 1c). On top of this, inspre takes just seconds to run, while comparable optimization-based approaches can take up to 10 h. Of course, the performance of inspre is dependent on edge weight and intervention strength—when network effects are small and interventions are weak, inspre performs comparatively poorly. However, even in this setting, the weighting scheme biases our approach to high precision, low recall, and the resulting SHD is still comparable to other methods even in the absence of confounding (Supplementary Data 1).
K562 Perturb-seq analysis
We applied inspre to the K562 genome-wide Perturb-seq9 experiment targeting essential genes, choosing the essential-scale screen over the genome-wide screen for our initial analysis due to the larger average number of guides per gene and resulting lower noise floor. We selected 788 genes on the basis of guide effectiveness and number of cells receiving a guide targeting that gene (Supplementary Data 2). Specifically, we included only genes whose targeting guide RNA reduced the expression of it’s target by at least 0.75 standard deviations of the untargeted expression levels, and where there were at least 50 cells receiving that gene-targeting guide. After estimating the ACE of each gene on every other, we found 131,943 significant effects at FDR 5%. We then used inspre to construct a graph on these 788 nodes, which contained 10,423 edges (1.68% non-zero, Fig. 2a–c). We calculated the eigencentrality19 (Fig. 2d), in-degree (Fig. 2e), and out-degree (Fig. 2f) of every node in the network. We observed an exponential decay in both in-degree and out-degree distributions, indicating the grpah has scale-free-like properties. However, there is an interesting asymmetry in the degree distributions, with the out-degree distribution showing a strong mode at 0 but a much longer tail. That is, in our graph estimate most genes do not regulate other genes, but those that do often regulate many. These genes include DYNLL1 (dynein light chain 1, out-deg 422), HSPA9 (heat shock 70 kDa protein 9, out-deg 374), PHB (prohibitin, out-deg 355), MED10 (mediator complex subunit 10, out-deg 306), and NACA (Nascent-polypeptide-associated complex alpha polypeptide, out-deg 284). These are highly conserved genes that play important roles in key cellular processes, particularly transcriptional regulation20,21,22,23. These genes also have high eigencentrality in our graph, but the most central genes also include several ribosomal proteins: RPS3, RPS11, and RPS16. We note that while there are only a few genes with very high eigencentrality, there are many other genes with non-trivial centrality. Specifically, there are 125 genes with eigencentrality above 0.2 (Supplemetary Data 2).
a For each gene pair, we calculate pairwise marginal average causal effects. b Using inspre, we infer a shrunk-approximation to this matrix that reduces noise, and c is supported by an estimate of the underlying causal graph. We plot the distribution of eigencentrality (d), in-degree (e), and out-degree (f), revealing scale-free and small-world properties. Source data for (a–c) have been deposited to Zenodo, see Data Availability. Source data for (d–f) are available in Supplementary Data 2.
Next, we used our graph estimate to calculate shortest paths and the total effect induced by this path for all pairs of vertices. 47.5% of gene pairs are connected by at least one path, with a median path length of 2.67 (standard deviation sd = 0.78) for all pairs and 2.46 (sd = 0.77) for FDR significant pairs (Fig. 3a). Next, we calculated the percentage of the total effect explained by the shortest path for all FDR 5% significant gene pairs. If this number is low, there are many paths between a given pair of nodes with the shortest making up only a fraction of the total effect; if this number is close to 1, the shortest path explains most of the total effect; if this number is above 1, other paths in the network work to cancel out some of the effect of the shortest path. We find that the average effect explained by the shortest path is low (median = 11.14%), and that there are many pairs where the effect explained exceeds 100% (5448 pairs, Fig. 3b). This indicates there tends to be many important network paths when considering the effect of one gene on another.
a The shortest path between pairs of nodes that have an FDR 5% significant total effect. b The percentage of the total effect explained by the shortest path indicates that many paths are responsible for the total effect of one gene on another. c Eigencentrality is correlated with multiple measures of gene essentiality, including pLI in gnomAD. d For genes with high genome-wide heritability in blood, there is a correlation between heritability and variance explained in that gene in our K562 gene network. Lines represent 25%, 50%, and 75% quantiles. Source data for (a, b) have been deposited to Zenodo, see Data Availability. Source data for (c) are available in Supplementary Data 2. Source data for (d) are provided as a Source data file.
We also asked whether centrality was related to other measures of gene importance from sources such as gnomAD24, ExAC25. We fit a beta regression model of eigencentrality on 16 such annotations (Supplementary Data 3) while controlling the family-wise error rate using Holm’s method26. We found associations between eigencentrality and numerous measures of loss-of-function intolerance (Supplementary Data 4), indicating that loss-of-function intolerant genes tend to be more central. Specifically eigencentrality was associated with loss of function intolerance (gnomad_pLI, padj = 2.9 × 10−8, Fig. 3c and Supplementary Fig. 1b), selection coefficient on heterozygous loss-of-function mutations (sHet, padj = 4.9 × 10−8, Supplementary Fig. 1c), happloinsufficiency score (HI_index, padj = 4.1 × 10−7, Supplementary Fig. 1d), probability of being haploinsufficient (pHaplo, padj = 5.2 × 10−6, Supplementary Fig. 1e), missense constraint (gnomad_MisOEUF, padj = 4.5 × 10−4, Supplementary Fig. 1f), and loss of function constraint (gnomad_LOUEF, padj = 4.3 × 10−3, Supplementary Fig. 1g). Eigencentrality was also strongly associated with number of protein-protein interactions (n_ppis, padj = 1.3 × 10−12, Supplementary Fig. 1a).
Finally, we wondered whether there might be a relationship between our graph estimates and human gene expression data from whole blood. Specifically, we asked whether genes with high genome-wide (cis plus trans) expression heritability would have more of their variance explained by the network (see Methods). Using data from the INTERVAL study27 we calculated total heritability (h2) for all genes using GCTA28. We then calculated the correlation of the variance explained by the network with the genome-wide heritability. While the overall correlation is low, restricting our analysis to genes with an estimated h2 above a certain threshold revealed an increasing pattern of correlation as the threshold was raised (Fig. 3d). Specifically, we observed a weak but statistically significant correlation when included genes estimated heritability was at least 0.2 (ρ = 0.36, bootstrap p = 0.04), and a stronger correlation when and at least 0.25 (ρ = 0.54, bootstrap p = 0.002). Beyond this, the small number of genes resulted in bootstrap estimates with large variation.
Explained variance analysis and validation using the genome-wide screen
We sought to empirically analyze the goodness of fit of our estimated graph. We considered two approaches: randomizing the incoming edges of each node while controlling for in-degree, and randomizing the entire graph (see Methods). We used randomized incoming edges to compare the variance in gene expression explained by our inferred edges compared to the same number of randomly selected edges and calculate the proportion of genes with significant variance explained at a given false discovery rate (Fig. 4a). We found that 605 (of 788, 76.8%) genes had significant variance explained compared to random edges at 5% FDR, with a Q-Q plot of the permutation p-values showing enrichment over the null at every quantile (Fig. 4b). Next, we followed Ružičková et al.29, generating random graphs by shuffling the node labels on our estimated graph. This yields random graphs with the same degree distribution and edge weights while breaking the association between individual genes. We then calculated the variance explained in the total observed dataset by our graph compared to random graphs from the same distribution. We found that our estimated graph explained 5.84% of the variance in the data, while random graphs explained an average of 1.35% (permutation p < 1 × 10−16). Interestingly, we observed a strong correlation between the variance explained in a gene by random graphs and that gene’s eigencentrality in our estimated graph (ρ = 0.71, p < 1 × 10−16 Fig. 4c). Indeed, the more connected a gene is within the network, the fewer genes actually have 0 effect on it. Thus, even random sets of genes will explain a non-trivial amount of variance in highly central genes.
a Number of significant genes vs False Discovery Rate (FDR) for the test that our estimated edges expalain more variance than random edges. b Quantile-quantile (Q-Q) plot of the \(-{\log }_{10}(p)\)-value distribution indicates widespread signal. The measure of center is the line that observed equals expected. c There is a correlation between eigencentrality in our estimated network and the variance explained by random gene sets. The measure of center is the resultant linear regression fit of eigencentrality on variance explained. Using out of sample data to evaluate our graph still results in large variance explained in terms of both the FDR (d) and quantile distributions (e). f The eigencentralities of genes in graphs estimated on independent datasets are correlated. Error bars represent 95% confidence intervals. GW: genome-wide. Ess: Essential. Source data for (a–e) are provided as a Source data file. Source data for (f) are available in Supplementary Data 2 and Supplementary Data 5.
We performed a cross-validation analysis to assess the consistency of graph estimates across data from the same sampling distribution (i.e., different splits of the same underlying dataset). We performed 5-fold cross-validation (CV), holding out 20% of the cells in each fold and estimated the causal graph for each fold. We calculated a consensus graph consisting of edges that appeared in at least 3 of the 5 CV graphs. We compared our original graph estimate to these CV estimates in terms of F1 score and weighed accuracy (Aw, Table 1). We observed high concordance between our graph estimate and the consensus graph (F1 = 0.677, AW = 0.883), with slightly lower concordance between our estimated graph estimate and individual folds, as well as between the folds themselves (mean F1 = 0.582, mean Aw = 0.733).
Next, we undertook a validation analysis using an independent dataset. The genome-wide Perturb-seq study includes a second dataset with guides targeting all genes rather than just essential genes. We asked whether our network, inferred on the essential screen, could also explain variance over random graphs in this screen. Indeed, our essential screen network explains 3.3% of the variance in the genome-wide data compared to 0.95% for random graphs (permutation p < 1 × 10−16). Similarly to the previous analysis, we asked whether the incoming edges of our essential screen network explained more variance than the same number of random incoming edges in the genome-wide screen. In this setting, we found 442 genes at 5% false discovery rate (Fig. 4d). Again, we observed enrichment over the null at nearly every quantile in the Q-Q plot of the permutation p values (Fig. 4e). Together, these results indicate that our graph inferred on the essential screen data also fits the genome-wide screen data well.
Finally, we applied the same filtering strategy to select genes and constructed a genome-wide screen network, resulting in a graph containing 1428 nodes and 30,509 edges (Supplementary Data 5). Similarly to the 788 gene essential screen network, this graph also showed small-world and scale-free properties (Supplementary Fig. 2). Of these genes, 522 were included in both networks. When restricted to these genes, we noticed that this graph had limited overlap with our essential-screen network in terms of actual edge inclusions. Using the essential screen network as a reference, the F1-score of the genome-wide screen network was 0.17, while the weighted accuracy was 0.41 and the SHD was 19,815. Despite this, these graphs agree on their structural properties. We again calculated eigencentrality and observed a moderate correlation between the estimated eigencentrality of genes in common between the networks (ρ = 0.66, p < 1 × 10−16, Fig. 4f). We again regressed eigencentrality on gene essentiality scores and obtained 100% replication of our previous associations (Supplementary Fig. 3 and Supplementary Data 6).
Discussion
There is substantial interest in dissecting the causal regulatory networks in relevant cell-types underlying human complex disease. The K562 cell line is commonly studied in CRISPR experiments because it is easy to grow, differentiate, and perturb; and because it bears similarity to erythroid progenitor cells30,31. We constructed a causal network relating 788 genes in K562 cells with data from the essential-scale genome-wide Perturb-seq experiment9. Our estimated network has both “scale-free” and “small-world” properties, indicative of a system with several hub nodes but many short paths between connected nodes. We integrated our graphs with known measures of gene essentiality from genetic conservation and loss of function intolerance, finding correlations between these measures and eigencentrality. We also integrated our graphs with estimates of genome-wide gene expression heritability in human whole blood, finding a correlation between genes with high variance explained by the network and genes with high whole blood heritability. We replicated these properties in a larger 1429 gene network from an independent screen of all genes, thus concluding that our results likely reflect general properties of large causal gene networks.
These results may support the notion that gene networks are sufficiently connected that perturbations to most genes have broad downstream effects, rather than having effects that concentrate in particular pathways1,3. However, there are several caveats that limit the interpretability of these results. First, despite their widespread use, it is unclear what the relationships are between the regulatory networks in the K562 cancer cell line and diverse blood cell populations. Along these lines, it is unclear whether the network connections discovered via said experiments are the same as those that the molecular effects of GWAS variants propagate along. Instead, large-scale gene promoter inhibition may result in changes to cellular state that bring a concomitant change in network structure. In addition, we are limited to analyzing 788 genes that were well-captured in this screen. While other genes were targeted, many of them were not captured at sufficient levels in control samples to obtain robust estimates of the intervention effect and thus were removed from our analysis. Perhaps most importantly, our analysis says nothing about the locations or properties of trait-relevant genes within this network estimate. We explored several methods for pathway identification and disease gene enrichment analysis, but ultimately concluded that obtaining a well-calibrated null distribution for these kinds of tests is challenging. Specifically, in any method that produces network estimates, edge inclusions will vary with each other, and accounting for this covariance is necessary for obtaining well-calibrated estimates for enrichment analyses. We leave this important question for future work.
Our approach, as well as the others that we test against, aims to discover the causal graph relating the features in the data. Other approaches, such as GSFA32 and SAMS-VAE33 use factor modeling techniques to identify sets of genes impacted by perturbations. These kinds of methods aim to directly identify gene modules and pathways; however, they are unable to identify the structural relationships within or between these modules. We view these methods as providing complementary but fundamentally different ways of analyzing perturbation data.
inspre leverages the ACE matrix to rapidly produce causal graph estimates even with cyclic graph structures and the presence of unmeasured confounding, enabling causal discovery at unprecedented scales. The utility of inspre is two-fold: it provides an estimate of the causal graph while also providing a shrunken estimate of the ACE matrix that is supported by a graph structure, thereby substantially denoising the data. However, it does have some limitations. First, when intervention effects are weak, inspre is biased towards producing sparse, high-precision, low-recall estimates. Second, inspre requires an intervention on every feature and cannot make use of observed data for features without interventions. Finally, inspre produces an approximate inverse that may result in a graph that does not satisfy the spectral radius requirement (see Methods).
We find high consistency between our reported graph estimate and cross-validated graphs on the same dataset. However, when attempting to validate our results using an independent dataset from the same cell line, we find conflicting results. On the one hand, we replicate the high-level structural properties and insights from our main analysis. On the other hand, the network estimates themselves have limited overlap in their edge sets. There could be multiple reasons for this phenomenon. While our method controls for unmeasured confounding between genes, it does not account for potential treatment-exposure confounding. That is, it is possible that cellular properties such as cell size or cycle may bias guide transduction. Another intriguing possibility is the potential robustness of biological networks. In this model, the functional behavior of a biological system is relatively immune to perturbation and environmental heterogeneity, making interpretation of the results of perturbation-screen experiments challenging. Specifically, the network response to perturbation may be heterogeneous while maintaining similar high-level properties. Further research is required to determine whether this discrepancy between high-level and low-level network properties is a technical limitation of the computational method or dataset, or a fundamental property of biological systems. This work represents a first step towards discovering high-dimensional causal networks from CRISPR-inhibition experiments. In the future, we anticipate improvements in this methodology, combined with simultaneous increases in the scope and scale of diverse molecular intervention experiments, will enable new insights into the causal structures underlying complex human disease.
Methods
Data model
We utilize the common linear autoregressive graph model11 while adding linear intervention effects. Y be an N × D matrix of features (genes, phenotypes) and G a D × D graph acting on the features in the form of a weighted adjacency matrix. Let X be an N × Q binary indicator matrix of intervention targets with Q × D effect matrix β. The model is
where γ is a mean 0 random effect representing unmeasured factors and noise. Our goal is to estimate G given Y and X. Note that in general, the covariance matrix of γ need not be isotropic. This will occur, for example, in the case of unmeasured confounding. Note also that in order for this model to be well defined, we require that the modulus of the largest eigenvalue of G (the “spectral radius” r(G)) to be less than 1. In this case, I − G is invertible, and the model can be rewritten as
Our approach to estimating G is to use the interventional data to first estimate the D × D matrix of average causal effects \(\hat{R}\) (as in \({\mathbb{E}}[{Y}_{j}| do({Y}_{i})]={R}_{ij}{Y}_{i}\)). We then use this matrix to estimate G. In practice, any approach that is consistent given the assumptions of the data setting can be used to estimate \(\hat{R}\). For simplicity in this derivation, we assume that there is one intervention on each feature and that these interventions are uncorrelated with each other. In this case, we can use two-stage least squares. The entries on the diagonal of \(\hat{R}\) are \({\hat{R}}_{ii}=1\). The off-diagonal elements are given by
Since \({\mathbb{E}}[{\hat{R}}_{ij}]={R}_{ij}\), we have that R satisfies the recurrence R = RG off the diagonal, from which it follows that13,
where / indicates elementwise division and the diagonal operator \(D{[A]}_{i,j}=\left\{\begin{array}{ll}{A}_{i,j}&i=j\\ 0&i\, \ne \,j\end{array}\right.\) sets off-diagonal elements of a matrix to 0.
In this model, the variance explained in a gene by the network can be calculated by estimating the variance of the residual γ. Specifically, let H = (I−G)−1 and observe that
where (H−I)2 represents a Hadamard power, and \({\sigma }_{Y}^{2}={({\sigma }_{1}^{2},\ldots,{\sigma }_{D}^{2})}^{\top }\) is the total variance of each node. In practice, solving this exactly can result in negative estimates for \({\hat{\sigma }}_{\gamma }^{2}\), so we estimate \({\hat{\sigma }}_{\gamma }^{2}\) using non-negative least squares.
Inverse sparse regression
If we knew R exactly, we could simply invert it and plug the inverse into (10). However, we only have access to the noisy estimate \(\hat{R}\), which is not necessarily well-conditioned or even invertible. Instead, we assume that the underlying directed graph is sparse. We observe that in (10), G is sparse if and only if \({\hat{R}}^{-1}\) is sparse, and so we can view solving (10) as finding a sparse matrix inverse. We seek matrices U, V with VU = I that minimize the loss,
We minimize this loss using alternating direction method of multipliers (ADMM)34. Let Θk be a matrix of Lagrange multipliers. The updates for Uk, Vk, and Θk are
where ρ is the penalty parameter34. The update for V k+1 is a straightforward LASSO regression, while the update for U is a system of linear equations. In both cases, we use an iterative solver and fit 10 iterations rather than running until convergence. We always start from the initial condition U0 = V0 = I. For the derivation of these equations, including the specifics of how we tune the penalty parameter, see Supplementary note.
The approximate DAG constraint
The general inspre model does not assume that the underlying graph G is a DAG. However, in some cases, it may be preferable to fit a model that assumes a DAG, or you may obtain better convergence in practice by assuming a DAG. It is known that G is a DAG if and only if D[(I−G)−1] = I11. Using (10), we have that
Thus, we can implement an approximate DAG constraint by constraining D[V] = D[U] = I in the above regression (12).
Cross-validation and setting the LASSO penalty
We provide several cross-validation metrics to aid in selection of the LASSO penalty. We use 5-fold cross validation to calculate the loss
the loss
and the loss
where Y(−k), X(−k) are the test observations for cross-validation iteration k, and Gk, βk are the network and intervention effects estimated using the training observations Y(k), X(k). In the latter, we use the formulation from (3) with the approximation in (18).
In addition, we calculate stability estimates using the Stability Approach to Regularization Selection (StARS35). StARS leverages the intuition that larger values of λ yield graphs that are more stable under random re-samplings of the input data to construct an interpretable quantity representing the average probability that each edge is included in the graph for each value of λ. Let ϕλ be a D × D matrix where entry i, j is the probability that each edge i, j is included in the graph for regularization parameter λ. We estimate ϕλ by using the graph estimates from each cross-validation iteration,
The instability measure Dλ is estimated as35
Clearly, Dλ = 0 for very large values of λ, where \({V}_{\lambda }^{k}=I\) for every mask k. As λ becomes smaller, Dλ rises, but as λ approaches 0, D → 0 as \({V}_{\lambda }^{k}\to {A}^{+}\). Following35, we first normalize \({\hat{D}}_{\lambda }\) by setting it to \({\bar{D}}_{\lambda }=\mathop{\sup }_{l\le \lambda }{\hat{D}}_{l}\) and then choose the smallest value of λ with stability below a cut point b, \(\hat{\lambda }=\sup \{\lambda :{\bar{D}}_{\lambda }\le b\}\).
In our simulation studies, we use the first loss and select the λ that minimizes the cross-validation error. In our analysis of the GWPS-data, this leads to underfitting due to high variability in the standard errors of the estimates of individual effects. Thus, we choose λ based on analyzing each of these cross-validation metrics, aiming for a \(\hat{D}\) of around 0.01.
Simulation details
We conducted extensive simulations according to the model (2) with D = 50 nodes while varying the density of the network, the network edge weights, and intervention effect sizes. We simulated random and scale-free networks, with and without cycles, and with and without confounding. We simulated 10 independent graphs for each of these 64 settings. To avoid issues of varsortability36, aid interpretability in setting simulation parameters, and more accurately represent common practices in genetics, all phenotypes are simulated with mean 0 and variance 1.
To vary the network density, we varied the probability of edge inclusion (in random graphs) and the probability of adding an edge between existing nodes (in scale-free graphs) such that the average node degree in each graph was 2 (low-density) or 4 (high-density). To generate edge weights, we used a PERT distribution37. We used the parameter v to represent the median per-variance graph effect while setting the minimum to v/2 and the maximum to 2v. We used the setting v = 0.15 to represent weak effects and the setting v = 0.3 to represent strong effects. To set the intervention effect, we used a normalized effect setting equal to the number of standard deviations the mean of Yi shifted given the presence of the intervention Xi = 1. For weak effects we used β = −1 and for strong effects, we used β = −2. To generate confounding, we leveraged the observation that confounding effects are dense components in the graph structure38. We then simulated a specified number of additional unobserved full out-degree nodes. For edge weights, we again used a PERT distribution, but with the median effect set to vα, where α is the mean path length for that graph type: \(\alpha=\log (D)\) for random graphs39 and \(\alpha=\log (D)/\log (\log (D))\) for scale-free graphs40. In each setting, we set C such that the mean variance in each phenotype explained by confounding factors was 10%.
We compared inspre against observational methods LiNGAM14, notears11, and golem15; and interventional methods GIES16, dotears10, and igsp12. For GIES and LINGAM, we used the implementation in the pcalgR package. We used default parameter settings for all methods with two exceptions: (1) for all methods, we used edge thresholding set to v/4, (2) for methods that used L1 regularization, we used 5-fold cross-validation with a logarithmically decreasing sequence of 10 λ values from 1 to 10−6 and report results from the value of λ that minimized the cross-validation error for each method.
Genome-wide Perturb-seq analysis
We obtained normalized essential-scale Perturb-seq data for K562 cell lines generated in Replogle et al.9 from https://plus.figshare.com/ndownloader/files/35773075. We used two-stage least squares to estimate perturbation effects given the targeted gene intervention. We retained genes where the estimated per-variance effect-size of the intervention on the target gene was at least −0.75 standard deviations, and where there were at least 50 cells receiving a guide targeting that gene.
We used 5-fold cross-validation to select from an exponentially decreasing sequence of 10 λ values with λmax set to the absolute value of the maximum off-diagonal element of \(\hat{R}\) and λmin to 0.1× this value (see above). We thresholded all elements of \(\hat{G}\) with absolute value <0.015 to 0, corresponding to roughly half the minimum FDR-5% significant ACE. We selected the λ value that minimized the cross-validation error. Due to improved convergence behavior, we use the approximate DAG constraint given above.
We calculated eigencentrality using the igraph R package function eigen_centrality with default arguments. Note that this does not consider direction of edges in directed graphs. We conducted a beta regression of eigencentrality on 16 gene annotations using the betareg R package function betareg. We corrected for multiple testing using Holm’s method as implemented in the stats R package function p.adjust. For detailed descriptions and sources of each annotation see Supplementary Data 3.
We used GCTA28 to estimate genome-wide heritability for 659 genes included in the INTERVAL27 study and our estimated network. We estimated the genetic relatedness matrix on 4732 individuals while adjusting for age, sex, cell-type proportion, season, batch, and 35 PEER factors41. We compared genome-wide heritability estimates with the variance explained by the graph (\({\hat{\sigma }}_{\gamma }^{2}\)) in each gene calculated using eq. (11).
To calculate variance explained by random edges for each gene, we hold the gene in-degree fixed and sample that number of random genes from the set of genes excluding the inferred edges. We then estimate the variance explained by regressing the gene on these random genes using 10,000 random resamplings. To sample random graphs, we held the graph edge structure fixed while shuffling the node labels. We then estimate the variance explained in each gene by random graphs over 10,000 permutations by regressing the target node on it’s parent nodes and calculating the regression r2. Note that in order to achieve a fair comparison in this analysis, we report variance explained in the true graph using the same regression approach, rather than the estimated variance \({\hat{\sigma }}_{\gamma }^{2}\) given above.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
Genome-wide Perturb-seq data were obtained from FigShare https://plus.figshare.com/ndownloader/files/35773075. The INTERVAL study data used in this paper are available to bona fide researchers from ceu-dataaccess@medschl.cam.ac.uk. The data access policy for the data is available at https://www.intervalstudy.org.uk/wp-content/uploads/2024/02/Data-Access-Policy-v1.0-14Apr2020.pdf. Full network results of our analysis are available on Zenodo at https://doi.org/10.5281/zenodo.14501665. Source data are provided with this paper.
Code availability
All code used in the production of this manuscript is available at https://github.com/brielin/inspre. v1 release documenting and archiving the code used in production of these results has been made available at https://doi.org/10.5281/zenodo.15858295.
References
Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186 (2017).
Liu, X., Li, Y. I. & Pritchard, J. K. Trans effects on gene expression can drive omnigenic inheritance. Cell 177, 1022–1034.e6 (2019).
Wray, N. R., Wijmenga, C., Sullivan, P. F., Yang, J. & Visscher, P. M. Common disease is more complex than implied by the core gene omnigenic model. Cell 173, 1573–1580 (2018).
Parsana, P. et al. Addressing confounding artifacts in reconstruction of gene co-expression networks. Genome Biol. 20, 4–9 (2019).
Verma, T. & Pearl, J. Equivalence and synthesis of causal models. In Proc. Sixth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-90), 220–227 (1990).
Chickering, D. M. Learning Bayesian Networks is NP-Complete 121–130 https://link.springer.com/chapter/10.1007/978-1-4612-2404-4_12 (Springer, 1996).
Hauser, A. & Ch, B. M. E. Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs Peter Bühlmann. J. Mach. Learn. Res. 13, 2409–2464 (2012).
Angrist, J. D. & Imbens, G. W. Identification and estimation of local average treatment effects https://www.nber.org/papers/t0118 (National Bureau of Economic Research, 1995).
Replogle, J. M. et al. Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq. Cell 185, 2559–2575.e28 (2022).
Xue, A., Rao, J., Sankararaman, S. & Pimentel, H. Dotears: scalable, consistent DAG estimation using observational and interventional data https://arxiv.org/abs/2305.19215v1 (2023).
Zheng, X., Aragam, B., Ravikumar, P. & Xing, E. P. DAGs with NO TEARS: continuous optimization for structure learning https://github.com/xunzheng/notears (2018).
Yang, K. D., Katcoff, A. & Uhler, C. Characterizing and learning equivalence classes of causal DAGs under interventions. Int. Conf. Mach. Learn. ICML 12, 8823–8839 (2018).
Pachter, L. S. The network nonsense of Albert-László Barabási ∣ Bits of DNA. https://liorpachter.wordpress.com/2014/02/10/the-network-nonsense-of-albert-laszlo-barabasi/.
Shimizu, S., Hoyer, P. O., Hyvärinen, A. & Kerminen, A. A linear non-Gaussian acyclic model for causal discovery. J. Mach. Learn. Res. 7, 2003–2030 (2006).
Ng, I., Ghassami, A. E. & Zhang, K. On the role of sparsity and DAG constraints for learning linear DAGs. Adv. Neural Inform. Process. Syst. https://arxiv.org/abs/2006.10201v3 (Neural Information Processing Systems Foundation, 2020).
Hauser, A. & Bühlmann, P. Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs. J. Mach. Learn. Res. 13, 2409–2464 (2011).
Erdös, P. & Rényi, A. On the evolution of random graphs. Publ. Math. Inst. Hungarian Acad. Sci. 5, 17–60 (1960).
Bollobas, B., Borgs, C., Chayes, J. & Riordan, O. Directed scale-free graphs. In Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 132–139 https://www.microsoft.com/en-us/research/publication/directed-scale-free-graphs/ (2003).
Bonacich, P. Power and centrality: a family of measures. Am. J. Sociol. 92, 1170–1182 (1987).
Plaschka, C. et al. Architecture of the RNA polymerase II-Mediator core initiation complex. Nature 518, 376–380 (2015).
Gamble, S. C. et al. Prohibitin, a protein downregulated by androgens, represses androgen receptor activity. Oncogene 26, 1757–1768 (2006).
Aitken, C. E. & Lorsch, J. R. A mechanistic overview of translation initiation in eukaryotes. Nat. Struct. Mol. Biol. 19, 568–576 (2012).
Yotov, W. V. & St-Arnaud, R. Mapping of the human gene for the alpha-NAC/1.9.2 (NACA/1.9.2) transcriptional coactivator to Chromosome 12q23-24.1. Mamm. Genome 7, 163–164 (1996).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Cassa, C. A. et al. Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat. Genet. 49, 806–810 (2017).
Holm, S. A simple sequentially rejective multiple test procedure. Scandinavian J Stat 65–70 (JSTOR, 1979).
Tokolyi, A. et al. Genetic determinants of blood gene expression and splicing and their contribution to molecular phenotypes and health outcomes https://www.medrxiv.org/content/10.1101/2023.11.25.23299014v1. 2023.11.25.23299014 (2023).
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88, 76–82 (2011).
Ružičková, N., Hledík, M. & Tkačik, G. Quantitative omnigenic model discovers interpretable genome-wide associations. Proc. Natl. Acad. Sci. USA 121, e2402340121 (2024).
Gasperini, M. et al. A genome-wide framework for mapping gene regulation via cellular genetic screens. Cell 176, 377–390.e19 (2019).
Morris, J. A. et al. Discovery of target genes and pathways at GWAS loci by pooled single-cell CRISPR screens. Science 380, https://pubmed.ncbi.nlm.nih.gov/37141313/ (2023).
Zhou, Y., Luo, K., Liang, L., Chen, M. & He, X. A new Bayesian factor analysis method improves detection of genes and biological processes affected by perturbations in single-cell CRISPR screening. Nat. Methods 20, 1693–1703 (2023).
Bereket, M. & Karaletsos, T. Modelling cellular perturbations with the sparse additive mechanism shift variational autoencoder http://arxiv.org/abs/2311.02794 (2024).
Boyd, S., Parikh, N., Chu, E., Peleato, B. & Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3, 1–122 (2010).
Liu, H., Roeder, K. & Wasserman, L. Stability approach to regularization selection (StARS) for high dimensional graphical models. Adv. Neural Inform. Process. Syst. 23: 24th Annual Conference on Neural Information Processing Systems 2010, NIPS 2010 1–14 (2010).
Reisach, A. G., Seiler, C. & Weichwald, S. Beware of the simulated DAG! Causal discovery benchmarks may be easy to game. Adv. Neural Inf. Process. Syst. 33, 27772–27784 (2021).
Clark, C. E. Letter to the editor-the PERT model for the distribution of an activity time. https://doi.org/10.1287/opre.10.3.40510, 405–406 (1962).
Chandrasekaran, V., Parrilo, P. A. & Willsky, A. S. Latent variable graphical model selection via convex optimization. Ann. Stat. 40, 1935–1967 (2012).
Fronczak, A., Fronczak, P. & Holyst, J. A. Average path length in random networks. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 70, http://arxiv.org/abs/cond-mat/0212230 (2002).
Chen, F., Chen, Z., Wang, X. & Yuan, Z. The average path length of scale free networks www.elsevier.com/locate/cnsns (2006).
Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc. 7, 500–507 (2012).
Acknowledgements
The authors would like to thank Júlia Domingo-Espinós and Mariia Minaeva for sharing curated gene-annotation data. Funding for B.C.B. is provided by NHGRI K99HG012373 and the Columbia Data Science Institute. Funding for DAK and B.C.B. is provided by NIA U01AG068880 and the MacMillan Center for the Study of the Non-Coding Cancer Genome at the New York Genome Center. Funding for T.L. is provided by NIA R01AG057422 and NIMH R01MH106842. Funding for JM is provided by NHGRI K99HG012792.
Author information
Authors and Affiliations
Contributions
B.C.B. and D.A.K. conceived the project, data model, and algorithm. B.C.B. performed primary analyses, wrote the software, and wrote the manuscript draft. A.T. estimated heritability using the INTERVAL data and provided guidance on its analysis. J.M. and T.L. provided guidance for the analysis of the PERTURB-seq data. D.A.K. and T.L. supervised the project. All authors read and provided feedback on the manuscript prior to submission.
Corresponding author
Ethics declarations
Competing interests
T.L. is an advisor in Variant Bio with equity and has received speaker honoraria from AbbVie. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Hauke Busch and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Brown, B.C., Tokolyi, A., Morris, J.A. et al. Large-scale causal discovery using interventional data sheds light on gene network structure in k562 cells. Nat Commun 16, 9628 (2025). https://doi.org/10.1038/s41467-025-64353-7
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41467-025-64353-7