Main

Single-cell RNA sequencing (scRNA-seq) has greatly enhanced our understanding of intricate biological systems and has provided groundbreaking insights into human genetics and clinical research1,2. Despite its capability to uncover novel cell types and states within heterogeneous tissues, transcriptomics alone often falls short in distinguishing molecularly similar yet functionally distinct cell categories3. The concurrent assessment of additional modalities, which offer distinct and complementary information from the same cell, is crucial not only for enhancing our capacity to identify cell types and states but also for gaining new insights into the interaction of different components within the cell that define cellular function4,5. However, the majority of multimodal single-cell sequencing methods remain expensive and have low throughput. Therefore, advanced multimodal single-cell sequencing technology is urgently needed to facilitate fundamental discoveries in biology and medicine.

With the efforts of international consortiums such as the Human Cell Atlas, the data size of scRNA-seq has grown exponentially, reaching hundreds of millions of cells. Gene expression values, which can be considered as ‘tokens’, scale to the order of trillions, which is equivalent to the amount of natural language text used to train large language models such as the generative pretrained transformer. Pretrained large-scale foundation models for deciphering the ‘language’ of cells have been successfully constructed and are expected to promote biomedical research6. However, these models are limited to single-cell transcriptomes. If multimodal single-cell sequencing technology continues to advance and becomes as widely used as current scRNA-seq, an additional benefit is that a more robust foundational model, which takes large-scale multimodal single-cell data as input, could provide a better understanding of gene regulatory mechanisms and the ability to predict cellular responses to perturbations.

We have developed UDA-seq, a general workflow for single-cell multimodal sequencing that connects droplet microfluidics with combinatorial indexing. This approach is a versatile strategy that can be seamlessly integrated with current droplet-based multimodal methods to enhance throughput. To validate the capabilities of UDA-seq, we conducted experiments involving various single-cell co-assay tasks, such as RNA and chromatin, RNA and variable, diversity and joining (VDJ), and RNA and CRISPR interference across different tissues and cell types. UDA-seq achieves extremely high throughput with data quality similar to that of standard droplet microfluidic procedures. We have also established efficient sample preparation and corresponding computational pipelines to facilitate UDA-seq analysis of clinical liquid or needle biopsy specimens. Our results demonstrate that UDA-seq can identify age-related ITGB1 + PREX1 + CD4 naive T cells, proteinuria-related podocyte (POD) cells and kidney injury-related glomerular capillaries endothelial cells (EC-GCs). Additionally, we show that UDA-seq can perform hundreds of CRISPR interferences in genes containing bromodomains, revealing the different vulnerabilities of existing subpopulations in the SNU16 cancer cell line.

Results

Theory and design of UDA-seq

In droplet microfluidics systems, single cells are encapsulated using low-density cell suspensions to minimize collisions. Typically, only 1–10% of the total droplets contain cells or nuclei, following a Poisson distribution. To enhance cell droplet utilization and reduce costs, new methods involving two rounds of combinatorial indexing have been developed7,8,9,10. These methods combine droplet microfluidics and pre-indexing strategies to increase cell throughput and sample multiplexing. Although these approaches allow the overloading of droplets and effective use of data from multiplets, current pre-indexing methods require specific protocols for each modality and involve time-consuming operations, which may impact the stability of data quality owing to potential degradation and loss of target nucleic acid molecules.

In this Article, we propose a straightforward two-round combinatorial indexing experimental workflow called UDA-seq. It uses droplet microfluidics-based techniques for the first round of cell indexing, followed by a second round of cell indexing after they are released from droplets. While the microfluidic droplets do not isolate single cells, they function as physical compartments similar to ‘wells’ in classical combinatorial indexing approaches such as single-cell combinatorial indexing RNA sequencing (sci-RNA-seq)11 and split-pool ligation-based transcriptome sequencing (SPLiT-seq)12. As a result, most droplets contain multiple cells or nuclei.

To get started (Fig. 1a and Extended Data Fig. 1), the following steps are carried out: (1) Hundreds of thousands of cells or nuclei are fixed and permeabilized before being encapsulated using a standard microfluidics device. Different single-cell modalities are achieved by using the appropriate reagents and chips. In the case of single-cell assay for transposase-accessible chromatin (ATAC) and RNA sequencing Multiome analysis, the nuclei are subjected to in situ Tn5 tagmentation before loading to the microfluidics chip. (2) Inside these overloaded droplets, targeted transcripts (for example, 3′-RNA and 5′-RNA) and DNA (for example, open chromatin regions and specific histone modification region) are labeled with a droplet-specific barcode (round 1). (3) The droplet emulsion is then broken to release the cells or nuclei, which remain intact (Extended Data Fig. 2). (4) The nuclei or cells are randomly mixed and aliquoted into 96/384-well PCR plates for a second round of indexing by well-specific index PCR. (5) The PCR amplification products from the two rounds of barcoding are pooled together for purification and further library construction for sequencing. After sequencing, the combination of the droplet- and well-specific barcodes allows for the unique identification of targeted nucleic acid fragments from the same single cell.

Fig. 1: The workflow of UDA-seq and its performance across diverse modalities and samples.
figure 1

a, The UDA-seq workflow involves two rounds of barcoding. bd, Species-mixing experiments by using single-cell 3′-RNA UDA-seq with 96-well post-indexing (b), single-nucleus 3′-RNA UDA-seq with 16 well post-indexing (c) and single-nucleus scMultiome UDA-seq with 16 well post-indexing (d). e,f, Violin plots showing the distribution of genes (RNA) and peak reads (ATAC) detected by UDA-seq scMultiome and 10x Genomics (e) as well as UDA-seq sc5′-RNA (f). Box plots show the interquartile range with the median marked. The whiskers extend up to 1.5 times the interquartile range; the outliers are not displayed. The median number of genes detected is shown at the top of each violin plot. g, A comparison between UDA and 10x Genomics for scVDJ-seq. The T-cell receptor (TCR) and B-cell receptor (BCR) detection ratio for UDA with Next GEM (Kit_v2) and GEM-X (Kit_v3), and the 10x Next GEM. h, The ultrahigh-throughput 5′-scRNA UDA-seq analysis of over 150,000 cells from a single donor, identifying 44 distinct cell types. The UMAP plot is color-coded to represent these cell types. j, Rare cell types, identified with cell type labels.

Source data

With the post-indexing strategy, we can smoothly incorporate an additional round of barcoding through index PCR in the well-established droplet-based single-cell method.

UDA-seq validation

Theoretically, employing two rounds of combinatorial barcoding has the potential to generate between 9,600,000 and 38,000,000 barcode combinations. This entails one round of barcoding in droplets followed by a subsequent round involving 96/384 well-specific barcoding. Notably, even when loading 500,000 cells, the collision rate remains exceedingly low, and it can be reduced further to almost zero by increasing the number of post-index barcodes (Extended Data Fig. 3a).

To assess the capability of UDA-seq for generating uniquely barcoded cells, we conducted species-mixing experiments. We utilized the 10x Genomics Chromium droplet generator for droplet barcoding owing to its widespread availability. Initially, we verified the feasibility of 3′-RNA UDA-seq in fixed whole cells by mixing human Hela and mouse NIH 3T3 cell lines in a 1:1 ratio. We loaded a total of 10,000 cells on the 10x Genomics Chromium system, followed by 96-well post-indexing. We successfully obtained 6,245 high-quality single-cell transcriptomes, representing 62% of the loaded cells. The majority of cells, each with a unique two-round barcode, were identified as belonging to a single species. The collision rate was only 1.23% (Fig. 1b), notably lower than the expected rate of 6.29% when using the round 1 barcode only (Extended Data Fig. 3b).

We also conducted UDA-seq testing using formaldehyde-fixed nuclei with a loading nuclei number of 10,000. This included a combination of human Hela, mouse NIH 3T3 and locust brain nuclei for 3′-RNA in a ratio of 1:1:1, or an equal mix of human Hela and mouse NIH 3T3 nuclei for Multiome. The cell recovery rates were approximately 57% for 3′-RNA and 37% for Multiome. Despite using only 16 barcodes in the second round of labeling, the collision rates were very low, at 2.11% for 3′-RNA and 0.67% for Multiome (Fig. 1c,d). The collision rates are markedly lower than those observed when using the round 1 barcode alone (Extended Data Fig. 3c,d). For the 3′-RNA UDA-seq experiments involving three species, all three species were distinguished on the basis of their transcriptomes (Extended Data Fig. 3e). In species-mixture experiments for Multiome UDA-seq, human and mouse nuclei can be effectively defined using either transcriptome or chromatin profiles (Fig. 1d and Extended Data Fig. 3f).

UDA-seq performance and scalability

We evaluated UDA-seq to assess its practicality and ability to generate high-quality data in various primary tissues and clinical samples, as well as to understand how its performance varies across different modalities. Initially, we applied the Multiome (RNA and ATAC) UDA-seq workflow to frozen cell lines and human clinical samples, including cholangiocarcinoma, lung cancer, pancreatic cancer, coronary artery and kidney biopsy tissues (Extended Data Fig. 3g). When randomly taking 1/12 of cells as a sublibrary based on post-index and downsampling the read depth to 20,000 reads per cell, the data obtained from Multiome UDA-seq demonstrated quality comparable to that of the standard 10x Multiome procedure within the same cell line or tissue type (Fig. 1e).

We performed single-cell 5′-RNA and VDJ UDA-seq on fixed whole cells of peripheral blood mononuclear cells (PBMCs). Upon downsampling the depth to 50k reads per cell, we detected a median of 1,089 genes per cell in PBMCs frozen for 2 years and 1,588 genes per cell in PBMCs frozen for 6 months. These results are slightly lower and comparable to the standard procedure for the same kit version of 10x Genomics (Fig. 1f). Furthermore, we detected approximately 45% T cells with either T cell receptor (TCR)α or TCRβ chains and 80% B cells with either heavy or light chains. These percentages are slightly lower than the standard procedure but still highly informative (Fig. 1g). The relatively lower detection of the TCR in UDA-seq may be a result of its low expression level or specific RNA structural features. Further optimization of fixation conditions and reverse transcription (RT) reagents or selection of a longer sequencing read length may enhance the detection.

We evaluated scalability by upgrading our reagents to the 10x Genomics GEM-X Reagent Kits v3 and instrumentation to the Chromium X, doubling the droplet numbers compared with the previous system. We conducted a large-scale single-cell UDA-seq 5′-RNA and VDJ with GEM-X (v3) reagents by loading 300,000 methanol-fixed PBMCs frozen for 2 years into a single channel followed by 96-well post-indexing. A total of 170,000 cells were recovered, of which 150,018 cells met the quality control criteria. When downsampling the read depth to 50,000 reads per cell, we observed a median of 1,296 genes per cell (Extended Data Fig. 3h). A sublibrary, representing 1/12 cells of the entire library, was subjected to deep sequencing to achieve read coverage of 178,000 reads per cell. We identified a median of 3,589 unique molecular identifiers (UMIs) and 1,822 genes per cell (Extended Data Fig. 3i), a median of 60.3% T cells with TCRα and TCRβ chains, and 93.4% B cells with heavy and light chains (Fig. 1g). This indicates that optimizing the reagent and increasing the number of droplets can be directly integrated into the existing UDA-seq workflow without additional configuration. The enhanced cellular recovery enables a more thorough analysis of cell populations. Reference mapping identified 44 distinct cell states (Fig. 1h), including some rare populations accounting for less than 0.1% of the sample. These uncommon cell states, such as activated T-cell states, cycling cell states and progenitor cell types, may be overlooked when using methods that have lower cell throughput (Fig. 1i and Extended Data Fig. 3j).

These results demonstrate that plate-based post-indexing is compatible with droplet-based barcoding. They also show that UDA-seq is reliable when used with whole cells or nuclei, fresh cell lines or frozen complex tissue, and that it is compatible with co-assaying of different single-cell modalities.

Multiome UDA-seq analysis on kidney frozen biopsy

After successfully confirming the feasibility and scalability of UDA-seq, we proceeded to assess its potential for increasing sample throughput and utilization in a large-scale cohort study. We employed Multiome UDA-seq to study chromatin accessibility and gene expression in kidney diseases. We collected 35 frozen human samples, including 10 time-zero transplant biopsies from healthy donors and 25 biopsies from patients with over 10 different diagnoses (Supplementary Table 1). These diseases were categorized on the basis of their underlying pathophysiological mechanisms, covering immune-mediated glomerulopathies (12 donors), metabolic-related renal diseases (5 donors) and other types (8 donors). Additionally, we assessed the samples on the basis of clinical symptoms, such as the presence of massive proteinuria (≥3.5 g per 24 h) and impaired kidney function (Fig. 2a).

Fig. 2: Single-nucleus RNA and ATAC joint profiling from human kidney mini biopsies.
figure 2

a, A total of 25 renal biopsy samples from patients with kidney disease and 10 healthy control samples from kidney transplantation donors were utilized for Multiome UDA-seq analysis. b, A schematic of the joint snATAC-seq and snRNA-seq analysis for kidney biopsy samples. c, The data quality of the Multiome UDA-seq, showing the distributions of gene number per cell (top left) and FiRP per cell (top right) for patients with disease (n = 25) and healthy control donors (n = 10). Box plots show interquartile range with the median marked. The whiskers extend up to 1.5 times the interquartile range, and the outliers are not displayed. Bottom left: normalized insertion profile around transcriptional start sites for patients with disease and healthy donors. Bottom right: the length distribution of fragments from patients with disease and healthy donors. d, UMAP plots with cells colored by clusters, defined by snRNA-seq alone (top) or snATAC-seq alone (bottom). e, A dot plot showing differentially expressed genes across the major cell types. f, A track plot showing differentially accessible regions across the major cell types. g, UMAP plots with cells colored by clusters, defined by weighted nearest network joint analysis of snRNA-seq and snATAC-seq. h, Peak2Gene links (P2GLinks) in kidney snRNA-seq and snATAC-seq profiles shown in a heat map with 25 modules. Each row represented a single cell, grouped by its cell type. Panel b was created using BioRender.com. Full names for all abbreviations of cell types in panels dh can be found in the Methods.

We addressed the challenge of working with ultralow starting materials by pooling 35 samples to isolate and fix nuclei. Each disease sample, equivalent to around one-fifth of a 16G or 18G needle biopsy (approximately 2–3 mm in length), was sourced from the tissue remaining after diagnostic procedures. Subsequently, the nuclei suspension underwent two channels of Multiome UDA-seq with 96 barcodes in post-indexing in a single batch (Fig. 2b). Sample demultiplexing was carried out using natural genetic variation13. As a result, we obtained a total of 207,789 cells that met high-quality criteria for both single-nucleus RNA sequencing (snRNA-seq) and single-nucleus ATAC sequencing (snATAC-seq) readouts, with a median of 1,707 and 1,518 genes per cell, and 9,881 and 7,163 fragments in peak per cell for patients and healthy controls, respectively. These results are comparable to the data quality from the standard 10x Multiome procedure (Figs. 1e and 2c and Extended Data Fig. 4a).

The data from the two channels shows a high degree of consistency (Extended Data Fig. 4b–d). After unsupervised clustering, a total of 60 cell clusters for the snRNA-seq and 68 cell clusters for the snATAC-seq were identified (Fig. 2d). These clusters were then categorized into 54 subclasses by leveraging a well-annotated human kidney scRNA-seq reference dataset14 (detailed cell types provided in Methods ‘Kidney snRNA-seq part analysis’ section). The specificity of our snRNA-seq data was demonstrated using previously characterized cell markers (Fig. 2e; for detailed genes, see Supplementary Table 2), and specific peaks were identified in the snATAC-seq within these cells (Fig. 2f). Subsequent integrative analysis of both modalities using the weighted nearest neighbor3 method revealed a more refined pattern of cell clustering (a total of 253 cell clusters) as identified by the aforementioned single-modality analysis (Fig. 2g). Furthermore, employing an orthogonal approach, ArchR15, by combining co-accessibility and RNA integration, allowed the identification of 98,698 Peak2Gene links. These links show a strong correlation between chromatin accessibility and target gene expression, resulting in 25 unique modules across all cell types (Fig. 2h), which confirms the high quality of the data.

Key cells and regulators linked to massive proteinuria

Massive proteinuria corresponds to protein of more than 3.5 g in 24 h urine. Our data include measurements of the extent of proteinuria for each donor and matched single-cell profiling of primary tissue. With a sample size of 35, this cohort enables us to determine which specific cellular subpopulation is most relevant to the clinical signs observed in proteinuric kidney diseases. To address this question, we modified an algorithm called Scissor16, which is a method for identifying subpopulations of cells associated with specific phenotypes from single-cell data. Surprisingly, Scissor identified podocytes (POD) as the highest-ranking candidate, and the signal seems quite unique and specific (Fig. 3a). Notably, the number of PODs represented less than 0.6% of our data (Fig. 3b). Moreover, cells positively associated with proteinuria (proteinuria positive cells) identified by Scissor were notably enriched in donors with proteinuria greater than 3.5 g (Fig. 3c). Genes differentially expressed in proteinuria positive cells were employed as a proteinuria-associated signature (for detailed genes, see Supplementary Table 3), which was positively correlated with the magnitude of proteinuria in the donors (Fig. 3d). These results further demonstrate that UDA-seq is very reliable in depicting the population scale of a complex cellular system.

Fig. 3: Rare cell subpopulations associated with proteinuria identified using UDA-seq.
figure 3

a, A violin plot displaying the distribution of positive proteinuria signature scores across different cell types. Box plots show the interquartile range with the median marked. The whiskers extend up to 1.5 times the interquartile range, and the outliers are not displayed. b. Left: the anatomical structure of the glomerulus, with arrows indicating a podocyte and EC-GC. Right: UMAP embedding, colored by positive proteinuria signature score, highlighting the enrichment of the proteinuria signature in the podocyte. c, A heatmap providing a visual representation of the enrichment of cells related to proteinuria positivity or negativity across different donors. The colors in the heatmap correspond to the enrichment scores. Donors were categorized into three groups: healthy, those with 24 h urinary protein levels ≥3.5 g and those with levels <3.5 g. d, A box plot depicting donors’ positive proteinuria signature scores across the control (n = 10), low proteinuria (n = 19) and high proteinuria (n = 6) group. The box plots show the interquartile range with the median marked. The whiskers extend up to 1.5 times the interquartile range. P values were calculated using the two-sided t test. *Bonferroni-adjusted P < 0.05. **Bonferroni-adjusted P < 0.01. e. Podocyte reclustering, revealing six subclusters. f, UMAP embedding, colored to indicate cells positively related to proteinuria. g, A Volcano plot showing marker genes of subcluster 1 for podocytes. P values were calculated by the two-sided Wilcoxon rank sum test. The vertical dashed lines represented −1 or 1 log2FC value, the horizontal dashed line represented p-value = 0.05. h, The key eRegulon identified by SCENIC+. Bold and enlarged gene symbols represent transcription factors (TFs), squares represent peak regions and gene symbols at the graph’s edges represent regulated genes. i, A heatmap illustrating the differential cell–cell communication strength of proteinuria donors compared with healthy control donors. j, A bar plot illustrating that the podocyte-to-EC-GC communication changes between donors with proteinuria and healthy control donors in the fibroblast growth factor (FGF), vascular endothelial growth factor (VEGF), neuregulins (NRG) and platelet-derived growth factor (PDGF) signaling pathway. The FGF signaling pathway exhibits an increase in proteinuria donors. Panel b was created using BioRender.com. Full names for all abbreviations of cell types in panels a,b,i can be found in the Methods.

Source data

Podocytes (PODs) are specialized cells that are crucial for establishing and maintaining the glomerular filtration barrier17,18,19. Our research findings support previous studies on the important role of podocyte injury in massive proteinuria. We delved into the heterogeneous subpopulations of PODs involved in this process, as well as the underlying molecular mechanisms and regulatory networks. Through unsupervised reclustering, we identified six distinct subclusters of PODs (Fig. 3e). Notably, subcluster 1 emerged as a crucial contributor to proteinuria by Scissor analysis (Fig. 3f and Extended Data Fig. 5a). We found that the highly expressed genes in subcluster 1 were intricately linked to the activation of the transforming growth factor-beta signaling pathway (Fig. 3g). This observation aligns with earlier findings indicating that transforming growth factor-beta activation triggers a dedifferentiation process of podocytes resembling the epithelial-to-mesenchymal transition, which may contribute to glomerular fibrosis20,21,22.

A better understanding of the molecular regulatory networks of POD dysfunction could help in developing podocyte-targeted therapeutic approaches. We used SCENIC+23 to infer enhancer-driven gene regulatory networks for proteinuria-associated POD subpopulations. SCENIC+ identified 17 activator eRegulons and 7 repressor eRegulons, including transcription factors (TFs) known to be associated with kidney disease (Extended Data Fig. 5b). For example, LMX1B mutations may cause glomerulopathy17,24,25, and high expression of E2F3 has been linked to worse pathological characteristics in renal cancer17,26,27. SCENIC+ then constructed a putative regulatory network, demonstrating the cooperative relationship between E2F3, ZNF398 and PLAG1 (Fig. 3h). Upregulation of DAAM2, MYLK3 and SCGB2B2 by transcription factor E2F3 was observed in massive proteinuria-associated POD (Extended Data Fig. 5c).

We used CellChat28 to analyze cell-to-cell communication and found that patients with massive proteinuria had increased communication between podocytes (as a source of signals) and EC-GCs (as a receiver of signals) compared with healthy controls (Fig. 3i). Additionally, we observed enhanced communication of the FGF signaling pathway and reduced communication of the VEGF signaling pathway in these patients (Fig. 3j). These results are consistent with previous reports indicating the importance of podocyte-derived VEGF signaling for the survival and function of EC-GCs29, as well as clinical observations that VEGF inhibition leads to endothelial swelling and thrombotic microangiopathy30.

Key cells and regulators linked to renal function impairment

Chronic kidney disease (CKD) and acute kidney injury (AKI) are increasingly understood as closely linked syndromes31,32,33,34,35. Thus, we grouped patients with AKI and CKD together as having ‘renal impairment’ to study the types of cells most impacted. We employed the Scissor algorithm-based approach to determine the cell types associated with kidney function impairment. The analysis showed that the top four most affected cell types in both immune-associated renal injury and metabolism-associated renal injury were all different endothelial cells. The top-ranking cell type was EC-GCs (Fig. 4a,b), a critical part of the glomerular filtration barrier. Additionally, the ‘renal impairment’ signature (for detailed genes, see Supplementary Table 4) calculated by Scissors displayed a negative correlation with the patient’s glomerular filtration rate (eGFR), a crucial negative indicator of renal filtration capacity, confirming the accuracy of our calculation (Fig. 4c,d). Upon reclustering of EC-GCs, six subclusters with specific marker genes were identified (Fig. 4e,f). Interestingly, cells of metabolic-related injury were predominantly found in subcluster 0 (Fig. 4g), while those of immune-related injury were enriched in subcluster 3 (Fig. 4h).

Fig. 4: Rare cell subpopulations associated with impairment identified using UDA-seq.
figure 4

a,b, A violin plot displaying the distribution of single-cell level immune-related impairment (a) and metabolic-related impairment (b) signature scores across different cell types, with the y-axis representing the immune-related signature score and the x-axis representing cell types. Box plots show the interquartile range with the median marked. The whiskers extend up to 1.5 times the interquartile range, and the outliers are not displayed. c,d, The median score of the immune-related impairment (c) and metabolic-related impairment (d) signature across all cells from an individual donor, exhibiting a negative correlation with the eGFR (Pearson’s r = 0.908 and 0.99, respectively). The shaded area around the regression line represents the 95% confidence interval. e, Left: EC-GC reclustering, revealing six subclusters in UMAP. Middle: UMAP embedding, colored by metabolic impairment positive cells. Right: UMAP embedding, colored by immune impairment positive cells. f, Marker genes identified for the six subclusters. The dot size indicates the gene expression percentage of a cluster, while the color represents the scaled average gene expression within a cluster. g, A bar plot showing that subcluster 0 was enriched with cells that are positively related to metabolic impairment. h, A bar plot displaying that subcluster 3 was enriched with cells that are positively related to immune impairment. i, Important eRegulons identified by SCENIC+ in the six subclusters. The colors of the heatmap represent scaled TF gene expression, and the dot size indicates scaled target region enrichment. j, GRNs revealed the key eRegulons of subcluster 3, which consists of immune impairment-related cells. Transcription factors are denoted in bold with enlarged gene symbols, peak regions are represented by squares and regulated genes are shown at the edge of the graphs. k, The representative genes regulated by the transcription factor in j are highly expressed in subcluster 3.

Source data

In our analysis using SCENIC+, we effectively examined the eRegulon of EC-GC subpopulations associated with renal impairment (Fig. 4i) and successfully constructed gene regulatory networks (GRNs). Our investigation revealed a collaborative relationship between RARB and KLF9 in immune injury positive cells (Fig. 4j). Notably, abnormal levels of CCND336,37,38,39 and PKT2840, the target genes of KLF9, have been previously linked to immune regulation and immune-mediated diseases. These compelling findings (Fig. 4k) suggest that these genes hold potential therapeutic targets for immune-mediated glomerulopathy.

In summary, we have effectively generated top-quality pan-nephropathy scMultiome data using UDA-Seq, leading to a remarkable 8- to 9-fold reduction in the cost of library construction per cell or 15- to 20-fold per sample (aiming for approximately 5,000 cells per sample). See Extended Data Figs. 6 and 7 for a more detailed cost comparison. In addition, it is worth noting that the enhanced cellular and sample throughput make pooling of tiny samples feasible and multimodal data from rare cell types meaningful.

RNA and VDJ UDA-seq of frozen PBMCs from human aging cohort

We proceeded to apply UDA-seq to the study of paired gene expression and adaptive immune receptor repertoire from the same cell, also referred to as FIPRESCI-seq210. We gathered and applied single-cell RNA and VDJ UDA-seq to PBMCs from 38 female donors aged 20–65 years, constituting a female natural aging cohort. The frozen PBMCs were thawed and then divided into two pools, one comprising 10 donor PBMCs and the other comprising 28. Following fixation with methanol, each pool of cells was loaded with one channel and underwent 96-well post-indexing for library construction (Fig. 5a). A total of 152,357 cells were recovered from the two channels. After conducting quality control and sample demultiplexing using natural genetic variation similar to mux-seq13, 132,712 cells remained for downstream analysis. On average, 3,112 cells and 1,149 genes per cell were identified for each individual (Extended Data Fig. 8a,b). We integrated our data and annotated them using a high-quality reference dataset3. We retained 126,616 cells with high prediction scores, which led to the identification of 27 major cell types (Fig. 5b).

Fig. 5: Investigating the PBMC characteristics in an aging female cohort.
figure 5

a, A flowchart depicting the multimodality experimental design carried out on PBMCs obtained from a female aging cohort, consisting of two batches of donors totaling 10 and 28 individuals, respectively. b, UMAP embedding of 126,616 5′-scRNA-seq profiled PBMC cells, encompassing 27 distinct cell types. c, Aging-related cells detected using SCIPAC with a UMAP visualization with cells color-coded on the basis of the significance of the association, with a red P value indicating a positive association and a blue P value indicating a negative association. The P values were calculated by the SCIPAC nonparametric bootstrap strategy. d, Pie charts showing cell type proportions in age-associated cell groups. Top: age positively associated groups. Bottom: age-negatively associated groups. e, Box plots displaying age-related gene signature scores across different age ranges, indicating a positive association with aging. Box plots show interquartile range with the median marked. The whiskers extend up to 1.5 times the interquartile range. The number of cells for different age ranges is shown at the top of each box plot. f, GO enrichment analysis performed on the gene set forming the age-associated signature. The length of the bars indicates the gene counts for corresponding GO terms, while the colors represent the adjusted P value of enrichment. The P value for GO terms was calculated by the two-sided hypergeometric test and adjusted by the Benjamini–Hochberg method. g, Reclustering of CD4 naive cells, revealing nine subclusters. h, The proportion of CD4 naive subcluster 2 increases with age. i, A stacked violin plot depicting the positive marker genes of subcluster 2 within the CD4 naive population. j, Box plots depicting the TCR clonal diversity across age groups (n = 5, 2, 8, 13 and 10), with each dot representing a donor’s diversity calculated by the inverse Simpson index. The box plots show the interquartile range with the median marked. The whiskers extend up to 1.5 times the interquartile range.

Source data

In our investigation into the correlation between cell type and age, we employed the SCIPAC algorithm41, which offers a quantitative assessment of the link between single cells and a specific phenotype. Through the use of matched single-cell RNA-seq and phenotype data, cells were categorized into three distinct groups: phenotype+ cells, phenotype− cells and null cells. Our analysis revealed that natural killer (NK) cells, CD8+ T effector memory (TEM) and γδT cells are among the top three cell types positively correlated with aging. Furthermore, CD8+ naive, CD4+ naive and CD4+ T central nemory (TCM) cells were identified as the top three cell types negatively correlated with aging (Fig. 5c,d and Extended Data Fig. 8c). The findings described align with prior research on single-cell transcriptomic data of PBMC in independent aging human cohorts42,43. It is worth noting that γδT cells constitute only a small portion of the PBMCs (0.36%), yet they demonstrate a positive association with senescence. As anticipated, genes differentially expressed between age positively correlated cells and age-independent cells serve as aging-related signatures and display a positive correlation with age (Fig. 5e; for detailed genes, see Supplementary Table 5). These signatures are linked to NK cell sensitivity and activation (Fig. 5f).

Further unsupervised reclustering of CD4+ naive cells resulted in the identification of nine subclusters (Fig. 5g). Although a negative correlation between CD4 naive cells and aging was found through SCIPAC analysis, there was a cumulative increase in subcluster 2 of CD4 naive cells with aging (Fig. 5h). Previous studies have suggested that a specific group of CD4+ naive cells (CD4+, CD31) increase with age44. The subcluster 2 of CD4 naive cells that we identified is probably a subset of the previously reported CD4+ CD31 cells (Extended Data Fig. 8d,e). However, we can now redefine this aging-negatively correlated cell subpopulation as ITGB1 + PREX1 + CD4 naive cells, on the basis of the positive molecular markers (Fig. 5i).

We conducted an analysis to understand the adaptive immune properties associated with aging. Using nested PCR, we enriched TCR/BCR (B-cell receptor) molecules from the cDNA of 5′-RNA UDA-seq libraries. As a result, we obtained 34,784 cells with gene expression and TCR data and 32,958 cells with gene expression and BCR data (Extended Data Fig. 8f). Our analysis focused on examining the relationship of clonal expansion among individual T cells and the usage of V(D)J genes across different age groups. The findings showed that age notably affected TCR clonotype diversity. There was a noticeable decrease in richness, meaning fewer different TCR clonotypes, and an increase in clonality, indicating a higher abundance of specific clonotypes in older individuals (Fig. 5j).

This advancement enables cohort studies to pinpoint rare subpopulations relevant to various phenotypes, such as aging, and offers fresh insights into the clonal expansion dynamics of immune cells.

UDA-seq enables analysis of CRISPR-mediated perturbations

In addition to cohort studies, an important application scenario that necessitates ultrahigh throughput is pooled CRISPR screening with single-cell transcriptome as readout. We conducted an assessment of the knockout impact of bromodomain-containing genes on human gastric cancer SNU16 cell lines by employing the combined use of CRISPR droplet sequencing (CROP-seq)45 and UDA-seq. SNU16 cells expressing Cas9 were transduced with lentiviral constructs that encode 255 single guide RNAs (sgRNAs) targeting 46 genes within the bromodomain and extra-terminal domain (BET) family proteins. Each gene was targeted with at least five sgRNAs, and ten sgRNAs were used as negative controls (for detailed genes and sgRNA, see Supplementary Table 6). After 7 days of infection, cells were subjected to 3′-RNA UDA-seq. A total of 50,000 methanol-fixed cells were loaded in one channel, followed by 96-well post-indexing. Guide RNA with single-cell barcodes was enriched from UDA-seq cDNA products (Fig. 6a). Subsequent single-cell transcriptomic analysis revealed that 38,000 cells were recovered. After quality control, 31,487 cells with sgRNA were retained (Extended Data Fig. 9a,b).

Fig. 6: UDA-seq enables pooled CRISPR screening with single-cell transcriptome readout.
figure 6

a, A flowchart of the UDA CRISPR screening on the SNU16 cell line. Cells were transduced with 255 different gRNAs overall. b, A violin plot illustrating the absence of expression of the targeted genes BRD7 and EP300. c, The genes that exhibit differential expression in cells subjected to interference targeting BRD7 as compared with those targeted by the negative control sgRNA. The vertical dashed lines represented −1 or 1 log2FC value, the horizontal dashed line represented p-value = 0.05. d, The degree of perturbations of the genes of interest, estimated by the scMAGeCK-LR module. An LR_score >0 represents upregulation by the corresponding target genes, while an LR_score <0 represents downregulation. FDR means adjusted p-value calculated from ScMAGeCK. e, Unsupervised clustering of the single-cell transcriptome, revealing two distinct groups (group 1 with n = 3,880 and group 2 with n = 8,707) of SUN16 cells in the PCA space defined by PC1 and PC2. f, A violin plot indicating that the gene expression of FGFR2 was higher in group 2. The box plots show the interquartile range with the median marked. The whiskers extend up to 1.5 times the interquartile range, and the outliers are not displayed. g, The average expression (dots) of the MYC gene within cells grouped by perturbation. The calculations for the target gene versus control and group 1 versus group 2 were performed separately. h, The expression of the MYC gene was notably reduced in cells subjected to interference targeting BRD1 (n = 5), BRD4 (n = 5), BAZ2A (n = 5) and BRD8 (n = 5) compared with the control (n = 10). The box plots show the interquartile range with the median marked. The whiskers extend up to 1.5 times the interquartile range, and the outliers are not displayed. P values were calculated using the two-sided t test. *Hochberg-adjusted P value <0.05 (adjusted P values of 0.039, 0.039, 0.047 and 0.022, respectively). i. The downregulated genes identified for each target gene separately in group 1 and group 2 cells. The enrichment of these downregulated genes per target on MYC-related pathways from the KEGG database is illustrated by \({-\log }_{10}({P \,{\rm{value}}})\). The P values for the KEGG pathway were calculated by the two-sided hypergeometric test and adjusted by the Benjamini–Hochberg method.

UDA-seq’s ultrahigh throughput simplifies downstream analysis. Even with stringent thresholds requiring cells with a single sgRNA or cells with more than one sgRNA but with a dominant one, we were able to identify 12,644 cells. This results in a mean library coverage of 44 cells per sgRNA and 233 cells per target gene, meeting typical requirements. A median of 1,745 genes and 2,497 UMIs are detected per cell in the final dataset (Extended Data Fig. 9c). We verified that targeted genes no longer express in the corresponding sgRNA-perturbed cells, such as BRD7 and EP300 (Fig. 6b). To systematically uncover gene regulation relationships, we used both differential expression gene analysis and ScMAGeCK46 analysis tailored for single-cell perturbation data and found consistent results from both analyses. For example, both analyses indicated that BRD7 is a potential repressor of FOXP1 and an activator of VDAC1 (Fig. 6c,d). These observations (Fig. 6b–d and Extended Data Fig. 9d,e) validate the successful use of CRISPR interference and the high quality of UDA-seq.

The analysis of scRNA-seq data showed that there are two distinct groups of SNU16 cells (Fig. 6e). These groups exhibit different levels of FGFR2 gene expression, with group 2 showing higher levels (Fig. 6f). Both groups contain cells with negative sgRNA and other target sgRNA (Extended Data Fig. 9f), suggesting that the observed differences between the groups are not due to CRISPR perturbation. Previous research has indicated that SNU16 cells display variability, with differing levels of FGFR2 amplification leading to varying responses to drugs47,48,49.

The bromodomain family of proteins are crucial epigenetic ‘readers’50,51. BET inhibitors52,53 have proven to rapidly suppress key proto-oncogenes such as MYC52,53,54,55. Our data support that perturbing these target genes has resulted in a notable decrease in MYC expression in comparison with control groups (Fig. 6g,h and Extended Data Fig. 9g).

To better understand the influence of different Bromodomain family genes on SNU16, we compared the correlation of differentially expressed genes with MYC-related pathways in two groups of SNU16 cells after depleting each target gene. We found that various Bromodomain family genes have differing effects on the MYC pathway in the two groups of SNU16 cells (Fig. 6i). The depletion of BAZ2A, BRD4 and BRD8 adversely affected the differentiation, proliferation, apoptosis and migration-related signaling pathways in group 1 cells. BAZ2A also inhibits those pathways in group 2 cells, suggesting its crucial role in maintaining cell growth and proliferation in both groups. Additionally, the depletion of SP140L, NF1 and SMARCA4 inhibited infection and metabolic pathways specifically in group 2 cells.

We anticipate that the extensive use of UDA-seq will pave the way for a thorough exploration of regulatory processes and support the creation of foundational models in single-cell perturbation modeling.

Discussion

We introduce UDA-seq, a single-cell multimodal sequencing method offering versatility, affordability and reliability for analyzing fresh and fixed cells or nuclei. UDA-seq achieves a 10- to 20-fold increase in throughput while remarkably reducing doublets. Its workflow and data quality are on par with the commercial 10x Genomics kit. We validated UDA-seq on a variety of frozen clinical samples, including PBMCs, kidney biopsies and multiple cancer samples, demonstrating its effectiveness for large-scale single-cell population cohort analysis. The increased sample size and deeper cellular coverage enable the identification of rare cell types associated with specific phenotypes, such as age-associated γδT cells and proteinuria-associated POD cells, even when these cell types constitute fewer than 5 in 1,000 cells. Moreover, UDA-seq can minimize experimental batch bias and variation, especially with limited starting material, making it a valuable tool for researchers conducting high-throughput, large-scale studies.

Some early pioneering work introduced the concept of combinatorial indexing in droplet microfluidics7,8. However, most initial studies primarily supported single modalities, often requiring tailored additional steps to enable multimodality, thus rendering the approach nonuniversal. In contrast, our work provides a universal solution to greatly enhance the throughput of existing single-cell multimodal omics technologies. This method can be readily expanded to augment other multimodality assays, such as cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq)56, Droplet Pair-Tag57, single-nucleus m6A-CUT&Tag (sn-m6A-CT)58, and numerous others.

Importantly, updates from the manufacturer can be seamlessly and synergistically integrated into the UDA-seq workflow. For example, transitioning from the Chromium Controller to Chromium X or updating reagents from Next GEM to GEM-X requires no additional configuration, leading to improved recovery rates and other benefits (Fig. 1g,h). UDA-seq is specifically compatible with 10x Genomics and similar methods that rely on the conditional dissolution of gel beads or photolysis to release barcoded oligonucleotides. However, platforms that require cell lysis within droplets for barcoding are not compatible with UDA-seq.

UDA-seq demonstrates the feasibility of achieving high throughputs for cells and real-world clinical samples by combining combinatorial indexing with droplet microfluidics through a post-indexing approach. Unlike pre-indexing strategies such as single-cell combinatorial fluidic indexing (scifi)-RNA-seq8, the post-indexing method does not inherently provide built-in sample multiplexing capabilities. To address this, we utilized genetic variation for donor deconvolution in this study, minimizing the preprocessing handling time. It is important to note that UDA-seq is compatible with established sample multiplexing methods, such as DNA-labeled antibodies59 and lipids60. This compatibility is particularly valuable in scenarios where SNP-based deconvolution is not applicable, as seen in longitudinal studies of the same donor. We have noticed that an independent method known as Overloading And unpacKing (OAK)-seq61 employs a barcoding strategy similar to UDA-seq. However, OAK-seq primarily focuses on in vitro systems. It nicely complements our work by showcasing the potential of capturing lineage information and its compatibility with Cell Hashing, which uses barcoded antibodies for sample multiplexing.

In conclusion, UDA-seq’s ultrahigh throughput makes it ideal for large-scale single-cell, multisample studies such as large-scale disease cohorts or cellular atlas constructions, as well as high-throughput CRISPR and drug screening, driving advancements in single-cell technology.

Methods

Cell culture

All established cell lines NIH-3T3 (SCSP-515), HeLa (SCSP-504), and SNU16 (TCHu243) were purchased from the National Collection of Authenticated Cell Cultures (Shanghai, China). Cells were cultured at 37 °C in an atmosphere of 5% (v:v) carbon dioxide in DMEM (Gibco, catalog no. 11965092) or RPMI (Gibco, catalog no. 11875093) medium (SNU16) supplemented with 10% fetal bovine serum, and digested with 0.25% Trypsin (catalog no. 25200056; Gibco) for preparing single-cell suspension.

Human specimen sampling

The study was approved by the Ethics Committee of the Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation (2024H013 and 2023H001), Peking Union Medical College Hospital (I-23PJ739) and West China Hospital of Sichuan University (2024-170) and was conducted following the Declaration of Helsinki. All the patients in this study provided written informed consent. Participants were not compensated.

Human PBMCs

The peripheral blood samples of healthy women were from the Quzhou Affiliated Hospital of Wenzhou Medical University. All the patients in this study provided written informed consent. The fresh peripheral blood was collected in EDTA anticoagulant tubes and processed immediately. PBMCs were isolated according to the manufacturer’s instructions for Ficoll-Paque PLUS (17-1440-02; GE Healthcare). Isolated PBMCs were gently suspended in cryopreservation medium CELLBANKER2 (11891; AMSBIO), and stored in liquid nitrogen before further processing. After thawing, dead cells were removed using the Dead Cell Removal Kit (catalog no. 130-090-101; Miltenyi Biotec). For the human aging cohort study, equal numbers of live cells were pooled for 10 and 28 samples per pool. Cells were fixed, permeabilized and immediately subjected to sc5′-RNA and VDJ-seq UDA-seq. After fixation, 100,000 cells (for 10 donors) and 250,000 cells (for 28 donors) were subjected to two channels of UDA followed by 96-well post-indexing.

Human kidney biopsy

A total of 35 human kidney samples were obtained for this study: 25 from nephritic patients from the Peking Union Medical College Hospital (PUMCH) and 10 time-zero transplantation biopsy samples from the West China Hospital of Sichuan University, under institutional review board-approved protocols. All samples were dissected from excess tissue from needle biopsy samples used for diagnostic purposes. Each patient biopsy was approximately 2–3 mm (equivalent to around one-fifth of a 16G or 18G needle biopsy). Samples were from 18 male and 17 female participants. All specimens were snap frozen with liquid nitrogen immediately after sampling and stored in liquid nitrogen before processing. The metadata for the samples are given in Supplementary Table 1. On the day of the experiment, all 35 human kidney samples were combined to isolate nuclei, and two channels of Multiome (single-nuclear (sn)3′-RNA and ATAC-seq) UDA-seq were performed. Each channel was loaded with 300,000 nuclei and then subjected to 96-well post-indexing.

Human lung cancer, cholangiocarcinoma, pancreatic cancer and heart (coronary artery) samples

Human lung cancer samples were obtained from the Fudan University Shanghai Cancer Center Hospital. Cholangiocarcinoma samples were obtained from the Beijing Tsinghua Changgung Hospital. Human pancreatic cancer samples were obtained from the Beijing Cancer Hospital. Human heart samples were obtained from the Wuhan University Zhongnan Hospital. All samples were obtained from patients’ leftover surgical tumor tissue and stored at −80 °C before processing. These samples were combined in equal mass to isolate the nuclei. A total of 300,000 nuclei were then subjected to Multiome (sc3′-RNA and ATAC-seq) UDA-seq, followed by 96-well post-indexing.

CRISPR screening cell lines

We cloned 255 guide RNA cassettes into the CROP-seq-Guide-Puro plasmid (catalog no. 86,708; Addgene), targeting 46 genes of the human bromodomain family with at least five guide RNAs each and including ten nontargeting controls (sequences are provided in Supplementary Table 6). SNU16 cells were transduced with lentiCas9-Blast (catalog no. 52,962; Addgene) and selected with 25 µg ml−1 blasticidin (catalog no. ant-bl-5; InvivoGen) to achieve stable expression of Cas9. Then, SNU16 cells expressing Cas9 were transduced with lentiviral constructs encoding 255 guide RNAs. Following overnight incubation at 37 °C and 5% CO2, the medium was exchanged to select for the guide RNA construct (RPMI, 10% FCS, penicillin–streptomycin, 25 µg ml blasticidin, 2 µg ml−1 puromycin, catalog no. A1113803; Thermo Fisher Scientific). The selective medium was renewed every 2–3 days for a duration of 7 days to allow for efficient genome editing. Seven days after transfection, the cells after perturbation were collected to apply UDA-seq (3′-RNA), followed by 96-well post-indexing.

Preparation for UDA-seq

Cell fixation and permeabilization

A total of 1 million live cells (NIH 3T3, Hela, PBMCs or SNU16) were fixed and permeabilized by suspending in 100 μl of 1× PBS and 900 μl of ice-cold methanol (catalog no. M116121; Aladdin) at −20 °C for 10 min. After two additional washes (centrifugation at 300g for 5 min at 4 °C) with 1 ml of chilled PBS-BSA-RNase inhibitor (1× PBS supplemented with 1% w/v BSA, A1933; Sigma) and 1 U µl−1 RNase inhibitor (3335399001; Sigma), fixed and permeabilized cells were resuspended in 40 µl PBS-BSA-RNase inhibitor, placed on ice and then immediately sent for UDA-seq.

Isolation of nuclei from human tissues

For single nucleus isolation, all steps were completed on ice. Frozen tissues were placed into a 1.5 ml microcentrifuge tube with 0.5 ml of chilled 0.1× lysis buffer (10 mM Tris–HCl pH 7.4 (catalog no. T2194; Sigma-Aldrich), 10 mM NaCl (catalog no. 59222C; Sigma-Aldrich), 3 mM MgCl2 (catalog no. M1028; Sigma-Aldrich), 0.1% Tween-20 (catalog no. 28320; Thermo Fisher), 0.01% NP40 (catalog no. 59222C; Sigma-Aldrich), 0.001% digitonin (catalog no. BN2006; Thermo Fisher), 1 mM DTT (catalog no. 646563; Sigma-Aldrich), 1% BSA (catalog no. A1933; Sigma-Aldrich) and 1 U μl−1 RNase inhibitor (catalog no. N2615; Promega)), and large tissues (such as lung cancer, cholangiocarcinoma, coronary artery and pancreatic cancer) were quickly minced into small pieces (~0.1 mm) on ice using scissors, then immediately homogenized 15 times using a pellet pestle. After incubation for 5 min on ice, the mix was pipetted 10 times with a wide-bore pipette tip then incubated for 10 min on ice before adding 1 ml chilled wash buffer (10 mM Tris–HCl pH 7.4, 10 mM NaCl, 3 mM MgCl2, 0.1% Tween-20, 1 mM DTT, 1% BSA and 1 U μl−1 RNase inhibitor) to the lysed sample. The mix was pipetted five times, and the suspension was passed through a 70 µm Flowmi cell strainer before being filtered through a 40 µm Flowmi cell strainer into a new 1.5 ml tube, then centrifuged at 500g for 5 min at 4 °C. The supernatant was removed without disrupting the nuclei pellet, then the pellet was resuspended in chilled PBS-RNase inhibitor (1× PBS supplemented with 1 U µl−1 RNase inhibitor (3335399001; Sigma)). If cell debris and large clumps were observed, it was passed through a 40 µm cell strainer again.

Isolation of nuclei from cell lines

Cells (100,000–3,000,000) were washed with 2 ml ice-cold 1× DPBS (catalog no. 14190144; Gibco) and centrifuged at 300g for 5 min at 4 °C. The cell pellets were resuspended in 200 μl chilled 1× lysis buffer (10 mM Tris–HCl pH 7.4, catalog no. T2194; Sigma-Aldrich), 10 mM NaCl (catalog no. 59222C; Sigma-Aldrich), 3 mM MgCl2 (catalog no. M1028; Sigma-Aldrich), 0.1% Tween-20 (catalog no. 28320; Thermo Fisher), 0.1% NP40 (catalog no. 59222C; Sigma-Aldrich), 0.01% digitonin (catalog no. BN2006; Thermo Fisher), 1 mM DTT (catalog no. 646563; Sigma-Aldrich), 1% BSA (catalog no. A1933; Sigma-Aldrich) and 1 U μl−1 RNase inhibitor (catalog no. N2615; Promega)), and the mix was pipetted ten times. After incubation for 3–5 min on ice, 1 ml chilled wash buffer (10 mM Tris–HCl, pH 7.4, 10 mM NaCl, 3 mM MgCl2, 0.1% Tween-20 and 1% BSA) was added to the lysed cells. The mix was pipetted five times then centrifuged at 500g for 5 min at 4 °C and resuspended in ice-cold PBS-RNase inhibitor.

Fixation of nuclei

Nuclei were fixed by adding formaldehyde (catalog no. F8775; Sigma-Aldrich, at a final concentration of 1% for cell lines or 1.5% for primary tissues) and incubated at room temperature for 10 min, then centrifuged at 600g for 5 min at 4 °C to remove supernatant. All centrifugations were performed on a swing bucket centrifuge. Then, the cell pellet was washed twice with 1 ml PBS-BSA-RNase inhibitor and centrifuged at 500g for 5 min at 4 °C. The supernatant was removed without disrupting the nuclei pellet, after which the pellet was resuspended in chilled diluted nuclei buffer (PN-2000153, 10x Genomics) and placed on ice. Subsequently, we immediately proceeded with the UDA-seq.

i7-only Tn5 transposome assembly

i7-only Tn5 transposome was assembled for UDA-seq transcriptome library construction. In detail, the assembly of the i7-only Tn5 transposome was performed according to the manufacturer’s instructions for TruePrep Tagment Enzyme (catalog no. S601-01; Vazyme) reagent. Tn5-top_ME and TN5_R2_index (sequences are provided in Supplementary Table 7) were synthesized and purified by high-performance liquid chromatography (Sangon Biotech, Shanghai) and dissolved into the annealing buffer at a final stock concentration of 10 µM. For the annealing reaction, oligonucleotides were mixed at a 1:1 molar ratio at 10 μM and mixed thoroughly. The samples were annealed using the following thermocycling parameters: 75 °C for 15 min, 60 °C for 10 min, 50 °C for 10 min, 40 °C for 10 min and 25 °C for 30 min. After this step, the oligonucleotide cassette can be aliquoted and frozen at −20 °C for future transposome assemblies. To assemble the Tn5 transposase, we mixed 14 µl of oligonucleotide cassette from the previous step with 8 µl of TruePrep Tagment Enzyme and 78 µl coupling buffer, mixed well and then incubated for 1 h at 30 °C in a thermocycler. The resulting 100 µl of assembled transposome can be stored at −20 °C for at least 1 month.

sc3′-RNA-seq and sc5′-RNA-seq UDA-seq

For the step-by-step protocol for scRNA UDA-seq, see ref. 62.

Microfluidics droplet barcoding

Fixed whole cells or nuclei were counted by using a LUNA automated cell counter (Luna-FL). A certain number of cells were partitioned and barcoded using the 10x Genomics Chromium Controller (10x Genomics) and the Single Cell 3′ Library and Gel Bead Kit (10x Genomics; PN-PN-1000268) or Single Cell 5′ Library and Gel Bead Kit (10x Genomics; PN-1000263), or upgraded instrumentation (Chromium X, which can generate twice the number of droplets) and upgraded 10x Genomics reagents (Chromium GEM-X Single Cell 5′ Reagent Kits v3) according to the manufacturer’s recommended protocol. After the generation of single-cell gel bead-in-emulsions (GEMs), in situ reverse transcription reaction was performed at 53 °C for 45 min with a 4 °C hold.

Release of cells from droplets and splitting

Recovery agent (125 µl) was added to the GEMs immediately following the completion of the RT reaction, to release cells. The recovery agent/partitioning oil (pink) was slowly removed from the bottom of the tube and and discarded. PBS (120 μl) was added to the remaining aqueous phase (80 μl), mixed thoroughly and then the liquid was dispensed evenly into a 96-well plate, with each well receiving 2 μl. After brief centrifugation, the products were incubated in a thermomixer with 85 °C for 5 min and a 4 °C hold.

Cell purification

The subsequent purification step can be accomplished in three distinct ways. (1) The first method is direct PCR, as the amount of liquid added to each well after the preceding reaction step is minimal and has a limited impact on the PCR reaction system. (2) The second method entails enzymatic digestion to remove the excess oligos, followed by PCR. In brief, 2 μl of ExoSAP-IT (catalog no. 78201.1.ML; Applied Biosystems) was added to each well. After brief centrifugation, the products were incubated in a thermomixer at 37 °C for 30 min and 80 °C for 25 min. This method is capable of detecting 15% more genes than the first method. (3) The third method employs the Dynabeads SILANE genomic DNA kit (catalog no. 37012D; Invitrogen) to purify the products according to the manufacturer, after which PCR is performed. This method detects a comparable number of genes to the second method, but the procedure is more intricate. Consequently, the second approach is the optimal choice in terms of data quality and ease of use.

Post-indexing PCR and purification

Next, index PCR mix (1× KAPA HiFi HotStart Ready Mix (catalog no. KK2602; KAPA), 0.5 μM partial template switch oligo (TSO)/IS primer and 0.5 μM Truseq-i5 index primer (oligo sequences are provided in Supplementary Table 7)) was added to each well (note that the Truseq-i5 index primer is unique in each well) in a final volume of 20 μl. After brief centrifugation, the PCR mix was incubated in a thermomixer to perform enrichment index PCR with 98 °C for 3 min, then 10 or 11 cycles of 98 °C for 15 s, 63 °C for 20 s and 72 °C for 1 min, and finally 72 °C for 1 min. All indexed PCR products in one 96-well plate were pooled and purified with 0.6× XP beads (catalog no. A63881; Beckman Coulter), eluting in 40 µl of EB buffer (catalog no. 19,086; Qiagen).

sc-RNA-seq library construction

cDNA products (50 ng) were mixed with 15 µl i7-only TN5 Tagmentation Mix (5 µl self-assembly i7-only TN5, prepared as described above) and 10 μl 5× reaction buffer (catalog no. S601-01; Vazyme) in a final volume of 50 μl. Tagmentation reactions were incubated at 55 °C for 10 min, followed by purification with 0.8× XP beads, eluting in 40 µl of EB buffer. Then, fragmented cDNA was mixed with 60 µl of library enrichment mix (50 µl NEBNext Ultra II Q5 Master Mix (catalog no. M0544L; NEB), 5 µl 10 µM Partial P5 and 5 µl 10 µM Nextare-i7 index primer). The reaction was amplified in a thermocycler at 72 °C for 5 min, 98 °C for 45 s, eight or nine cycles of 98 °C for 20 s, 60 °C for 30 s and 72 °C for 1 min, and finally 72 °C for 5 min, with storage at 4 °C. Libraries were size-selected with 0.6–0.8× XP beads and eluted in 25.5 µl EB. The final libraries were quantified using a Bioanalyzer high-sensitivity DNA chip (50674626; Agilent) and Qubit HS assay (Q32854; Thermo Fisher Scientific) and then sequenced in MGISeq-T7 (MGI Tech Co., Ltd.) with a 150 bp paired-end read length, targeting a depth of 10,000–50,000 reads per cell.

Single‑cell VDJ enrichment from UDA-seq (sc5′-RNA-seq)

VDJ libraries were enriched from the PBMC cDNA product of UDA-seq (sc5′-RNA-seq) by using two consecutive nested PCRs and VDJ library preparation according to the manufacturer’s protocol (Chromium Single Cell V(D)J Reagent Kits, 10x Genomics). The final libraries were sequenced in MGISeq-T7 (MGI Tech Co., Ltd.) with a 150 bp paired-end read length, targeting a depth of 5,000 reads per cell.

Multiome (sn3′-RNA and ATAC-seq) UDA-seq

For a step-by-step protocol of Multiome UDA-seq, see ref. 63.

Chromatin tagmentation

A certain number of fixed nuclei (5 μl) were resuspended in transposition mix (7 μl ATAC buffer B, 3 μl ATAC enzyme B, PN-1000283; 10x Genomics) and incubated at 37 °C for 60 min.

Microfluidics droplet barcoding

After chromatin tagmentation, nuclei were partitioned and barcoded using the Chromium Controller (10x Genomics) and the Chromium Next GEM Single Cell Multiome ATAC + Gene Expression (PN-1000283; 10x Genomics) according to the manufacturer’s recommended protocol. After the generation of single-cell GEMs, in situ RT reaction and ligation reaction were performed at 37 °C for 45 min and 25 °C for 30 min with a 4 °C hold.

Release of cells from droplets and splitting

The operation of releasing nuclei and spilling to 96-well plates is the same as UDA-seq (sc3′-RNA-seq and sc5′-RNA-seq). Then, 1 μl of proteinase K (catalog no. W9527; Tiangen) and 5 μl of EB buffer were added to each well (with 2 μl of products), followed by thorough vortexing. After brief centrifugation, products were incubated in a thermomixer with 55 °C for 10 min and 85 °C for 5 min and a 4 °C hold.

Purification of nuclei

Dynabeads SILANE genomic DNA kit (catalog no. 37012D; Invitrogen) was employed to purify the products according to the manufacturer. A final 10 μl of EB buffer for each well was used to elute the beads.

Post-indexing PCR for both modality and purification

Next, 20 μl of index PCR mix (15 µl NEBNext Ultra II Q5 Master Mix (catalog no. M0544L; NEB), 0.5 μM Partial TSO/IS, 0.5 μM Truseq-i5 index primer, 0.5 μM P5 primer and 0.5 μM Nextare-i7 index primer) (Sangon Biotech; oligo sequences are provided in Supplementary Table 7) was added to each well (note that the Truseq-i5 index primer and Nextare-i7 index primer are unique in each well), each of which already contained the nuclei purification products. After brief centrifugation, index PCR reaction was performed at 72 °C for 5 min, 98 °C for 3 min, then seven cycles of 98 °C for 20 s, 63 °C for 30 s and 72 °C for 1 min, and finally 72 °C for 1 min. Then, half of the indexed PCR products in one 96-well plate were pooled and purified with 1.6× XP beads, eluting in 160 µl of EB buffer.

sc-ATAC-seq library construction

The post-indexing product (40 µl) was mixed with 60 µl of ATAC-seq enrichment mix (50 μlx KAPA HiFi HotStart Ready Mix, 5 μl of 10 μM Partial P5 and 5 μl of 10 μM P7 primer). The reaction was amplified in a thermocycler at 98 °C for 45 s, then seven to nine cycles of 98 °C for 20 s, 67 °C for 30 s and 72 °C for 20 s, and finally 72 °C for 1 min. Libraries were size-selected with 0.5–1.4× XP beads and eluted in 25.5 μl EB. The final libraries were sequenced in MGISeq-T7 (MGI Tech Co., Ltd.) with a 100 bp paired-end read length, targeting a depth of 10,000–50,000 reads per cell.

cDNA enrichment

The post-indexing product (40 µl) was mixed with 60 µl of cDNA enrichment mix (50 μlx KAPA HiFi HotStart Ready Mix, 5 μl of 10 μM Partial P5 and 5 μl of 10 μM Bio-Tso/ISPCR). The reaction was amplified in a thermocycler at 98 °C for 3 min, then four or five cycles of 98 °C for 15 s, 63 °C for 20 s and 72 °C for 1 min, and finally 72 °C for 1 min. The product was then purified with Dynabeads MyOne Streptavidin C1 (catalog no. 65,001; Invitrogen). In brief, 10 µl of Dynabeads MyOne Streptavidin C1 was washed twice with 1× B&W buffer (5 mM Tris pH 8.0, 1 M NaCl, 0.5 mM EDTA). After washing, the beads were resuspended in 100 µl of 2× B&W buffer (10 mM Tris pH 8.0, 2 M NaCl, 1 mM EDTA) and mixed with the sample. The mixture was rotated on an end-to-end rotator at 10 rpm for 60 min at room temperature. The lysate was put on a magnetic stand to separate the supernatant and beads, then the beads were washed with 100 μl 1× B&W buffer, and 100 μl EB buffer again. Finally, beads were resuspended in 40 μl nuclease-free water. Enriched cDNA products (40 μl) (with C1 beads) were then mixed with cDNA amplifaction mix (50 μlx KAPA HiFi HotStart Ready Mix, 5 μl of 10 μM Partial P5, 5 μl of 10 μM Tso/ISPCR). The reaction was amplified in a thermocycler at 98 °C for 3 min, then four or five cycles of 98 °C for 15 s, 63 °C for 20 s and 72 °C for 1 min, then finally 72 °C for 1 min. Amplified cDNA was then purified with 0.6× XP beads and eluted in 40 μl EB.

sc-RNA-seq library construction

cDNA products (50 ng) were taken for further library construction, the same as for UDA-seq (sc3′-RNA-seq and sc5′-RNA-seq). The final libraries were then sequenced in MGISeq-T7 with a 150 bp paired-end read length, targeting a depth of 10,000–50,000 reads per cell.

UDA-seq collision rate estimate

The collision rate estimation followed the single-cell combinatorial indexed cytometry sequencing (SCITO-seq) formula with the 10x Genomics Chromium parameters setting.

Species mixture analysis

For the two-species mixture (human and mouse cell line) scRNA-seq, sequencing data were mapped to a human and mouse mixture genome reference (hg38+mm10), and mapped and quantified by using the Cell Ranger Single Cell software suite (version 6.0.2). We defined a cell as a mouse cell if its mm10 UMI ratio was greater than 0.85. A cell was considered to be a human cell if its hg38 UMI ratio was greater than 0.85.

For the three-species mixture (human cell line, mouse cell line and locust tissue) scRNA-seq, sequencing data were mapped to a human, mouse and locust ref. 64 by using the Cell Ranger Single Cell software suite (version 6.0.2). Similarly, a cell was determined to be a locust cell if its locust UMI count number was two fold greater than other species’ UMI count number, and similarly for mouse and human cell species determination.

For the two-species (human and cell line) Multiome analysis, sequencing data were mapped to a human and mouse mixture genome reference by using Cell Ranger ARC (version 2.0.1). For the scRNA-seq part, the determination of a cell’s species was the same as the scRNA-seq species mixture analysis described above. For the scRNA-seq and scATAC-seq paired profiled part, a cell was considered to be a human cell if its hg38 UMI ratio was greater than 0.85 or if its hg38 ATAC-seq fragment ratio was greater than 0.85.

GEM-X PBMC scRNA-seq cell state annotation

To infer the detailed cell states within a single PBMC sample profiled using UDA-seq with GEM-X reagents, we conducted Seurat v4 reference mapping to align our query data to the highly detailed cell state annotation presented in the Human Severe Acute Respiratory Syndrome Coronavirus 2 challenge study65, in which 83 cell states from multiple PBMC samples collected under different coronavirus disease 2019 infection conditions were annotated, leveraging both RNA expression and surface protein (CITE-seq) measurements. We annotated our query cells on the basis of the predicted cell state assignments and manual curation. We further leveraged the reference data to infer the imputed surface protein levels for the query cells. Finally, we calculated a weighted nearest neighbor (WNN) graph to integrate a weighted combination of the RNA assay and predicted_ADT assay, and further used this graph to generate the uniform manifold approximation and projection (UMAP) visualization.

Basic quality comparison of the sequencing data

We extensively tested the single-cell Multiome to compare data quality. Our testing involved various cell lines, such as mouse NIH 3T3 and human Hela cells, as well as multiple human tissue samples, including kidney biopsy, lung cancer and pancreatic cancer. Different tissue types have an impact on the quality of single-cell data. To make comparisons, we downloaded raw data from publicly available datasets generated using the standard 10x Multiome kit for the NIH 3T3 cell line and the corresponding tissue samples. We applied consistent quality control criteria to all datasets, keeping only cells with more than 200 detected genes and less than 10% mitochondrial percentage, to ensure consistency. Initially, we processed all data using cellranger-arc 2.0.1 with a 10x-provided reference genome to obtain the exact number of high-quality cells in each dataset. Then, we uniformly downsampled the raw data to 20,000 read pairs per cell for ATAC data and 20,000 reads per cell for RNA data. After downsampling, we reran the cellranger-arc process to generate the final output matrices. These matrices were then used to evaluate the data quality across different samples. For samples with original sequencing depths higher than 20,000 reads per cell, we also conducted comparisons at multiple sequencing depths.

The public datasets used in this comparison were sourced as follows: the scMultiome data for NIH 3T3 cells were obtained from PRJEB50424, the kidney biopsy data from the GSE220251 dataset, the lung cancer data from the GSE241468 dataset and the pancreatic cancer data from the GSE241895 dataset.

For the comparison of 5′ scRNA-seq, we selected three datasets from the 10x Genomics website. These datasets utilized different chemistry versions and were compared with UDA-seq data. To ensure consistency, we downsampled datasets with more than 50,000 reads and recalculated matrices. The quality control followed uniform criteria, retaining all cells with mitochondrial counts below 5%.

For 5′ scRNA-seq, scVDJ-seq and surface protein data, the following sources were used for comparison:

  1. (1)

    For comparing scRNA-seq and scVDU-seq, ref. 66 (Kit_v1)

  2. (2)

    For comparing scRNA-seq, ref. 67 (Kit_v2)

  3. (3)

    For comparing scRNA-seq, ref. 68 (Kit_v3)

To calculate the TCR and BCR detection ratios, T cells were identified by the surface protein CD3-UCHT1, while B cells were identified by the surface protein CD19-HIB19. The TCR detection percentage was calculated as (number of cells with TCR)/(total T cell number), and the BCR detection percentage was calculated as (number of cells with BCR)/(total B cell number).

Female aging PBMC scRNA-seq and VDJ-seq data preprocessing

For sequencing data mapping and quantifying, in UDA-seq, 10x barcodes were round 1 barcodes, and post-index were round 2 barcodes. FASTQ files were firstly grouped by post-index (index 1–96 in the human kidney case). For each post-index, the Cell Ranger (version 7.0.0) ‘count’ command was applied to map and quantify the scRNA-seq part of the data. The Cell Ranger (version 7.0.0) ‘vdj’ command was applied to map and quantify paired scVDJ-seq data. Finally, a cell barcode was formed by the combination of the 10x barcode (round 1 barcode) and post-index (round 2 barcode). The quantified matrix (UMI count matrix) was aggregated by cell barcode. Demuxlet (version 2) was applied to scRNA-seq mapping results (bam files) and each donor’s imputed (by minimac369) Illumine Infinium Asian screening array SNP file SNP sequencing results (VCF file, processed by Plink2 v2.070 to assign the cell barcode to a specific donor). Droplets containing two or more cells were excluded using Demuxlet and Scrublet71 (version 0.2.3). scVDJ-seq mapping results were paired with scRNA-seq by corresponding paired cell barcodes using scRepertoire72.

Female aging PBMC cohort analysis

For quality control and annotation of scRNA-seq, firstly, cells with fewer than 200 genes and mitochondria percentage <5% were filtered out. Then, the batch effect was removed by using Harmony (version 1.0)73 and unsupervised clustering by Seurat, and cells from clusters 4, 9 and12 were filtered out since those clusters had low UMI content. For the remaining cells, batch effects were removed again by Harmony and reclustering by the Seurat pipeline. After that, label transfer was performed from the PBMC CITE-seq data3 by using Seurat (with the FindTransferAnchors and MapQuery functions). Visualization was done by using Scanpy74 (version 1.9.1). For scVDJ-seq data, quality control was done by scRepertoire with default parameters. Moreover, cells from scVDJ-seq that were not successfully annotated by scRNA-seq (using the annotation procedure described above) were filtered out.

For age positively related gene signature analysis, firstly, a pseudo-bulk RNA-seq profiled donor was formed by aggregating the expression values from a group of cells from the same donor. Secondly, (1) age as a continuous phenotype with (2) the pseudo-bulk RNA-seq profile and (3) the snRNA-seq profile were used as co-inputs, and SCIPAC (version 1.0.0) was applied to efficiently find cell types associated with age. Age positively related cells were defined by a SCIPAC association >0 and a corresponding SCIPAC significance P value below 0.05. The age positively related signature gene set was selected by differentially expressed gene between age positively-associated cells and non-associated cells with a log2 fold change of average expression >1 and Bonferroni-adjusted P value of <0.05. Gene Ontology (GO) analysis of the age positively associated genes was performed by using clusterProfiler75 (version 4.8.1).

For the analysis of the subgroup of CD4 naive cells, unsupervised reclustering was performed on CD4 naive scRNA-seq data by using the Seurat pipeline, as follows: (1) normalization of raw UMI counts matrix by using the NormalizeData function with default parameters, (2) finding the top 2,000 highly variable genes by using the FindVariableFeatures function with default parameters, (3) principle component analysis (PCA) performed by using the ScaleData and RunPCA functions with default parameters, (4) removing the batch effect by the Harmony73 method, (5) applying unsupervised clustering by using the FindNeighbors (parameters: dims = 1:20, reduction = “harmony”) and FindClusters (parameters: resolution = 0.5 and other default parameters), functions and (6) UMAP performed by the RunUMAP function, using the top 20 Harmony-corrected principle components (PCs). The protein expression level was imputed by using PBMC CITE-seq data with Seurat. Marker genes of cluster 2 in CD4 naive cells were found by comparison of cluster 2 with other CD4 naive cells.

Human kidney sequencing data preprocessing

For sequencing data mapping and quantifying, 10x barcodes were round 1 barcodes and post-index were round 2 barcodes. Multiome sequencing data FASTQ files were firstly grouped by post-index (in this case, index 1–96 from two channels). For each post-index, Cell Ranger ARC (version 2.0.1) was applied for mapping and quantification. A cell barcode was formed by the combination of the 10x barcode (round 1 barcode) and post-index (round 2 barcode). Demuxlet was applied on both the snRNA-seq and snATAC-seq mapping results (bam format files) and each donor’s imputed (by Michigan Imputation Server; https://imputationserver.sph.umich.edu/) Illumine Infinium Asian screening array SNP file (VCF format file, processed by Plink2 version 2.0) to assign the cell barcode to a specific donor. Doublets with multidonor assignments were removed.

For data quality control, for snRNA-seq data, cells whose gene number was greater than 200 and mitochondria gene percentage was <5% survived. For snATAC-seq data, cells whose transcriptional start side (TSS) enrichment score was >4 and fragment number was >1,000 survived. Cells that survived both snRNA-seq and snATAC-seq quality control procedures were used in downstream analysis.

Kidney snRNA-seq part analysis

After data quality control, unsupervised clustering and visualization were carried out by using the Seurat pipeline, as follows: (1) normalization of the raw UMI count matrix by using the NormalizeData function with default parameters, (2) finding the top 2,000 highly variable genes by using the FindVariableFeatures function with default parameters, (3) PCA by using the ScaleData and RunPCA functions with default parameters, (4) unsupervised clustering by using the FindNeighbors (with parameter dims = 1:20 and others at default) and FindClusters (with parameter resolution = 2.5 and others at default) functions and (5) UMAP performed by using the function RunUMAP with the top 20 PCs.

Annotation of kidney snRNA-seq results was achieved by using Seurat label transfer procedures. Labels of kidney cell type annotations (subclass.l2) were transferred from a kidney atlas76 to snRNA-seq data in this study with the Seurat FindTransferAnchors (parameter dums = 1:30) and TransferData (with default parameters) functions. Cells with prediction score ≤0.4 were filtered out. Abbreviations and full names for all cell types can be found on https://github.com/yyt1010/UDA_seq_Kindey_Cell_full_name/tree/main.

Kidney snATAC-seq part analysis

After mapping, we obtained 96 sublibrary fragment files for two channels, respectively. On the basis of the cell barcodes, more than 200,000 cells paired with high-quality snRNA-data were selected from 96 × 2 = 196 scATAC fragment files. New fragment files for 44 donors (one file per donor) were rebuilt with these cells for subsequent analysis. First, ArchR (version 1.0.2) was used to create arrow files (filtering on TSS ≥4 and filterFrags ≥1,000) with the addition of TileMatrix and GeneScoreMatrix. After cell annotation in snRNA-seq, cells that were both high quality in snATAC-seq and annotated on snRNA-seq data remained. In summary, we obtained 207,671 cells with a median TSS enrichment score of of 10.6 and 8,116 fragments. Quality control plots were visualized for TSS enrichment, fragments and fragments in peak (FRiP), grouped by states (health or disease) or donors, respectively. Secondly, dimensionality reduction was applied by using iterative latent semantic indexing based on TileMatrix (with default parameters plus iterations = 2 and dimsToUse = 1:30), followed by cell clustering with the addClusters function (with resolution = 2 and maxClusters = 50). In this step, 42 clusters were identified. The visualization by UMAP was thus finished (with default parameters of nNeighbors = 30 and minDist = 0.5). Thirdly, cell types with <50 cells were filtered out. Then, addGroupCoverages, addReproduciblePeakSet and addPeakMatrix with default parameters were used for peak calling. After peak calling, marker peaks and marker genes in each cell type were obtained by using getMarkerFeatures with PeakMatrix and GeneScoreMatrix, respectively. For the visualization of cell type-specific marker peaks located on the promoter of marker genes, the scaled matrix was obtained by using plotMarkerHeatmap(seMarker = markerGenes, cutOff = “FDR < = 0.1 & Log2FC ≥ 2”, transpose = T, returnMatrix = T). The marker genes with marker peaks for each cell type were selected from the matrix by using the following cutoff: a value of one cell type of 2, all values of other cell types of <−1 and a standard deviation of other cell types’ values of <0.3. Then, getGroupBW(groupby = “cell_type”, normMethod = “None”) was used to obtain bigwig files for each cell type, followed by IGV visualization of selected marker genes as mentioned above. MotifMatrix was obtained by using the addDeviationsMatrix function. Adding the gene expression matrix to the ArchR project by using addGeneExpressionMatrix, peak to gene links were calculated by using addPeak2GeneLinks.

Human kidney proteinuria positive-related cell type analysis

In this study, we regarded donors with 24 h urinary protein ≥3.5 g as having massive proteinuria and as nonproteinuria donors otherwise. A pseudo-bulk RNA-seq profiled donor was formed by aggregating the expression values (UMI counts) from a group of cells from the same donor. Cells annotated as altered states were not taken into consideration. Three types of data, viz. (1) binary phenotype data for each donor: proteinuria or nonproteinuria, (2) pseudo-bulk RNA-seq profiled data and (3) snRNA-seq profiled data, were used as inputs (using randomly selected half of the cells, for computational efficiency) for Scissor (version 2.0.0) with parameter α = 0.8 to identify the proteinuria positively related cell subpopulation (Scissor+ cells).

For proteinuria positive signature analysis, differential gene expression was performed between proteinuria positively related cells (Scissor+ cells) and other cells by using the R package Seurat (version 4.9) FindMarkers (with parameter max.cells.per.ident = 500 and others at default) function. Highly expressed genes in proteinuria positively related cells were selected for Bonferroni-adjusted P value <0.01, log2 fold change of average expression of >1 and non-gender-related genes. The single-cell-level proteinuria positive signature score was calculated with those highly expressed genes by using the Seurat function AddModuleScore with default parameters. The donor-level proteinuria signature score was calculated as the median value of single all cells’ signature scores from that donor. The differential significance between each group of the donor was calculated by using the R package rstatix (version 0.7.0).

To characterize cell–cell communication differences between immune cells, EC-type cells and podocytes between proteinuria greater than or equal to 3.5 g per 24 h and healthy human controls, the CellChat (version 1.6.1) package was used to identify major signal strength differences in a given signaling network and predict critical information flows for specific cell types.

Human kidney podocyte proteinuria positive-related cell subcluster analysis

Firstly, only podocytes from the Scissor outputs were selected with unsupervised reclustering on podocyte snRNA-seq data by using the Seurat pipeline, as follows: (1) normalization of the raw UMI count matrix by using the NormalizeData function with default parameters, (2) finding the top 2,000 highly variable genes by using the FindVariableFeatures function with default parameters, (3) PCA by using the ScaleData and RunPCA functions with default parameters, (4) unsupervised clustering by applying the FindNeighbors (with parameter dims = 1:20 and the others at default) and FindClusters (with parameters resolution = 0.8 and others at default) functions and (5) UMAP by using the RunUMAP function with the top 20 PCs.

For proteinuria phenotype enrichment analysis, the proteinuria phenotype enrichment score of cluster i was calculated by using the formula

$${\mathrm{Enrichment}}_{i}={\log }_{2}\left(\frac{{P}_{i}}{{{Ex}}_{i}}+{\mathrm{pseudo}}{{\_}}{\mathrm{value}}\right),$$
(1)

where \({P}_{i}\) is the proportion of cluster i in the proteinuria positive-related cells identified by Scissor (Scissor+ cells), \({{Ex}}_{i}\) is the expected proportion of cluster i in proteinuria positive-related cells, while setting pseudo_value to 0.000001, to avoid meaningless operations. The podocyte subcluster whose enrichment score was highest and greater than 0 was regarded as a proteinuria phenotype-enriched subcluster.

For eRegulon identification in podocytes, firstly, the ATAC-seq peak matrix (obtained from the ArchR version 1.0.2 peak caller) was subsetted according to the podocyte’s cell barcode. Peaks that were open in fewer than five cells were filtered. The top 100,000 highly variable open peaks identified by the Seurat function FindVariableFeatures (with parameters selection.method = “disp” and nfeatures = 1,000,000) were retained for downstream analysis. Secondly, a set of genes differentially expressed between cluster 1 and other clusters (except cluster 0, since it had a small value of positive enrichment of proteinuria phenotype) was selected based on having a log2 fold change of average expression greater than 0.5. Only these genes were retained for downstream analysis. Thirdly, an snATAC-seq topic-based model was built by using the Cis-topic77 algorithm with the Python package pycisTopic (version 1.0.3) run_cgs_models function (with parameters n_topics = 10, randon_state = 555, alpha = 50, alpha_by_topic = True, eta = 0.1 and eta_by_topic = False). The methods used for the binarization of each topic were Otsu and ntop (nTop = 3,000) with the binarize_topics function. The accessibility of regions was imputed and normalized by using the impute_accessibility (parameter scale_factor = 10**6) and normalize_scores (parameter scale_factor = 10**4) functions. Peaks from Otsu and the top 3,000 for each topic were selected as candidate enhancers for downstream analysis. Then, motif enrichment analysis was performed by using pycistarget.motif_enrichment_dem and pycistarget.motif_enrichment_cistarget. Utilizing Scenicplus (version 1.0.1), an RNA and ATAC paired scenicplus object was created. The gene regulatory network was built by using the scenicplus function build_grn (with parameters quantiles = c(0.85,0.90,0.95), order_regions_to_gene_by = “importance”, rho_threshold = 0.02 and keep_extended_motif_annot = True). After the gene regulatory network was built, we set rho to −0.05 and 0.05 for negative correlated and positive correlated, respectively. Important eRegulons were selected on the basis of the regulon specificity score (RSS).

Human kidney impairment positive related cell type analysis

AKI and CKD were diagnosed by nephrologists according to Kidney Disease: Improving Global Outcomes guidelines. Impairment was classified into three types: immune mediated, metabolic mediated and other. Similar to the proteinuria analysis, a pseudo-bulk RNA-seq profiled donor was formed by aggregating the expression values (UMI counts) from a group of cells from the same donor. Cells annotated as altered states were not included. Three types of data, viz. (1) a binary phenotype for each donor: impairment or non-impairment, (2) the pseudo-bulk RNA-seq profile and (3) the snRNA-seq profile, were used as inputs (using a randomly selected half of cells, for computational efficiency) for Scissor (version 2.0.0) with parameter α = 0.05 for immune-mediated impairment and α = 0.5 for metabolic-mediated impairment, to identify the impairment related cell subpopulation (Scissor+ cells).

For immune-mediated impairment signature and metabolic-mediated impairment analysis, the way to select the signature gene set and calculate the signature score was the same as for the proteinuria signature analysis described above. For human kidney EC-GC impairment positive related cell subcluster analysis, firstly, only EC-GCs were selected from the Scissor output (both immune mediated and metabolic mediated), then unsupervised reclustering was performed on the EC-GC snRNA-seq data by using the Seurat pipeline, as follows: (1) normalization of the raw UMI count matrix by using the NormalizeData function with default parameters, (2) finding the top 2,000 highly variable genes by using the FindVariableFeatures function with default parameters, (3) PCA by using the ScaleData and RunPCA functions with default parameters, (4) unsupervised clustering by using the FindNeighbors (with parameter dims = 1:20 and others as default) and FindClusters (with parameter resolution = 0.8 and others as default) functions and (5) UMAP performed by using the RunUMAP function with the top 20 PCs.

For proteinuria phenotype enrichment analysis, cluster 5 was filtered out since its cell number was less than 50. The impairment phenotype enrichment score for cluster i was calculated by using equation (1) above, but with \({P}_{i}\) being the the proportion of cluster i in the immune-mediated or metabolic-mediated phenotype positive related cells identified by Scissor (Scissor+ cells) and \({{Ex}}_{i}\) being the expected proportion of cluster i in the injury positive related cells, setting pseudo_value to 0.000001 to avoid meaningless operations. The EC-GC subcluster whose enrichment score was highest and greater than 0 was regarded as the immune-mediated or metabolic-mediated impairment phenotype enriched subcluster.

For eRegulon identification in EC-GCs, the methods used were the same as for eRegulon identification in podocytes, except the parameter nTop was set at 8,000 in the binarize_topics function.

Analysis of UDA CRISPER screen data

The preprocessing of the UDA CRISPER sequencing data was the same as for the PBMC cohort RNA-seq part. In quality control of the UDA CRISPER screen, unsupervised clustering based on the single-cell RNA profile showed that cluster 11 had an extremely low UMI content, while clusters 4 and 7 had relatively higher sgRNA numbers than other clusters. Thus, clusters 4, 7 and 11 were filtered out. For sgRNA assignment, we assigned the most dominant sgRNA (where the sgRNA UMI content exceeded the sum of all other sgRNA UMI contents within a cell) to the cell. Finally, only uniquely assigned cells survived.

For the analysis of two groups of cells, single-cell RNA profiled SNU16 cells were subject to unsupervised clustering by using the Seurat pipeline. In the PCA analysis, two groups of cells were clearly observed along two dimensions (PC1 and PC2) in the visualization. Cluster 9 was removed as it had the lowest cell number and was scattered. We grouped clusters 8, 0 and 7 into group 1 and the other clusters into group 2.

For MYC-related pathway analysis, the MYC-related pathways were downloaded from the Kyoto Encyclopedia of Genes and Genomes database (KEGG)78. Within each group of SNU16 cells, perturbed upregulated or downregulated genes for each target gene were defined as highly upregulated genes in comparison with cells with the target gene and negative control cells by using the Seurat differential expression gene pipeline, with the R package clusterProfiler (version 4.8.1) to do the KEGG pathway enrichment and restrict the enrichment results on MYC-related pathways. The enrichment score was \(-{\log }_{10}(\,{p\_value})\). Pathways specific to a single cancer in MYC-related pathways were not taken into consideration. The heatmap was plotted by using the R package ComplexHeatmap (version 2.10.0)79.

For genotype to multiple phenotypes analysis. We implemented the scMAGeCK (version 1.6.0) function scmageck_lr, setting the parameter NEGCTRL to negative control sgRNAs and PERMUTATION to 100.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.