Main

While the definition of what constitutes a rare disease is arbitrary, and thus varies by jurisdiction, the European Union has adopted a definition of a rare disease as being an ailment that affects <50 individuals per 100,000. More than 70% of the >6,000 unique rare diseases are genetic and, collectively, they constitute a major health issue, with 3.5–6.0% of individuals affected by a rare disease over their lifetime1.

Despite improvements in diagnostics and research options for rare diseases, many individuals remain without a molecularly proven genetic diagnosis. In healthcare systems, where exome or genome sequencing is becoming the standard of care, diagnostic yield varies between 20 and 70% depending on the type of rare disease, inclusion criteria, sequencing strategy and analysis standards, as highlighted by projects such as The 100,000 Genomes Project via Genomics England, and the Deciphering Developmental Disorders Study2,3,4.

As reviewed in Dai et al.5, it has been shown that reanalysis of existing genomic data can lead to novel diagnoses, both as a result of newly described disease genes and due to improvements in the identification, annotation and interpretation of genomic variants. However, reanalysis of such data is not routinely undertaken due to the time and multidisciplinary expertise required, and associated costs.

In 2017, the European Union brought together expertise on rare diseases into 24 thematic European Reference Networks (ERNs). Each ERN has multiple national centers across the 27 member states, all of which have been vetted for their clinical, diagnostic and research expertise. These collaborations provide a pan-European framework to improve care for individuals with rare diseases.

Solve-RD is a pan-European omics project that brings together (1) clinicians, geneticists and translational researchers from four ERNs, including rare neurological diseases (RND, https://www.ern-rnd.eu/), intellectual disability, telehealth and congenital anomalies (ITHACA, https://ern-ithaca.eu/), neuromuscular diseases (EURO-NMD, https://ern-euro-nmd.eu/) and genetic tumor risk syndromes (GENTURIS, https://www.genturis.eu), as well as the Spanish undiagnosed disease program6; (2) patient organizations represented by EURORDIS7 (https://www.eurordis.org/); (3) genomic data-sharing and -analysis resources, such as the RD-Connect Genome-Phenome Analysis Platform8 (RD-Connect GPAP, https://platform.rd-connect.eu/) and the European Genome-Phenome Archive9 (EGA, https://ega-archive.org/); (4) European networks aiming to improve and harmonize the quality of genetic testing services, such as EuroGentest (http://www.eurogentest.org/); and (5) experts in the field of omics technologies, bioinformatics, knowledge management and rare-disease ontology, such as Orphanet Rare Disease Ontology (ORDO, https://www.orphadata.com/ontologies/) and Human Phenotype Ontology (HPO)10.

One of the core aims of Solve-RD is to improve the rate of genetic diagnosis for individuals affected by a rare disease. A specific objective of Solve-RD is to systematically collate and reanalyze existing exome/genome datasets and corresponding structured ontology-based phenotype and pedigree information across the disease areas of its ERN partners (Fig. 1). Previous pilot studies analyzed only subcohorts and focussed on established pathogenic (ClinVar) variants, whereas the work presented here is the primary large-scale and systematic reanalysis across all diseases of Solve-RD7,11,12,13. Here we report the results from the systematic reanalysis of data from 6,004 undiagnosed rare-disease families recruited from across Europe by Solve-RD. The entire dataset is available as a resource for the global rare-disease research community.

Fig. 1: Overview of the Solve-RD analysis and interpretation framework and community resource established.
figure 1

a, Solve-RD brought together rare-disease data and expertise. Central to Solve-RD are four core ERNs relating to rare diseases; via these expert disease networks, patients with rare diseases were recruited from 43 research groups from 37 institutes in 12 European countries (Belgium, Czech Republic, Finland, France, Germany, Hungary, Italy, the Netherlands, Portugal, Slovenia, Spain and the United Kingdom) and Canada. The work involved >300 collaborators in the submission, analysis and interpretation of rare-disease data. The RD-REAL framework allows sharing of data and expertise on a continental scale, consisting of (1) expert curated data, (2) a comprehensive analysis suite and (3) a two-level (that is, molecular and clinical) expert review. The complete dataset comprises 9,645 individuals from 6,004 families and includes phenotypes in Phenopacket format (average of six HPO terms per affected individual), pedigrees and genomic data (genomes and exomes). b, Illustration of the utility of this resource to the global rare-disease community. In total, RD-REAL data of >23,000 individuals with >100 million unique genomic variants are available via RD-Connect GPAP and EGA. This represents a growing resource containing data that have been submitted since the start of Solve-RD. Interpretable data (genetic variants, phenotypes and pedigrees) are standardized and annotated, and are made available for querying, analysis and interpretation in RD-Connect GPAP for authorized users. In addition, all raw and processed data are available for download at EGA under a controlled-access model. All icons, except logos of services (GPAP; EGA) and consortia/networks (Solve-RD; European Reference Networks) that are contributors of this publication, created with Biorender.com.

Results

Pan-European rare-disease data collection

Solve-RD involves over 300 clinicians, laboratory geneticists and translational researchers from 43 research groups associated with 37 institutes located in 12 European countries and Canada. In total, we collected 10,276 genomic datasets, as well as phenotypic descriptions and pedigrees, from 10,039 individuals, all previously analyzed through local diagnostic or research efforts. The collection includes 554 genomes and 9,722 exomes enriched using 28 different exome-enrichment kits and generated on several short-read sequencing platforms. Following quality control (Methods), 9,874 datasets (523 genomes and 9,351 exomes) from 9,645 individuals remained. These represent 6,449 individuals affected by rare diseases, and 3,196 unaffected relatives, from 6,004 families (Fig. 1, Table 1 and Supplementary Table 1). Disease categories comprise rare neurological diseases (RND, n = 2,271 families), (multiple) malformation syndromes, intellectual disability and other neurodevelopmental disorders (ITHACA and SpainUDP, n = 1,857), rare neuromuscular diseases (EURO-NMD, n = 1,517) and suspected hereditary gastric and bowel cancer (GENTURIS, n = 359).

Table 1 Solve-RD reanalysis data

Phenotypic information was collected using standardized HPO terms, consistent with the GA4GH Phenopacket schema14, with a median of six terms (range 0–74) assigned per affected individual (Extended Data Fig. 1), varying from a median of four terms for GENTURIS to ten for ITHACA, reflecting the phenotypic complexity of probands affected by the respective rare disease. In addition, for 2,126 (35.4%) probands, a clinical diagnosis was encoded using an ORDO ORPHA code15, of which 338 were unique.

New genetic diagnoses following systematic reanalysis

A two-level expert analysis strategy (data-expert and clinical-expert levels) was applied, as detailed in Methods. All datasets were reanalyzed for a broad range of genomic variants, including SNVs and short insertions–deletions (InDels), noncanonical splice variants predicted in silico, homoplasmic and heteroplasmic mitochondrial DNA variants, copy number variants (CNVs), structural variants (SVs), mobile element insertions (MEIs) and short tandem repeat expansions (STRs) (Extended Data Fig. 2). Each ERN generated a list of established disease genes for their respective conditions, resulting in gene lists ranging from 230 genes for GENTURIS to 1,820 for RND (Methods and Supplementary Table 2). Systematic reanalyses resulted in 506 genetic diagnoses, by (probable) pathogenic variants that explained the phenotype, representing 8.4% of probands. The amount of time that was invested in expert reanalysis was manageable at 4.8 min per variant, or 42.8 min on average per proband.

New molecular diagnoses

SNV/InDel reanalysis revealed 461 (probable) pathogenic variants, enabling a diagnosis in 419 families. To retrieve the 461 (likely) pathogenic SNV/InDel variants from the >50,000 prioritized variants, an average of nine variants underwent molecular and clinical expert review (Supplementary Table 3).

The 461 SNV/InDel variants identified, in 419 probands, consisted of 282 heterozygous variants with dominant effect, 85 homozygous and 76 compound heterozygous variants with recessive effect and 18 hemizygous variants. Functionally, these represented 187 nonsense/frameshift variants, 249 missense variants, 11 in-frame deletions, ten splicing variants (eight intronic and two synonymous), two 5′ UTR variants, one promoter region variant and one complex InDel variant (Fig. 2 and Supplementary Table 4). Forty-one of the 461 (9.1%) variants could be confirmed as de novo mutations, due to the availability of proband–parent trios for 1,320 (22%) families, primarily from ERN ITHACA (1,081).

Fig. 2: Systematic reanalysis of genomic datasets for the genetic diagnosis of rare diseases.
figure 2

a, Flowgram of systematic analysis of 6,004 families. Yield per analysis type (genetic diagnoses by SNV/InDel and other variant types; candidate genetic diagnoses and genetic diagnoses by ad hoc expert review) are shown. For SNV/InDels, we evaluated why the 464 variants previously identified in 419 families had not been classified as disease causing. b, Chart summarizing diagnostic yield across 6,004 families in Solve-RD. c, Chart summarizing yield per disease category (ERN); the denominator is 6,004 families. d, Chart summarizing the different variant types that led to a molecular diagnosis in 506 of 6,004 families as part of the systematic reanalysis effort of Solve-RD. aDisease-causing SNVs or short insertions/deletions were identified in 419 families. bDisease-causing non-SNV variants identified in 87 families, including three cases of compound heterozygosity involving an SNV and a CNV/SV, identified through the ‘other variant type’ analyses, and are counted only under ‘New genetic diagnosis other variant types’. cIn 114 of 147 cases where we could confirm the variant identified in the ad hoc analysis, we established that it would also have been found by the standard analysis. RD, rare disease; splicing SNV/InDel, noncanonical splicing sites; WG, work group.

We evaluated why the 461 SNV/InDel variants had not been classified as disease causing in previous analyses. We found that 67 affect genes which were established as a novel disease gene following data submission to Solve-RD (that is, appeared in Online Mendelian Inheritance in Man (OMIM) after 1 January 2018; Extended Data Fig. 3 and Supplementary Table 4), while the remaining 394 were among established disease genes at the time of data submission. Of these, 117 variants have been reclassified in the interim (that is, novel or modified ClinVar16 entry since 2018) and 70 had initially been deemed not fully explaining disease, despite the variant being classified as pathogenic in ClinVar as a result of perceived insufficient clinical concordance at the time. The remaining 207 variants were not included in ClinVar and were classified only as (probable) pathogenic by the experts involved in this project.

We applied a suite of analysis tools for calling and annotating variants. These included queries for noncanonical splice variants, mtDNA variants, CNVs, SVs, MEIs and STRs. These additional analyses yielded a diagnosis in 87 rare-disease families among a total of 88 variants, with CNVs in 44 probands (45 variants) being the most prevalent variant type (Fig. 3). This included three cases where biallelic pairings of an SNV with a CNV/SV formed a compound heterozygous variant, and one case where two CNVs affecting different genes led to a digenic diagnosis (Extended Data Fig. 2 and Supplementary Table 4).

Fig. 3: Examples of ‘beyond standard’ variant types by Solve-RD.
figure 3

ad, Illustrative examples of previously unsolved rare-disease probands for which a new variant other than a coding SNV/InDel resulted in a new diagnosis. a, De novo CNV affecting BICRA (P0012861). b, MEI variant in COL6A2 (P0014682). c, SV in SCN11A (P0011371). d, STR expansion affecting AR (P0002409).

The diagnostic yield across disease groups (that is, ERNs) ranged from 2.8% (genetic tumor risk syndromes, GENTURIS) to 10.6% (rare neurological disorders, RND), correlating with the number of established disease genes provided by the ERNs (Fig. 2 and Supplementary Table 2). Overall, for the 506 newly diagnosed probands, the inheritance pattern was autosomal dominant for 306, autosomal recessive for 137, X-linked for 42, mitochondrial for 16, dual diagnoses in four individuals and digenic inheritance in one individual (Supplementary Table 4).

Next to the overall yield across the cohort, the importance of new diagnoses can be illustrated by individual rare-disease case reports, each benefitting from technical and interpretational improvements, leading to the closure of diagnostic odysseys. For example, we highlight a 58-year-old male from the RND cohort who developed a rare neurological disorder at 42 years of age, including sensory neuronopathy or sensory polyneuropathy, which was later specified as spastic ataxic gait and confirmed the presence of signs of peripheral neuropathy. Our reanalysis revealed a large intragenic deletion in combination with a missense variant in B4GALTNT1, which were both proven to be pathogenic (P0015028; Extended Data Fig. 4). Functional confirmation was obtained via glycomics analysis of plasma glycolipids, indicating reduced levels of B4GALNT1 glycolipid products.

An example of a previously missed CNV was a small. single-exon deletion of APC identified in an individual (P0009136, Extended Data Fig. 5) from the GENTURIS cohort presenting with suggestive familial adenomatous polyposis. Although the clinical course, family history and haplotype analysis had already pointed to an underlying APC variant, the diagnostic deletion was not detected in routine diagnostics due to a lack of multiplex ligation-dependent probe amplification probes covering the specific region affected.

In the ITHACA cohort we highlight two individuals, one with a mosaic de novo mutation in PIK3CA (Chr3(GRCh37):g.178916876G>A; NM_006218.4:c.263G>A; p.(Arg88Gln), present in 13% of the reads) in an individual with complex partial seizures and asymmetry of the legs and face (P0012716; Extended Data Fig. 6a). This individual had been clinically suspected of having underdevelopment of the left side of the body, rather than overgrowth of the right side of the body, which meant that an overgrowth syndrome had not previously been considered. Furthermore, probably due to mosaicism, the proband presented with a relatively mild phenotype when considering the spectrum of PIK3CA-related overgrowth, which made accurate clinical diagnosis challenging.

The second ITHACA example involves an individual (P0013065; Extended Data Fig. 6b) with severe developmental delay and multiple syndromic features, including delayed motor, communicative and social milestones: crawling at 15 months, walking at 30 months, first words at 7 years of age and speech characterized by severe verbal dyspraxia. Additional medical problems comprised divergent strabismus, muscle tone dysregulation with contractures and inattentive and hyperactive behavior with aggressive tantrums. Physical examination revealed a slender body and microcephaly (height 184 cm (s.d. = 0); weight 51.5 kg; body mass index 15.2; head circumference 54.5 cm, s.d. −2). He had a small, asymmetric thorax of unusual shape (the midthoracic region being broader in the frontal plane and flattened in the sagittal plane compared with the high thoracic region), high thoracic kyphosis and scapular winging. His hands and feet were slender, with long fingers and toes, camptodactyly of the 2nd, 3rd and 4th fingers of the right hand and he exhibited elbow and knee contractures. Facial dysmorphisms included a long and narrow facial shape, full eyebrows with synophrys, downslant of the palpebral fissures, prominent eyelids with ptosis, divergent strabismus, low-set ears with a square-shaped and flattened upper helix, and a short nose. Here, the identification of a de novo variant in MN1 ended a 20-year diagnostic odyssey. The disease–gene relationship for MN1 was established following initial routine analysis, but now finally enables the diagnosis of CEBALID syndrome.

In the NMD cohort, we highlight a 14-year-old boy with an initial diagnosis of congenital myasthenic syndrome (CMS) and his mildly affected mother. Systematic reanalysis led to the identification of a mitochondrial variant, m.3243A>G in MT-TL1, with an observed heteroplasmy of 0.27 in the proband and 0.14 in his mother (Extended Data Fig. 7). The difference in heteroplasmy probably correlates with the mild phenotype observed in the proband, and with the absence of mitochondrial myopathy features in his mother. While the initial clinical suspicion in the proband was CMS due to the notable fatigability, the fact that mitochondrial disease can be highly variable in presentation means that mild forms of mitochondrial myopathy can be difficult to diagnose clinically.

An example on how variant annotation pipelines can aid in variant interpretation is provided through the diagnostic path of a girl (P0012491) who was clinically suspected to have Rett syndrome (MIM#312750). Exome sequencing performed in 2014 did not yield a diagnosis, despite specific attention being applied to variants affecting MECP2, the gene associated with Rett syndrome. Almost 8 years later, the reanalysis presented here uncovered a pathogenic de novo MECP2 variant from the same data. Retrospective analysis of previous interpretation steps revealed that the variant was initially annotated to a less relevant isoform of MECP2 (MECP2-e2; ENST00000303391.11), in which the variant located to an intron. However, reannotation here revealed that the variant truncates the brain-specific isoform of MECP2 (MECP2-e1; ENST00000453960.7), and hence is indeed explanatory for the Rett syndrome in this girl.

Cases diagnosed by ad hoc expert review

During the course of Solve-RD, many contributing partners continued to perform analysis on specific families of interest, both locally and using RD-Connect GPAP. This ad hoc expert review provided 249 additional diagnoses (4.1%), some of which have been included in individual reports13,17,18,19,20,21,22, and novel disease gene discovery efforts23,24 published previously (Supplementary Table 5). Cases solved through ad hoc expert review were reported to Solve-RD and not interpreted further as part of the systematic reanalysis. For 197 (79%) of these ad hoc diagnoses, the causative variants were SNVs. For 147 (75%) of these SNVs we could assess post hoc whether the variants would also have been identified by the systematic reanalyses performed. We found that in 114 of 147 (78%) cases the SNVs would have been identified, while the remaining cases were diagnosed due to the discovery of variants located in novel disease genes not included in ERN gene lists, or initially discounted for technical reasons (for example, having insufficient coverage (fewer than ten reads) or being deep intronic variants).

Candidate disease-causing variants

In addition to variants that were deemed causative for disease, we identified a further 378 variants (in 333 affected individuals) in established disease genes that have not yet been confirmed as causative, either because the variant does not fully explain the individual’s phenotype or because the variant’s pathogenicity cannot yet be conclusively determined (Fig. 2 and Supplementary Table 4).

Cross-ERN analysis, recurrences and clinical actionability

Cross-ERN de novo mutation analysis

Systematic reanalyses were performed by each of the four ERNs, thus maximizing disease-specific expertise. Because the clinical spectrum may occasionally cross ERN boundaries, we assessed all de novo mutations across all genes included in any of the ERN gene lists (2,512 unique genes), irrespective of which ERN originally submitted the case. This led to a molecular diagnosis in an additional three probands through the identification of (probable) pathogenic de novo variants in CSDE1 (ref. 25), EP300 and SYT1 in individuals P0012248, P0014714 and P0018474, respectively (Supplementary Table 6), which would have been missed without this cross-ERN analysis. This included a young girl (P0014714) presenting with microcephaly, face abnormality, muscle hypotonia and neurodevelopmental delay, leading to a clinical suspicion of Cornelia de Lange syndrome (MIM#122470; https://www.omim.org/entry/122470). Solve-RD’s efforts led to the identification of a de novo frameshift variant in the histone acetyltransferase p300 gene: EP300(NM_001429.4):c.1152_1153del; p.(Gly385GlnfsTer25), suggesting a clinical diagnosis of Rubinstein–Taybi syndrome (MIM#180849). This prompted clinical re-evaluation of the proband’s phenotype, at which point the clinical diagnosis was confirmed. Another example (P0012248) concerned a young male with severe neurodevelopmental delay, microcephaly, absent speech, generalized hypotonia, nystagmus and inability to walk. Here, the systematic reanalysis of the proband’s ES data within Solve-RD led to the identification of a de novo missense variant in synaptotagmin 1, SYT1 (NM_001135806.2):c.1103T>C; p.(Ile368Thr), leading to a molecular diagnosis of Baker–Gordon syndrome (MIM#618218). Retrospective analysis of the original ES data of both cases revealed that the variants had not been identified by the corresponding in-house pipeline.

Recurrent variants

We observed recurrence for 21 (probable) pathogenic variants, together accounting for 41 diagnoses (Supplementary Table 7). These 21 variants occurred in 18 genes, with three genes (SPG7, KCNA2 and SPAST) harboring two different recurring variants.

One of the recurring variants was identified across three ERNs: an identical MT-ATP6 missense variant (chrM:9185T>C (ENST00000361899:c.659T>C (p.(Leu220Pro))) was observed in five affected individuals (P0010243, P0009606, P0009608, P0004265 and P0004266) from three unrelated families submitted by ERNs EURO-NMD, RND and ITHACA. The variant was observed with a heteroplasmy of 77 and 90% in the EURO-NMD and RND probands, respectively, while it was homoplasmic in the ITHACA proband, in line with the variable phenotypic presentation (Supplementary Table 8).

Beyond diagnosis to clinical actionability

We investigated the number of diagnosed individuals that would potentially benefit from therapy or other actionability, by considering medications or interventions included in three databases: IEMbase26, Treatabolome27 and ClinGen28, and in international cancer guidelines.

We identified 73 affected individuals (14.4% of diagnosed individuals) that harbored variants in a potentially actionable gene (Extended Data Fig. 8).

Implementation, and feedback to referring clinicians and eventually to families and patients, is following local guidelines that differ between centers. Actual actionability has already happened and is continuously ongoing. To date we have received feedback for a subset of the aforementioned cases, with details of 16 examples summarized in Supplementary Table 9.

An example from ERN EURO-NMD is provided by the case of two young-adult patients from different families who had presented with limb-girdle muscle weakness and fatigability from 2 years of age, and subsequently developed ptosis and difficulty in swallowing, leading to a suspected diagnosis of limb-girdle myasthenic syndrome (P0020778). While previous ES analyses were negative, reanalysis within Solve-RD using SpliceAI29 led to the identification of a homozygous intronic variant with a potential splice donor effect, c.1023+5G>A proximal to the exon 5–intron 5 junction of DES in both patients. In parallel, but outwith Solve-RD, a female with a similar phenotype, among a cohort of patients suspected of having CMS being treated in the same hospital, was also found to be homozygous for this mutation. Subsequent laboratory analyses indicated reduced production of normal desmin transcript and protein. Administration of the standard CMS treatment of pyridostigmine and salbutamol was initiated and, while one of the two patients showed no improvement after 3 months, the other exhibited 50% improvement in measures of fatigable weakness.

Discussion

Genomic data from rare-disease cases that have been extensively analyzed by experts in the past can still yield a large number of new diagnoses, with previous studies reporting success rates commonly in the range of 6–13% (ref. 5). We previously reported on preliminary ClinVar-focussed reanalyses undertaken within Solve-RD, which resulted in molecular diagnoses being provided for 111 families12,13. The value of an in-depth systematic reanalysis is supported by our success in diagnosing 8.4% of affected individuals through our systematic reanalysis, and the further 4.1% diagnosed in parallel by local reanalysis in individual centers through ad hoc expert review. In total, we have successfully diagnosed 12.6% of families to date. While a few recent studies have reported higher diagnostic rates following reanalysis, ranging from 15–21% (refs. 30,31,32,33), it should be noted that those datasets were more homogeneous in nature, usually originated from a single country and were of substantially smaller scale and breadth. Nevertheless, our diagnostic yield is at the top end of the typical range5.

The proposed framework, rare disease–reanalysis logistics (RD-REAL), with its two-level expert review (Methods), represents a practical blueprint for reanalysis efforts. Here we limited our analysis to four of 24 ERN rare-disease domains and, although it remains to be established whether similar results can be obtained in the other domains, the approach applied in Solve-RD is generic and can easily be implemented across the full gamut of rare diseases and at global scale.

Such collaborative reanalysis efforts can, for the present, exist in parallel with local or national reanalysis efforts, ideally embedded within the healthcare system and allowing for prompt return of results with immediate actionability in some individual cases. Ultimately, reanalysis efforts should be automated.

Further, the previously generated exome and genome sequencing data were highly heterogeneous because this is a pan-European project aiming to provide diagnoses for individuals across Europe. This heterogeneity, both in terms of the quality of the historic ES data and the breadth of phenotypic descriptions, impacted upon our ability to confidently identify potentially pathogenic variants. The limited number of genomes, and the focus on well-established disease genes used in this study, were not sufficient to support a systematic advantage of genome over exome sequencing in rare-disease studies (Supplementary Table 10). Another limitation was that, for two-thirds of the families analyzed (4,103 of 6,004), we had sequencing data only from the affected proband, thus limiting supporting segregation information during downstream variant interpretation, especially with respect to the identification of pathogenic de novo variants.

This study provides several key insights. After more than a decade of diagnostic exome sequencing34,35, our knowledge of the spectrum of genes and variants causing monogenic rare disease, and of the bioinformatic pipelines used to detect them, is still increasing. This is exemplified not only by the large number of SNV/InDel variants that can now be correctly interpreted, leading to 84.1% of all novel diagnoses (n = 419), based on the availability of new gene- or variant-level information, but also by the substantial proportion (15.9%, n = 87) of novel diagnoses that were a result of individually rare variant types not previously detectable by standard diagnostic bioinformatics pipelines.

With the growing size of rare-disease datasets, we shall identify an increasing number of identical variants in multiple individuals, improving the odds of arriving at the correct variant interpretation for multiple cases. This is evident here, because we identified 21 (probable) pathogenic variants that occurred two or three times across a total of 41 unrelated probands from the 6,004 families analyzed, sometimes straddling different clinical disease categories.

We examined clinical actionability for the diagnoses in the series, using a definition that considered only approved medication or (preventive) interventions. This is a more restrictive definition than that applied in a previous study3. Even without considering reproductive choice and surveillance of family members, there was potential for medical actionability in 14.4% of those receiving a diagnosis in our series, with ongoing implementation and the first concrete examples shown in Supplementary Table 9.

In Solve-RD, we developed several practical recommendations for large-scale distributed genomic reanalysis initiatives.

Because data submitted are likely to be heterogeneous, it is essential to standardize phenoclinical data and metadata, and to start genomic reanalysis using raw sequencing reads: define strict inclusion criteria, including checking and verifying biological relationships; and define a minimum on-target coverage of 80-fold for exome sequencing and 30-fold for genome sequencing. Multiple variant-calling pipelines should be used for each variant type, as highlighted by the results of our CNV analysis. Regular updates to bioinformatic workflows are essential for integration of new tools and the latest versions of databases such as gnomAD and ClinVar. When variants are found in genes linked to the individual’s phenotype, consider reducing stringency in alternative allele frequency and/or read-depth to detect mosaicism or true heterozygotes with poor allele balance.

When prioritizing cases for reanalysis, focus on those analyzed further in the past, and prioritize variants based on their presence in clinical interpretation databases such as ClinVar, HGMD and similar resources. Favor specificity over sensitivity when sharing short lists of variants, and ensure they are shared only once per individual. Record feedback from variant interpretation—whether confirming disease-causing variants, identifying potential candidates or discarding them—in an accessible database to prevent duplicated efforts. Finally, reverse phenotyping is crucial for re-evaluation of clinical diagnoses, particularly in syndromic cases.

We already have the first insights into the future value of the Solve-RD resource and infrastructure. Our current effort focussed on diagnoses in established rare-disease genes. However, this resource and the datasets in Solve-RD should be well suited for the generation of continued insights. Since the systematic analysis presented here was completed, we have already promoted two SVs and seven CNVs from candidate to disease causing36,37, and likewise for an additional ten SNV/InDel variants (Supplementary Table 11). This resource shall also allow the discovery of novel disease genes or loci, and the discovery of new disease mechanisms and causes is an ongoing part of Solve-RD7,11. The recent association of the noncoding RNA gene RNU4-2 with a complex NDD phenotype38,39 led to one further solved case in Solve-RD (P001996), in addition to the Solve-RD case (P0007197) that contributed to the original discovery (Extended Data Fig. 9c). As a further example we highlight RAB14, which had been suggested to play a role in neurodevelopmental disorders by a statistically significant enrichment of de novo variants in a developmental disorder cohort in 2020 (ref. 23). The Solve-RD dataset includes data from two male individuals with neurodevelopmental phenotypes harboring de novo variants in RAB14, now enabling genotype–phenotype characterization as a result of the comprehensive HPO description collected here (Fig. 4a,b). Similarly, many additional genotype–phenotype and/or mechanistic studies have been initiated from the Solve-RD datasets and are currently followed up within the Solve-RD RDMM-Europe initiative40.

Fig. 4: Example of a new discovery by Solve-RD.
figure 4

a,b, An example of discoveries enabled by the Solve-RD resource. a, RAB14 de novo variants in two cases from this project contribute to the establishment of a new genotype–phenotype relationship. The first individual (P0012753) presents with mild global developmental delay in the absence of any facial dysmorphism or congenital anomalies, and carries a de novo variant in RAB14 (chr9:123952916G>A; NM_016322.3:c.200C>T; p.(Thr67Met)), which is rare (not observed in gnomAD v.2.1.1), likely to be deleterious (CADD score of 29) and has been observed de novo in at least four additional individuals with developmental disorders in the literature23. The second individual (P0012904) presents with mild ID, subtle facial dysmorphisms comprising a high, square-shaped forehead, downslant of palpebral fissures and a low-hanging columella, in the absence of congenital anomalies. The de novo variant found in this individual (chr9:123954475A>C; NM_016322.3:c.80T>G; (p.(Leu27Trp)) is also absent from gnomAD, predicted to be deleterious (CADD score of 28) and has been observed de novo in at least one additional individual with a neurodevelopmental disorder in DECIPHER (https://www.deciphergenomics.org/patient/305550/phenotypes/person/62257). The female individual reported in Decipher presents with moderate ID, facial dysmorphism consisting of large earlobes, smooth philtrum, a wide mouth and protruding tongue, short feet with congenital talipes calcaneovalgus, thick hair and an umbilical hernia. b, Salent features of the two cases in a. aa, Amino acid.

Global data sharing is essential for discoveries in rare-disease diagnostic research41, and has been enabled here. Authorized users can use either RD-Connect GPAP to search and analyze integrated phenotype (HPO and ORPHA codes) and gene- and variant-level data, or EGA to download all data. The worldwide detection of gene-level recurrence in other individuals affected by a rare condition is further facilitated through connection to the MatchMaker Exchange network42. To benefit the rare-disease community, our framework will involve expansion to other types of rare diseases through their respective ERNs, the incorporation of novel omics datasets43,44,45—including those obtained from long-read technologies46,47,48,49,50,51—and the inclusion of artificial intelligence-based methodology52. The tools and infrastructure developed within Solve-RD have been adopted as the core framework for undiagnosed rare-disease case reanalysis within the ERDERA project, which aims to extend out to all 24 ERNs and reanalyze >100,000 datasets from rare-disease families across all disease types (https://erdera.org/).

Methods

Ethics oversight and enrollment

The ethics committee/IRB of University of Tübingen gave ethical approval for this work (ClinicalTrials.gov no. NCT03491280). Informed consent for data sharing, including indirect identifiers within Europe for the purpose of research, was obtained from all recruited individuals, and all data submitters signed the Adherence Agreement and Code of Conduct of RD-Connect GPAP. This covers the use of P-numbers that link to sample IDs only in an arbitrary fashion and have the function to allow traceability of results throughout the manuscript.

All individuals were recruited via four ERNs. Inclusion criteria were a clinical rare-disease diagnosis in at least one family member by one of the associated expert centers and an inconclusive exome or genome analysis at the time of submission. We did not exclude anyone based on sex, gender, ethnicity, race, age or any other socially relevant groupings.

Each patient entry was associated with its submitting investigator or clinician and linked to its corresponding ERN or UDP. The responsibility of checking that the data were suitable for submission to RD-Connect GPAP and Solve-RD lay with the data submitter, as required by their Code of Conduct (current institution: Consorcio para la Explotación del Centro Nacional de Análisis Genómico) and Data-sharing Policy (institution: Solve-RD general assembly), respectively. In some cases, individuals had to be reconsented before data submission. The individuals described in Extended Data Fig. 6 gave permission for their photographs to be used in this publication, for which we thank them and their families. This study adheres to the principles set out in the Declaration of Helsinki.

Family recruitment

Any undiagnosed individual with an apparent genetic rare disease that falls under the umbrella of conditions in which one of the four partner ERNs specialize, and for whom a previous ES analysis had been undertaken and proven inconclusive, was a candidate for inclusion in this study. The pan-European recruitment effort involved over 300 clinicians with expertise in rare-disease working in 43 research groups across 37 institutions located in 13 countries. To facilitate data submission and sharing, we implemented a pragmatic approach to collecting datasets to allow efficient reanalysis across centers. We refer to these datasets as RD-REAL, which must include genomic data, family information and phenotypic descriptions. The RD-REAL framework facilitates sharing of data and expertise at a continental scale, consisting of (1) expert curated data, (2) a comprehensive analysis suite and (3) two-level (that is, molecular and clinical) expert review (Fig. 1).

Data pertaining to 10,039 individuals from 6,246 undiagnosed families were initially assembled, which were then reduced to 9,645 individuals (6,447 affected) in 6,004 families following application of quality control measures, as described below. Of the 6,447 affected individuals, 3,592 (56%) were male and 2,855 (44%) female; 6,215 (96.4%) were alive at the start of the study, 84 (1.3%) were deceased and for 148 (2.3%) their vital status was unknown.

Pseudonymized phenotypic data collation for all individuals was facilitated using the PhenoStore module of RD-Connect GPAP. PhenoStore promotes deep phenotyping of affected individuals using HPO terms, and disease classification using Orphanet Rare Disease Ontology (ORDO) ORPHA codes (http://www.orphadata.org/cgi-bin/index.php) and/or OMIM identifiers (https://www.omim.org/) as appropriate, and can import/export this information using the GA4GH Phenopackets format14.

ERN cohort descriptions

For all families recruited to Solve-RD, local standard-of-care genetic diagnostic work-up and/or research-based analyses had failed to identify any molecular genetic cause underlying the proband’s rare condition.

ERN RND

The ERN RND cohort consists of 2,799 individuals from 2,271 families with previously unsolved rare neurological diseases. Genomic and phenotypic data for all affected individuals, and for family members where available (~20% of families), were submitted for reanalysis by nine ERN RND partner institutions located in eight European countries: Belgium, France, Germany, Hungary, the Netherlands, Slovenia, Spain and the UK. Individuals had been recruited and sequenced either as part of standard diagnostic care or through participation in large European rare-neurological disease research projects such as NeurOmics (https://rd-neuromics.eu/) and Treat-HSP (https://www.treathsp.net/). The 2,271 families comprised 1,924 singletons, 168 duos, 141 triples (103 of which were parent–child trios) and 38 families with four or more members, giving a total of 2,453 affected individuals. The HPO terms most frequently used to describe phenotypes were ataxia, gait disturbance, dysarthria and spastic paraplegia (Supplementary Table 12).

ERN ITHACA

The ERN ITHACA cohort consists of 4,405 individuals from 1,836 families, submitted for reanalysis by 12 partner institutions located in six countries: the Czech Republic, France, Germany, Italy, the Netherlands and the UK. A further 65 individuals from 21 families from the Spanish Undiagnosed Disease Program (SpainUDP)6 were included in this cohort for analysis, due to the similarity of the underlying phenotypes. The clinical spectrum of the ERN ITHACA cohort consisted of individuals with intellectual disability (ID) with or without additional phenotypic features, and individuals with (multiple) congenital anomalies without ID. Given the importance of de novo mutations underlying the rare conditions within ERN ITHACA34,53, unaffected parents and/or unaffected siblings were also included, wherever possible, to allow for direct segregation of variants. The 1,857 families comprised 632 singletons, 38 duos, 1,138 triples (1,081 parent–child trios) and 49 families with four or more members, giving a total of 1,933 affected individuals. The HPO terms most frequently used to describe affected individuals related to global developmental decay, intellectual disability and autism (Supplementary Table 12).

ERN EURO-NMD

The ERN EURO-NMD cohort consists of 2,125 individuals from 1,517 families, submitted for reanalysis by 16 partner institutions located in eight countries: Belgium, Canada, Finland, France, Germany, Italy, Spain and the UK. Previously unsolved datasets submitted to Solve-RD had either been recruited and sequenced as part of large international neuromuscular research projects, such as NeurOmics (https://rd-neuromics.eu/), SeqNMD, Myocapture (https://www.france-genomique.org/projet/myocapture-novel-for-genes-myopathies/?lang=en), MYO-SEQ54, UK10K (https://www.uk10k.org/), Unravel-CMS, BBMRI-LPC (https://cordis.europa.eu/project/id/313010), CMS CMG (https://cmg.broadinstitute.org/) or Consequitur55, or through participating centers’ own diagnostic or research pipelines. Samples incorporated from the MYO-SEQ project were recruited from 50 specialized neuromuscular disease centers across Europe and the Middle East, and some datasets incorporated from the Unravel-CMS, BBMRI-LPC and CMS CMG projects were from privately sequenced undiagnosed individuals followed at Nimhans, India (https://nimhans.ac.in/). The 1,517 families comprised 1,202 singletons, 90 duos, 156 triples (135 parent–child trios) and 69 families with four or more members, giving a total of 1,685 affected individuals. The HPO terms most frequently used to describe affected individuals related to muscle weakness, myopathy and abnormal muscle morphology (Supplementary Table 12).

ERN GENTURIS

The ERN GENTURIS cohort consists of 390 individuals, from 359 families, with a suspected genetic tumor risk syndrome, submitted for reanalysis by seven partner institutions located in four countries: Germany, the Netherlands, Portugal and Spain. All individuals were either recruited and sequenced as part of daily diagnostic care, or as part of research projects. The 359 families comprised 345 singletons, six duos, four triples (one parent–child trio) and four families with four or more members, giving a total of 378 affected individuals. The terms most frequently used to describe affected individuals related to colorectal cancer, followed by gastric cancer and pheochromocytoma (Supplementary Table 12).

Phenotype and clinical diagnosis

A median of six HPO terms (range 0–74) were used to describe each affected individual across this Solve-RD cohort. This drops to five HPO terms (range 0–45) following removal of HPO redundancies. To remove annotation redundancy, only the most specific HPO terms were considered by counting terms from leaf nodes, or nodes without selected parent or child entities. Overall quality of phenotypic descriptions was assessed using the Monarch Initiative annotation sufficiency score (maximum possible value of 5.0). The median annotation sufficiency value across the Solve-RD cohort was 3.61 (Extended Data Fig. 1). Clinical diagnosis was reported using ORDO codes for 2,126 affected individuals.

Generation of ERN-specific candidate gene lists

To facilitate the potential for clinicians to confirm a diagnosis based on identified variants, findings returned to the ERN data interpretation task forces (DITFs) for interpretation were restricted to those in disease genes of interest to the specific ERN, apart from any potentially pathogenic variants encountered in the mitochondrial genome, all of which were returned. Each of the four ERNs generated a curated list of genes implicated in diseases studied, exploiting their pan-European disease expertise. The RND list was primarily based on genes associated with neurological disease with green review status in Genomics England PanelApp56, with the addition of a further 25 genes based on recommendations by clinical experts (n = 1,821 genes). For ITHACA, a consolidation of gene lists pertaining to ID from a variety of resources was undertaken, followed by evaluation based on occurrence in multiple resources and the quality of curation of said resources, resulting in a list of diagnostically relevant genes (n = 1,645). In the case of GENTURIS, the list included all genes routinely screened in the partners’ diagnostic laboratories (n = 230). For EURO-NMD, the manually curated and annually updated Gene Table of Muscular Disorders57 was used (n = 615 in 2021). These ERN gene lists were used as a primary filter in the identification of potentially pathogenic variants of any type in affected individuals submitted to Solve-RD by collaborators from the corresponding ERN, irrespective of the individual’s phenotype. This resulted in a list of 2,512 distinct genes implicated in rare diseases of interest to the four ERNs, many of which were identified by more than one ERN (Supplementary Table 2).

Identification of clinically actionable genes

Potentially clinically actionable genes in affected individuals were identified from three independent initiatives: ClinGen28 (n = 77), IEMbase58 (n = 214) and Treatabolome59 (n = 154; https://treatabolome.cnag.crg.eu). This provided a total of 392 unique genes, of which 311 (79%) were included in at least one of the curated gene lists from the ERNs. For the assessment of clinically actionable genes in individuals affected by a hereditary cancer disposition, we searched GeneReviews and the National Comprehensive Cancer Network Clinical Practice Guidelines in Oncology (https://www.nccn.org/guidelines/category_1) for actionability based on surveillance for cancer advice.

Data submission and analysis workflow

Raw sequencing data, associated metadata and phenotypic and pedigree descriptions were collated from 43 research groups across Europe using RD-Connect GPAP8. To ensure secure, rapid and robust transfer of the large quantity of raw genomic data (FASTQ, BAM or CRAM) for reanalysis (approximately 100 TB in total), each research group was provided with access to a dedicated private space in which to upload their sequencing data, on an Aspera server hosted by RedIRIS, the Spanish national research and education network (https://www.rediris.es/). From here the sequencing data were downloaded to the Centro Nacional de Análisis Genómico in Barcelona, which develops and hosts RD-Connect GPAP.

All genomic data submitted to Solve-RD were analyzed in identical fashion to minimize any batch effects, using the RD-Connect GPAP standard analysis pipeline60. Briefly, reads were aligned to the decoy version of GRCh37 (hs37d5) using BWA-MEM. Short variants (that is, SNVs) and insertions and deletions <50 nt in length (referred to here as InDels) were identified across the genome, independent of the target capture region of interest, using the GATK HaplotypeCaller in accord with the GATK Best Practices workflow. The output of the pipeline for each experiment is an aligned, base quality score recalibrated BAM, and a genetic variant call format (gVCF) per chromosome and for the mitochondrion. All variant positions covered by at least eight reads, and a GATK-assigned genotype quality of at least 20, are uploaded to RD-Connect GPAP, as are any nonvariant positions for which at least one other experiment in the uploaded batch has a variant position at the same genomic location. SNVs, InDels and mitochondrial variants received detailed annotations provided by Ensembl Variant Effect Predictor61, gnomAD62 and ClinVar16, among other resources.

In addition to the above described annotations available through RD-Connect GPAP, all gVCFs derived from affected individuals were converted to VCFs and annotated by a custom annotation pipeline at RadboudUMC, as described previously63. This comprises variant-based annotations, including nucleotide conservation scores (phyloP and CADD), RadboudUMC in-house database allele frequencies and gene-based annotations including, for example, mouse knockout model phenotypes and pLI/LOEUF scores, among others. These annotated VCF files were made available to the Solve-RD consortium through the Solve-RD Sandbox, a cloud environment used by project partners to conduct bespoke analyses and thereby to securely share analysis and interpretation results, hosted by UMC Groningen, the Netherlands. A more detailed description of the Solve-RD data infrastructure has been published previously64.

Raw sequencing data (FASTQ), and newly generated alignment (BAM or CRAM) and variant call (gVCF) files for each experiment, accompanied by the corresponding phenotypic description in Phenopackets and pedigree descriptions in PLINK PED format, were submitted to EGA9 in Hinxton, UK for long-term archival and to allow controlled access by the wider human genomics community.

Quality control

A total of 10,276 ES and GS RD-REAL datasets from 10,039 individuals were initially submitted to Solve-RD for reanalysis. Preliminary quality control of sequencing data required a median coverage of at least ten reads over at least 70% of the defined target region of interest for the corresponding enrichment kit, or across the entire genome in the case of GS data. Furthermore, with respect to phenotypic data, each submitted family was required to have an affected proband with associated HPO terms. Misassigned relationships were identified, and subsequently corrected where possible, using KING (https://www.kingrelatedness.com/). Following application of these quality control measures, the final number of datasets taken forward for reanalysis comprised data from 9,645 individuals from 6,004 families, of which 6,447 (66.9%) were affected by a rare disease. Of these, ES data were available for 9,124 (94.6%) individuals, GS data for 333 (3.5%) and both ES and GS data for the remaining 190 (2.0%).

Variant identification and prioritization

RD-REAL data analysis and interpretation

We applied two-level expert analysis and interpretation to the RD-REAL datasets, comprising firstly bioinformatic and molecular genetics experts working together in dedicated working groups within a data analysis task force, and secondly, clinicians and rare-disease experts from each ERN who jointly prioritized and interpreted all variants returned by the data analysis task force, working in four distinct DITFs. To maximize the generalizability of this effort, the entire dataset of 6,004 families was included in a comprehensive analysis suite comprising an initial centralized analysis of each different variant type: short SNVs and InDels; de novo mutations; and mitochondrial variants, noncanonical splice variants, CNVs, SVs, STRs and MEIs. Subsequently, filters were applied with respect to variant quality, control population allele frequencies and predicted consequence, followed by further ERN- and disease-specific filters including the application of the ERN-specific gene lists described above. Details of all tools applied in these analyses are provided in Supplementary Table 13.

Because Solve-RD processed data in multiple data freezes, subsets of experiments continued to undergo analyses in parallel, some of which resulted in diagnoses before the results of the centralized systematic analyses were returned to submitters. This includes the preliminary analysis of a smaller dataset12,13. Furthermore, many datasets underwent parallel or additional analyses in the laboratories of the respective submitters, resulting in the identification of (probable) pathogenic, or candidate disease-causing, variants in established or novel genes. These results are labeled as ad hoc expert review (Fig. 2 and Supplementary Table 5), although the majority of these variants were also prioritized in the systematic analyses.

Taken together, this resulted in either diagnosed individuals (that is, those harboring (probable) pathogenic variants that fully explain the proband’s phenotype, unequivocally allowing a molecular diagnosis of a rare condition) or affected individuals with candidate variants worthy of further follow-up and/or functional studies, which may prove to be diagnostic in the future, as adjudged by the referring clinicians and/or expert ERN partners.

SNVs/InDels

Programmatic reanalysis was undertaken on annotated variants from RD-Connect GPAP using application programming interface endpoints, enabling complex queries with different combinations of filters across specific datasets13. Two different sets of parameters were used: first, a low-hanging fruit analysis to identify (probable) pathogenic variants already listed in ClinVar; second, identification of rare variants of high or moderate impact in ERN genes of interest, matching the expected mode(s) of inheritance.

  1. (1)

    Low-hanging fruit analysis: depth of coverage (DP) >7; GATK genotype quality (GQ) >19; minor allele frequency (MAF) <0.01 in gnomAD; observed allele frequency <0.02 in the internal RD-Connect GPAP database; affecting a gene in the corresponding ERN gene list, and annotated as pathogenic (class 5) or probably pathogenic (class 4) for any disorder in ClinVar as of May 2021.

  2. (2)

    High–moderate-impact variant analysis: DP >7; GQ >19; MAF <0.01 in gnomAD; observed allele frequency <0.02 in the internal RD-Connect GPAP database; affecting a gene in the corresponding ERN gene list, predicted to have a high or moderate consequence at the protein level according to Ensembl VEP and matching the expected inheritance pattern (that is, autosomal dominant, autosomal recessive or X-linked).

Variants passing the above filtering criteria were returned in a single table to the respective DITF for each ERN, to facilitate evaluation and provision of feedback. Across the Solve-RD cohort we identified a mean of eight SVs per affected individual for interpretation, ranging from one to 13 across ERNs, this difference largely reflecting differences in the number of genes included in the corresponding ERN gene lists (Supplementary Tables 2 and 14).

De novo mutations

For all families for which parent-child trios were available (n=1,320; 22% overall), de novo mutation calling was undertaken using both HaplotypeCaller and DeNovoCNN65. De novo mutation calls from DeNovoCNN with probability >0.85 of being a bona fide event, and any apparent de novo mutations identified by HaplotypeCaller which were located in a gene on the respective ERN gene list, were returned to DITFs for variant interpretation.

Mitochondrial genome variants

Mitochondrial DNA variants were identified using MToolBox.The workflow includes mapping reads to the revised Cambridge Reference Sequence mitochondrial genome and annotation using the MITOMAP database (https://www.mitomap.org/MITOMAP, accessed 28 June 2021). Both homoplasmic and heteroplasmic variants were identified (Supplementary Table 15).

Identification of noncanonical SVs

For identification of variants potentially affecting splicing at sites other than canonical splice sites, two novel tools were applied, SpliceAI29 and SQUIRLS66. Rare variants receiving a strong splice-altering prediction from both tools (that is, both a delta-score >0.8 in SpliceAI and a pathogenic classification by SQUIRLS, which would potentially alter splicing of any gene in the corresponding ERN gene list) were returned to DITFs for interpretation.

Large CNVs and SVs

Three different tools were used to maximize the likelihood of identifying pathogenic CNVs, as described in Demidov et al.36: ClinCNV67, Conifer and ExomeDepth. Variants observed to have a frequency >0.01 across the cohort were discarded, and the remaining rare CNVs were intersected with the corresponding ERN gene list and annotated using AnnotSV68 before being returned to DITFs for interpretation. In parallel, Manta37 was run in exome mode to search for signatures of split reads, which might indicate the presence of balanced SV such as inversions. To facilitate interpretation, Integrative Genomics Viewer (IGV) tracks were generated for all large variants, indicating the exons, the position and type of call produced by the tools and beta-allele frequency. See Supplementary Table 13 for details regarding sources and exact versions of tools applied.

STR expansions

The identification of potentially pathogenic STR expansions was largely based on the work of van der Sanden et al.69. In brief, ExpansionHunter70 was used to screen 21 genomic loci previously described as harboring pathogenic repeat expansions in both ES and GS data (Supplementary Tables 16 and 17), from a total of 5,983 families. Following retrieval of predicted pathogenic genotypes across all samples, any frequently observed events were discarded and the remaining variants affecting genes on the corresponding ERN gene list were manually curated by visual inspection, before being returned to DITFs for interpretation.

MEIs

To identify any MEIs potentially affecting ERN genes of interest, the methods described by Wijngaard et al.71 were followed. In brief, MEI identification was undertaken using both MELT and SCRAMble. MEIs of potential interest were limited to those that fell within a window of ±50 base pairs (bp) of ES target areas. All MEIs observed in nonaffected cases were removed, followed by the exclusion of those present in the Database of Retrotransposon Insertion Polymorphisms in Humans. MEI frequency was calculated by counting all overlapping (±50 bp) MEIs in the cohort, and only rare events—defined as having a frequency <0.03% in their respective cohorts—were retained. We further filtered to MEIs found in clinically relevant genes based on the patient’s phenotype as defined by the ERN. The remaining MEIs were visually inspected in IGV to discard low-quality calls. Finally, MEIs were selected for confirmation by ERN members, taking into consideration the phenotype–genotype match, inheritance pattern and presence of a second variant in the case of an autosomal recessive disorder.

An overview of the accurate number of families analyzed for each variant type is provided in Supplementary Table 18.

Statistics and reproducibility

This study includes only observational statistics, primarily counts. We report means and medians where appropriate, and applied two-tailed Fisher’s exact tests to compare differences between groups. Each family was analyzed independently, in order of submission, and no statistical method was required to predetermine sample size. No data were excluded, with the exception of cases that failed quality control as described above. Sex is not a relevant variant, because both sexes are essentially equally likely to be affected by a rare disease. The investigators were not blinded to allocation during outcome assessment.

Reproduction of results was not applicable. However, follow-up and validation of identified variants by orthologous means and/or using other bioinformatic tools were undertaken in the vast majority of cases, to ensure that the variants identified were biologically real and relevant. As commonly found in the rare-disease field, replication of previously variant observations has happened, or will happen, via databases (for example, ClinVar) or the scientific literature.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.