Introduction

Lumbar disc herniation (LDH) is one of the most frequent structural findings in the lumbar spine to cause specific symptoms. Disc herniation is a general term that includes different types of disc displacements such as disc protrusion, extrusion, and sequestration1. Disc protrusions, for instance, are rarely symptomatic, and they are prevalent even among asymptomatic subjects. The overall prevalence of LDH in the whole population has been estimated to be 14% in a large cross-sectional study, and herniations are found most frequently at the L4/5 and L5/S1 segments of the spine2.

LDH has a clinical relevance if it causes radicular symptoms in the lower extremity. LDH is, indeed, the most frequent condition that causes lumbar radicular pain, i.e., sciatica, or radiculopathy3. The mechanisms underlying LDH-induced radicular pain are manifold. While mechanical compression of the nerve root is a contributing factor, inflammatory mediators and autoimmune responses also play a considerable role in eliciting radicular pain4. When the nerve root is perturbed, typical symptoms include radicular pain, paraesthesia, or numbness in the area of the nerve, and possible weakness in the muscles innervated by the affected nerve root5.

Typically, patients with symptomatic LDH are treated conservatively, but surgery is required if there is a sudden or progressive neurological deficit or unmanageable pain despite appropriate conservative treatment5. Overall, surgery has not been proven to have superior outcomes in the long term, even though surgery can have better pain relief in the short term6,7.

While most patients tend to recover from symptoms rather quickly, some studies have reported conflicting evidence. Psychosocial factors, such as the patient’s own beliefs of recovery, can also have a role in prognosis8.

Genetic influence on LDH and sciatica has been established through research that has shown certain loci to be associated with these conditions in genome-wide association studies (GWASs)9,10. Symptomatic LDH could be caused by factors affecting either disc-related structures, such as collagen11, or other factors, such as those related to nerves, inflammation, or autoimmunity4. Previous studies have associated several pathways with LDH pathogenesis, encompassing inflammation, chondroitin sulfation, collagen synthesis, and chondrogenic differentiation9,10. Some of these genes have also been associated with back pain and the regulation of pain sensations12,13,14. The aetiological factors behind LDH are quite well understood15. However, the genetic factors behind these distinct features warrant more research. The aim of this study was to explore different genetic features behind LDH by conducting a GWAS using data from three large biobanks: FinnGen, Estonian Biobank, and UK Biobank.

Results and discussion

64 loci associate with LDH risk

In this genome-wide meta-analysis of LDH, as defined by International Classification of Diseases (ICD)−10 code M51 (M51.1-M51.9), we utilised data from FinnGen, the Estonian Biobank, and the UK Biobank, comprising a total of 829 699 participants (80 724 cases, and 748 975 controls; Supplementary Table 1). In the meta-analysis, we identified 41 previously unreported (Supplementary Figs 141, Table 1) and replicated 23 known loci located more than 1MB apart (Supplementary Fig 42, Supplementary Table 2), each containing at least one variant associated with LDH at the genome level (M51.1-51.9, Fig. 1, Supplementary Data 1, Supplementary Data 2). Out of the previously unreported loci, 4 loci had previously been associated with other back-related diseases12,13,16; however, our research demonstrates that these loci are also associated with LDH. In the conditional analysis, we observed secondary signals at 5 of the loci (Supplementary Table 2). We estimated LD score regression-derived SNP-based heritability of LDH to be 0.081 (standard error [SE] = 0.003), suggesting that genetic factors account for 8.1% of the common variation in LDH risk. This estimate aligns with the one available at the UKB-Heritability Browser (0.077, SE = 0.026; https://nealelab.github.io/UKBB_ldsc/h2_summary_M51.html). The genomic inflation factor lambda (1.47) suggested inflation in the test statistics. Given the intercept value of 1.12, the inflation observed may be attributed to a polygenic signal and population stratification17. In FinnGen, the SNP-based heritability estimate was 14.3% (SE = 0.0083), and lambda 1.39 with an intercept of 1.13.

Table 1 The list of previously unreported lead variants at the 41 genome-wide significant (p < 5 × 10−8) loci that were associated with LDH in the meta-analysis (80 724 cases, 748 975 controls) and five previously unreported loci associated with LDH related surgical operations in sensitivity analysis that were performed in FinnGen (7347 cases, 270 964 controls)
Fig. 1: Manhattan plot and lead variant effect size comparison with sensitivity analyses.
figure 1

A The Manhattan plot of the genome-wide meta-analysis of LDH with 80 724 cases and 748 975 controls. Previously reported loci are indicated with orange colour, 37 previously unreported loci are highlighted in red, and 4 unreported loci for LDH, but previously associated with back pain are indicated with black. Candidate genes possibly explaining the LDH associations were used as loci identifiers. The red dashed line depicts the genome-wide significance limit (p < 5 × 10−8). B A comparison of the effect sizes of the lead variants discovered in the original meta-analysis (ICD-10:M51: 80 724 cases, 748 975 controls, [blue]), and in sensitivity analyses with more strict case definitions (ICD-10:M51.1: 18 857 cases, 270 964 controls, [green] or a surgery: 7347 cases, 270 964 controls, [red]). Dots indicate effect size and vertical lines are the corresponding 95% confidence intervals. For effect differences statistical comparison, we used a two-tailed test, using group-specific effect estimates of the variants and the corresponding standard errors diff = ((Effect_Meta-Effect_M51.1)/sqrt(standarderror_Meta2+standarderror_M51.12)), P-value = 2*(1-diff). P-value < 0.05 was considered the limit of a significant effect size difference. Significant effect size differences were observed for nine variants in surgical GWAS compared to the original meta-analysis. No statistically significant effect differences were found between the meta-analysis and M51.1. Other variants can be seen in Supplementary Fig 43. Source data are provided in the Source Data file.

Five previously unreported loci associated with LDH requiring surgical intervention

In two sets of sensitivity analyses conducted in FinnGen, more strict case definitions were used by limiting LDH cases to the ones with radiculopathy (M51.1) and to those who have undergone surgery (Supplementary Table 1). No statistically significant differences in the effect sizes of the lead variants were observed between the GWAS specific to the M51.1 endpoint and the original meta-analysis. When comparing the GWAS on LDH patients who underwent surgical treatment with the original meta-analysis, however, the effect sizes were significantly larger for nine variants (Fig. 1, Supplementary Fig 43, Supplementary Table 3). In this analysis, we also identified five previously unreported loci associated with LDH-related surgical operations and replicated three previously associated loci (Supplementary Fig 44, Table 1, Supplementary Table 4)10.

Description of the LDH-associated loci

As reported previously, numerous LDH-associated loci were in the vicinity of genes related to inflammation or disc-related structures (Supplementary Table 2, Supplementary Data 2), indicating that these pathways play a central role in LDH pathogenesis. In addition, we detected unreported loci near genes related to the Wnt/β-catenin pathway, such as PDZRN3 (PDZ domain containing ring finger 1, locus 3p13), activation of the Wnt/β-catenin pathway has been found to be associated with endplate degeneration, increased intervertebral disc (IVD) cell senescence, and extracellular matrix degradation18,19. LDH associations were also observed in loci near NGF (nerve growth factor), DCC (DCC netrin-1 receptor), and NCK1 (NCK adaptor protein 1). These genes are involved in the growth and regulation of nerve axons20,21, suggesting a potential connection between genes affecting nerves and the nervous system and LDH pathogenesis. The dysfunction of these genes could lead to IVD innervation20, and increase sensation of pain. Notably, some of these genes have already been associated with pain sensation in previous studies12,20.

In addition, we observed previously unreported LDH associations near CA10 (carbonic anhydrase 10) and DNM1 (dynamin 1) that are involved in synaptic transmission. CA10 blocks the binding of heparan sulphate to neurexin, which possibly affects the function of neurexins22. Neurexins are pre-synaptic cell adhesion molecules that play a role in connecting neurons at synapses. Heparan sulphate has been found to potentially expand the interactome of neurexins, and they also play a role in fine-tuning synaptic transmission23. CA10 is expressed especially in the central nervous system, and it has been associated with chronic pain in previous studies12,24. DNM1 plays a central role in the transmitting nociceptive messages within the nociceptive circuits in the dorsal horn of the spinal cord, where DNM1-mediated endocytosis of synaptic vesicles enables sustained neurotransmission25. While further studies are needed, the association of genes involved in synaptic transmission with LDH suggests that differences in synaptic transmission may influence differences in pain perception in patients with morphologically similar LDHs. Moreover, these genes could also contribute to the chronification of pain in patients with symptomatic LDH. The findings above underline the relevance of the physiological factors of the nervous system in addition to the disc-related structures behind symptomatic LDH.

Connective tissue related traits were enriched in functional analyses

The MAGMA gene-based test and GCTA-fastBAT analysis highlighted multiple genes that we had deemed as potential candidate genes at the previously unreported LDH-associated loci (Table 1, Fig. 2A, Supplementary Table 5), thus providing supportive evidence for our findings. In the MAGMA gene-set analysis (Fig. 2B), we observed the most significant enrichments for pathways responsible for chondrocyte differentiation. Significant gene set enrichments were also observed for cartilage and connective tissue development pathways, as well as for pathways related to chromatin organization and modification. Among the gene sets related to the nervous system, we also observed significant enrichments in gene sets related to the structure and active zone organization of presynapses. In the MAGMA tissue expression analysis (Supplementary Fig 45), we found no tissues showing a positive correlation between tissue-specific gene expression profiles and LDH associations. The likely reason for the null result is the absence of the relevant tissues, namely cartilage and bone, in the Genotype-Tissue Expression (GTEx) dataset used in these analyses. Colocalizations between LDH GWAS and eQTL signals were observed in each selected tissue, namely whole blood, cultured fibroblasts, tibial nerve, and cervical spinal cord. In one or more of the four investigated tissues, we discovered evidence of the colocalization of LDH signals and the expression of twenty genes (posterior probability [PP] for the shared variant ≥ 0.8, Fig. 2C, Supplementary Table 6). At 6p21.31, the LDH association signal colocalized with the expression of ILRUN (inflammation and lipid regulation with UBA-like and NBR1-like domains, formerly known as C6orf106) in three tissues (PPWhole Blood = 0.967, PPCultured fibroblasts = 0.966, PPTibial Nerve = 0.996). In addition to its potential involvement in the development of chronic pain in the brain, ILRUN, an inhibitor of proinflammatory and antiviral cytokine transcription, has also been linked to peripheral pain perception26,27. Colocalization of the LDH association signal and gene expression for ten additional genes was observed in nervous system-related tissue, supporting the notion that the nervous system has a role in the pathophysiology of LDH (Fig. 2C, Supplementary Table 6).

Fig. 2: Results of functional analyses.
figure 2

A MAGMA29 gene-based test in a Manhattan plot. X-axis chromosomes, y-axis -log10(p-value). B MAGMA gene-set enrichment analysis. The plot shows significantly (pFDR < 0.05) enriched pathways (pFDR < 0.05), curated gene sets, and GO-annotations ranked by p-value -log10(P). The size of the circles refers to the size of the gene set. Small grey < 15, blue 15–100, violet 100–200, and red > 200 genes. The analysis was done using FUMA47. The gene sets and GO annotations are from MSigDB49. C Colocalizations between LDH GWAS and eQTL signals were estimated with approximate Bayesian factor analyses, implemented with the ‘coloc.abf’ function available in the ‘coloc’ R-library51. Candidate genes and genes closest to the association signal were selected for the analysis if the closest gene was a different gene than the candidate gene (Supplementary Table 7). Variant-gene expression associations for selected tissues (whole blood, cultured fibroblasts, tibial nerve, and cervical spinal cord), were downloaded from GTEx v845. Colocalizations with posterior probabilities ≥ 0.8 for the variant were considered significant52. Source data are provided in Source Data file.

Cumulative incidence of LDH diagnoses uncovers early-onset variants

We investigated the cumulative incidence of LDH diagnoses and assessed the differences in the cumulative incidences between homozygotes for the lead variants at each LDH-associated locus. As observed previously28,29, LDH diagnoses accumulated greatly between the ages of 40 and 50 until circa 70 years old, after which the accumulation of diagnoses was very low (Fig. 3). Overall, the variants exhibited a similar pattern in terms of the accumulation of the LDH diagnoses, with a few exceptions. The cumulative incidence of LDH differed between GSDMC (gasdermin C), rs7814941 G/G and A/A homozygotes already at the age of 26 (p = 0.0005; Fig. 3A) and CHST3 (carbohydrate sulfotransferase 3), rs4284332 C/C and T/T homozygotes at the age of 25 (p = 2.22e−5; Fig. 3B). For the other variants, the differences became statistically significant at a later age, IGFBP3 (insulin like growth factor binding protein 3), rs788747 G/G and T/T at the 35 years of age (p = 0.001; Fig. 3C) and SOX6 (SRY-box transcription factor 6), rs9787942 C/C and T/T at the 36 years of age (p = 0.03; Fig. 3D). While these genes have already been associated with LDH in previous studies, we were also able to describe the age at which diagnoses begin to accumulate for the variants under investigation. Regarding the other variants, the difference in the cumulative LDH diagnoses between homozygotes manifested at a later age, or there was no significant difference (Supplementary Table 8). Other variants cumulative morbidities followed the sample prevalence values (12.2% for LDH diagnoses and 2.6% for surgical patients, Supplementary Table 8). We also repeated the analysis restricted to radiculopathy cases (M51.1), where all variants behaved identically in the analyses compared to the original analysis (Supplementary Fig 47). The results obtained in the Estonian Biobank were similar to those observed in FinnGen (Supplementary Fig 48).

Fig. 3: Cumulative LDH diagnoses.
figure 3

Cumulative LDH diagnoses for (AGSDMC, B CHST3, C IGFBP3, D SOX6 variants in the Kaplan-Meier plots. Age in years is on the x-axis, and the corresponding fraction of the cases is on the y-axis. The orange line depicts homozygotes for the effect allele, the grey homozygotes for the other allele, and the blue line corresponds to heterozygotes having one copy of each allele. Additionally, candidate genes, rsid, CHR:POS, and alleles were used to identify variants in the plots. Analyses were conducted in FinnGen (n = 308 600); the exact survival rates and ranges are listed in the Source data. Cumulative diagnoses for the (AGSDMC and (BCHST3 variants before the age of 30 can be seen in more detail in Supplementary Fig 46. Corresponding plots obtained using the subset of cases with radiculopathy (M51.1) and data from Estonian Biobank are given in the supplementary information (Supplementary Fig 47, Supplementary Fig 48). CHR:POS, chromosome and position (genome build hg38). Source data are provided in Source Data file.

Numerous strong genetic correlations were observed for LDH

We found specific significant genetic correlations between LDH and 437 traits. The most significant positive genetic correlation in terms of the smallest p-value was observed with the number of treatments/medications taken (Fig. 4: rg = 0.52, pFDR = 2.27−100), while the largest significant genetic correlation was observed with the diagnosis of dorsalgia (Fig. 4: rg = 0.83, pFDR = 1.23−45). LDH was also positively genetically correlated with other pain-related endpoints, such as neck and shoulder pain (Fig. 4: rg = 0.58, pFDR = 8.60−95) and knee pain (Fig. 4: rg = 0.46, pFDR = 2.07−51). The most significant negative genetic correlation of LDH was with a higher level of education (Fig. 4: rg = −0.41, pFDR = 2.10−88).

Fig. 4: Genetic correlations (rg) between LDH and other traits.
figure 4

Analyses were conducted using the LDSC-software, and analysis was done utilizing our meta-analysis summary statistics (N = 829 699). All traits were extracted from the GWAS database provided by the MRC-IEU database (n ranges from 7637 to 575 601; the trait-specific sample sizes are provided in Supplementary Data 3). Only the strongest observed (rg < −0.4 & rg> 0.45) correlations with a significant false discovery corrected p-value than(pFDR < 0.05) are shown in the figure. rg, genetic correlation coefficient value; pFDR, false discovery rate-corrected p-value. The forest plot visually represents observed rg (95% CI) values. Genetic correlations for all 437 phenotypes can be seen in Supplementary Data 3. Source data are provided in the Source Data file.

Evidence of a causal relationship between LDH and back pain

In Mendelian randomization, we uncovered a potential causal relationship between several factors and LDH (Fig. 5, Supplementary Table 9, Supplementary Table 10). Results suggested a causal relationship between being overweight and higher LDH risk (OR = 1.15, pFDR=0.002, Fig. 5, Supplementary Fig 49). Similarly, a potential causal relationship was observed between lumbar spine bone mineral density (LSBMD) and higher LDH risk (OR = 1.15, pFDR = 2.29−5, Fig. 5, Supplementary Fig 50). Both overweight and LSBMD are well-known risk factors for LDH, and both have been found to cause increased mechanical loading on the lumbar discs and vertebral endplates, affecting the pathogenesis of LDH30,31,32,33,34. A possible causal relationship was also observed between a higher level of education and lower LDH risk (OR = 0.34, pFDR = 4.35−23, Fig. 5, Supplementary Fig 51). In previous studies, patients with lower socioeconomic status tend to have more pronounced symptoms, experience more severe pain, and report higher rates of depression35. Also, individuals with lower socioeconomic status are more likely to work in physically demanding manual labour occupations. It has been previously determined that physically demanding work poses a risk, particularly for recurrent LDH36. Conversely, individuals with higher education usually have a better income level, they tend to seek treatment at an earlier stage after the onset of LDH, and they often have better pain management methods and generally healthier lifestyles35,37. Additionally, we noted a potential causal relationship between LDH and the frequency of tiredness in the last two weeks (beta = 0.02, pFDR = 0.046, Fig. 5, Supplementary Fig 52) and back pain (beta=0.05, pFDR = 6.20−13, Fig. 5, Supplementary Fig 53). The role of LDH to cause back pain is well known3, and the possible causal relationship we observed between LDH, and increased tiredness likely arises from sleeping problems, which are commonly reported by patients suffering from radicular pain8. These findings should be interpreted with caution; while we did not observe pleiotropy, some causal estimates exhibited heterogeneity. In the leave-out analyses, all causal estimates were consistently in the same direction, so individual variants do not seem to drive the observed causal relationships (Supplementary Figs 5458).

Fig. 5: Exposures potentially causal for LDH (above, Supplementary Table 9), and outcomes that LDH was potentially found to be causal for (below, Supplementary Table 10).
figure 5

The analysis was performed using the TwoSampleMR R library, and data from FinnGen LDH-GWAS results (N = 308 600) and MRC-IEU database (n ranges from 28 498 to 461 857; the trait-specific sample sizes are provided in Supplementary Table 11). Inverse variance weighted model was our primary analysis, for which statistical significance was considered at false discovery rate corrected p-value (pFDR) < 0.05. As a sensitivity analysis, we also performed analysis by using MR Egger. nsnp, number of SNP’s; OR (95% CI), odds ratio and its 95% confidence interval; Beta (95% CI), beta estimate and it’s 95% confidence interval; pFDR, false discovery rate-corrected p-value. The forest plot visually represents observed OR (95% CI) and Beta (95% CI). Source data are provided in Source Data file.

Discussion

Although the incorporation of data from three extensive biobanks enabled a large sample size and facilitated discoveries of multiple genome-wide significant associations with LDH, our sample is limited to European ancestry only. Variations in the relative prevalence of LDH cases across the sample populations included in the meta-analysis suggest potential discrepancies in how biobanks can identify LDH patients. The eQTL analyses would benefit from incorporating bone and cartilage data, which are crucial tissues in LDH pathogenesis. Given this paper solely employs computational techniques, our findings would be enhanced by further functional research in the future.

In conclusion, the previously unreported LDH risk loci that we found expand the understanding of the hereditary causes of LDH. While changes in disc-related structures and inflammation-related factors play a major role in the aetiology of LDH, our results from multiple downstream analyses suggest that nervous system-related mechanisms may also be implicated. Through sensitivity analyses, we also discovered genetic factors associated with a more severe form of LDH that required surgical interventions. We also recognized that certain variants exhibit a statistically significant increase in LDH risk even before the age of 30, as evidenced by the patterns of cumulative LDH incidence. All things considered, our study advances our knowledge of the genetic causes of LDH and has the potential to pave the way for the development of novel medications, preventative strategies, and therapies for LDH in the future.

Methods

Our research complies with all necessary ethical regulations (Supplementary Note 1). Patients and control subjects in FinnGen provided informed consent for biobank research, based on the Finnish Biobank Act. Alternatively, separate research cohorts, collected prior the Finnish Biobank Act coming into effect (in September 2013) and the start of FinnGen (August 2017), were collected based on study-specific consents and later transferred to the Finnish biobanks after approval by Fimea (Finnish Medicines Agency), the National Supervisory Authority for Welfare and Health. Recruitment protocols followed the biobank protocols approved by Fimea. The Coordinating Ethics Committee of the Hospital District of Helsinki and Uusimaa (HUS) statement number for the FinnGen study is Nr HUS/990/2017.

The FinnGen study is approved by Finnish Institute for Health and Welfare (permit numbers: THL/2031/6.02.00/2017, THL/1101/5.05.00/2017, THL/341/6.02.00/2018, THL/2222/6.02.00/2018, THL/283/6.02.00/2019, THL/1721/5.05.00/2019 and THL/1524/5.05.00/2020), Digital and population data service agency (permit numbers: VRK43431/2017-3, VRK/6909/2018-3, VRK/4415/2019-3), the Social Insurance Institution (permit numbers: KELA 58/522/2017, KELA 131/522/2018, KELA 70/522/2019, KELA 98/522/2019, KELA 134/522/2019, KELA 138/522/2019, KELA 2/522/2020, KELA 16/522/2020), Findata permit numbers THL/2364/14.02/2020, THL/4055/14.06.00/2020, THL/3433/14.06.00/2020, THL/4432/14.06/2020, THL/5189/14.06/2020, THL/5894/14.06.00/2020, THL/6619/14.06.00/2020, THL/209/14.06.00/2021, THL/688/14.06.00/2021, THL/1284/14.06.00/2021, THL/1965/14.06.00/2021, THL/5546/14.02.00/2020, THL/2658/14.06.00/2021, THL/4235/14.06.00/202, Statistics Finland (permit numbers: TK-53-1041-17 and TK/143/07.03.00/2020 (earlier TK-53-90-20) TK/1735/07.03.00/2021, TK/3112/07.03.00/2021) and Finnish Registry for Kidney Diseases permission/extract from the meeting minutes on 4th July 2019.

The Biobank Access Decisions for FinnGen samples and data utilized in FinnGen Data Freeze 9 include: THL Biobank BB2017_55, BB2017_111, BB2018_19, BB_2018_34, BB_2018_67, BB2018_71, BB2019_7, BB2019_8, BB2019_26, BB2020_1, Finnish Red Cross Blood Service Biobank 7.12.2017, Helsinki Biobank HUS/359/2017, HUS/248/2020, Auria Biobank AB17-5154 and amendment #1 (August 17 2020), AB20-5926 and amendment #1 (April 23 2020) and it´s modification (Sep 22 2021), Biobank Borealis of Northern Finland_2017_1013, Biobank of Eastern Finland 1186/2018 and amendment 22 § /2020, Finnish Clinical Biobank Tampere MH0004 and amendments (21.02.2020 & 06.10.2020), Central Finland Biobank 1-2017, and Terveystalo Biobank STB 2018001 and amendment 25th Aug 2020.

Study populations

FinnGen

The main goal of FinnGen (www.finngen.fi/en) is a better understanding of disease mechanisms by combining genomic and health data from up to over 500,000 Finns, with the aim of making healthcare and medical care more efficient. The aim of the studies is to find connections between individual genetic differences and diseases. Tenth Revision (ICD-10) codes were used to characterize phenotype M51 (M51.0-M51.9: Thoracic, thoracolumbar, and lumbosacral intervertebral disc disorders, Supplementary Table 1). Patients without a record of these ICD codes were categorized as controls, and from the controls participants that had been diagnosed with other dorsopathies were excluded (M40-M50, M52-M54, M76). The Hospital Discharge Registry and the Cause of Death Registry served as patient data sources; the analysis did not include patients whose registration data was only available in the primary care registry. The FinnGen (R9-version) data used in the study contains 37 636 LDH cases and 270 964 controls.

Sensitivity analyses

We also performed two sensitivity GWAS analyses in FinnGen using stricter case definitions. The first of the two additional case definitions included only LDH patients with the M51.1 code. There were 18 857 cases and 270 964 controls in the GWAS; LDH patients who also had other LDH codes were excluded from the analysis (Supplementary Table 1). The second additional case definition included LDH patients who had undergone an LDH-related operation (NOMESCO version 1.15, ABC07, ABC16 & ABC26)38; this analysis included 7347 cases and controls of 270 964. LDH cases that had not been operated were excluded from the analysis. The same was done for operated cases that didn’t have a LDH diagnosis, as these patients were probably operated as a result of acute injury.

Estonian biobank (EstBB)

The Estonian Biobank (www.genomics.ut.ee/en) cohort is a volunteer-based sample of the Estonian resident adult population (aged ≥ 18 years)39. Estonians represent 83%, Russians 14%, and other nationalities 3% of all participants. The current number of participants is > 205,000 and represents a large proportion, > 15 % of the Estonian adult population, making it ideally suited to population-based studies. General practitioners (GPs) and medical personnel in the special recruitment offices have recruited participants throughout the country. At baseline, the GPs performed a standardized health examination of the participants, who also donated blood samples for DNA, white blood cells and plasma tests and filled out a 16-module questionnaire on health-related topics such as lifestyle, diet and clinical diagnoses described in WHO ICD-10. A significant part of the cohort has whole genome sequencing (3000), whole exome sequencing (2500), genome-wide single nucleotide polymorphism (SNP) array data (200 000) and/or NMR metabolome data (200 000) available. In the meta-analysis, there were 34 035 LDH cases and 66 533 controls from the Estonian Biobank.

UK biobank (UKBB)

The UK biobank (https://pan.ukbb.broadinstitute.org/) material consists of samples collected during the years 2006–2010. Samples were collected from hundreds of thousands of people aged 40–69 from all over Great Britain. We utilized the summary statistics from the PanUKBB project and used subset of European ancestry in the analysis that contained 9053 LDH patients and 411 478 controls from the UK biobank.

Genotyping, imputation & quality control

FinnGen

Illumina and Affymetrix DNA microarrays were used to determine genotypes. Genotype data were quality controlled to exclude variants with a low Hardy-Weinberg equilibrium (HWE) p-value (< 1 × 10−6), minor allele count (MAC) below three, and high missingness (cut-off 2%), as well as individuals with high genotype missingness (cut-off 5%), high levels of heterozygosity (±4 SD), non-Finnish ancestry, and individuals whose sex did not match the genotype data. Samples were pre phased using Eagle 2.3.5, with the number of conditioning haplotypes set to 20 000. Beagle 4.1 (version 08Jun17.d8b) was used for genotype imputation. The reference panel was Finnish SISu v4, and the imputation protocol has been described at (https://www.protocols.io/view/genotype-imputation-workflow-v3-0-e6nvw78dlmkj/v2). Finally, post-imputation quality control was carried out to exclude variants with imputation information less than 0.6.

EstBB

For genotyping, Illumina Human CoreExome, OmniExpress, 370CNV BeadChip and GSA arrays were used. Individuals were excluded from the analysis if their call-rate was <95% or sex defined using X chromosome heterozygosity estimates didn’t match phenotypic data. Quality control included filtering on the basis of sample call rate ( < 98%), heterozygosity ( > mean ± 3 SD), genotype and phenotype sex discordance, cryptic relatedness (IBD > 20%) and outliers from the European descent based on the MDS plot in comparison with HapMap reference samples. SNP quality filtering included call rate ( < 99%), MAF ( < 1%) and extreme deviation from Hardy–Weinberg equilibrium (P < 1 × 10 − 4). Imputation was performed using SHAPEIT2 for prephasing, the Estonian-specific reference panel40 and IMPUTE241 with default parameters. Association testing was carried out with snptest-2.5.2, adjusting for 4 PCs, arrays, current age, and sex(when relevant). Variants with call-rate <95%, MAF < 1% or HWE p-value < 1e-4 (autosomal variants only) and indels were excluded.

GWAS

In FinnGen and in EstBB, GWAS using an additive genetic model was performed using the Regenie (version 2.2.4) programme42, adjusting each phenotype for age, sex, the first 10 genetic principal components, and genotyping arrays. The sensitivity analyses with stricter case definitions conducted in FinnGen were performed using Regenie and the same covariates as above. The goal of the sensitivity analyses was to evaluate whether general degeneration of IVD has an effect on the effect estimates of the lead variants observed in the meta-analysis, and if the effect estimates obtained in the original meta-analysis differ from the ones obtained using stricter case definitions. The aim was also to identify variants that could underlie LDH cases requiring surgery.

Meta-analysis

A Python-based software was used for inverse-variance weighted meta-analysis (https://github.com/FINNGEN/META_ANALYSIS/). Variant data from the Estonian and UK biobanks were converted with from hg19 to hg38 prior to meta-analysis using the Picard liftover (http://broadinstitute.io/picard). In case of no exact match, matching was tried by flipping strand and or switching (EA- > OA/OA- > EA) for the EstBB and UKBB variants. If there were multiple variants in the same position, the exact match was favoured. In total, there were 829 699 participants in the meta-analysis, of which there were 80 724 cases and 748 975 controls.

Candidate gene characterization

We defined a locus as a window of 2MB ( ± 1,000,000 bases) containing at least one variant associated with LDH at P < 5 × 10−8. For the loci that had not been reported in association with LDH in prior studies, we determined a potential candidate gene with a relevant biological function with the help of literature and databases (Genbank43, Uniprot44, GTEx-Portal45) and identified variants affecting gene regulation (eQTL).

Conditional analysis

We performed conditional analyses for the loci to identify possible secondary signals. The analyses were performed using the GCTA software (1.93.0 beta Linux)46 and the meta-analysed summary statistics. FinnGen was used as the reference sample to estimate linkage disequilibrium corrections. The associations were first conditioned on the most significant variant at each locus using the --cojo-cond option and default settings. Conditioning persisted until no variant achieved genome-wide significance (p < 5 × 10 − 8).

Heritability

The LDSC software (version 1.0.1)17 was used to calculate the SNP-based heritability estimate. Heritability estimation was performed using the liability scale, with a sample prevalence of 0.097 and a population prevalence of 0.14, as estimated by Zhang et al. (2016)2. HapMap3, a European-only reference panel provided by Alkes Price’s group (https://alkesgroup.broadinstitute.org/LDSCORE/) was used in heritability estimation. Summary statistics included 1,202,124 SNPs, and after merging with the LD reference panel 1,176,449 SNPs remained. In FinnGen’s data, the sample prevalence was 0.122. The reference panel remained the same, 1,166,621 SNPs were included from the summary statistics based on FinnGen, and after merging with the LD reference panel 1,157,245 SNPs remained.

Functional annotations

Functional annotations of the results of the meta-analysis were completed using FUMA47. Functional settings were selected for the analysis, in which case the programme uses functional information for mapping. Positional mapping was also performed, and for that, SNP markers were selected for the region of exons or introns affecting post-transcriptional modifications and involved in gene regulation. Filtering of SNP markers based on CADD results was also done, which provided additional information on the possible harmful effects of SNP markers. In addition, filtering of SNP markers was performed based on the RegulomeDB results, and in turn, information was obtained based on gene expression data and epigenomics about the possible functions of SNP markers affecting gene regulation. In the mapping, gene expression data were also utilized, and eQTL mapping was performed. For this, we used whole blood, cultured fibroblasts, tibial nerve, and cervical spinal cord cervical (GTEx v8) as tissues, and the focus was only on genes involved in protein-coding. A MAGMA analysis48, a functional association test, was also performed in the run, which focuses on gene-level information, unlike GWAS, where associations are reported at the variant level. MAGMA uses curated gene sets and GO annotations from MSigDB49 in the analyses. A 10 kb gene window and selected GTEx v8 tissue variants were put into the analysis; the HLA region was also left out of the annotations. For gene prioritization we also utilized fastBAT, from the GCTA software package (1.93.0 beta Linux)46. By utilizing LD correlations between SNPs from a reference sample with individual-level genotypes and summary data from GWAS, fastBAT computes the association p-value for a set of SNPs from an approximated distribution of the sum of χ2-statistics over the SNPs at gene regions50. We analysed the loci using a 2MB window and FinnGen as a reference for LD estimation. Default methods and a 0.9 r2 threshold for LD pruning were used in the analysis. Results with pfastBAT< 5 × 10−8 and pGWAS< 5 × 10−8 were considered significant50.

Colocalizations

We investigated colocalizations between LDH association signals and gene expression: approximate Bayesian factor analyses were performed with the ‘coloc.abf’ function found in the ‘coloc’ R-library (version 5.2.3)51. Colocalizations were investigated for a total of 78 genes (Supplementary Table 7). Candidate genes and genes closest to the association signal were selected for the analysis if the closest gene was a different gene than the candidate gene. For the analysis, we selected tissues that could be relevant for LDH, namely whole blood, cultured fibroblasts, tibial nerve, and cervical spinal cord. Variant-gene expression associations (GTEx v8), for tissues were downloaded (02/22/2024) from GTEx-Portal45. Colocalizations with posterior probabilities ≥ 0.8 for the variant were considered significant52.

Survival analysis

By exploring the cumulative events, our aim was to observe how the LDH diagnoses accumulate for different variants according to age, and to evaluate whether there are early onset variants. Of the variants where differences in the accumulation of diagnoses were observed on the basis of the plots, we determined the diagnosis age in years when the curve of the effect allele (EA) homozygotes statistically differed from the curve of other allele (OA) homozygotes of the same variant. The calculation was performed with the two-tailed test by using the survival rates and survival rate standard errors obtained with the ‘survfit’ function, which is part of the ‘survival’ (https://github.com/therneau/survival, version 3.2-7) R library. The equation is given below:

diff = (‘survival rate EA/EA’-‘survival rate OA/OA’)/sqrt(‘se_survival_rate_EA/EA2’+’se_survival_rate_OA/OA2’),

p = 2 × (1-diff). P < 0.05 was used as the statistical cut-off value for a significant difference. Additionally, we calculated cumulative morbidity for every variant to clarify further whether some variants accumulate more diagnoses.

Genetic correlations

Genetic correlations were calculated between our meta-analysed LDH summary statistics and 437 other phenotypes extracted from the GWAS database provided by the MRC Integrative Epidemiology Unit (IEU) (https://gwas.mrcieu.ac.uk/). The LDSC software (version 1.0.1)17,53 was used for these calculations. We used a false discovery rate (FDR)-corrected p-value (pFDR) < 0.05 as the limit for significant correlations. HapMap3 European-only reference panel provided by Alkes Price’s group (https://alkesgroup.broadinstitute.org/LDSCORE/) were used in genetic correlations. For genetic correlations, we used our LDH summary statistics and the GWAS summary statistics that were similar to those that have been previously available in LDhub (https://ldsc.broadinstitute.org).

Mendelian randomization

For Mendelian randomization, we used the Two-Sample MR (https://github.com/MRCIEU/TwoSampleMR, version 0.5.8) R library to conduct a bi-directional Mendelian randomization to examine the causal relationships between LDH and its associated risk factors. Risk factors included in the analysis (Supplementary Table 11). Due to a bi-directional study approach, we were able to evaluate whether risk factors are causal for LDH and, consequently, if LDH is causal for risk factors. We obtained the LDH instruments from the FinnGen GWAS results since many of the GWAS data provided by the MRC-IEU are UKBB-based. This ensured that there was no overlap between the study populations. Variants correlated with each other were removed from the data so that only independent variants would be included in the analysis. For this, we used the default clumping settings (clumbing window 10 kb, r2 = 0.001). We run our primary analysis using the Inverse Variance Weighted (IVW) model. In the sensitivity analyses, we obtained MR Egger estimates. We also performed Cochran’s Q-test and the MR Egger intercept test to evaluate the heterogeneity and pleiotropy of the instruments. A leave-one-out analysis was also performed to see if there is a specific SNP driving a potentially observable causal relationship.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.