Domesticated cannabinoid synthases amid a wild mosaic cannabis pangenome

Lynch, Ryan C.; Padgitt-Cobb, Lillian K.; Garfinkel, Andrea R.; Knaus, Brian J.; Hartwick, Nolan T.; Allsing, Nicholas; Aylward, Anthony; Bentz, Philip C.; Carey, Sarah B.; Mamerto, Allen; Kitony, Justine K.; Colt, Kelly; Murray, Emily R.; Duong, Tiffany; Chen, Heidi I.; Trippe, Aaron; Harkess, Alex; Crawford, Seth; Vining, Kelly; Michael, Todd P.

doi:10.1038/s41586-025-09065-0

Download PDF

Article
Open access
Published: 28 May 2025

Domesticated cannabinoid synthases amid a wild mosaic cannabis pangenome

Nature volume 643, pages 1001–1010 (2025)Cite this article

32k Accesses
20 Citations
112 Altmetric
Metrics details

Subjects

Abstract

Cannabis sativa is a globally important seed oil, fibre and drug-producing plant species. However, a century of prohibition has severely restricted development of breeding and germplasm resources, leaving potential hemp-based nutritional and fibre applications unrealized. Here we present a cannabis pangenome, constructed with 181 new and 12 previously released genomes from a total of 144 biological samples including both male (XY) and female (XX) plants. We identified widespread regions of the cannabis pangenome that are surprisingly diverse for a single species, with high levels of genetic and structural variation, and propose a novel population structure and hybridization history. Across the ancient heteromorphic X and Y sex chromosomes, we observed a variable boundary at the sex-determining and pseudoautosomal regions as well as genes that exhibit male-biased expression, including genes encoding several key flowering regulators. Conversely, the cannabinoid synthase genes, which are responsible for producing cannabidiol acid and delta-9-tetrahydrocannabinolic acid, contained very low levels of diversity, despite being embedded within a variable region with multiple pseudogenized paralogues, structural variation and distinct transposable element arrangements. Additionally, we identified variants of acyl-lipid thioesterase genes that were associated with fatty acid chain length variation and the production of the rare cannabinoids, tetrahydrocannabivarin and cannabidivarin. We conclude that the C. sativa gene pool remains only partially characterized, the existence of wild relatives in Asia is likely and its potential as a crop species remains largely unrealized.

Epidemiological overview of multidimensional chromosomal and genome toxicity of cannabis exposure in congenital anomalies and cancer development

Article Open access 06 July 2021

Parallel evolution of cannabinoid biosynthesis

Article 01 May 2023

Genetic diversity, population structure, and cannabinoid variation in feral Cannabis sativa germplasm from the United States

Article Open access 01 July 2025

Main

Cannabis (C. sativa L., cannabis) is an ancient domesticated plant with widespread archaeological evidence for seed (achene) and fibre utilization dating to 8,000 years ago in East Asia, and earlier occurrences found up to 12,000 years ago^1,2, rivalling that of important crops such as wheat, barley, maize and rice. Cannabis was originally a multipurpose crop in Asia, where the same plants were utilized as a source of fibre, food and drugs^2,3. Over time, cannabis spread globally and single or dual-use-type cultivars were developed, eventually giving rise to divergent hemp and drug-type populations of the twentieth century⁴. Prior to the early 1900s, cannabis was an important commodity across Asia, Europe and the New World, and was used to produce fibres used in sails, ropes, clothing and paper. However, competition from other fibre crops, entanglement with drug laws, and the eventual development of synthetic fibres led to a decline in production. In recent decades, the use of cannabis has shifted to specialized applications, including niche seed oils and drug production, where it continues to hold significant economic and cultural importance today⁵.

Throughout history and around the world, cannabis has undergone cycles of “cultivation, consumption, and crackdown”⁶. Modern prohibition originated in the USA during the early twentieth century⁷, but by 1961 had spread to a majority of countries⁸. Prohibition eliminated the fibre and food uses of cannabis for decades, but gave rise to a high-value illegal market for phytocannabinoid-based drugs, which are derived from glandular trichomes. Although more than 100 phytocannabinoids have been identified, only a limited number are produced in significant quantities, which are used to classify plants by chemotype: delta-9-tetrahydrocannabinolic acid (THCA; type I), cannabidiolic acid (CBDA; type III), balanced CBDA and THCA (type II), cannabigerolic acid (CBGA; type IV) and cannabinoid-free (type V)⁹. Although tetrahydrocannabinol (THC), the primary intoxicant, remains a controlled substance, a majority of US states and many countries now allow medical or adult use of cannabis products. Separately, the 2014 and 2018 US Farm Bills facilitated hemp production and research in plants that produce less than 0.3% THC on US soil, generating opportunities for improved non-THC drug, grain and fibre applications.

The haploid cannabis genome is relatively small in size (around 750 Mb), yet its complexity is driven by a high proportion (approximately 79%) of transposable elements (TEs) and substantial heterozygosity (single nucleotide polymorphisms (SNPs): greater than 2%). The CBDRx (cs10) reference genome, derived from the high-cannabinoid (HC) cannabidiol (CBD) hemp lineage related to the well-known anti-epileptic ‘Charlotte’s web’ cultivar¹⁰, resolved the arrangement of cannabinoid synthase genes as a single full-length copy of CBDAS nested within conserved 70 to 80-kb tandem TE arrays. Furthermore, HC hemp lines such as CBDRx emerged through the introgression of the CBDAS locus into a predominantly marijuana (MJ) genetic background, thereby leveraging high-potency alleles to enhance CBD production¹¹. However, initial comparison of published cannabis genomes suggests substantial genomic dynamism across use types^{11,12,13,14,15,16}, raising key unresolved questions about the global extent of genetic diversity. Additionally, the role of hybridization in shaping genome architecture and allele transmission remains unclear, highlighting the need for further high-quality assemblies and population-scale genomic analyses. Here we have built a comprehensive framework for exploring genetic diversity in this multi-use crop by creating a cannabis pangenome using haplotype-resolved, chromosome-scale assemblies.

The cannabis pangenome

Cannabis is often classified as a monospecific genus¹⁷, although debate remains regarding the status of Cannabis indica Lam. and Cannabis ruderalis, the latter of which is thought to be the source of the day-neutral (DN; autoflowering) flowering type¹⁸. We addressed the diversity of cannabis by building the pangenome with samples selected from multiple sources to cover use types, history, sex expression and agronomic traits (Extended Data Fig. 1 and Supplementary Fig. 1). The cannabis pangenome comprises 181 new PacBio assemblies and 12 previously published genomes, representing 144 biological samples, including 78 haplotype-resolved, chromosome-scale assemblies and 103 contig-level assemblies. We highlight an F₁ hybrid (ERBxHO40_23; EH23) between two phenotypically and genetically divergent parents to clarify features of the genome that have been missed in previous studies (Fig. 1a, Extended Data Figs. 2 and 3 and Supplementary Note 1).

**Fig. 1: Cannabis pangenome architecture uncovers at least five populations.**

All genomes are of high quality, with an average N50 of 7.5 Mb, and BUSCO¹⁹ genome and proteome completeness scores of 97% and 95%, respectively (Extended Data Fig. 4). The average haploid genome length was 781 Mb with around 35,000 protein-coding genes per genome (Supplementary Tables 1, 2 and 3). Consistent with a predominantly outcrossing behaviour, the SNP-based heterozygosity ranged between 1% and 2.5% (Supplementary Fig. 2). The assemblies are also high quality structurally, resolving previous TE placement issues (Supplementary Fig. 3) and revealing centromere regions, telomere length, large structural variations (SVs), fine-scale genetic architecture of important genes such as the cannabinoid synthases, as well as the sex-determining region (SDR) and pseudoautosomal region (PAR) of the Y chromosome (Fig. 1a,b), the largest chromosome in the Cannabis genome (Extended Data Fig. 5).

We constructed comprehensive Cannabis pangenomes using both reference-based and reference-free approaches. A reference-based pangenome graph was generated with Minigraph-Cactus (MGC)²⁰ using the 78 chromosome-scale, haplotype-resolved genomes. For a reference-free approach, we built a k-mer matrix with PanKmer²¹ using all 193 genomes and a graph-based representation with PanGenome Graph Builder (PGGB)²². Owing to the high memory demands of PGGB, we selected a subset of 16 genomes for graph generation (Extended Data Fig. 6 and Methods). SVs detected by MGC and PGGB closely matched those from pairwise whole-genome alignments. Mapping rates for a diverse short-read dataset² were similar between the MGC pangenome graph (95.09%) and the linear EH23a reference genome (95.0%), indicating that both approaches effectively captured variation.

The pangenome reveals five populations

The taxonomy, history and nomenclature of the cannabis genus have long been debated²³. Owing to its wide phenotypic and geographic diversity, it has been classified either as a multi-species interbreeding complex or as a single species with subspecies designations. We calculated the collector’s curve to evaluate the completeness and diversity of the pangenome using shared gene-based orthogroups as well as shared k-mers (Fig. 1c,d). The curve suggested that we captured the majority of cannabis orthogroup diversity at around 100–125 genomes (Fig. 1c), although significant global genomic variation remains uncharacterized (Fig. 1d), possibly owing to the recent TE activity. Collector’s curves for the 78 haplotype-resolved, chromosome-scale assemblies revealed similar but more attenuated diversity–sample relationships (Supplementary Fig. 4). Across all pangenome samples we found that 23% of genes were ‘core’ (present in all genomes), 55% were ‘nearly-core’ (95–99% of genomes), 21% were ‘shell’ (5–94% of genomes), and a small fraction were classified as ‘cloud’ (0.4%) or ‘unique’ (0.7%) (Fig. 1e and Supplementary Fig. 5). Gene Ontology (GO) terms related to terpene biosynthesis and defence response were some of the most frequently enriched among core genes (Extended Data Fig. 7, Supplementary Note 2 and Supplementary Table 4), although both showed substantial variation at the sequence level (Extended Data Figs. 7 and 8).

Cannabis has not undergone a whole-genome duplication since the ancient lambda event approximately 100 million years ago¹³. This suggests that its extensive genomic diversity arose not through recent whole-genome duplications or hybridization-driven allopolyploidy, but through tandem gene duplication and other local duplication mechanisms (Supplementary Fig. 6 and Supplementary Note 3). Comparisons between populations using pairwise average F_st (fixation index) values based on phased SNPs indicated that some cannabis populations exhibited levels of genetic differentiation that were similar to interspecies comparisons, such as in strawberry²⁴ (F_st = 0.20 for MJ versus hemp; Supplementary Table 5). Specific genes with high F_st SNPs were linked to environmental response, with circadian, light signalling and flowering time genes exhibiting an above-average F_st (0.42) (Supplementary Table 6). Notably, GIGANTEA (GI)²⁵, a highly conserved, typically single-copy gene that has a central role in the circadian clock that regulates daily period length, flowering time and cell elongation, contained a SNP with the fifth-highest F_st (0.77, MJ versus hemp). Separately, using a test for selective sweeps across 20-kb SNP windows (XP-CLR, MJ versus hemp), GI was again found within a significant region of the X chromosome. Finally, a broader analysis of gene family diversity revealed substantial variation at the GI locus between the HC hemp and hemp populations (Supplementary Fig. 7). These findings highlight the effect of selection on key agronomic genes²⁶ that may underlie differentiation of traits such as flowering and internode elongation (fibre length), which contrast markedly between hemp and MJ populations.

Drug-type populations from North America that produce high levels of cannabinoids are thought to have originated from regions of Southeast and Central Asia, and were brought to the western hemisphere via the Caribbean and South America; however, most of what is known about these ancestral populations is based on limited historical accounts and speculation⁵. A broad split of drug-type samples into two groups, one aligned with Asian hemp and one with European hemp, was suggested by the k-mer-based hierarchical clustering using the PanKmer pangenome (Fig. 1f,g and Extended Data Fig. 1). Both groups contained MJ and HC hemp samples, which were thought to have largely MJ ancestry with a recent history of introgression breeding for CBDAS genes, perhaps from European hemp origins¹¹. However, using a phased SNP-based structure with all MJ samples treated as a single population, the TreeMix model inferred a highest-likelihood phylogeny that included six gene flow (migration) events between Asian hemp, HC hemp and European hemp, as well as MJ and HC hemp samples (Supplementary Fig. 8). These results may partially explain the European and Asian groupings of drug-type samples found by our k-mer clustering analysis, and reflect the effects of historical hybridization breeding between Asian and European hemp that is documented in the breeding literature²⁷. In addition to the two drug-type populations and separate European and Asian hemp populations, the k-mer clustering showed significant divergence between the single available wild Tibetan assembly from all other domesticated and feral lines,¹³ suggesting that wild Cannabis relatives still exist in remote regions of Asia². Indeed, k-mer based hierarchical clustering of the pangenome assemblies combined with short reads from samples collected across Europe and Asia recapitulated the original authors’ finding that samples from Asia described as ‘drug-type feral’ and ‘basal’ represent distinct populations² (Fig. 1f and Supplementary Fig. 9). Ultimately, refining hypotheses about domestication, biogeography and use-type history will require broader sampling of Asian and historical specimens, along with careful delineation of wild and feral populations.

Sex chromosome evolution

Sex expression in cannabis has long puzzled biologists²⁸. Although most populations are dioecious, with separate male (XY) and female (XX) plants, monoecious (XX) forms also exist, which exhibit variable ratios of male and female flowers. The Cannabaceae sex chromosomes originated in a common ancestor of Cannabis and Humulus more than 36 million years ago (Ma)²⁹—earlier than previous estimates³⁰—making them among the oldest known in flowering plants³¹. Despite their ancient origin, cannabis sex chromosomes have been shaped by human selection on sexually dimorphic traits³². In drug-type populations, males produce few glandular trichomes, and pollination reduces cannabinoid yield in female plants, leading to reduced use (or elimination) of males in breeding programmes (Methods). By contrast, hemp seed production requires pollen, and male plants enhance bast fibre yield and quality. Additionally, European monoecious fibre cultivars, such as Santhica (SAN) and KC Dora (KCDv1), were developed to improve mechanized harvesting efficiency of both fibre and seeds, adding another layer of artificial selection³¹.

Unlike most angiosperms, cannabis has a heteromorphic XY pair, with a Y chromosome that is approximately 30% larger than the X chromosome (Fig. 1b, Extended Data Figs. 4 and 5). Recombination occurs in the PAR but is suppressed throughout the SDR on the Y chromosome. The SDR spans 79–84 Mb out of the approximately 110 Mb Y chromosome, making it one of the largest SDRs in plants, with 840–1,160 genes (Supplementary Figs. 10 and 11 and Supplementary Tables 7 and 8). By contrast, the PAR covers only around 29 Mb, yet hosts 1,900–1,980 genes, including many important flowering genes, such as FLOWERING LOCUS T (FT), CONSTANS (CO) and GI. Theory predicts that after initial recombination suppression, the SDR expands in a stepwise manner owing to selection linking genes to the SDR that are beneficial to males but deleterious to females³³. Alternatively, neutral processes, reflected in synonymous substitution rates (K_s), can drive SDR expansions. K_s values along the SDR showed a continuous pattern of gene addition from the PAR boundary to the centromere²⁹, suggesting that recombination suppression near the centromere at least partially caused expansion. Using k-mers and X–Y orthologue phylogenies, we identified two distinct SDR haplotypes: Ya, shared by six samples, and Yb, found in two samples (Fig. 1b). These haplotypes differed at the SDR–PAR boundary, separated by 5 conserved gene models spanning approximately 51 kb (GVA-21-1003-002)³⁴ to 132 kb (Kompolti), with all others spanning 61–62 kb. The gene located nearest to the PAR–SDR boundary in the Ya haplotype (within the PAR in Yb; Fig. 1b) is TRANSCRIPTION ELONGATION FACTOR (SPT5), which is known to interact with FLOWERING LOCUS C (FLC) via FRIGIDA during cold-induced flowering in Arabidopsis³⁵. This suggests that selection on flowering time genes has facilitated a stepwise shift in recombination suppression and SDR expansion, which may explain why male flower development begins before female flowering onset in some varieties. Polymorphisms in the SDR–PAR boundary signal that the hemp gene pool hosts ancestral diversity of sexually antagonistic genes, which may underlie useful variation in flowering timing³⁶.

Furthermore, gene expression profiling of Ace High (AH3M) male and female tissue found biased expression of more than 7,000 genes in male flowers across all chromosomes, spanning many functions including pollen development. This contrasted with biased expression of genes in male leaf (approximately 1,400 genes), female leaf (approximately 3,700 genes) and female flower (approximately 3,900 genes) tissue (Extended Data Fig. 9). Whereas gene expression in the X chromosome was fairly uniform, gene density and expression in the Y chromosome were skewed toward the PAR. Of note, a substantial proportion of genes in the PAR (38%, around 750 genes) showed biased expression in male flowers, compared with only 6% (94 genes) in the SDR. Although the SDR encodes one or more unidentified sex-determining genes for male flower development, the majority of the required transcriptional network for male or female flower expression is broadly distributed across all chromosomes.

TEs shape the pangenome

TEs had a major role in shaping the cannabis genome, particularly in the proliferation of intronless cannabinoid synthase genes, which are embedded within 70–80-kb conserved TE cassettes¹¹. On average, TEs comprised 68% of each genome, with long terminal repeat retrotransposons (LTR-RTs) representing 50% of the total (Fig. 2a and Supplementary Tables 1 and 9). Genes on average were located near TEs (443–613 bp from TEs; Supplementary Table 10). Different TE types showed distinct insertion patterns: DNA transposons (for example, Mutator and Helitron) were inserted within 500 bp upstream of coding regions, whereas LTR-RTs were more evenly distributed flanking genes (Supplementary Fig. 12). Genes involved in transposition, transcription, recombination and DNA repair were frequently associated with Ty3-long terminal repeats (LTRs), whereas defence and metabolite biosynthesis genes were enriched near Ty1-LTRs (Supplementary Table 11). Many intact TEs were estimated to have inserted into the genome within the past 100,000 years, suggesting that ongoing diversification may be driven by hybridization and stress factors, particularly in F₁ and MJ populations (Fig. 2a–c). One such factor is clonal propagation, which is a common practice in modern MJ production but is rarely used in hemp field cultivation.

**Fig. 2: TEs shape the cannabis pangenome.**

Despite 4 million years of sustained activity and a recent burst of LTR proliferation (Fig. 2b,c), the cannabis genome has maintained a smaller haploid genome size (approximately 750 Mb) than that of its sister genus Humulus, which ranges from 1,700 Mb in Humulus japonicus to 2,700 Mb in Humulus lupulus³⁷. Solo LTRs reflect genome purging and can be formed by ectopic recombination, which occurs in the internal sequence of a complete LTR-RT³⁸. The high solo:intact ratio observed in cannabis (Fig. 2d–g) is likely to contribute to its compact genome size by mitigating TE accumulation. Ty1-LTRs displayed the highest solo:intact ratio within the SDR of the Y chromosome (Fig. 2d,f), suggesting the initial expansion of this region was driven by TE insertions that preceded deletion events by ectopic recombination (Fig. 2i,j). DNA methylation also prevents uncontrolled TE proliferation by silencing expression³⁹. We found that TE methylation levels were higher than genome-wide averages, although population-specific differences were detected (Supplementary Fig. 13 and Supplementary Table 12). We detected expressed TE transcripts in the EH23 F₁ hybrid, indicating ongoing TE activity (Supplementary Fig. 14). On the Y chromosome, the PAR and the SDR exhibited distinct patterns of gene expression and intact TE expression (Extended Data Fig. 9d,f), with the SDR showing increased methylation levels (Fig. 2h), consistent with its degenerate, gene-poor nature. Several TE families are actively transcribed, and many insertions are evolutionarily recent; however, TE frequency profiles varied distinctly among populations (Supplementary Figs. 15 and 16). The combination of recent divergence times for certain TE types (Fig. 2b,c), their enrichment near genes, and their population-specific distributions suggests that TEs contribute to both gene evolution and the regulation of adaptive responses in cannabis.

SVs drive innovation

Given the high abundance of young, active TEs in cannabis, we examined their role in shaping pangenome SVs (Fig. 3). SV counts varied most in translocations and duplications, mirroring population-specific TE abundance (Fig. 2a), whereas inversions showed the least variation (86 per genome on average) (Fig. 3a and Extended Data Fig. 10). However, inversion sizes ranged from 200 bp to 25 Mb (average 304 kb), forming a multi-modal distribution, suggesting that multiple evolutionary forces shaped inversions of different lengths. Whereas the SNP heterozygosity ranged between 1 and 2.5% in the pangenome, the heterozygosity (variable regions) when including SVs and non-alignable regions was on average 20.6% of total genome length (Supplementary Fig. 17), highlighting the extent of previously uncharacterized genomic variation in cannabis.

**Fig. 3: Structural variants occur at different frequencies in populations and are non-randomly distributed across the genome.**

TEs frequently caused small-to-medium translocations, duplications and inversions, whereas larger inversions arose at breakpoints that were enriched with segmental duplications and inverted repeats⁴⁰ (Extended Data Fig. 10). SV hotspots on chromosome (chr.) 1, chr. 4 and chr. 7 overlapped common inversion breakpoints and TE-enriched regions (Fig. 3b). Analysis of TEs within SV breakpoints (500 bp upstream and downstream, 1 kb total) revealed population-specific TE enrichment patterns. In MJ genomes, duplications frequently contained three DNA TE families and Ty3-LTR-RTs (Supplementary Table 13; P < 0.05, Welch’s t-test). Only Harbinger and Mutator DNA TEs were enriched at duplication breakpoints in other populations, whereas feral hemp duplications showed no significant TE enrichment, suggesting recent TE activity or alternative SV formation mechanisms. Inversions covered up to 7% of the genome, surpassing values observed in multi-species comparisons, such as soybean and grapevine⁴¹. Given the population-specific interplay of TEs and SVs, as well as their frequent proximity to genes, our findings revealed a diverse set of mechanisms driving cannabis genome evolution, many of which were undetected in previous resequencing efforts.

Segregation distortion has been observed across multiple regions of the cannabis genome¹⁶, mirroring patterns detected in the F₁ EH23 hybrid (Extended Data Fig. 3), which suggests that SVs may contribute to allele transmission biases⁴². Long inversions, such as the one found on chr. 1 (19.5 Mb in length; Fig. 3b), may function as a supergene, perhaps maintained as a balanced polymorphism through associative overdominance⁴³. Indeed, the 17 instances of this inversion were found to be heterozygous in 15 samples and homozygous in 1. This inverted region contained around 1,203 genes, spanning many functions, including the core circadian and flowering time gene PSEUDO RESPONSE REGULATOR 3 (PRR3), which has been implicated in the ‘autoflower’ DN behaviour in cannabis⁴⁴ as well as in flowering time variation associated with range expansion in major crops (soybean and sorghum) and natural populations^45,46,47. PRR3 contained a high-F_st SNP (0.61) as well as biased expression in our F₁ EH23 hybrid that was recessive for the DN trait (Extended Data Fig. 3). We found that pairwise SNP r² values and local principal component analysis (PCA) plots of this area suggested some level of haplotype formation and increased linkage disequilibrium (LD; >10 kb) across this region, especially at the interior breakpoint (Fig. 3c,d and Supplementary Fig. 18). However, these were not obvious signals of complete differentiation or recombination suppression as has been shown in other species⁴⁸.

Domesticated cannabinoid pathway

Cannabis is the only prolific producer of cannabinoids, although other plants (such as liverworts) and fungi synthesize smaller quantities⁴⁹. Although key enzymes in the cannabinoid biosynthetic pathway have been identified (Fig. 4a and Supplementary Fig. 19), the genomic organization of the final step in this pathway remained unresolved owing to the complexities of the cannabis genome (Supplementary Fig. 20). This mystery was clarified with the discovery of full-length THCAS, CBDAS and CBCAS genes nested within conserved TE cassettes, arranged in arrays on chr. 7¹¹. However, it was unclear whether this TE-mediated arrangement of synthase genes was conserved across the cannabis pangenome.

**Fig. 4: The cannabinoid biosynthesis pathway is domesticated yet shows contrasting patterns of genetic diversity and synteny.**

Cannabinoid synthases duplicated and neofunctionalized from the ancestral Berberine bridge enzyme-like (BBE-like) family of genes on chr. 7, then were ultimately reduced to a limited set of functional THCAS and CBDAS alleles through the domestication process^11,50 (Fig. 4b,c and Supplementary Fig. 21). Across the pangenome, each haploid genome hosted a maximum of one full-length THCAS or CBDAS, which were arranged in similar arrays of TE cassettes, most of which contained synthase pseudogenes. These cannabinoid synthase cassettes were found in a limited number of arrangements with association to specific TEs (Fig. 4c,e, Supplementary Figs. 22 and 23 and Supplementary Table 14), which suggested that selection had linked a small range of functional alleles to pseudogene cassette haplotypes. As a result, most THCAS and CBDAS genes were non-syntenic, and associated with inversions between cannabis types, but were generally located within a region constrained to about 1.5 Mb on chr. 7 (Figs. 1a and 4d). Whereas the cannabis pangenome exhibits high genomic variation, the conserved structure of the THCAS and CBDAS loci suggests that these regions are under strong selective pressure.

Full-length CBCAS paralogues were typically 15–20 Mb from the chr. 7 centromere, but owing to a genomic inversion, sometimes appeared within about 1.2 Mb of THCAS (Fig. 4d). CBCAS occurred in 56% (110 out of 193) of genomes, in arrays of 1–15 copies (Supplementary Fig. 22). Although CBCAS is capable of producing cannabichromenic acid (CBCA) in yeast¹⁶, analysis of more 59,000 cannabis samples detected almost no CBCA, probably owing to low natural levels⁵¹. In EH23, CBCAS expression was low across all tissues, suggesting that CBCA accumulation has not been under strong selection, potentially owing to human preference for THC and CBD (Supplementary Fig. 19).

Varin cannabinoids and fatty acid genes

In planta cannabinoid alkyl side-chain length can vary from one to at least seven carbons, with five carbons being the most common in modern gene pools⁵². Three-carbon side-chain cannabinoids (propyl; tetrahydrocannabivarin (THCV), cannabivarin (CBDV) and cannabigerovarin (CBGV)) are much less common, but have attracted interest as novel therapeutic agents⁵³. Prior studies have characterized the polygenic nature of this trait, and associated the β-keto acyl carrier protein reductase (BKR) gene with varin cannabinoid production, but left open at least one step needed for a complete biosynthetic hypothesis⁵⁴. We extended the model for varin cannabinoid production by identifying a complex of acyl-lipid thioesterase (ALT3 and ALT4) genes located near the beginning of chr. 7 that were associated with varin production in our F₂ mapping population and were contained within a common haplotype in our k-mer-based crossover analysis of trios (Fig. 5, Supplementary Note 1, Supplementary Figs. 24–26 and Supplementary Tables 15 and 16). There was high ALT gene copy number variation in cannabis, ranging from 2–14 copies (considering both phased and unphased assemblies) across 4 chromosomes (Fig. 4a). Most plant genomes contain 4–5 ALT homologues, and some contain only a single homologue (for example, Brassica rapa and Glycine max)⁵⁵. Additionally, ALT protein sequence variation in cannabis was notable, with distinct orthogroup membership of each ALT4 in EH23a and EH23b genomes (Fig. 5b,c), despite these genes being located at similar positions (Fig. 5b). Since the shortest known fatty acid product of a plant fatty acyl-thioesterase is a 6:0 fatty acid generated by the Arabidopsis ALT4, the EH23a ALT4 allele is a lead candidate for further experimentation. However, given the crossover locations (Fig. 5a), potential for linkage disequilibrium and short-read mapping issues in this region, any of these ALT3 and ALT4 trans-duplicated genes (or splice variants) could be causal for varin cannabinoid production. Alternatively, they may have overlapping sub-functionalized substrate specificities, which would pose challenges for further mapping and improvement efforts⁵⁶.

**Fig. 5: *ALT* gene *trans*-duplication and diversification explains varin cannabinoid phenotype in cannabis.**

Although the BKR gene on chr. 4 was identified previously in a genome-wide association study, the pangenome showed that a 2-bp deletion produced a 6-exon loss-of-function gene model, which lacked catalytic active site residues (Fig. 5e). Thus, reduction or loss of function in this gene is probably required to increase the butyryl-acyl carrier protein pool, which one of the ALT3 or ALT4 gene products then hydrolyses to butyric acid, leading to varin cannabinoid biosynthesis (Fig. 4a). Since cannabis hosts BKR genes on chr. 3 and chr. 4, loss of catalytic function of one copy is unlikely to fully terminate iterative fatty acid chain synthesis, which could also explain why varin cannabinoids are only found in certain ratios with pentyl cannabinoids^52,54. Across the pangenome, the EH23a 6-exon BKR variant was exclusively found in HO40 pedigree samples (high varin); all other samples, except one 8-exon version of BKR in the seed oil cultivar Finola (low varin producer) were 11- or 12-exon models. The phylogenetic relationships of the predicted BKR proteins showed that the 6-exon gene may be closer to certain Asian hemp, European hemp and feral variants (Fig. 5f). However, one of the 11-exon gene clades contained the varin-producing AutoCBDV genome, and the potential varin producer Durban Poison, which could be reduced-function variants. Some reports suggest that there is no defined geographic origin associated with the varin chemical phenotype⁵⁷. However, other studies report plants that contain high levels of varin cannabinoids from the southern regions of Africa and certain regions of Asia^52,58. Collectively, the BKR gene phylogeny and whole-genome k-mer-based clustering analysis suggest an Asian origin for varin cannabinoid genes used in this breeding project (Fig. 1f,g). Deeper understanding of these biosynthetic pathways enhances our ability to select and optimize diverse cannabinoid production and suggests a path toward improvement of seed oil lipid profiles.

Conclusions

Our analysis of 193 cannabis genomes revealed that global diversity remains undersampled, with Asian germplasm notably underrepresented. Despite its phenotypic similarity to European hemp, Asian hemp carries highly divergent genomic regions, some of which align more closely with North American drug-type cannabis, suggesting undiscovered wild relatives and unresolved taxonomy. TE activity and hybridization, rather than whole-genome duplication, drive cannabis genome evolution. SVs uncover previously hidden diversity missed by short-read sequencing. Whereas cannabinoid synthase genes show limited variation, genes related to fatty acid metabolism, growth, defence and terpene biosynthesis exhibit extensive diversity and copy number variation. We assembled fully phased cannabis X and Y chromosomes, identifying a variable SDR–PAR boundary and unique male-specific homologues on the large Y chromosome that may influence flowering time and development, offering new targets for breeding.

Finally, the discovery of extensive variation in fatty acid biosynthesis genes (for example, ALT and BKR) suggested that cannabis has untapped potential for lipid metabolism. Given the overlap between cannabinoid biosynthesis and seed oil pathways, hybridizing diverse parental lines beyond the conventional Northern European hemp seed oil gene pool could yield novel lipid profiles and traits. The conservation and utilization of Asian hemp and wild cannabis will be critical for advancing cannabis breeding and the development of agronomic and pharmaceutical potential.

Methods

Plant material

C. sativa pangenome samples were selected from multiple sources to maximize the genetic diversity, history and agronomic value. A large portion of the pangenome comes from the Oregon CBD (OCBD) breeding programme that includes elite cultivars; foundational marijuana lines potentially originating from the 1970s to the present; and elite trios used for different aspects of the breeding programme (Extended Data Figs. 1 and 2, Supplementary Table 1 and Supplementary Fig. 1). The remaining cultivars come from the US Department of Agriculture (USDA) Germplasm Resource Information Network (GRIN) and German Federal Genebank (IPK Gatersleben) repositories, as well as collections made by the Salk Institute from various breeders. The pangenome includes European and Asian fibre and seed hemp, feral populations, North American marijuana (type I) and North American high cannabinoid yielding (CBD or CBG) hemp (type III and IV). Additional cannabinoid diversity is represented with chemotypes presenting high expression of pentyl or propyl (varin) homologues of CBD or THC, and cannabinoid-free (type V) plants. Flowering time variation is also captured with the inclusion of both regular short-day and day-neutral (autoflowering) phenotypes (Supplementary Table 1).

EH23 phased, haplotype-resolved, chromosome-scale anchor genome

EH23a (HO40) and EH23b (ERB) are haplotype-resolved assemblies for ERBxHO40_23, an F₁ resulting from a cross between parents, ERB and HO40, both proprietary female inbred lines from OCBD. ERB is a DN (autoflower), type III (CBDA-dominant) plant that is part of the drug-type group more closely related to European HC hemp. HO40 is type I propyl cannabinoid (THCVA and THCA)-producing, short-day flowering responsive, and is part of the drug-type marijuana group (MJ) with a closer affinity to Asian hemp. The genetically female (XX) ERB plant was induced to produce male flowers by treatment with silver thiosulfate and used to pollinate HO40. One individual from the F₁ populations (ERBxHO40_23) was selected for genome sequencing. Initial genome size estimates of ERB × HO40_23 using flow cytometry estimated a diploid genome size of 1445.6 Mb (722.8 Mb haploid genome size). High molecular weight (HMW) DNA was extracted from leaf tissue. Following DNA extraction and library preparation (see ‘HMW DNA isolation and genome sequencing’) HiFi reads were generated on the Pacific Bioscience (PacBio) Sequel II. Hifiasm v0.16.1⁵⁹ was then used in conjunction with Hi-C reads to produce initial assemblies. After assembly, Hi-C reads were aligned to the Hifiasm_HiC contigs using the Juicer v1.6.2 pipeline⁶⁰ followed by ordering and orientation utilizing version 180922 of the 3D-DNA pipeline⁶¹. The scaffolded assemblies were then manually corrected using Juicebox v1.11.08⁶².

EH23 F₂ population

In addition to the whole-genome sequencing data described above, ERBxHO40_23 was self-pollinated using silver thiosulfate induced masculinization of select flowers, to create an F₂ mapping population. From this F₂ population, individuals were scored for autoflower and varin content, and sequenced using Illumina 100 bp reads by NRGene (Nrgene Technologies). Illumina WGS genotyping runs were performed on 288 plants from this population, plus the ERBxHO40_23 parent. Trim_galore was used to trim sequences using: --2 colour 20, resulting in 271 individuals for analysis⁶³. On average samples had 8.5× coverage. Minimap was used to align each sample to EH23b.softmasked.fasta. Freebayes was used to call variants: -g 4500 -0 -n 4 --trim-complex-tail --min-alternate-count 3⁶⁴. Bcftools was used to filter on QUAL > 20 scores (99% chance variant exists)⁶⁵. Finally, Vcftools⁶⁶ tools was then used to further filter SNPs: --remove-indels --minGQ 20 --maf 0.25 --max-missing 1 --min-alleles 2 --max-alleles 2 --stdout –recode⁶⁶; only sites that were scored as heterozygous (0/1) in ERBxHO40_23 sample were retained, resulting in 93,251 SNPs.

EH23 F₂ cannabinoid HPLC methods

High-performance liquid chromatography (HPLC) was conducted according to the protocol thoroughly described previously⁶⁷ to determine relative propyl and pentyl cannabinoid content in all the plants used in this study, including F₂ progeny. In short, mature flower tissue was collected from each individual, frozen at −80 C and homogenized, before cannabinoids were extracted in methanol.

EH23 RNA sequencing

ERBxH040-21 seedlings were grown under controlled environmental conditions. Various tissues were collected during the development of the plants, including early and late flowers, foliage, foliage under a 12-h inductive light regimen, roots and shoot tips. Total RNA extraction was done using the QIAGEN RNeasy Plus Kit following manufacturer protocols. Total RNA was quantified using Qubit RNA Assay and TapeStation 4200. Prior to library prep, we performed DNase treatment followed by AMPure bead clean up and QIAGEN FastSelect HMR rRNA depletion. Library preparation was done with the NEBNext Ultra II RNA Library Prep Kit following manufacturer protocols. Then these libraries were run on the NovaSeq6000 platform in 2× 150-bp configuration.

EH23 haplotype expression analysis

We measured gene expression levels using Salmon v1.6.0⁶⁸. In brief, the raw paired end short reads from sequencing were mapped to the CDSs from both haplotypes (EH23a and EH23b) and the abundance was estimated in transcripts per million (TPM) for downstream analysis. Mapping rates were calculated with samtools flagstat⁶⁵. The minimum TPM threshold for a given gene was ≥0.1. Haplotype gene pairs were identified by reciprocal best hits and synteny using blastp and MCScanX⁶⁹, and only genes shared between both haplotypes were included. A minimum of ≥95% sequence similarity and a threshold of 5 TPM difference between haplotypes was imposed. Visualization was performed using a combination of Matplotlib⁷⁰, SciPy⁷¹ and NumPy⁷², and expression values are shown in heat maps as log₂TPM to represent log fold change. Enrichment of Biological Processes GO Terms was performed with topGO⁷³ with the following parameters: resultWeight <- runTest(topGOdata, algorithm = “weight01”, statistic = “fisher”). A multiple test correction was performed with the following command: fullResults$p.adj <- p.adjust(as.numeric(fullResults$weightFisher), method = “fdr”). The background gene universe included all genes with a GO term from either EH23a or EH23b.

Ace High sex-biased gene expression analysis

We collected flower and leaf tissue from four Ace High plants, two male and two female, at the same developmental time point, at 08:00 and 20:00, for a total of 16 samples. Since Ace High males flower several weeks before female plants under normal outdoor conditions, plants were germinated and grown under long days and transferred to inductive short-day conditions for flowering, which resulted in both male and female plants developing flowers at the same time. Samples were collected at two times of day to capture all transcripts regardless of their circadian or diurnal expression⁷⁴. RNA was extracted with the Qiagen Plant RNA kit. Library prep was performed with the Oxford Nanopore Technologies (ONT) full-length cDNA kit. We aligned full-length cDNA to the haplotype-resolved Ace High (AH3Ma/b) genomes with minimap2 (v2.24)⁷⁵ and gene expression was measured using Salmon v1.6.0⁶⁸. Sex-biased expression was assigned for all tissue-specific male and female samples (leaf and flower from two male plants (plants A and B, collected at 08:00 and 20:00) and two female plants (plants C and D, collected at 08:00 and 20:00)). Each sex-specific tissue had four replicates (for example, gene expression measurements from male flowers sampled from two male plants at two different time points were averaged). Two categories of biased expression were defined: first, average expression that was higher (at least 5.0 TPM greater) in male or female samples, relative to the other sex; and second, male or female-only expression, where genes were not expressed in one sex (0.0 TPM for all replicates), but had an average of at least 1.0 TPM expression in the other sex. For GO term analysis with topGO⁷³, both categories of biased gene expression were combined. Fully syntenic genes were identified in the set of four genomes with X and Y chromosomes (AH3Ma/b, BCMa/b, GRMa/b and KOMPa/b) using genespace, and were grouped according to location in the PAR, SDR or X-specific region.

Hi-C library preparation and sequencing

For the Dovetail Omni-C library, chromatin was fixed in place with formaldehyde in the nucleus and then extracted. Fixed chromatin was digested with DNAse I, chromatin ends were repaired and ligated to a biotinylated bridge adapter followed by proximity ligation of adapter containing ends. After proximity ligation, crosslinks were reversed and the DNA purified. Purified DNA was treated to remove biotin that was not internal to ligated fragments. Sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters. Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of each library. The library was sequenced on an Illumina HiSeqX platform to produce ~30× sequence coverage. Then HiRise used (see read-pair above) MQ > 50 reads for scaffolding. Additional Hi-C libraries were generated using Phase Genomics Proximo Hi-C Kit (Plant) version 4.

HMW DNA isolation and genome sequencing

All samples were sequenced on a PacBio Sequel II. For samples sourced from ‘Michael’ (Supplementary Table 1), HMW DNA was isolated using Carlson Lysis buffer and Qiagen Genomic tips as described in the ONT Protocol ‘Plant leaf gDNA’ Arabidopsis method. The DNA was further size-selected for fragments longer than 10–25 kb using the ONT Short Fragment Eliminator Kit (EXP-SFE001). HMW DNA was then confirmed by Tapestation Genomic DNA ScreenTape (Agilent 5067-5365) or Femto Pulse Genomic DNA 165 kb Kit (Agilent FP-1002-0275). For samples sourced from ‘OCBD’ (Supplementary Table 1), HMW DNA was isolated using a modified protocol⁷⁶. In brief, samples were ground in a mortar and pestle with liquid nitrogen, two chloroform:isoamyl wash cycles were performed, and Total Pure NGS beads (Omega Biotek) were used as a substitute from the original protocol. Genomic DNA (gDNA) quality and purity was then assessed using a NanoDrop One (ThermoFisher) prior to starting library preparation. Continuous long read (CLR) libraries were made using the Pacbio protocol PN 101-693-800 V1. Size selections on gDNA were made using the Blue Pippin U1 High Pass 30–40 kb cassette with a 30–40 kb base pair starting threshold to produce fragment distributions of 60–90 kb. HiFi circular consensus sequencing (CCS) libraries were prepared according to the PacBio protocol (PN 101-853-100 V5). Sheared gDNA fragment distributions with a modal peak ~18 kb were produced using g-Tubes from Covaris and Blue Pippin S1 High Pass 6–10 kb cassettes to remove everything under 10 kb in size.

Pangenome assembly and scaffolding

All genomes labelled Hifiasm_HiC, Hifiasm_Trio_RagTag, Hifiasm_RagTag, and Hifiasm (Supplementary Table 1) were assembled using Hifiasm v0.16.1⁵⁹. When available, Hi-C data and HiFi parental trio data were also incorporated into the assembly process defining the Hifiasm_HiC and Hifiasm_Trio_RagTag types respectively. CLR assemblies were generated using FALCON Unzip from PacBio SMRT Tools 9.0 Suite⁷⁷ and CCS labelled genomes were assembled with HiCanu v2.2⁷⁸. After assembly, Hi-C reads were aligned to the Hifiasm_HiC contigs using the Juicer v1.6.2 pipeline⁶⁰ followed by ordering and orientation utilizing version 180922 of the 3D-DNA pipeline⁶¹. The scaffolded assemblies were then manually corrected using Juicebox v1.11.08⁶². Hifiasm_RagTag and Hifiasm_Trio_RagTag assemblies were scaffolded using the split chromosomes of the 24 Hi-C scaffolded genomes and error checked with yak-0.1 (github.com/lh3/yak). Sourmash v4.6.1⁷⁹ was used to generate a Jaccard similarity matrix between the chromosomes and each un-scaffolded assembly, and the most similar version of chromosome 1 through X was concatenated to generate a reference for scaffolding via RagTag v2.1.0⁸⁰. If the similarity matrix identified the Y chromosome as the best match, the assembly remained un-scaffolded. BUSCO v5.4.3⁷⁹ with the eudicots_odb10 dataset and assembly-stats v1.0.1 (https://github.com/sanger-pathogens/assembly-stats) were used on all assemblies to measure completeness and contiguity.

Reference-based graph construction with Minigraph-cactus

The graph pangenome of all 78 scaffolded and softmasked assemblies was generated with Minigraph-Cactus²⁰. We used the cactus-pangenome command within an Apptainer (v1.1.8) Image⁸¹ (https://quay.io/comparative-genomics-toolkit/cactus:v2.6.7-gpu) and the following parameter flags: --reference EH23a EH23b --vcf --vcfReference EH23a EH23b --giraffe --chrom-og --chrom-vg --viz --gfa --gbz. The seqFile input as well as the output graph in various formats (vg, paf, hal, etc.) can be found at https://resources.michael.salk.edu. We also compiled variants across the pangenome in terms of each assembly’s coordinates by using vg deconstruct -a -C (vg tools v1.61.0 “Plodio”) to derive vcf files from the Minigraph-Cactus gfa output and then using vcfbub --max-ref-length 100000 --max-level 0 to flatten nested variants and remove those >100 kb in length (see 78csatHaps_minigraphcactus_<assembly>.vcf.gz)^20,82,83.

Reference-free graph construction with PGGB

Input sequences and orientation

We generated two versions of each PGGB graph, one with the fasta files provided in the ‘Assembly files’ table and in the JBrowse instance at https://resources.michael.salk.edu (mixed-orientation) and one with fasta files in which the sequences have been consistently oriented to match the plus strand of the corresponding homologous chromosome in EH23a (consistent-orientation).

For PGGB graph 16csatAsms, we generated one graph per autosomal chromosome from the following 16 scaffolded and softmasked assemblies: AH3Ma, AH3Mb, BCMa, BCMb, EH23a, EH23b, GRMa, GRMb, KCDv1a, KCDv1b, KOMPa, KOMPb, MM3v1a, SAN2a, SAN2b and YMv2a. We generated one combined fasta file per chromosome as inputs for PGGB (see 16csatAsms_chr[1-9]_combined.fa.gz and 16csatAsms_chr[1-9]-oOrient_combined.fa.gz for the consistent- and mixed-orientation fasta inputs, respectively, at resources.michael.salk.edu). We constructed per chromosome graphs instead of a single graph for the entirety of all assemblies combined due to the computational requirements for analysing genomes of this size and repetitive content (Extended Data Fig. 6).

For PGGB graph 13csatSexChroms, the 13 scaffolded and softmasked sex chromosome sequences AH3Ma.chrX, AH3Mb.chrY, BCMa.chrX, BCMb.chrY, EH23a.chrX, GRMa.chrY, GRMb.chrX, KCDv1a.chrX, KCDv1b.chrX, KOMPa.chrX, KOMPb.chrY, SAN2a.chrX and SAN2b.chrX were combined into one fasta file (see 13csatSexChromsCombined_filtOrientation.fa.gz and 13csatSexChromsCombined_origOrientation.fa.gz for the consistent- and mixed-orientation fasta inputs, respectively, at https://resources.michael.salk.edu).

Graph generation

Nextflow v24.04.3.5916⁸⁴ was used to run the nf-core/pangenome v1.1.2 - canguro deployment^85,86 of PGGB²² within the nextflow singularity profile. All default PGGB settings were used for graph generation. For PGGB graph 13csatSexChroms, the flag --vcf_spec was used to compile sequence variation across the pangenome relative to each assembly’s coordinates, and each vcf was further processed with vcfbub --max-ref-length 100000 --max-level 0 to flatten nested variants and remove those >100 kb in length²⁰ (see 13csatSexChroms_pggb-fOrient_<assembly>.vcfbub.vcf.gz and 13csatSexChroms_pggb-oOrient_<assembly>.vcfbub.vcf.gz files for vcfs from graphs generated with consistent- and mixed-orientation input fastas, respectively, at https://resources.michael.salk.edu). For PGGB graph 16csatAsms, PGGB was run without the flag --vcf_spec and, instead, vg deconstruct -a was used to compile sequence variation across the pangenome from the final gfa file for each autosomal chromosome (vg tools v1.61.0 “Plodio”)^82,83. Per-autosome vcf files were concatenated into a single file for each assembly using bcftools⁶⁵ and then processed with vcfbub --max-ref-length 100000 --max-level 0 to flatten nested variants and remove those >100 kb in length²⁰ (see 16csatAsms_pggbByChrom_<assembly>.vcf.gz and 16csatAsms_pggbByOriginalChrom_<assembly>.vcf.gz for vcfs from graphs generated with consistent- and mixed-orientation input fastas, respectively, at resources.michael.salk.edu). Identical parameters were used for each pair of graphs generated with consistent- and mixed-orientation inputs.

Visualization

Visualizations of the graph pangenomes were generated from the FINAL_GFA files of the PGGB pipeline run on consistent-orientation input fastas. Vg files were derived from gfa files using vg convert^82,83. Then prepare_vg.sh and prepare_chunks.sh were used to visualize the pangenome variation at regions of interest in a local instance of the Sequence Tube Map server (https://github.com/vgteam/sequenceTubeMap.git, cloned on 4 September 2024).

Short-read mapping to graph pangenome

Short-read sequences from the EH23 F₂ population and Ren et al.² were aligned to the pangenome graph with vg giraffe (example command: vg giraffe -Z {input.inputGBZ} -d {input.inputDist} -m {input.inputMin} -f {input.inputR1} -f {input.inputR2} -t {threads} > {output.outputFile})⁸⁷. Summary statistics were collected with vg stats⁸² (example command: vg stats -a {input.inputGAM} {input.inputGBZ} > {output.outputFile}). Calculate read support from GAM file with vg pack⁸² (example command: vg pack -x {input.inputGBZ} -g {input.inputGAM} -Q 5 -t {threads} -o {output.outputFile}). Variants for the F₂ mapping population were called with vg call⁸⁸ (example command: vg call --gbz {input.inputGBZ} -k {input.inputPack} -S EH23b -t {threads} > {output.outputFile}). Downstream processing of VCF files was performed with BCFtools⁶⁵ (example commands: (1) bcftools view -a -f PASS merged.sorted.vcf.gz > merged.sorted.a.PASS.vcf.gz; (2) bcftools norm --fasta-ref EH23b.softmasked.fasta -m -any merged.sorted.a.PASS.vcf.gz > merged.sorted.a.PASS.normed.vcf.gz; (3) bcftools norm --fasta-ref EH23b.softmasked.fasta --rm-dup exact merged.sorted.a.PASS.normed.vcf.gz > merged.sorted.a.PASS.normed_no_dups.vcf.gz). Filtering of the pangenome graph-based VCF file to compare with the linear reference-based VCF file was performed with VCFtools⁶⁶ (example command: vcftools --remove-indels --minGQ 20 --maf 0.25 --max-missing 0.3 --min-alleles 2 --max-alleles 2 --stdout --recode --gzvcf merged.sorted.a.PASS.normed_no_dups.vcf.gz > merged.sorted.a.PASS.normed_no_dups.more_filter_missing0.3.vcf.gz).

Graph pangenome data availability

Input and output files for the graph pangenomes described above (78csatHaps generated by Minigraph-Cactus, and 16csatAsms and 13csatSexChroms generated by PGGB) are available at https://resources.michael.salk.edu. Vcf files have been added as tracks to the Cannabis genomes JBrowse instance at https://resources.michael.salk.edu.

Base-calling methylated cytosines

Genomic reads from the raw ONT FAST5 files generated from Cannabis sequencing samples were used for methylation calling. Genome assemblies generated for the same individuals were used as references for alignment. FAST5 data were converted to POD5 format using the pod5 software package (https://github.com/nanoporetech/pod5-file-format). Methylation calling was performed with ONT base-calling software Dorado version 0.3.4 (https://github.com/nanoporetech/dorado/). Dorado uses the raw POD5 data and a reference to identify methylated cytosines. This was performed with the super high accuracy (SUP) base-calling model trained for R9.4.1 or R10.4.1 pore type and 400 bps translocation speed, according to the sequencing conditions for each line. The assembled genomes generated from each sample were used as references to generate an aligned BAM file with MM/ML tags containing 5mC and 5hmC methylation calls. These were then piled up with modkit (https://github.com/nanoporetech/modkit), and the piled-up calls (aggregating 5mC with 5hmC) were used for calculating genome-wide methylation frequencies across all CG sites.

Gene and repeat prediction

Gene model prediction involved a multi-step pipeline and was applied to all assemblies.

(1)
We first curated a repeat library using RepeatModeler⁸⁹ on a small number of high-quality Cannabis assemblies and pre-existing repeat libraries. We used OrthoFinder (v2.5.4)⁹⁰ to group repeats for deduplication. The final repeat library included 10% of the sequences from each repeat orthogroup (minimum 1 sequence) for a total of 6,262 sequences from 5,793 groups.
1. a.
  Finola (GCA_003417725.2)
2. b.
  CBDRx (GCF_900626175.2)
3. c.
  Purple_Kush (GCA_000230575.5)
4. d.
  ERBxHO40_23
5. e.
  ERBxHO40_23
6. f.
  I3
7. g.
  JL (GCA_013030365.1)
8. h.
  ERB_F3
9. i.
  Cannbio-2 (GCA_016165845.1)
10. j.
  W103
11. k.
  JL_Mother (GCA_012923435.1)
12. l.
  FB30
13. m.
  TS1_3_v1
14. n.
  HO40
(2)
For all 193 genomes, repeats were masked with RepeatMasker (v4.1.2)⁹¹ using the repeat library (above).
(3)
We predicted gene models with the TSEBRA pipeline (using Braker v2.1.6)⁹². We developed a Snakemake workflow for running TSEBRA, available here: https://gitlab.com/salk-tm/snake_tsebra. We incorporated a variety of pre-existing protein libraries from cannabis and other organisms as evidence: (a) Arabidopsis thaliana; (b) Theobroma cacao; (c) G. max; (d) Rhamnella rubrinervis; (e) Ziziphus jujuba; (f) Trema orientale; (g) Vitis vinifera; (h) Prunus persica; (i) Morus notabilis; (j) C. sativa; (k) H. lupulus.
(4)
RNA-seq libraries (Supplementary Table 2) were aligned with either hisat2 (v2.2.1)⁹³ for short-read mapping, or minimap2 (v2.24)⁷⁵ for full-length cDNA. Short-read Illumina data was trimmed with fastp⁹⁴. The expression data was incorporated into the TSEBRA pipeline as gene model evidence.
(5)
Putative functional annotations of gene models were assigned using eggnog-mapper (v2.0.1)⁹⁵.
(6)
Overall gene model quality and completeness was assessed by comparing genome BUSCO (v5.4.3)⁹⁶ scores to proteome BUSCO scores on the eudicots_ocdb10 dataset (Supplementary Table 1: https://doi.org/10.6084/m9.figshare.25869319.v2).
(7)
EDTA v1.9.6⁹⁷ was also utilized to identify TEs in the cannabis pangenome with the following command: EDTA.pl --genome {inputFastaFile} --anno 1 --threads 32.

Ideogram methods

Ideograms for each pair of chromosomes for the 78 chromosome-level, haplotype-phased genomes were created using ggplot2 [https://ggplot2.tidyverse.org] in R (www.R-project.org) (Fig. 1 and Extended Data Fig. 5). The length of each chromosome was determined using ‘nuccomp.py’ (https://github.com/knausb/nuccomp) and used with ggplot::geom_rect() to initialize the plot. One million base pair windows were created for each chromosome where the number of CpG motifs were counted for each window with the program motif_counter.py (https://github.com/knausb/nuccomp). The CpG count was converted into a rate by dividing by the window size; this also accommodated the last window of each chromosome, which was less than one million base pairs in size. These rates were scaled by subtracting the minimum rate and then dividing by the maximum rate (the maximum rate after subtracting the minimum rate), on a per chromosome basis. In order to visually emphasize the enrichment of the CpG motif in the centromeric region, an inverse of the CpG rate was taken by taking one and subtracting the CpG rate for each window. This scaled, inverse CpG rate was used for the width of each one mbp window and coloured based on gene density using the viridis magma palette (https://doi.org/10.5281/zenodo.4679424).

Structural variation among each pair of chromosomes was determined using minimap2⁷⁵ alignments. The minimap2 comparisons were annotated using SyRI⁹⁸. The syntenous and inverted regions were plotted using ggplot2::geom_polygon() in a manner inspired by plotsr⁹⁹ but implemented in R (github.com/ViningLab/CannabisPangenome).

The location of candidate loci within EH23 haplotypes A and B were determined using BLASTN¹⁰⁰. Query sequences were as follows: CBCA synthase (LY658671.1), CBDA synthase (AB292682, AB292683, AB292684), THCA synthase (AB212829, AB212830), and olivetolic acid cyclase (NC_044376.1:c4279947-4279296, NC_044376.1:c4272107-4271242). These sequences were combined with centromeric, telomeric and rRNA sequences in the file blastn_queries_rrna_cann.fasta (https://github.com/ViningLab/CannabisPangenome). BLASTN was called with the following options: -task megablast -evalue 0.001 -perc_identity 90 -qcov_hsp_perc 90. Tabular results (subject chromosome, subject start of alignment, subject end of alignment) from BLASTN were read into R and plotted on ideograms with ggplot2::geom_rect() (https://ggplot2.tidyverse.org).

Centromere and telomere analysis

ONT and PacBio based long read-based genome assemblies enable the assembly of some of the highly repetitive centromeres and telomeres sequences¹⁰¹. Centromeres were identified by searching genomes using tandem repeat finder (TRF; v4.09) using modified settings (1 1 2 80 5 200 2000 -d -h)¹⁰². Tandem repeats were reformatted, summed and plotted to find the highest copy number tandem repeat per our previous methods to identify centromeres¹⁰¹ (Extended Data Fig. 5c).

Telomeres were estimated using two different methods. First, the TRF output was queried for repeats with the period of 7 for the 14 different version of the canonical telomere base repeat: AAACCCT, AACCCTA, ACCCTAA, CCCTAAA, CCTAAAC, CTAAACC, TAAACCC, TTTAGGG, TTAGGGT, TAGGGTT, AGGGTTT, GGGTTTA, GGTTTAG and GTTTAGG: (grep -a ‘PeriodSize=7’ *.genome.fasta.1.1.2.80.5.200.2000.dat.gff | grep -a ‘Consensus=AAACCCT\|Consensus=AACCCTA\|Consensus=ACCCTAA\|Consensus=CCCTAAA\|Consensus=CCTAAAC\|Consensus=CTAAACC\|Consensus=TAAACCC\|Consensus=TTTAGGG\|Consensus=TTAGGGT\|Consensus=TAGGGTT\|Consensus=AGGGTTT\|Consensus=GGGTTTA\|Consensus=GGTTTAG\|Consensus=GTTTAGG’ -). Second, we searched raw ONT and PacBio reads for telomere sequences using our TeloNum algorithm¹⁰³. Although the results were variable across the pangenome assemblies, in general, telomere sequence was found at the end of the chromosome with an average length of 16 kb for PacBio assemblies and 60 kb for ONT assemblies. The differences between ONT and PacBio telomere length most likely reflected the input read length of >100 kb and 15–20 kb, respectively. TeloNum analysis of the raw reads supported the distributions from the assemblies consistent with most chromosomes having telomere sequence while being shorter than the actual size. Cannabis telomeres are on the longer side for a eudicot and could be explained by its predominantly clonal propagation for medicinal uses¹⁰⁴.

Centromere sequence was identified based on the hypothesis that it will be the most abundant repeat in the genomes that also has a higher-order repeat (HOR) structure^101,105. Two different repeats with HOR were identified in the PacBio HiFiasm assemblies, whereas only one was found in the ONT assemblies and the previous CBDRx assembly, which is based on ONT sequence¹¹. The highest copy number repeat was 370 bp that varied between 20–30 Mb (2–4% of the total genome) with HOR at 740 and 1,110 bp (Extended Data Fig. 5). The second highest, and the only one found in the ONT assemblies, was a 237 bp repeat that varied between 3–5 Mb (0.4–1.0% of the total genome) and had HOR at 474 and 711 bp (Extended Data Fig. 5). Mapping of the 370-bp repeat to the chromosome-resolved genomes revealed that this repeat was primarily located at the end of the chromosomes next to the telomere sequence, which suggested that it may be related to the CS-1 sub-telomeric repeat¹⁰⁶. Comparison of the putative 370-bp centromeric repeat and the CS-1 sub-telomeric repeat showed they are the same repeat element. By contrast, the putative 237-bp centromeric repeat predominantly was found on chr. 6 and chr. 8 in the predicted centromere region (Fig. 1a and Extended Data Fig. 5). However, smaller 237-bp arrays were found on all chromosomes across the assemblies in the predicted centromere region (based on CpG, methylation, gene content and TEs) with most assemblies having small arrays on chr. 6 and chr. 8.

Ribosomal DNA detection and quantification

Ribosomal DNA (rDNA) 45S (18S, 5.8S and 26S) and 5S sequences were identified in the CBDRx/CS10 assembly (LOC115701787 5.8S, LOC115701759 18S, LOC115701762 26S and LOC115721558 5S) and used to BLAST against the pangenome assemblies (Fig. 1a and Extended Data Fig. 5). Across the scaffolded genomes the 45S array was predominantly located on the acrocentric end of chr. 8, and the 5S was located exclusively on chr. 7 between the cannabinoid synthase cassette array, consistent with published results with fluorescence in situ hybridization¹⁰⁶. However, partial arrays were found in some assemblies on all of the chromosomes (Extended Data Fig. 5). The distribution of the partial arrays on different chromosomes could reflect variability across the genomes since some share similar locations across assemblies. Most arrays are found on the un-scaffolded contigs, suggesting that these variable arrays across different chromosomes could be the result of mis-assemblies. In general, there are on average 1,000 45S and 2,000 5S arrays in the cannabis genome; some assemblies have the 5S array completely assembled on chr. 7.

Allele frequency methods

Genotype data in the VCF format¹⁰⁷ was input into R using vcfR¹⁰⁸. Allele and heterozygous counts were made with vcfR. Wright’s F_IS was calculated¹⁰⁹ to provide the deviation in heterozygosity from our random, Hardy–Weinberg, expectation. Wright’s F_IS was calculated as (HS − HO)/HS, where HO is the observed number of heterozygotes divided by their number and HS is the number of heterozygotes we expect based on the allele frequencies, calculated as the frequency of the first allele multiplied by the frequency of the second multiplied by two and divided by their number. Scatter plots were generated using ggplot2. Graphical panels were assembled into a single graphic using ggpubr (https://cran.r-project.org/package=ggpubr).

PanKmer genome analysis

Using PanKmer, we constructed two 31-mer indexes: a ‘full’ index of 193 Cannabis assemblies and a ‘scaffolded-only’ index of 78 scaffolded assemblies, using the ‘pankmer index’ command with default parameters. We calculated and plotted pairwise Jaccard similarities for all assemblies in the full index using ‘pankmer adj-matrix’ followed by ‘pankmer clustermap --metric jaccard’. We calculated and plotted a collector’s curves for both the full and scaffolded-only indexes using the ‘pankmer collect’ command with default parameters. All scripts used for this analysis can be found on GitHub.

Analysis of gene-based pangenome

We define the gene-based pangenome as the set of all gene families (orthogroups) with a representative in at least one genome of the pangenome. For each of 193 (as well as the 78 chromosome-level, haplotype-phased genomes, as a separate set) C. sativa genomes, the primary transcript of each high-confidence gene prediction was chosen as a representative. The proteins corresponding to each primary transcript were clustered into orthogroups using Orthofinder (v.2.5.4, see Orthofinder and synteny analysis section below)⁹⁰. The set of primary transcript CDS were merged into a single FASTA file, and exact duplicates were removed with SeqKit (2.7.0)¹¹⁰. Among primary transcripts, likely contaminants were determined by identifying transcripts predicted on contigs where fewer than 90% of predictions were annotated as either ‘viridiplantae’ or ‘eukaryote’ according to eggNOG-mapper (v2.1.12)⁹⁵, and were removed. To mitigate the problem of unannotated genes, we aligned coding sequences of all primary transcripts to each of the 193 (78) cannabis genomes using minimap2 (v2.26)⁷⁵ with parameters ‘minimap2 -c -x splice’ to generate a PAF file with CIGAR strings for each genome. For each genome, if an aligned CDS sequence had a mapping quality of at least 60, had a number of CIGAR matches at least 80% of the query length, and did not overlap a directly annotated gene, it was considered an unannotated gene and its orthogroup was marked as present in the target genome. The set of orthogroups that had at least one representative present in all 193 (78) genomes were considered to be the core genome, the remaining orthogroups were considered to be the variable genome. The presence or absence of each orthogroup in each genome was recorded in a table (see Data availability). All scripts for this analysis are available from GitHub.

Haplotypes, orthogroups and scores

In pangenomics, collector’s curves (pangenome rarefication) show the relationship of the number of haplotypes (here H) to the number of gene families or orthogroups (here X).

Given the X orthogroups distributed across H haplotypes, let the score s_x ∈ [0, H] of an orthogroup x be the number of haplotypes in which x is present. For any score s let P(s) be the number of orthogroups with score equal to s.

$$P(s)=\sum _{x\in {x}_{0}...{x}_{X}}{I}_{{s}_{x}=s}(x)$$

Where I_{s_x}:{x₀…x_X} → {0,1} is the indicator function on {x ∈ x₀…x_X: s_x = s}.

The collector’s curves

The collector’s curve C(h): [1, H] → [0, X] is the expected number of orthogroups that will be present in a subset of h haplotypes randomly drawn from the total set of H. It can be calculated by:

$$C(h)=\sum _{s\in 1...H}1-P(s)\prod _{i\in 0...h-1}\frac{H-s-i}{H-i}$$

The expected number of core orthogroups ${C}^{\wedge }(h)$ can be estimated by

$${C}^{\wedge }(h)=\sum _{s\in {\rm{1..}}.H}P(s)\prod _{i\in {\rm{0..}}.h-1}\frac{s-i}{H-i}$$

Each of these is a special case of a general formula for the expected number of orthogroups with a score of at least n, based on the hypergeometric survival function:

$${C}_{n}(h)=\sum _{s\in 1...H}P(s){S}_{{hyp}}(n,H,s,h)$$

Where S_hyp is the hypergeometric survival function or the hypergeometric cumulative distribution function subtracted from 1:

$${S}_{{\rm{hyp}}}(n,H,s,h)=1-{{\rm{CDF}}}_{{\rm{hyp}}}(n,H,s,h)$$

Where for clarity, the hypergeometric probability mass function (PMF) is:

$${{\rm{PMF}}}_{{\rm{hyp}}}(n,H,s,h)=\frac{\left(\begin{array}{c}h\\ n\end{array}\right)\,\left(\begin{array}{c}H-s\\ h-n\end{array}\right)}{\left(\begin{array}{c}H\\ h\end{array}\right)}$$

With binomial coefficients defined as:

$$(\begin{array}{c}h\\ n\end{array})=\frac{h!}{n\,!(h-n)!}$$

And, conventionally, the cumulative distribution function (CDF_hyp) is:

$${{\rm{CDF}}}_{{\rm{hyp}}}(n,H,s,h)=\sum _{{n}_{i}\le n}{{\rm{PMF}}}_{{\rm{hyp}}}({n}_{i},H,s,h)$$

So defined, we can see that the pan-genome collector’s curve C(h) is equivalent to C₁(h), while the core genome collector’s curve ${C}^{\wedge }(h)$ is equivalent to C_h(h):

$$C(h)={C}_{1}(h)$$

$${C}^{\wedge }(h)={C}_{h}(h)$$

k-mer based collector’s curves

The definition of the collector’s curve is agnostic to the unit of genomic sequence, so the calculation of a k-mer based curve is identical to the orthogroup based curve, excepting that X will be the number of k-mers and x will represent a k-mer, rather than an orthogroup.

k-mer analysis of pangenome assemblies and global diversity short-read libraries

Trim_galore was used to trim Illumina short-read sequences from Ren et al.² using: --2 colour 20⁶³. These reads were next filtered for low abundance reads (trim-low-abund.py -C 10 -M 5e9), and then used to make a k-mer sketch (sourmash sketch dna -p scaled=1000,k = 31)⁷⁹. All pangenome assemblies were also analysed for 31-mer frequencies (sourmash sketch dna -p scaled=1000,k = 31). Finally, all pairwise samples of Illumina read and pangenome assemblies were compared (sourmash compare -p 64 *.sig -k 31). The 31-mer distances were then plotted in R using (hclust(dist(sourmash_comp_matrix), method = “average”)).

Identification of pangenome core and dispensable genes

We assigned core and dispensable (nearly-core, cloud, shell, private) genes based on orthogroup membership (https://github.com/padgittl/CannabisPangenomeAnalyses/tree/main/CoreDispensableGenes). Core genes were defined as being present in 100% of genomes (193 genomes), nearly-core genes were defined as being present in 95–99% of genomes (183–192 genomes), shell genes were found in 5–94% of genomes (10–182 genomes), cloud genes were found in 2–5% of genomes (3–9 genomes), and unique genes were found in 0.5–1% of genomes (1–2 genomes)¹¹¹. This analysis was performed on all 193 genomes (Fig. 1e) and also visualized according to population (Supplementary Fig. 5). For the contig-level assemblies (103 genomes), only contigs with similarity to the ten chromosomes of EH23a were included. Gene sets were filtered to include only genes that were present on the ten chromosomes and contigs homologous to the chromosomes. We performed an analysis of functional enrichment with topGO⁷³ for each of the core, shell, cloud, nearly-core, and unique gene groupings for each genome, where the background gene set was all genes with a GO term for a given genome. Among the core genes, the most common significant GO term in the pangenome was sesquiterpene biosynthetic process (GO:0051762), which was significant in all but one genome (PBBK), followed by GO:0045338 farnesyl diphosphate metabolic process, which was absent in three genomes (public genomes: CANN, FIN and PBBK) (Supplementary Table 4). This analysis was restricted to high-confidence gene models predicted with the TSEBRA pipeline. By contrast, the collector’s curve analysis of gene content also included unannotated genome regions lacking gene model predictions, but with similarity to known genes, as a way to capture unsampled diversity (Fig. 1c,d and Supplementary Fig. 4; see also ‘Analysis of gene-based pangenome’).

Repeat analysis

Calculation of divergence time in TEs

Estimates of divergence time shown (Fig. 2b,c) were calculated using the equation T = (1 − identity)/2µ, where identity was obtained from EDTA output GFF3 files described previously⁹⁷. We used a substitution rate (µ) of 6.1 × 10⁻⁹ from Arabidopsis^112,113. This analysis was performed on all genomes.

Identification of solo to intact LTR-RT ratio

To identify solo LTRs and intact LTR-RTs, we used the EDTA pipeline on 193 cannabis genomes⁹⁷. We identified solo LTRs by first collecting the set of LTRs that were not assigned as intact LTR-RTs, which are retrieved on the basis of ‘method=homology’ in the attribute column of the TEanno.gff3 file. We applied thresholds to isolate solo LTRs from truncated and intact LTRs, as well as internal sequences of LTR-RTs. These thresholds include a minimum sequence length of 100 bp, 0.8 identity relative to the reference LTR, and a minimum alignment score¹¹⁴ of 300. We also required that the four adjacent LTR-RT annotations did not have the same LTR-RT ID¹¹⁵. Further, we required a minimum distance of 5,000 bp to the nearest adjacent solo-LTR, intact LTR or internal sequence¹¹⁶. Last, we kept solo-LTR sequences that fell within the 95th percentile for LTR lengths¹¹⁷. Overall, this method represents a modified approach based on the solo_finder.pl script from LTR_retriever¹¹⁴ and the LTR_MINER script¹¹⁶ with guidance from the github page for LTR_retriever (https://github.com/oushujun/LTR_retriever/issues/41).

Enrichment of TEs flanking genomic features

The method presented as part of PlanTEnrichment¹¹⁸ was adapted for the cannabis pangenome to assess TE enrichment both upstream and downstream of different genomic features, including cannabinoid synthase genes. The goal of the analysis was to identify TEs that are significantly associated with a specific category of genomic feature. In brief, ‘X’ represents a specific type of TE and ‘Y’ encompasses all TEs. The total number of X located upstream or downstream of a specific genomic feature (for example, cannabinoid synthases) is denoted as a; the total number of X located upstream or downstream of all genomic features (for example, all genes) is b; the total number of Y located upstream or downstream of a specific genomic feature (cannabinoid synthases) is c; and the total number of Y located upstream or downstream of all genomic features (all genes) is d. An enrichment score (ES) is defined as ${\rm{ES}}=(a/b)/(c/d)$, and the P value is defined as $p=(a+b)!(c+d)!(a+c)!(b+d)!/(a!b!c!d!N!)$, where N is the sum of a, b, c and d. A multiple test correction¹¹⁹ was performed on the P values using the Python library statsmodels¹²⁰. Significance threshold cut-offs included a false discovery rate (FDR) < 0.05 and ES ≥ 2. We used bedtools intersect¹²¹ to collect and survey the set of TEs located 1 kb upstream or downstream of the genomic feature category of interest. An example command: bedtools intersect -a assemblyID_genomic_feature_coord_file.txt -b assemblyID.TE.gff3 -wo > assemblyID_intersect_results.txt.

Distance between genes and TEs

The median and mean distances between genes and each of the TE categories was calculated using bedtools sort (bedtools sort -i genome.TEs.bed > genome.sorted.TEs.bed) and bedops closest-features (command: closest-features --closest --header --dist genome.sorted.genes.bed genome.sorted.TEs.bed > genome.closest_features.bed)¹²². To obtain the initial pre-sorted BED file for genes, the following command was used: cat genes.gff3 | grep mRNA | grep ‘\.chr’ | awk ‘{print $1”\t”$4”\t”$5”\t”$7”\t”$3”\t”$9}’ > genome.genes.bed. For TEs, the following command was used: cat genome.EDTA.TEanno.gff3 | grep ‘\.chr’ | awk ‘{print $1”\t”$4”\t”$5”\t”$7”\t”$3”\t”$9}’ > genome.TEs.bed. To calculate mean and median values, the built-in Python statistics module was used.

Enrichment of genes associated with different categories of TEs

We performed a GO term enrichment analysis to identify genes that were statistically significantly located near different types of TEs on the full pangenome. To identify genes near TEs, we first created a concatenated, sorted bed file with both gene and TE coordinates to find the nearest TE for a given gene, while excluding cases where the closest genomic feature to a given gene was another gene. For scaffolded genomes, genes and TEs were restricted to the ten chromosomes. For contig-level assemblies, genes were included if they were on a contig with similarity to one of the ten EH23a chromosomes. Next, we identified gene/TE pairs using bedops closest-features¹²². We performed a GO enrichment test for each genome separately using topGO with parameters algorithm = ‘weight01’, statistic = ‘fisher’, and Benjamini–Hochberg multiple test correction with FDR < 0.05⁷³. The background gene universe for statistical comparison was the set of all genes with a GO term for a given genome. To assess broad patterns, only GO terms that were significant in at least five genomes were considered further. This analysis included the full set of genomes (Supplementary Table 11).

Phylogeny of TEs surrounding cannabinoid synthases

The genomic coordinates for the 2 kb flanking distance surrounding copies of CBCAS, CBDAS and THCAS for the 78 scaffolded assemblies were retrieved with bedtools flank (bedtools flank -i assemblyID_synthase_coords.bed -g chromSizes.txt -l 2000 -r 2000 > assemblyID_flanking_2000.bed). Next, the TEs contained in this flanking region were retrieved using bedtools intersect (bedtools intersect -a assemblyID_flanking_2000.bed -b assemblyID.EDTA.TEanno.gff3 -wo > assemblyID_intersect_2000.bed)¹²¹. The genomic sequences for each of the TE types identified with bedtools intersect were collected in a fasta file and aligned with mafft (mafft --auto helitron.fasta > helitron_aln.fasta)¹⁰⁷. A maximum-likelihood tree was constructed with FastTree (FastTree -nt -gtr -gamma helitron_aln.fasta > helitron_aln.tree)¹²³. The tree was visualized with FigTree¹²⁴. To reduce redundancy in the full set of LTRs, CD-HIT was applied to the set of sequences, prior to multiple sequence alignment (cd-hit-est -i Ty1_LTRs.fasta -o Ty1_LTRs.cdhit.fasta -c 1)¹²⁵.

Expression analysis of active TEs in EH23

The non-redundant TE sequence library from EDTA was provided as the ‘transcriptome’ to salmon. Each of the EH23 RNA-seq samples was mapped to the TE transcriptome. Similar to the gene expression analysis, the minimum TPM threshold for a given TE was ≥0.1 TPM in ≥20% of samples¹²⁶. The top 50 expressed TEs were visualized as a heatmap, showing log₂TPM to represent log fold change.

Observed/expected CpG

‘CpG islands’ are defined as unmethylated regions spanning >200 bp, GC content >50% and observed/expected CpG ratio >0.6. Cytosine methylation over time results in a loss of CpG dinucleotides after cytosine is deaminated to thymine. With cytosine methylation, the expectation is that CpG dinucleotides (CG, CHG, CHH (where H is A, T, or C)) will have greater methylation activity. The observed/expected CpG ratio calculation^127,128 is: $({\rm{CpG}}\,{\rm{dinucleotide}}\,{\rm{count}}/L)/({\rm{C}}\,{\rm{count}}/L\times {\rm{G}}\,{\rm{count}}/L)$. Observed/expected CpG patterns were visualized in Fig. 2h,k.

Analysis of TEs directly flanking SVs

For each of the SV subtypes (inversions (INVS), duplications (DUPS), translocations (TRANS) and inverted translocations (INVTR)), the flanking region 500 bp upstream and downstream of each breakpoint (1 kb total for each breakpoint) was surveyed for TE content, using both intact and fragmented annotations. The set of 78 scaffolded, chromosome-level genomes were included, grouped by population. To compare with the genome at large, a random window was retrieved from the same genome and chromosome, with the same length as each of the SVs with bedtools shuffle, and the flanking windows were retrieved for each of the simulated breakpoints. Only cases where a specific type of TE was associated with both breakpoints of a single SV were further assessed with bedtools intersect. Both fragmented and intact TEs were included in this analysis. Statistical significance was assessed using Welch’s two-sided t-test in SciPy⁷¹. TEs occur more frequently near SV breakpoints (500 bp upstream and downstream of the breakpoint; 1 kb total) than in randomly selected regions of the same length from the same chromosome and genome. To overcome differences in abundance, the randomly shuffled regions of the genome were bootstrapped (1,000 replicates), with the requirement that each of the simulated, shuffled TE datasets match the number of observed breakpoints in the population. The TE content of observed and simulated data was assessed for statistical significance with Welch’s two-sided t-test in scipy⁷¹ and Benjamini–Hochberg multiple test correction (alpha=0.5, method = ‘indep’, is_sorted=False)¹²⁰. A test statistic and P value was generated for each of the 1,000 bootstrap replicates. The average test statistic and P value were then calculated (Supplementary Table 13).

Orthofinder and synteny analysis

We ran Orthofinder version 2.5.4 to aid in analysis of the 193 cannabis proteomes. Two runs were completed. The first was focused on our highest quality cannabis assemblies and only included scaffolded assemblies along with dozens of other plant samples from Plaza and a few samples from NCBI. Another run, including all of our cannabis pangenome assemblies, along with close relatives sourced from Plaza, was also produced to allow for detailed protein level analysis of the remaining assemblies. In all cases, only the primary (longest isoform unless otherwise annotated) protein sequence was used. Orthofinder results were analysed using a variety of methods, including Orthobrowser¹²⁹, which is capable of generating static web pages that allow for simultaneous visualization of gene tree dendrograms, gene tree multiple sequence alignments, and synteny of the selected gene and surrounding genes across all of the genomes (https://resources.michael.salk.edu/root/home.html).

Non-cannabis genomes included in the scaffolded cannabis Orthofinder run: (1) Amborella trichopoda; (2) Aquilegia oxysepala; (3) A. thaliana; (4) C. sativa; (5) Carpinus fangiana; (6) Carya illinoinensis; (7) Ceratophyllum demersum; (8) Citrullus lanatus; (9) Corylus avellana; (10) Cucumis melo; (11) Cucumis sativus; (12) Fragaria vesca; (13) Fragaria X; (14) Lotus japonicus; (15) Magnolia biondii; (16) Malus domestica; (17) Manihot esculenta; (18) M. notabilis; (19) Nelumbo nucifera; (20) Oryza sativa; (21) Parasponia andersoni; (22) P. persica; (23) Quercus lobata; (24) Rosa chinensis; (25) Sechium edule; (26) T. orientale; (27) Trochodendron aralioides; (28) Vaccinium macrocarpon; (29) V. vinifera; (30) Z. jujuba; and (31) H. lupulus.

Non-cannabis genomes included in the full cannabis Orthofinder run: (1) F. vesca; (2) L. japonicus; (3) M. domestica; (4) P. persica; and (5) R. chinensis.

Calculation of sequence entropy for DNA and protein sequences

We calculated sequence entropy for protein and DNA-based orthogroups on 193 genomes. High entropy corresponds to more diversity and variation among sequences in an orthogroup, and low entropy indicates less diversity and more similarity among orthogroup sequences. A minimum entropy value of 0 corresponds to matching identity. The maximum entropy corresponds to a random sequence of amino acids and is derived from the equation: log₂(20) = 4.32, where 20 is the number of amino acids. For DNA, the maximum entropy¹³⁰ is log₂(4) = 2.0. We computed the entropy for each column of the orthofinder multiple sequence alignment using the entropy function from scipy.stats⁷¹ and then calculated the average entropy for the whole multiple sequence alignment. A minimum of five sequences per orthogroup were required for inclusion in the analysis. Pairwise comparisons were made for each orthogroup across populations, and the distribution of entropy values for each multiple sequence alignment was visualized as a joint histogram. This analysis was applied to both proteins (gene sequences) and DNA (TEs).

Visualization and analysis of synteny with genespace

To visually assess gene-level variation in the haplotype-resolved, chromosome-scale genomes with X and Y chromosomes (AH3M, BCM, GRM and KOMP), we used genespace version 0.9.3¹³¹ within R version 4.2.2 (2022-10-31)¹³². We initially ran OrthoFinder⁹⁰ outside of the genespace environment and imported the results. To run the analysis, we used the synteny function, followed by plot_riparianHits. We built a pangenome representation with the pangenome function. We used the output file gffWithOgs.txt as the primary file used for obtaining syntenic gene pairs across all genomes in the subset. Gene IDs with an identical integer value in the ‘og’ column (last column) were retrieved as syntenic orthologues.

SV analysis

The 78 fully scaffolded assembly haplotypes were each aligned to the EH23a assembly using minimap2⁷⁵. Syri was then used to call SVs on each alignment⁹⁸ and plotsr was used to visualize alignments and SVs⁹⁹. CDS and TE content were analysed using bedtools intersect¹²¹. Inversion breakpoint repeats were called using blastn alignments of inversions with a minimum size of 10 kb. Windows of 8 kb centred around the start and end breakpoint of each inversion, and were aligned self-to-self, as well as to the breakpoint window pair on the opposing side of the inversion (start to end). Only one the top scoring alignment (excluding the full-length self–self alignment) was counted per breakpoint. Inverted repeats were called as alignments in opposing orientations and segmental duplications were called for alignments in the same orientation.

Phased SNPs

SNPs were also called using Syri⁹⁸ on the same assemblies and alignments as described above. SNPs from each of the two haplotypes per sample were merged into single phased genotype calls per sample, and sites with an N as the ALT call were removed (github.com/RCLynch414/SYRI_vcf.sh). Finally, vcftools was used to quality filter and thin SNP sites to a minimum of 1000 bp spacing: --remove-indels --minGQ 20 --remove-indv EH23a --min-alleles 2 --max-alleles 2 --thin 1000 --stdout --recode.

LD calculations

Phased SNPs from the scaffolded assemblies were first assessed for r2 correlations in with bin using plink¹³³: --double-id --allow-extra-chr --set-missing-var-ids @:# --maf 0.01 --geno 0.1 --mind 0.5 --chr 7 --thin 0.1 -r2 gz --ld-window 100 --ld-window-kb 1000 --ld-window-r2 0 --make-bed. Then ld_decay.py was used to make decay curves (GitHub - erikrfunk/genomics_tools), which were plotted with ggplot in R. Separately LD heat maps were made using vcftools: --thin 50000 --recode; and plotted in with LDheatmap in R (sfustatgen.github.io/LDheatmap/).

GO terms

GO term enrichment tests were performed with the topGO package in R, using all high-confidence gene annotations from EH23a as the null distribution and classic Fisher test of significance⁷³.

Selection scans with F _st and XP-CLR

F_st values were calculated using vcftools for each phased SNP and the scaffolded assembly MJ and hemp population assignments; significance was calculated using the top 5% of these values. The XP-CLR model for selective sweeps was applied to the same SNPs and 20-kb genome widows 59; significance was calculated using the top 5% of these values.

TreeMix

The TreeMix model was run using only SNPs outside of gene models: -seed 69696969 -o out_stem -m 5 -k 50 -noss -root asian_hemp. One to 10 migration scenarios were simulated, and ranked based on the ln(likelihoods). Five migration events (-m = 5) was selected as the most likely final number.

Local PCA

The local PCA method was applied to the phased SNPs, with 1,000-bp minimum spacing between SNPs, and genome windows of 100 SNPs¹³⁴.

Disease resistance gene analogue analysis

Plant disease resistance gene analogues are defined by the presence of one or more highly conserved amino acid motifs in their encoded proteins. These motifs encode functional protein domains that determine pathogen specificity and subcellular localization. Depending on the particular pathosystem, resistance gene analogue proteins can be entirely cytoplasmic, or can span the cell membrane with cytoplasmic functional domains, extracellular domains, or both.

Drago2¹³⁵ was used to identify motifs conserved among plant disease resistance gene analogues for the 78 chromosome-level, haplotype-resolved genomes. Input files were transcript annotation fasta files for each genome. Sets of genes containing both nucleotide binding site (NBS) and leucine-rich repeat (LRR) domains were used as input to MEME to assess and compare amino acid composition in motifs over gene sets.

To identify genes related to powdery mildew resistance, the sequence of a marker mapped to chr. 2 in CBDRx was used as a blastn query against the EH23a anchor genome¹³⁶. The resulting hit had 96% nucleotide identity on chr. 2 of EH23a at 77,292,037–77,291,397 bp. It was located in a cluster of 46 genes including 32 with kinase domains, six receptor-like kinases, two with nucleotide binding site plus transmembrane domains, one with coiled-coil and kinase domains, and one with coiled-coil, nucleotide binding site, and transmembrane domains. The blast hit itself was between two annotated kinase genes, EH23a.chr2.v1.g115480 and EH23a.chr2.v1.g115510.

The resulting top blast hits did not overlap with any gene annotations; however, 16 of the 38 genomes had blast hits on chr. 2 with >95% nucleotide identity to the CBDRx gene; of these, nine of these had 99–100% nucleotide identity over all three exons (1,745 bp, 1,448 bp and 287 bp), respectively. Sequences from five of the 16 genomes (H3S7a, OFBa, SZFBa, TKFBa and WCFBa) clustered separately from the rest. These were distinguished by a 1-bp insertion in the first exon, ten small indels (2–8 bp) in exonic space, and a 1,280 bp longer second intron. These regions were extracted and aligned with the CBDRx gene sequence, and the alignment was used to produce a maximum-likelihood tree (Extended Data Fig. 8).

Coiled-coil NBS–LRR genes (CNLs) showed a distinct pattern on chr. 3 and chr. 6. There were one to two CNL genes between 400–600 kb; two to four between 1–1.4 Mb; one to two at 6–8 Mb; a single CNL gene near the near the centromeric region of the chromosome at 35–37 Mb, and one to five (COFBa) CNLs between 78–84 Mb. Exceptions to this pattern were OFBa, H3S1a, and MMv31a, which lacked a CNL in the centromeric region. In SDFBa and SN1v3a, the centromeric CNLs were located at 42.8 and 47.5 Mb, respectively. SN1v3a had a CNL at 12.2 Mb, another exception to the overall pattern. Chr. 3 in this genome was larger than the others, at 90 Mb, compared to the rest at 80–85 Mb. Finally, GERv1a lacked a CNL in the 78–84 Mb region of chr. 3.

Identification of terpene synthase genes

Each of the Cannabis proteomes was aligned to a set of 40,926 protein sequences from UniProt (search criteria ‘Embryophyta’ and ‘reviewed’; accessed on 20 September 2022) with blastp (version blast 2.6.0, build 7 December 2016)¹³⁷. Alignment thresholds included an E-value threshold of less than 10⁻³, at least 20% query coverage, and a per cent identity based on the length of the alignment¹³⁸. Terpene synthases were also identified based on the presence of Pfam domains, PF01397 and/or PF03936¹³⁹. To assess domain content, each of the Cannabis proteomes was aligned to the Pfam-A.hmm database (last modified 15 November 2021; accessed 20 September 2022)¹⁴⁰ with hmmscan (HMMER 3.3.2 November 2020)¹⁴¹ on default settings.

Identification of genes in the precursor pathways for terpene and cannabinoid biosynthesis

Terpene biosynthesis proceeds via two pathways: the chloroplastic methyl-d-erythritol phosphate pathway, which produces precursors for monoterpene and cannabinoid biosynthesis, and the cytosolic mevalonate pathway, which produces precursors for sesquiterpene biosynthesis. The protein sequences for these pathways^142,143,144 were aligned to each of the Cannabis proteomes with diamond version 2.1.4 on default settings¹⁴⁵.

Synthase cassette analysis

To identify full and partial length cannabinoid synthases in each of the 193 cannabis genomes, the reference cannabinoid synthase sequences were aligned to the genome with blastn. An enriched LTR sequence developed from CBDRx¹¹ was used as a reference to further aid in the identification of synthases. LTR08 is an LTR sequence from the CBDRx genome that is associated with the synthase cassettes. A Python script was written to take in cannabinoid synthase blast results and LTR08 blast results in table format. Synthase hits with length <500 bp were filtered out. LTR08 hits with bitscore <1,250 were filtered out. Synthase and LTR08 hits with mismatches <10 and zero gaps were labelled as ‘Full’ sequences. All other hits were labelled as ‘Partial’ sequences. Hits that shared the same starting position were then filtered to a single sequence and given one of the synthase labels according to the following. Full hits were retained and labelled as the corresponding functional synthase. Partial hits within 60 kb of an LTR08 hit upstream or downstream were labelled as CBDAS and retained. If there were no Full hits or hits with an LTR08 in proximity, the hit with the highest bitscore was labelled as the respective synthase and retained. The filtered and labelled synthases were then plotted onto a track to visualize cannabinoid synthase orientation for each region of a genome. A minimum of four synthase hits was required for visualization. Inkscape was used to visualize synthase cassette tracks. Manual edits were used to correct a few incorrect labels between CBDAS and CBCAS. Synthase cassettes are grouped by overall cassette shape.

Cannabinoid synthase gene analysis

First ORFinder was used to remove pseudogenes from the initial list of potential genes described above (ftp.ncbi.nlm.nih.gov/genomes/TOOLS/ORFfinder/linux-i64/). Then we used usearch11.0.667 to cluster synthase coding sequences: -cluster_fast -id 0.997 -sort length -strand both -centroids -clusters¹⁴⁶. TranslatorX was then used to produce protein-guided multiple sequence alignments¹⁴⁷. Synthase evolutionary history was inferred by using the maximum-likelihood method and General Time Reversible model in MEGA11¹⁴⁸.

k-mer crossover analysis

We used the anchoring function of PanKmer to locate crossover events in known trios of cannabis genotypes (Supplementary Table 15). Eleven trios included FB191 as a varin-donor parent and 6 trios included SSV as a varin-donor parent. The parents of FB191 are HO40 and FB30, while the parents of SSV are HO40 and SSLR; in both cases, HO40 was the varin donor. For each trio, the F₁ genome was haplotype-resolved and included one haplotype from a varin-donor parent and one from a non-varin donor parent. In each case, we used PanKmer anchoring to identify the ‘varin haplotype’. For FB191 trios, we generated a 31-mer index of the FB191 genome using ‘pankmer index’ with default parameters. Using a Python script importing PanKmer’s API functions pankmer.anchor_region() and pankmer.anchor_genome()²¹, we anchored the FB191 index in each haplotype of the cross, for example COFBa and COFBb. We identified the varin haplotype as the haplotype with higher 31-mer conservation in the FB191 index. We applied the same procedure to SSV trios using a PanKmer index of SSV. We then sought to trace potential varin alleles from HO40 to the varin haplotype of the cross. To represent HO40, we generated two single-genome 31-mer indexes: one for the HO40 genome and a second for the highly similar EH23a sequence. We also generated single-genome 31-mer indexes of FB30 and SSLR. For each FB191 cross, we anchored the HO40 index, EH23a index and FB30 index in the varin haplotype. We inferred crossover events at loci with a clear ‘haplotype switch’ indicated by k-mer conservation values. We repeated the same procedure for SSV trios, applying the SSLR index in place of the FB30 index. All scripts for this analysis are available on GitLab.

Varin SNP association tests and genetics

First, the BestNormalize package in R was used to select the ordered quantile (ORQ) method to transform the varin ratio data, which were initially deemed multi-modal. Then the model BLINK from the GAPIT package in R¹⁴⁹ was used with PCA.total=6 to test associations between SNPs in the F₂ population and transformed varin ratio data (Supplementary Table 16). This PCA.total parameter was selected based on visual evaluation of QQ plots for PCA.total values 1–10, where 6 was the smallest number that did not show systemic inflation of P values¹⁴⁹. Next, gene and TE models were manually assessed in the regions surrounding the four FDR-corrected significant SNPs (Supplementary Table 16), in conjunction with the k-mer based crossover results. Of the four significant SNPs, we focused further analyses on the genes associated with the top two highest phenotypic variance explained (Supplementary Fig. 25). Then, Orthofinder groups for BKR, ALT3 and ALT4 were extracted, and the three ALT3 and ALT4 orthogroups were pooled into a single set of ALT gene counts. Phylogenies of BKR and ALT protein sequences were constructed in MEGA with the neighbour-joining method from the orthogroups using 100 bootstrap replicates¹⁴⁸. The BKR alignment and translation displayed was made using the Geneious¹⁵⁰ alignment algorithm on default settings (Fig. 5).

Sex chromosome SDR–PAR boundary identification and comparisons

Y based k-mers (Y-mers) were mapped to X/Y haplotypes using BWA (v.0.7.17) mem, requiring perfect alignments and allowing multimapping up to 10 times. To determine putative SDR–PAR boundaries, we focused on extracting conserved orthologues in regions with decreased Y-mer mapping density for subsequent gene tree analysis. Orthologues were defined using OrthoFinder (v.2.5.4) with the multiple sequence alignment option. OrthoFinder was executed using proteins from all available male (XY) assemblies from this study, including all male and several female contig-level assemblies, and additional haplotype-resolved assemblies from other studies: (1) BOAXa; (2) BOAXb; (3) AH3Ma; (4) AH3Mb; (5) BCMa; (6) BCMb; (7) GRMa; (8) BCMb; (9) GRMa; (10) Carmagnola_HAP2²⁹; (11) Futura75_HAP1²⁹; (12) Futura75_HAP2²⁹; (13) OttoII_HAP1²⁹; (14) OttoII_HAP2²⁹; (15) Uso31_HAP1²⁹; (16) Uso31_HAP2²⁹; (17) FIMv1a; (18) FIMv1b; (19) GVA-H-22-1061-002_hap1³⁴; (20) GVA-H-22-1061-002_hap2³⁴; (21) GVA-H-21-1003-002_hap1³⁴; (22) GVA-H-21-1003-002_hap2³⁴; (23) SAN2a; (24) SAN2b; (25) TIBv1a; (26) TIBv1b; (27) WFv1a; (28) WFv1b; (29) WIv1a; (30) WIv1b; (31) YMMv1a; and (32) YMMv1b.

Gene trees were estimated for ten conserved orthologues spanning putative SDR–PAR boundaries, to determine which orthologues were SDR- or PAR-linked in each assembly. For example, strong support for separate clades containing either X- or Y-linked orthologues is expected when the Y gametologue (1:1 orthologues on X and Y chromosomes) is tightly linked to the SDR¹⁵¹.

For all ten conserved orthologues or gametologues, we: (1) used blastn (BLAST+ v.2.14.1) and bedtools (v.2.31.0) getfasta, to find and extract nucleotide sequences for full-length genes (including introns); (2) aligned each gene matrix with MAFFT (v.7.505), using the options ‘--localpair --maxiterate 1000’; and (3) inferred maximum-likelihood trees with IQ-TREE (v.1.6.12) with the options ‘-MFP -bb 1000’. Following our analysis of X–Y gametologue trees, we used gene coordinates corresponding to the first putative Y-specific, SDR-linked gene to define each SDR boundary, then padded starting coordinates by 10 bp. The start of X-specific regions (that is, region on the X that does not recombine with the Y and is collinear to the Y-SDR) was defined based on X-gametologue coordinates corresponding to the first Y-specific gene.

The SDR–PAR boundary was defined using gene trees of XY gametologues from SDR bordering regions, which we identified by mapping male-specific k-mers to each haplotype. Our gene tree analysis revealed two major Y haplotype groups with distinct SDR boundaries (Ya and Yb). The ‘cloud boundary’ represents variation in the SDR–PAR boundary within cannabis, based on XY gametologue relationships. Ya was more common in our dataset (n = 6), and exhibits an ~132-kb extended SDR that spans the cloud boundary; whereas this region remains PAR-linked in the less frequent, Yb, haplotype (n = 2). The Ya haplotype reported in the main text was found in BCMb (feral), GRMa (HC hemp), AH3Mb (MJ), and Carmagnola, which is a fibre hemp landrace from Northern Italy, and the Yb haplotype was found in Kompolti (Hungarian fibre cultivar), which was selected for superior fibre characteristics in the 1950s from an older Italian variety, and GVA-H-21-1003-002 (isolated feral population from NY, USA).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The NCBI BioProject ID for the cannabis pangenome is PRJNA1140642. All of the pangenome sequencing data at NCBI Sequence Read Archive (SRA) is under the BioProject accession PRJNA904266. The BioProject accession IDs for EH23a and EH23b are PRJNA1111955 and PRJNA1111956, respectively. Genomes and annotation files for all 193 assemblies (including links to corresponding US National Plant Germplasm System accessions), orthobrowser and Genome Jbrowse instances, and input and output files for graph pangenomes are available at https://resources.michael.salk.edu. Annotations for R-genes, terpene synthases, cannabinoid synthases and additional genome visualizations are available from https://figshare.com/projects/Cannabis_Pangenome/205555 (ref. ¹⁵²) and https://doi.org/10.25452/figshare.plus.c.7248427.v1 (ref. ¹⁵³). Links to specific genome datasets are provided in Supplementary Table 1 (https://doi.org/10.6084/m9.figshare.25869319.v1 (ref. ¹⁵⁴)). Source data are provided with this paper.

Code availability

Scripts and analysis pipelines are available at https://github.com/anthony-aylward/CannabisPangenomeShared (ref. ¹⁵⁵) and https://github.com/padgittl/CannabisPangenomeAnalyses (ref. ¹⁵⁶).

References

Long, T., Wagner, M., Demske, D., Leipe, C. & Tarasov, P. E. Cannabis in Eurasia: origin of human use and Bronze Age trans-continental connections. Veg. Hist. Archaeobot. 26, 245–258 (2017).
Article Google Scholar
Ren, G. et al. Large-scale whole-genome resequencing unravels the domestication history of Cannabis sativa. Sci. Adv. 7, eabg2286 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Bai, Y. et al. Archaeobotanical evidence of the use of medicinal cannabis in a secular context unearthed from south China. J. Ethnopharmacol. 275, 114114 (2021).
Article CAS PubMed Google Scholar
Kovalchuk, I. et al. The genomics of cannabis and its close relatives. Annu. Rev. Plant Biol. 71, 713–739 (2020).
Article CAS PubMed Google Scholar
Clarke, R. & Merlin, M. Cannabis: Evolution and Ethnobotany (Univ. of California Press, 2016).
Stoa, R. Craft Weed: Family Farming and the Future of the Marijuana Industry (MIT Press, 2018).
Patton, D. V. A history of United States cannabis law. J. Law Health 34, 1–29 (2020).
PubMed Google Scholar
Bewley-Taylor, D. & Jelsma, M. Regime change: re-visiting the 1961 Single Convention on Narcotic Drugs. Int. J. Drug Policy 23, 72–81 (2012).
Article PubMed Google Scholar
Hanuš, L. O., Meyer, S. M., Muñoz, E., Taglialatela-Scafati, O. & Appendino, G. Phytocannabinoids: a unified critical inventory. Nat. Prod. Rep. 33, 1357–1392 (2016).
Article PubMed Google Scholar
Devinsky, O. et al. Trial of cannabidiol for drug-resistant seizures in the Dravet syndrome. N. Engl. J. Med. 376, 2011–2020 (2017).
Article CAS PubMed Google Scholar
Grassa, C. J. et al. A new Cannabis genome assembly associates elevated cannabidiol (CBD) with hemp introgressed into marijuana. New Phytol. 230, 1665–1679 (2021).
Article CAS PubMed PubMed Central Google Scholar
McKernan, K. J. et al. Sequence and annotation of 42 cannabis genomes reveals extensive copy number variation in cannabinoid synthesis and pathogen resistance genes. Preprint at bioRxiv https://doi.org/10.1101/2020.01.03.894428 (2020).
Gao, S. et al. A high-quality reference genome of wild Cannabis sativa. Hortic. Res. 7, 73 (2020).
Article PubMed PubMed Central Google Scholar
Braich, S., Baillie, R. C., Spangenberg, G. C. & Cogan, N. O. I. A new and improved genome sequence of Cannabis sativa. GigaByte https://doi.org/10.46471/gigabyte.10 (2020).
van Bakel, H. et al. The draft genome and transcriptome of Cannabis sativa. Genome Biol. 12, R102 (2011).
Article PubMed PubMed Central Google Scholar
Laverty, K. U. et al. A physical and genetic map of Cannabis sativa identifies extensive rearrangements at the THC/CBD acid synthase loci. Genome Res. 29, 146–156 (2019).
Article CAS PubMed PubMed Central Google Scholar
Barcaccia, G. et al. Potentials and challenges of genomics for breeding cannabis cultivars. Front. Plant Sci. 11, 573299 (2020).
Article PubMed PubMed Central Google Scholar
McPartland, J. M. & Small, E. A classification of endangered high-THC cannabis (Cannabis sativa subsp. indica) domesticates and their wild relatives. PhytoKeys 144, 81–112 (2020).
Article PubMed PubMed Central Google Scholar
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Article PubMed Google Scholar
Hickey, G. et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. 42, 663–673 (2024).
Article CAS PubMed Google Scholar
Aylward, A. J., Petrus, S., Mamerto, A., Hartwick, N. T. & Michael, T. P. PanKmer: k-mer-based and reference-free pangenome analysis. Bioinformatics 39, btad621 (2023).
Article CAS PubMed PubMed Central Google Scholar
Garrison, E. et al. Building pangenome graphs. Nat. Methods 21, 2008–2012 (2024).
Article CAS PubMed Google Scholar
McPartland, J. M. & Guy, G. W. Models of cannabis taxonomy, cultural bias, and conflicts between scientific and vernacular names. Bot. Rev. 83, 327–381 (2017).
Article Google Scholar
Qiao, Q. et al. Evolutionary history and pan-genome dynamics of strawberry (Fragaria spp.). Proc. Natl Acad. Sci. USA 118, e2105431118 (2021).
Article CAS PubMed PubMed Central Google Scholar
Li, C., Lin, H., Debernardi, J. M., Zhang, C. & Dubcovsky, J. GIGANTEA accelerates wheat heading time through gene interactions converging on FLOWERING LOCUS T1. Plant J. 118, 519–533 (2024).
Article CAS PubMed Google Scholar
Steed, G., Ramirez, D. C., Hannah, M. A. & Webb, A. A. R. Chronoculture, harnessing the circadian clock to improve crop yield and sustainability. Science 372, eabc9141 (2021).
Article CAS PubMed Google Scholar
de Meijer, E. Fibre hemp cultivars: a survey oforigin, ancestry,availability and brief agronomic characteristics. J. Int. Hemp Assoc. 2, 66–73 (1995).
Google Scholar
Westergaard. M. in Advances in Genetics, Vol. 9 (ed. Demerec, M.) 217–281 (Academic Press, 1958).
Carey, S. B. et al. The evolution of heteromorphic sex chromosomes in plants. Preprint at bioRxiv https://doi.org/10.1101/2024.12.09.627636 (2024).
McPartland, J. M. Cannabis systematics at the levels of family, genus, and species. Cannabis Cannabinoid Res. 3, 203–212 (2018).
Article CAS PubMed PubMed Central Google Scholar
Prentout, D. et al. Plant genera Cannabis and Humulus share the same pair of well-differentiated sex chromosomes. New Phytol. 231, 1599–1611 (2021).
Article CAS PubMed Google Scholar
Petit, J., Salentijn, E. M. J., Paulo, M.-J., Denneboom, C. & Trindade, L. M. Genetic architecture of flowering time and sex determination in hemp (Cannabis sativa L.): a genome-wide association study. Front. Plant Sci. 11, 569958 (2020).
Article PubMed PubMed Central Google Scholar
Charlesworth, D., Charlesworth, B. & Marais, G. Steps in the evolution of heteromorphic sex chromosomes. Heredity 95, 118–128 (2005).
Article CAS PubMed Google Scholar
Stack, G. M. et al. Comparison of recombination rate, reference bias, and unique pangenomic haplotypes in Cannabis sativa using seven de novo genome assemblies. Int. J. Mol. Sci. 26, 1165 (2025).
Article CAS PubMed PubMed Central Google Scholar
Lu, C. et al. Phosphorylation of SPT5 by CDKD;2 is required for VIP5 recruitment and normal flowering in Arabidopsis thaliana. Plant Cell 29, 277–291 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Lappin, F. M. et al. A polymorphic pseudoautosomal boundary in the Carica papaya sex chromosomes. Mol. Genet. Genomics 290, 1511–1522 (2015).
Article CAS PubMed Google Scholar
Grabowska-Joachimiak, A., Śliwińska, E., Piguła, M., Skomra, U. & Joachimiak, A. J. Genome size in Humulus lupulus L. and H. japonicus Siebold and Zucc. (Cannabaceae). Acta Soc. Bot. Pol. 75, 207–214 (2006).
Article CAS Google Scholar
Ma, J., Devos, K. M. & Bennetzen, J. L. Analyses of LTR-retrotransposon structures reveal recent and rapid genomic DNA loss in rice. Genome Res. 14, 860–869 (2004).
Article CAS PubMed PubMed Central Google Scholar
Choi, J., Lyons, D. B., Kim, M. Y., Moore, J. D. & Zilberman, D. DNA methylation and histone H1 jointly repress transposable elements and aberrant intragenic transcripts. Mol. Cell 77, 310–323.e7 (2020).
Article CAS PubMed Google Scholar
Harringmeyer, O. S. & Hoekstra, H. E. Chromosomal inversion polymorphisms shape the genomic landscape of deer mice. Nat. Ecol. Evol. 6, 1965–1979 (2022).
Article PubMed PubMed Central Google Scholar
Hirabayashi, K. & Owens, G. L. The rate of chromosomal inversion fixation in plant genomes is highly variable. Evolution 77, 1117–1130 (2023).
Article PubMed Google Scholar
Gabur, I., Chawla, H. S., Snowdon, R. J. & Parkin, I. A. P. Connecting genome structural variation with complex traits in crop plants. Züchter Genet. Breed. Res. 132, 733–750 (2019).
Google Scholar
Jay, P. et al. Supergene evolution triggered by the introgression of a chromosomal inversion. Curr. Biol. 28, 1839–1845.e3 (2018).
Article CAS PubMed Google Scholar
Toth, J. A., Stack, G. M., Carlson, C. H. & Smart, L. B. Identification and mapping of major-effect flowering time loci Autoflower1 and Early1 in Cannabis sativa L. Front. Plant Sci. 13, 991680 (2022).
Article PubMed PubMed Central Google Scholar
Murphy, R. L. et al. Coincident light and clock regulation of pseudoresponse regulator protein 37 (PRR37) controls photoperiodic flowering in sorghum. Proc. Natl Acad. Sci. USA 108, 16469–16474 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Li, M.-W., Liu, W., Lam, H.-M. & Gendron, J. M. Characterization of two growth period QTLs reveals modification of PRR3 genes during soybean domestication. Plant Cell Physiol. 60, 407–420 (2019).
Article CAS PubMed Google Scholar
Whiting, J. R. et al. The genetic architecture of repeated local adaptation to climate in distantly related plants. Nat. Ecol. Evol. 8, 1933–1947 (2024).
Article PubMed PubMed Central Google Scholar
Todesco, M. et al. Massive haplotypes underlie ecotypic differentiation in sunflowers. Nature 584, 602–607 (2020).
Article ADS CAS PubMed Google Scholar
Andre, C. M. et al. Unique bibenzyl cannabinoids in the liverwort Radula marginata: parallels with Cannabis chemistry. New Phytol. https://doi.org/10.1111/nph.20349 (2024).
van Velzen, R. & Schranz, M. E. Origin and evolution of the cannabinoid oxidocyclase gene family. Genome Biol. Evol. 13, evab130 (2021).
Article PubMed PubMed Central Google Scholar
Smith, C. J., Vergara, D., Keegan, B. & Jikomes, N. The phytochemical diversity of commercial Cannabis in the United States. PLoS ONE 17, e0267498 (2022).
Article CAS PubMed PubMed Central Google Scholar
de Meijer, E. P. M. & Hammond, K. M. The inheritance of chemical phenotype in Cannabis sativa L. (V): regulation of the propyl-/pentyl cannabinoid ratio, completion of a genetic model. Euphytica 210, 291–307 (2016).
Article Google Scholar
Vigli, D. et al. Chronic treatment with the phytocannabinoid Cannabidivarin (CBDV) rescues behavioural alterations and brain atrophy in a mouse model of Rett syndrome. Neuropharmacology 140, 121–129 (2018).
Article CAS PubMed Google Scholar
Welling, M. T. et al. An extreme-phenotype genome‐wide association study identifies candidate cannabinoid pathway genes in Cannabis. Sci. Rep. 10, 18643 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Pulsifer, I. P. et al. Acyl-lipid thioesterase1-4 from Arabidopsis thaliana form a novel family of fatty acyl-acyl carrier protein thioesterases with divergent expression patterns and substrate specificities. Plant Mol. Biol. 84, 549–563 (2014).
Article CAS PubMed Google Scholar
Kalinger, R. S., Pulsifer, I. P., Hepworth, S. R. & Rowland, O. Fatty acyl synthetases and thioesterases in plant lipid metabolism: diverse functions and biotechnological applications. Lipids 55, 435–455 (2020).
Article CAS PubMed Google Scholar
Turner, C. E. et al. Constituents of Cannabis sativa L. IV. Stability of cannabinoids in stored plant material. J. Pharm. Sci. 62, 1601–1605 (1973).
Article CAS PubMed Google Scholar
Welling, M. T., Liu, L., Shapter, T., Raymond, C. A. & King, G. J. Characterisation of cannabinoid composition in a diverse Cannabis sativa L. germplasm collection. Euphytica 208, 463–475 (2016).
Article CAS Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Article CAS PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 3, 95–98 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 3, 99–101 (2016).
Article CAS PubMed PubMed Central Google Scholar
Krueger, F. et al. FelixKrueger/TrimGalore: v0.6.10. Zenodo https://doi.org/10.5281/zenodo.7598955 (2023).
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://doi.org/10.48550/arXiv.1207.3907 (2012).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
Article PubMed PubMed Central Google Scholar
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Article CAS PubMed PubMed Central Google Scholar
Garfinkel, A. R., Otten, M. & Crawford, S. SNP in potentially defunct tetrahydrocannabinolic acid synthase is a marker for cannabigerolic acid dominance in Cannabis sativa L. Genes 12, 228 (2021).
Article CAS PubMed PubMed Central Google Scholar
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).
Article CAS PubMed PubMed Central Google Scholar
Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40, e49 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Hunter, Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Article Google Scholar
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article CAS PubMed PubMed Central Google Scholar
Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Alexa, A. & Rahnenführer, J. topGO: enrichment analysis for Gene Ontology. https://doi.org/10.18129/B9.bioc.topGO, R package version 2.59.0 (2024).
Denyer, T. et al. Streamlined spatial and environmental expression signatures characterize the minimalist duckweed Wolffia australiana. Genome Res. 34, 1106–1120 (2024).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Schalamun, M. High molecular weight gDNA extraction after Mayjonade et al. optimised for eucalyptus for nanopore sequencing. Protocols.io https://doi.org/10.17504/protocols.io.i6vche6 (2017).
Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).
Article CAS PubMed PubMed Central Google Scholar
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).
Article CAS PubMed PubMed Central Google Scholar
Titus Brown, C. & Irber, L. sourmash: a library for MinHash sketching of DNA. J. Open Source Softw. 1, 27 (2016).
Article ADS Google Scholar
Alonge, M. et al. Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing. Genome Biol. 23, 258 (2022).
Article CAS PubMed PubMed Central Google Scholar
Kurtzer, G. M. et al. Hpcng/singularity: Singularity 3.7.1. Zenodo https://doi.org/10.5281/ZENODO.4435194 (2021).
Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018).
Article CAS PubMed PubMed Central Google Scholar
Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017).
Article PubMed Google Scholar
Heumos, S. et al. Cluster-efficient pangenome graph construction with nf-core/pangenome. Bioinformatics 40, btae609 (2024).
Article CAS PubMed PubMed Central Google Scholar
Heumos, S. et al. Nf-core/pangenome: Pangenome 1.1.2 - canguro. Zenodo https://doi.org/10.5281/ZENODO.10869589 (2024).
Sirén, J. et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021).
Article PubMed PubMed Central Google Scholar
Hickey, G. et al. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol. 21, 35 (2020).
Article PubMed PubMed Central Google Scholar
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl Acad. Sci. USA 117, 9451–9457 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20, 238 (2019).
Article PubMed PubMed Central Google Scholar
Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker Open-4.0. 2013−2015; https://www.repeatmasker.org/ (2015).
Gabriel, L., Hoff, K. J., Brůna, T., Borodovsky, M. & Stanke, M. TSEBRA: transcript selector for BRAKER. BMC Bioinformatics 22, 566 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
Article CAS PubMed PubMed Central Google Scholar
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Article PubMed PubMed Central Google Scholar
Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol. 38, 5825–5829 (2021).
Article CAS PubMed PubMed Central Google Scholar
Waterhouse, R. M. et al. BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol. Biol. Evol. 35, 543–548 (2018).
Article CAS PubMed Google Scholar
Ou, S. et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol. 20, 275 (2019).
Article CAS PubMed PubMed Central Google Scholar
Goel, M., Sun, H., Jiao, W.-B. & Schneeberger, K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol. 20, 277 (2019).
Article PubMed PubMed Central Google Scholar
Goel, M. & Schneeberger, K. plotsr: visualizing structural similarities and rearrangements between multiple genomes. Bioinformatics 38, 2922–2926 (2022).
Article CAS PubMed PubMed Central Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS PubMed Google Scholar
VanBuren, R. et al. Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum. Nature 527, 508–511 (2015).
Article ADS CAS PubMed Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
Article CAS PubMed PubMed Central Google Scholar
Colt, K. et al. Telomere length in plants estimated with long read sequencing. Preprint at bioRxiv https://doi.org/10.1101/2024.03.27.586973 (2024).
Garcia-Cisneros, A. et al. Long telomeres are associated with clonality in wild populations of the fissiparous starfish Coscinasterias tenuispina. Heredity 115, 480 (2015).
Article CAS PubMed PubMed Central Google Scholar
Melters, D. P. et al. Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. Genome Biol. 14, R10 (2013).
Article PubMed PubMed Central Google Scholar
Divashuk, M. G., Alexandrov, O. S., Razumova, O. V., Kirov, I. V. & Karlov, G. I, Molecular cytogenetic characterization of the dioecious Cannabis sativa with an XY chromosome sex determination system. PLoS ONE 9, e85118 (2014).
Article ADS PubMed PubMed Central Google Scholar
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
Article CAS PubMed PubMed Central Google Scholar
Knaus, B. J. & Grünwald, N. J. vcfr: a package to manipulate and visualize variant call format data in R. Mol. Ecol. Resour. 17, 44–53 (2017).
Article CAS PubMed Google Scholar
Wright, S. The genetical structure of populations. Ann. Eugen. 15, 323–354 (1951).
Article MathSciNet CAS PubMed Google Scholar
Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS ONE 11, e0163962 (2016).
Article PubMed PubMed Central Google Scholar
Kaur, H., Shannon, L. M. & Samac, D. A. A stepwise guide for pangenome development in crop plants: an alfalfa (Medicago sativa) case study. BMC Genomics 25, 1022 (2024).
Article PubMed PubMed Central Google Scholar
Koch, M. A., Haubold, B. & Mitchell-Olds, T. Comparative evolutionary analysis of chalcone synthase and alcohol dehydrogenase loci in Arabidopsis, Arabis, and related genera (Brassicaceae). Mol. Biol. Evol. 17, 1483–1498 (2000).
Article CAS PubMed Google Scholar
Lynch, M. & Conery, J. S. The evolutionary fate and consequences of duplicate genes. Science 290, 1151–1155 (2000).
Article ADS CAS PubMed Google Scholar
Ou, S. & Jiang, N. LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiol. 176, 1410–1422 (2018).
Article CAS PubMed Google Scholar
Ou, S., Chen, J. & Jiang, N. Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. 46, e126 (2018).
PubMed PubMed Central Google Scholar
Pereira, V. Insertion bias and purifying selection of retrotransposons in the Arabidopsis thaliana genome. Genome Biol. 5, R79 (2004).
Article PubMed PubMed Central Google Scholar
VanBuren, R. et al. Extreme haplotype variation in the desiccation-tolerant clubmoss Selaginella lepidophylla. Nat. Commun. 9, 13 (2018).
Article ADS PubMed PubMed Central Google Scholar
Karakülah, G. & Suner, A. PlanTEnrichment: a tool for enrichment analysis of transposable elements in plants. Genomics 109, 336–340 (2017).
Article PubMed Google Scholar
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. 57, 289–300 (1995).
Article MathSciNet Google Scholar
Seabold, S. & Perktold, J. Statsmodels: econometric and statistical modeling with Python. In Proc. 9th Python in Science Conference https://doi.org/10.25080/Majora-92bf1922-011 (SciPy, 2010).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Article CAS PubMed PubMed Central Google Scholar
Neph, S. et al. BEDOPS: high-performance genomic feature operations. Bioinformatics 28, 1919–1920 (2012).
Article CAS PubMed PubMed Central Google Scholar
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
Article ADS PubMed PubMed Central Google Scholar
Rambaut, A. FigTree, version 1.4; http://tree.bio.ed.ac.uk/software/figtree/ (2012).
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Article CAS PubMed PubMed Central Google Scholar
GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Article Google Scholar
Gardiner-Garden, M. & Frommer, M. CpG islands in vertebrate genomes. J. Mol. Biol. 196, 261–282 (1987).
Article CAS PubMed Google Scholar
Zhou, W., Liang, G., Molloy, P. L. & Jones, P. A. DNA methylation enables transposable element-driven genome expansion. Proc. Natl Acad. Sci. USA 117, 19359–19366 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Hartwick, N. T. & Michael, T. P. OrthoBrowser: gene family analysis and visualization. Bioinformatics Adv. 5, vbaf009 (2025).
Article Google Scholar
Adami, C. Information theory in molecular biology. Phys. Life Rev. 1, 3–22 (2004).
Article ADS Google Scholar
Lovell, J. T. et al. GENESPACE tracks regions of interest and gene copy number variation across multiple genomes. eLife 11, e78526 (2022).
Article CAS PubMed PubMed Central Google Scholar
R Core Team. R: A Language and Environment for Statistical Computing. http://www.R-project.org/ (R Foundation for Statistical Computing, 2013).
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Article CAS PubMed PubMed Central Google Scholar
Li, H. & Ralph, P. Local PCA shows how the effect of population structure differs along the genome. Genetics 211, 289–304 (2019).
Article CAS PubMed Google Scholar
Calle García, J. et al. PRGdb 4.0: an updated database dedicated to genes involved in plant disease resistance process. Nucleic Acids Res. 50, D1483–D1490 (2022).
Article PubMed Google Scholar
Mihalyov, P. D. & Garfinkel, A. R. Discovery and genetic mapping of PM1, a powdery mildew resistance gene in Cannabis sativa L. Front. Agron. https://doi.org/10.3389/fagro.2021.720215 (2021).
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article CAS PubMed PubMed Central Google Scholar
Rost, B. Twilight zone of protein sequence alignments. Protein Eng. 12, 85–94 (1999).
Article CAS PubMed Google Scholar
Zhou, H.-C., Shamala, L. F., Yi, X.-K., Yan, Z. & Wei, S. Analysis of terpene synthase family genes in Camellia sinensis with an emphasis on abiotic stress conditions. Sci. Rep. 10, 933 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Punta, M. et al. The Pfam protein families database. Nucleic Acids Res. 40, D290–D301 (2012).
Article CAS PubMed Google Scholar
Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).
Article ADS MathSciNet CAS PubMed PubMed Central Google Scholar
Zager, J. J., Lange, I., Srividya, N., Smith, A. & Lange, B. M. Gene networks underlying cannabinoid and terpenoid accumulation in cannabis. Plant Physiol. 180, 1877–1897 (2019).
Article CAS PubMed PubMed Central Google Scholar
Jin, H., Song, Z. & Nikolau, B. J. Reverse genetic characterization of two paralogous acetoacetyl CoA thiolase genes in Arabidopsis reveals their importance in plant growth and development. Plant J. 70, 1015–1032 (2012).
Article CAS PubMed Google Scholar
Booth, J. Terpene and isoprenoid biosynthesis in Cannabis sativa. PhD thesis, Univ. of British Columbia (2020).
Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368 (2021).
Article CAS PubMed PubMed Central Google Scholar
Edgar, R. Usearch. OSTI.gov https://www.osti.gov/biblio/1137186 (2010).
Abascal, F., Zardoya, R. & Telford, M. J. TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Res. 38, W7–W13 (2010).
Article CAS PubMed PubMed Central Google Scholar
Tamura, K., Stecher, G. & Kumar, S. MEGA11: Molecular evolutionary genetics analysis version 11. Mol. Biol. Evol. 38, 3022–3027 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wang, J. & Zhang, Z. GAPIT Version 3: boosting power and accuracy for genomic association and prediction. Genomics Proteomics Bioinformatics 19, 629–640 (2021).
Article PubMed PubMed Central Google Scholar
Kearse, M. et al. Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics 28, 1647–1649 (2012).
Article PubMed PubMed Central Google Scholar
Prentout, D. et al. An efficient RNA-seq-based segregation analysis identifies the sex chromosomes of Cannabis sativa. Genome Res. 30, 164–172 (2020).
Article CAS PubMed PubMed Central Google Scholar
Lynch, R. Cannabis_Pangenome. Figshare https://figshare.com/projects/Cannabis_Pangenome/205555 (2024).
Lynch, R. Cannabis pangenome. Figshare https://doi.org/10.25452/figshare.plus.c.7248427.v1 (2024).
Lynch, R. et al. Pangenome metadata and statistics. Figshare https://doi.org/10.6084/m9.figshare.25869319.v2 (2025).
CannabisPangenomeShared. GitHub https://github.com/anthony-aylward/CannabisPangenomeShared (2024).
CannabisPangenomeAnalyses. GitHub https://github.com/padgittl/CannabisPangenomeAnalyses (2024).
Woods, P., Price, N., Matthews, P. & McKay, J. K. Genome-wide polymorphism and genic selection in feral and domesticated lineages of Cannabis sativa. G3 13, jkac209 (2022).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors thank members of the Michael laboratory for discussion on this work; and T. Gordon and Z. Stansell for sending leaf material from lines from the GRIN collection. This work was funded in part by the Tang genomics fund (T.P.M.), a National Science Foundation Plant Genome Postdoctoral Research Fellowship to L.K.P.-C. (NSF-IOS PRFB 2209290), and the development of pangenome tools in the Michael laboratory was supported by Bill and Melinda Gates Foundation (INV-040541) (T.P.M.). Support for this work was also provided by the US Department of Agriculture National Institute of Food and Agriculture Postdoctoral Fellowship (USDA NIFA) no. 2022-67012-38987 (S.B.C.), USDA NIFA no. 2023-67013-39620 (A.H.) and National Science Foundation (NSF) IOS-PGRP CAREER no. 2239530 (A.H.).

Author information

These authors contributed equally: Ryan C. Lynch, Lillian K. Padgitt-Cobb

Authors and Affiliations

The Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, USA
Ryan C. Lynch, Lillian K. Padgitt-Cobb, Nolan T. Hartwick, Nicholas Allsing, Anthony Aylward, Allen Mamerto, Justine K. Kitony, Kelly Colt, Emily R. Murray, Tiffany Duong, Heidi I. Chen & Todd P. Michael
Oregon CBD, Independence, OR, USA
Andrea R. Garfinkel, Aaron Trippe & Seth Crawford
Department of Horticulture, Oregon State University, Corvallis, OR, USA
Brian J. Knaus & Kelly Vining
HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA
Philip C. Bentz, Sarah B. Carey & Alex Harkess
Department of Cell and Developmental Biology, School of Biological Sciences, University of California San Diego, La Jolla, CA, USA
Todd P. Michael
Science and Conservation, San Diego Botanical Garden, Encinitas, CA, USA
Todd P. Michael
Center for Marine Biotechnology and Biomedicine, University of California San Diego, La Jolla, CA, USA
Todd P. Michael

Authors

Ryan C. Lynch
View author publications
Search author on:PubMed Google Scholar
Lillian K. Padgitt-Cobb
View author publications
Search author on:PubMed Google Scholar
Andrea R. Garfinkel
View author publications
Search author on:PubMed Google Scholar
Brian J. Knaus
View author publications
Search author on:PubMed Google Scholar
Nolan T. Hartwick
View author publications
Search author on:PubMed Google Scholar
Nicholas Allsing
View author publications
Search author on:PubMed Google Scholar
Anthony Aylward
View author publications
Search author on:PubMed Google Scholar
Philip C. Bentz
View author publications
Search author on:PubMed Google Scholar
Sarah B. Carey
View author publications
Search author on:PubMed Google Scholar
Allen Mamerto
View author publications
Search author on:PubMed Google Scholar
Justine K. Kitony
View author publications
Search author on:PubMed Google Scholar
Kelly Colt
View author publications
Search author on:PubMed Google Scholar
Emily R. Murray
View author publications
Search author on:PubMed Google Scholar
Tiffany Duong
View author publications
Search author on:PubMed Google Scholar
Heidi I. Chen
View author publications
Search author on:PubMed Google Scholar
Aaron Trippe
View author publications
Search author on:PubMed Google Scholar
Alex Harkess
View author publications
Search author on:PubMed Google Scholar
Seth Crawford
View author publications
Search author on:PubMed Google Scholar
Kelly Vining
View author publications
Search author on:PubMed Google Scholar
Todd P. Michael
View author publications
Search author on:PubMed Google Scholar

Contributions

T.P.M., R.C.L., S.C., A.R.G., K.V. and L.K.P.-C. conceived and organized research efforts. R.C.L., L.K.P.-C., T.P.M., B.J.K., N.T.H., N.A., A.A., A.M., J.K.K., H.I.C., A.R.G., A.T., P.C.B., S.B.C. and A.H. analysed pangenome data. R.C.L., L.K.P.-C., A.R.G., T.P.M., K.C., E.R.M., T.D. and S.C. conducted greenhouse, field and laboratory experiments. R.C.L., L.K.P.-C., T.P.M., B.J.K. and K.V. wrote and edited the manuscript. R.C.L., L.K.P.-C. and T.P.M. revised the manuscript. All authors read and approved the manuscript.

Corresponding authors

Correspondence to Ryan C. Lynch, Lillian K. Padgitt-Cobb or Todd P. Michael.

Ethics declarations

Competing interests

S.C. was a co-founder of Oregon CBD. A.R.G. and A.T. were employees of Oregon CBD. R.C.L. is a stakeholder in Saint Vrain Research LLC, which manufactures hemp-based products. T.P.M. is a founder of the carbon sequestration company CQuesta. A.H. is a co-founder of the genotyping company Veil Genomics. The other authors declare no competing interests.

Peer review

Peer review information

Nature thanks Shelby Ellison, Manuel Spannagl and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer review reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 PanKmer Jaccard similarity matrix of 193 Cannabis genomes.

PanKmer (PK) was used to estimate the relationship between the genomes in the cannabis pangenome. A large portion of the pangenome included elite cultivars, breeding trios and foundational Marijuana (MJ) lines originating from breeding programs spanning the 1970s to present (Supplementary Fig. 1; Supplementary Table 1). These samples represented chemotypes showing high expression of pentyl or propyl (varin) homologs of CBDA or THCA, and cannabinoid free (type V) plants. Flowering time variation was also captured with the inclusion of both short-day (SD) and DN phenotypes. The remaining cultivars came from the United States Department of Agriculture (USDA) Germplasm Resource Information Network (GRIN) and German federal genebank (IPK Gatersleben) repositories to ensure researchers will have access to plants for experimentation. These samples included European and Asian fiber and seed hemp, feral populations, North American marijuana (type I), hc yielding (CBDA or CBGA) hemp (type III and IV), male plants (XY; Fig. 1b) and monoecious plants (XX; Supplementary Table 1). Together, this comprehensive dataset provides a foundation for exploring cannabis genomic diversity, hybridization, and trait evolution. See Figshare for full resolution version.

Extended Data Fig. 2 The EH23 anchor genome sequencing strategy and resulting populations.

A) The F1 hybrid EH23 (ERBxHO40#23) was generated by crossing the type III (high CBDA), day neutral (DN), Early Resin Berry (ERB) with the type I (high THC), day sensitive (DS), HO40. Both ERB and HO40 were sequenced with PacBio CLR, while EH23 was sequenced with PacBio HiFi (CCS) and scaffolded with High-throughput Chromatin Conformation Capture (Hi-C). The F2 mapping population (288 individuals) was sequenced with Illumina short reads. The remaining pangenome samples from OCBD are summarized in (Supplementary Table 1) with a pedigree chart (Supplementary Fig. 1). B) Organization schematic for 193 genomes of the cannabis pangenome. Two methods were used to achieve haplotype-resolved, chromosome-scale genomes. The first, a streamlined method, employed Hi-C data for both phasing and scaffolding (Supplementary Table 1, Methods), generating 24 haploid genomes from 12 samples (Hifiasm_HiC). These served as scaffolding references for 42 genomes from 21 samples (Hifiasm_Trio_RagTag), resulting in trio-phased haploid assemblies. Together, these 78 genomes serve as the foundation for our pangenome analyses of transposable elements and structural variation. Additionally, we generated 20 haplotype-resolved contig-level assemblies (Hifiasm), along with 83 contig-level assemblies using older PacBio continuous long reads (CLR; 23 assemblies) and circular consensus sequencing (CCS; 60 assemblies) (Supplementary Table 1). C) Diagram of genomes used in different analyses for this study. For all assemblies we generated gene model annotations using both ab initio tools and RNA expression data, as well as TEs called using a RepeatModeler library (Supplementary Table 2, Methods).

Extended Data Fig. 3 F1 hybrid (ERBxHO40_23; EH23a and EH23b) between two phenotypically and genetically divergent parents clarifies features of the genome missed in other studies to-date.

A) Inheritance of alleles across the genome from the F2 population. The upper panel presents the frequency of each allele and the lower panel shows FIS or the deviation from our evolutionarily neutral expectation of heterozygosity. B) Haplotype specific expression for all tissue types from EH23, grouped by chromosome. Haplotype gene pairs were either syntenic or reciprocal best hits. Balanced and biased gene expression was assigned according to TPM difference. A difference threshold of 5 TPM was required for gene pairs to be assigned as biased, otherwise gene pairs were assigned as balanced (see also Supplemental Table 2 for counts by tissue type). C) LATE ELONGATED HYPOCOTYL (LHY) showed biased gene expression in EH23b foliage under 12 h of light (12/12 h). D) The copy of LHY with biased expression also belonged to an orthogroup with high entropy in different populations, with the largest difference in entropy separating feral and MJ. E) GO term enrichment of biased gene expression for all tissues in EH23a; and F) GO term enrichment of biased gene expression for all tissues in EH23b. See also Supplemental Note 2.

Extended Data Fig. 4 The cannabis pangenome and pangenes are high quality.

A) Benchmarking Universal Single-Copy Orthologs (BUSCO)¹⁹ for both the genome and gene predictions suggest that they are both high quality and complete. Gene models were predicted based on homology and expression data from different tissues, including flowers, leaves, and roots (Supplementary Table 2) with TSEBRA. We evaluated the quality of gene models with BUSCO¹⁹, which were around 95% complete on average for all assembly types. The scaffolded genomes contained 35,000 genes on average, and in the contig genomes, the number of genes scaled with the presence of duplications detected by BUSCO (Fig. 1e). B) The number of genes predicted contrasted with the number of BUSCO duplicate genes suggesting that the CCS and CLR contig-based assemblies were retaining significant duplicated sequence due to uncollapsed haplotypes. These haplotypes were not removed to retain the level of variation for downstream analysis. C) Scatter plot of chromosome lengths on the x-axis compared with gene counts per chromosome on the y-axis across the nine autosomes and both sex chromosomes.

Extended Data Fig. 5 Cannabis centromere and telomere analysis shows higher order repeat structure.

A-B) The AceHigh3 (AH3M) chromosomal features of nine pairs of autosomes and one pair of sex chromosomes (X and Y). One million base pair rectangular windows extend outward from each pair of haplotypes at a width proportional to the absence of the CpG motif. Each rectangular window is colored by gene density with warm colors indicating high gene density and cool colors indicating low gene density. Each pair of haplotypes is connected by polygons indicating structural arrangement, with gray for syntenic regions and orange connecting inversions. Rectangles along each haplotype indicate select loci, including 45S (26S, 5.8S, 18S) rDNA arrays (firebrick red), 5S RNA arrays (black), 237 bp centromere repeat (blue), 370 bp CS-1 sub-telomeric repeat (pink) and cannabinoid synthases (forest green; CBCAS, CBDAS, THCAS, and OAC). Chromosomal plots for all 78 haplotype-resolved, chromosome-scale genomes show similar trends (see Ideos.pdf at https://doi.org/10.25452/figshare.plus.28405079.v1). C) The centromere arrays identified in the AH3M genome (as an exemplar for the pangenome) with Tandem Repeat Finder (TRF). Two high copy number arrays were identified with base repeats of 237 and 370 bp, along with their higher order repeats (HOR). The 237 bp array is sparsely found in the genome (blue, panel A), although usually proximal to the high “CpG” sites. The 370 bp repeat is the same sequence as the sub-telomeric repeat CS-1¹⁰⁶ and found on the ends of the chromosomes (pink, panel A). D) A subset of the genomes were sequenced on Oxford Nanopore Technologies to estimate the telomere length in cannabis genomes¹⁰³. The N50 ONT read length is plotted as a function of the max telomere repeat identified using the TeloNum software¹⁰³.

Extended Data Fig. 6 Comparison of Syri, Pan Genome Graph Builder, and Minigraph-Cactus structural variation (SV) calls.

A) Differences between Syri SV, Pan Genome Graph Builder (PGGB), and Minigraph-Cactus (MGC) variant lengths. This is a violin plot showing Gaussian kernel density estimates for PGGB variant lengths, Minigraph-Cactus variant lengths, and Syri SV lengths (all SV types combined, including duplications, inversions, inverted translocations, and translocations). The input data are log-transformed variant lengths. Lengths are log-transformed due to the very large range between smallest and largest lengths. The highest probability region in the violin plot is approximately at the same density for all three methods (~8). MGC shows a smoother distribution than Syri and PGGB. PGGB appears to be the most granular method, with more distinct groupings than the other methods. PGGB discovers more short variants, while MGC and Syri capture variants >= 50 bp. For comma-separated variants in the VCF file (“ALT” column), only the longest of the variants was counted. Plots showing average depth of EH23 F2 population short reads mapping to B) EH23b chromosome 7 as represented in the MGC pangenome graph; C) the linear reference sequence of EH23b chromosome 7; D) EH23b chromosome 8 as represented in the MGC pangenome graph; and E) the linear reference sequence of EH23b chromosome 8. F) Plot showing the maximum computational memory (RAM in units of gigabytes [GB]) required for analyzing pangenomes of varying sizes (in units of gigabases [Gb]) using PGGB and PanKmer.

Extended Data Fig. 7 Terpene synthase genes across the cannabis pangenome.

A) Violin plot showing terpene synthase copy number in the cannabis pangenome. Chromosomes 5 and 6 are copy number “hotspots” in the cannabis pangenome. B) Odgi 2D visualization of EH23a.chr6.v1.g321150.t1, the highest expressed terpene synthase in all flower samples on EH23a.chr6, from reference-free PGGB pangenome graph (PGGB graph of chromosome 6 including AH3Ma/b, BCMa/b, EH23a/b, GRMa/b, FCS1a/b, H3S1a/b, KCDv1a/b, KOMPa/b, MM3v1a, SAN2a/b, YMv2a). C) Pangenome variation graph visualization of EH23a.chr6.v1.g321150.t1, showing interspersed regions of variation across the gene sequence. D) Visualization of entropy values for protein multiple sequence alignment showing low variation at the beginning of the alignment and high variation towards the end of the alignment.

Extended Data Fig. 8 Disease resistance genes across the cannabis pangenome.

A) Circos plot showing the EH23a genome as an example of the chromosomal distribution of disease resistance gene analogs (RGAs). Outer track (gold)=all categories of RGAs identified by drago2; middle track (blue)=receptor-like kinases; interior track=coiled-coil nucleotide binding site leucine-rich repeat genes. B) Violin plot showing numbers of RGAs per chromosome in chromosome-level, haplotype resolved genomes. C) Maximum likelihood tree of coiled-coil NBS-LRR (CNL) genes on chromosome 2 with similarity to a gene associated with powdery mildew resistance. D) Sequence tube map visualization of gene near PM1 marker (EH23a.chr2.v1.g115410; EH23a.chr2:77164374-77165978).

Extended Data Fig. 9 Expression patterns in the flowers and leaves of male and female AceHigh (AH3M) plants.

A) Stacked bar chart showing the number of genes with balanced, biased, or exclusive expression in male and female tissues. Overall, for a gene to be considered expressed, a minimum average TPM value of 1.0 across tissue replicates was required, grouped by sex. For balanced expression, genes were required to have a minimum average TPM of at least 1.0 in both sexes, grouped by tissue type, while also having less than a difference of 5 TPM between each sex. For biased expression, a difference of >= 5 TPM between sexes was required for each tissue type. For exclusive expression, a gene was required to have a minimum average TPM of at least 1.0 in one sex for a given tissue, without expression in the other sex for that tissue type (TPM = 0). On average, approximately 90% of genes with balanced or biased expression were syntenic across tissues and sexes; in contrast, approximately 80% of genes with exclusive expression were syntenic. The main exception was exclusively-expressed genes in female leaf tissue, in which approximately 90% of genes were syntenic. For this analysis, synteny was relative to the set of eight genomes with X and Y chromosomes, determined by GeneSpace. B) Chromosome-level counts of genes with biased expression in male flowers. C) and D) Scatter plots showing biased gene expression in male flowers across chromosomes X and Y, respectively. The x-axis shows gene start positions and the y-axis shows the difference of log₂ TPM between male and female flowers, specifically showing genes with biased or exclusive expression in male flowers. The blue markers correspond to genes in the PAR and red markers correspond to genes in the X-specific region. E) and F) Biased expression of intact TEs in male flowers across chromosomes X and Y, respectively. GO term enrichment among genes with biased and exclusive expression in male flowers included a variety of metabolic pathways, including pollen development.

Extended Data Fig. 10 Cannabis pangenome reveals a wide range of structural variation (SV), on par with some of the values that have been reported for interspecies comparisons.

A) Distributions of three types of SVs across the 78 scaffolded assemblies of the cannabis pangenome. Each sample assembly was aligned to the EH23a haplotype assembly for SV calling. B) Multi-modal distribution of inversion lengths, for all inversions from all samples. C) Distribution of the total length of inversions in each assembly as a percent of total genome length. D) Distributions of inversion lengths, for all inversions from all samples. E) Distributions of coding sequences (CDS) and intact transposable elements (TEs) within all inversions and syntenic regions from each sample. Inversions are significantly depleted of CDSs compared to syntenic regions, while on average, TEs are present at nearly an equal level within inversions and syntenic regions. F) Inversion breakpoint (BP) pairs, defined as 8 kb windows centered at the start and end of each inversion larger than 10 kb, contain repetitive elements about 50% of the time. G) Inversion BPs show a higher rate of segmental duplications, but lower rate of inverted repeats, within self-to-self alignments for each 8 kb BP window, compared to the start-to-end pair alignments. F) Example alignment and SVs of a European hemp sample haplotype (KC Dora). The two mega base scale inversions are in a region of chromosome 4 that showed elevated F_st values for SNPs in prior work comparing feral US hemp to marijuana populations¹⁵⁷.

Supplementary information

Supplementary Information

Supplementary Figures, Supplementary Tables, Supplementary Notes 1–3 and references

Reporting Summary

Peer Review file

Source data

Source Data Fig. 1

Source Data Fig. 2

Source Data Fig. 4

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lynch, R.C., Padgitt-Cobb, L.K., Garfinkel, A.R. et al. Domesticated cannabinoid synthases amid a wild mosaic cannabis pangenome. Nature 643, 1001–1010 (2025). https://doi.org/10.1038/s41586-025-09065-0

Download citation

Received: 21 May 2024
Accepted: 24 April 2025
Published: 28 May 2025
Version of record: 28 May 2025
Issue date: 24 July 2025
DOI: https://doi.org/10.1038/s41586-025-09065-0

This article is cited by

FT-like genes in Cannabis and hops: sex specific expression and copy-number variation may explain flowering time variation
- Caroline A. Dowling
- Todd P. Michael
- Rainer Melzer
BMC Genomics (2025)
From genotype to phenotype with 1,086 near telomere-to-telomere yeast genomes
- Victor Loegler
- Pia Thiele
- Joseph Schacherer
Nature (2025)
A simple and reliable PCR-based method to differentiate between XX and XY sex genotypes in Cannabis sativa
- Ainhoa Riera-Begue
- Matteo Toscani
- Rainer Melzer
Planta (2025)