Main

Robertsonian chromosomes (ROBs) are structurally variant chromosomes created by the fusion of two telocentric or acrocentric chromosomes to make a single metacentric chromosome. First recognized in 1916 by Robertson in grasshopper karyotypes6, these fusions are a common occurrence in nature, having since been recognized in many branches of life including plants, vertebrates and invertebrates. Robertsonian fusions (or translocations) are the most common karyotypic change in mammals8. ROBs create challenges for meiosis, potentially leading to subfertility, reproductive isolation and speciation9,10. Human ROB carriers are often asymptomatic, but ROBs contribute to trisomies such as Down and Patau syndromes5 and are associated with increased rates of certain cancers4 and uniparental disomy11. Despite their frequent occurrence and important effects on fertility, speciation and human health, the underlying mechanisms and evolutionary origins that explain why these chromosomes form so frequently in nature remain unknown.

In humans, ROBs occur in 1 out of 800 births1,2,3, most commonly in female meiosis12,13. The most common ROBs are the fusion of acrocentric chromosome 14 with chromosome 13 or chromosome 21, which are suspected to form by a similar specific mechanism14. In these fusions, the long arms of the chromosomes are joined and parts of the short arms are lost. ROBs can occur de novo or be inherited15. A population-level analysis7 of the recent complete assembly of the human genome (CHM13)16, which includes the short arms of the acrocentric chromosomes, revealed megabase-sized homology blocks called pseudo-homologue regions (PHRs) that are shared between acrocentric chromosomes. The existence of PHRs implies frequent interhomologue recombination and a working model for how ROBs form17.

Here we fully assemble three common ROBs: two chromosome 13–chromosome 14 (13;14) fusions and one 14;21 fusion. In all three cases, the ROBs have two centromere arrays, have lost all ribosomal 45S DNA repeats and are fused near a macrosatellite array composed of SST1 repeats. The repeats are named for the restriction enzyme that cuts them, isolated from the eponymous Streptomyces stanford. We demonstrate that the SST1 arrays on ribosomal DNA (rDNA)-containing chromosomes (also known as NBL2)18 bear hallmarks of exchange between chromosomes, including enrichment for PRDM9 DNA binding motifs that are associated with recombination hotspot activity19. Further analysis of SST1 repeats and segmental duplications in human and nonhuman primate genomes suggests that SST1 may have a broader role in genome rearrangements and evolution beyond Robertsonian fusions. Additionally, analysis of the two centromeric arrays on each ROB indicates epigenetic adaptations to support stable mitotic transmission, consistent with previous observations20,21. In conclusion, we provide, to our knowledge, the first complete assemblies of common ROBs, precise mapping of their fusion sites, and insight into their formation and transmission mechanisms.

A working model for the formation of common ROBs postulated that they occur owing to: (1) sequence homology between nonhomologous chromosomes, provided by the PHRs; (2) proximity of the PHRs, provided by co-location of rDNA arrays from different chromosomes in the nucleolus; (3) recombination initiation in meiosis to form a crossover; and (4) an inversion on chromosome 14 (refs. 7,17,22) (Fig. 1a). The asymmetric structure and chromosome-unique features around the SST1 arrays suggested that we would be able to map the translocation site (Fig. 1b). To further test this working model, we sequenced and assembled the complete, telomere-to-telomere sequence of three ROBs.

Fig. 1: Complete assembly of ROBs.
figure 1

a, Working model for the dependence of ROB formation on recombination between SST1 repeats (pink triangles) located in PHRs (coloured bands) on the short arms of human chromosomes 13, 14 and 21. The adjacent 45S rDNA arrays facilitate 3D proximity by co-locating in nucleoli. b, Schematic representation of the main SST1 arrays and flanking sequences in acrocentric chromosomes from the CHM13 genome. This region is similar on chromosomes 13 and 21 and is inverted on chromosome 14. ce, Representative images of ROBs from GM03786 (c), GM04890 (d) and GM03417 (e) cell lines. Left image, chromosome labelled with an SST1 probe (magenta) and whole chromosome paints as indicated. Centre image, chromosome labelled with an SST1 probe (magenta) and centromeric satellite probes for cen14/22 (orange) and cen13/21 (green). DNA was counterstained with DAPI. Right image, magnified view showing SST1 localization between the two centromere arrays. Scale bar, 1 µm. Right, averaged intensity profiles of lines drawn through the centromeres of multiple ROBs (GM03786: n = 10, GM04890: n = 20, GM03417: n = 11). Intensity profiles were aligned to the peak of the Gaussian of the SST1 signal and normalized to the maximum intensity of each channel. Error bars denote s.d. Bottom, synteny plots comparing the assembled ROB to the CHM13 genome sequence. The structure of each fused region is shown in detail. a.u., arbitrary units.

We selected three independent cell lines for sequencing and assembly, each harbouring a unique ROB: (1) GM03417 (45 or 46,XX,t(14;21); clinical features of Down syndrome); (2) GM04890 (45,XX, t(13;14); clinically normal with five miscarriages); and (3) GM03786 (45,XX,t(13;14); clinically normal). The fusions were initially confirmed in mitotic spreads using chromosome-specific paints (Fig. 1c–e). We generated Oxford Nanopore Technologies (ONT), Hi-C, Illumina and PacBio HiFi sequencing data (Supplementary Table 1). The Verkko assembler utilized ONT, PacBio and Hi-C data to generate complete de novo assemblies of the ROBs23. Each ROB was visible in the assembly graph by a node connecting regions representing two different acrocentric chromosomes (Extended Data Figs. 13). Notably, these connecting nodes skipped the rDNA arrays, consistent with the working model and supporting the loss of rDNA in the common ROBs.

We assessed the quality of the assemblies (see Methods), finding high Phred quality scores (quality values) ranging from 49.10 to 54.70, corresponding to approximately 99.999% accuracy, and high gene completeness, with all assemblies containing over 98% of benchmarking universal single-copy orthologues24 (BUSCOs) (Supplementary Fig. 1). Read coverage analysis across the breakpoint regions shows consistent coverage without significant drops or high-frequency secondary alleles, confirming the structural accuracy of the assembled fusion points (Supplementary Fig. 2). Hi-C data, which capture the three-dimensional organization of chromosomes, showed increased interactions between the q arms of each ROB, emphasizing the physical proximity of these regions due to the fusion (Supplementary Fig. 3), which was also observed by microscopy in nuclei (Extended Data Fig. 4).

SST1 macrosatellite arrays are present on several chromosomes25. Multiple subfamilies of SST1 arrays exist. One subfamily occurs in the PHRs of chromosomes 13, 14 and 21, whose population genetic and structural features have led to the hypothesis that they are the site of recombination in recurrent Robertsonian fusions7. Using cytogenetics, we probed centromeres of chromosomes 13 and 21 (cen13/21), cen14/22 and SST1 and 45S rDNA arrays. Cen13/21 and cen14/22 are highly similar at the sequence level and cannot be distinguished by conventional fluorescence in situ hybridization (FISH) probes. Our analysis confirmed that all three ROBs are missing 45S rDNA arrays and that the SST1 signal is present between centromeric arrays (Extended Data Fig. 4), consistent with the working model for the formation of common ROBs and the assembly graphs. Karyotypic analysis of GM03417 indicates cellular heterogeneity with regard to the number of copies of chromosome 21, with 3 copies of chromosome 21 being relatively rare (around 10%) (Extended Data Fig. 4). Notably, the normal copy of chromosome 14 in GM03786 has lost the SST1 array (Extended Data Figs. 2 and 4), suggesting that polymorphisms in this region exist. Analysis of 26 contigs from the human pangenome that extend beyond the SST1 array have a deletion in 9 out of 26 haplotypes (Extended Data Fig. 5).

To visualize the assemblies, we created synteny plots (Fig. 1c–e) using NGenomeSyn26. SST1 arrays are present on chromosomes 13, 14 and 21, but are inverted on chromosome 14 relative to chromosomes 13 and 21 in many genomes7. Owing to this inversion, when this region of chromosome 14 pairs with a 13 or 21 partner chromosome in meiosis, a crossover event is predicted to join the two long arms, forming a ROB. When we compared the assembled ROBs to their component chromosomes in the CHM13 genome, we observed that the sequence order and directionality were syntenic on either side of the SST1 region. This pattern is consistent with crossover within the SST1 array generating each of the assembled ROBs. Sequence asymmetry in PHRs on chromosomes 13 and 21 versus chromosome 14, which includes a partial SST1 monomer, allowed us to identify that the SST1 array was the breakpoint (Extended Data Fig. 6). We speculate that haplotypes of chromosome 14 that are missing this region would form ROBs less readily.

SST1 and interchromosomal exchange

To better understand the evolutionary relationships and potential exchange between SST1 arrays, we conducted a phylogenetic analysis of SST1 monomers derived from the HG002 and CHM13 genomes. The analysis revealed three distinct subfamilies. Subfamily 1 (sf1), also known as NBL2, consists of monomers primarily derived from large arrays on the acrocentric chromosomes and forms a single branch in the phylogenetic tree. Subfamily 2 (sf2) comprises monomers mainly from autosomal arrays (for example, chromosomes 4, 17 and 19) and forms chromosome-specific branches. Subfamily 3 (sf3), also referred to as TTY2, is composed of monomers primarily originating from arrays on the Y chromosome. This classification of subfamilies and their distinct characteristics are illustrated in Fig. 2a,c and Extended Data Fig. 6. The Y-derived repeats co-cluster with a few SST1 monomers from autosomes (see more below). The co-clustering of sf1 monomers from the large arrays on the acrocentric chromosomes due to their high sequence identity (approximately 99% compared with around 90% for sf2 and 75% for sf3) is consistent with frequent recombinational exchange between these repeats, leading to concerted evolution of their constituent monomers.

Fig. 2: Evidence for SST1-mediated interchromosomal exchange in human genomes.
figure 2

a, All SST1 monomers from previous analysis of CHM137 and from HG002 were collected and phylogenetic analysis was performed using the maximum-likelihood method based on the best-fit substitution model (Kimura two-parameter + G, parameter = 5.5047) inferred by Jmodeltest2 with 1,000 bootstrap replicates. Bootstrap values higher than 75 are indicated at the base of each node. The colour indicates the source chromosome and the shape indicates the source genome. Three major subfamilies were identified: sf1, primarily on the acrocentrics (Acros); sf2, primarily on the remaining autosomes; and sf3, primarily on the Y chromosome. Black arrows indicate the location on the phylogenetic trees of sf2 monomers S and L from the acrocentric chromosomes (Fig. 1b). b, Predicted PRDM9 DNA binding site frequency (mean sites per kb, each dot indicates one haplotype) in SST1 arrays in n haploid genomes are plotted by chromosome. ANOVA analysis with the two-sided Tukey–Kramer test for pairwise mean comparisons. c, Schematic representation of the three subfamilies of SST1. SST1-sf1 has a central gap and a predicted PRDM9 DNA binding site (red box). d, A segmental duplication of 27 kb or larger was identified on several autosomes in CHM13 that includes Y-like α-satellite DNA (α-sat) and SST1-sf3. Phylogenetic analysis was performed using the maximum-likelihood method and general time reversible (GTR)-plus-gamma substitution parameters. Bootstrap values are shown. e, Comparison of overlaps between segmental duplications (SDs; ≥10 kb) and random regions or SST1 monomers across 147 genomes. Distributions show the number of overlaps versus density. A permutation test with 10,000 iterations per genome was used to generate random region overlaps. The significant difference between distributions (Wilcoxon signed-rank test, paired, two-sided) indicates non-random association between segmental duplications and SST1 regions.

The recurrent involvement of the SST1 array in ROB formation shown here and the concerted evolution of SST1 arrays on chromosomes 13, 14 and 21 (ref. 7) suggest that this array may comprise a meiotic recombination hotspot. To investigate this, we focused on PRDM9, a key protein in meiosis that has a crucial role in defining recombination hotspots. PRDM9 contains a variable number of zinc-fingers that allow it to bind to specific DNA sequences. Additionally, it possesses a histone methyltransferase domain that trimethylates histone H3 at lysine 4 (H3K4) and lysine 36 (H3K36)27,28. This trimethylation of H3K4 and H3K36 by PRDM9 creates an open chromatin environment that favours recombination events during meiosis19. Previous work suggested that SST1 arrays on chromosomes 13, 14 and 21 in the CHM13 genome contain predicted PRDM9 binding sites7. To further examine PRDM9 binding sites within SST1 arrays, we searched for the 13-bp PRDM9 DNA binding motif29 of the common A allele30 in the SST1 arrays in 147 human genomes (Methods). The density of PRDM9 DNA binding motifs is significantly higher in the SST1 arrays on chromosomes 13, 14 and 21 relative to chromosomes 4, 17, 18 and Y (Fig. 2b). Whereas previous analysis used a collection of PRDM9 binding motifs to identify sites across acrocentric p-arms7, our current analysis uses only the high-confidence 13-bp PRDM9 DNA binding motif, which appears at a lower frequency in rDNA arrays.

PRDM9 activity leads to the erosion of its own binding sites19. PRDM9 sites in SST1 arrays should erode on all three acrocentric chromosomes. We speculate that PRDM9 sites within the SST1 arrays can regenerate via intrachromosomal or interchromosomal exchange events. However, the inversion on chromosome 14 may create a barrier for interchromosomal gene conversion, tipping the relative balance of erosion and regeneration toward erosion and resulting in the observed lower density of sites on this chromosome. The significantly higher density of PRDM9 DNA binding motifs in SST1 arrays on acrocentric chromosomes aligns with our finding that the consensus sequence of their monomers contains a predicted PRDM9 DNA binding motif (Fig. 2c and Extended Data Fig. 7). These findings suggest that the predominant subfamily of SST1 repeats on acrocentric chromosomes (sf1) creates a sequence context that is permissive to meiotic recombination.

Nearly 100 PRDM9 alleles have been identified across human populations30. Some differ by a single nucleotide; others vary substantially by copy number and organization of their zinc-finger motifs. PRDM9 alleles can influence the activity of meiotic recombination hotspots. We reasoned that PRDM9 allele variation could influence the likelihood of ROB formation, so we examined the genotype of our three ROB cell lines. GM3417 is A/A, GM3786 is A/L24 and GM4890 is L24/L9. A is the most common allele, but L24 and L9 are rare, being present in only 1.66% of individuals in the previous study (12 out of 720), although these alleles are overrepresented in an ethnically southern European population30 (9 out of 109). No conclusions can be drawn regarding PRDM9 alleles and ROB formation at this time, as we lack information on vertical transmission and do not have enough samples to test for an association.

Further evidence for SST1-mediated interchromosomal recombination comes from two key observations. First, the phylogenetic analysis revealed the presence of Y-like SST1 monomers (sf3) on multiple autosomes (7, 9, 12, 17 and 20). These monomers are contained within larger sequence blocks ranging in size from 25–50 kb with 95–99% identity that also include Y-like α-satellite DNA (Fig. 2d). This pattern suggests that these blocks originated from the Y chromosome, although we could not identify syntenic blocks on the human Y chromosome. Second, we examined the association between SST1 arrays and segmentally duplicated regions, defined as regions longer than 10 kb with at least 90% identity at two or more locations in the genome. Segmental duplications arise and persist via recombination facilitated by repetitive DNA shared between chromosomes. To quantify the association between sf1, sf2 and segmental duplications, we conducted a permutation test with 10,000 iterations in 147 genomes (Methods). The results suggest that the observed association between SST1 and segmental duplications is unlikely to occur by random chance, and the association is highest with sf1 (Fig. 2e). Together, these findings suggest that, beyond its role in recurrent Robertsonian translocations, SST1 may be associated with interchromosomal recombination throughout the genome.

SST1 and rDNA in chimpanzee and bonobo genomes

Tandem arrays of the SST1 macrosatellite are not unique to humans; they also exist in nonhuman primate genomes, including several recently assembled telomere-to-telomere (T2T) genomes31. In gibbon, SST1 arrays have been suggested to be responsible for evolutionary patterns of genome instability. As chimpanzees (Pan troglodytes) and bonobos (Pan paniscus) are the most closely related species to humans, we examined the acrocentric chromosomes in their genomes32 for the position and orientation of rDNA and SST1 arrays to understand the potential for ROB formation, specifically, and chromosome evolution more generally. Two out of five rDNA array-positive acrocentric p arms in the bonobo genome (hsa14 and hsa22) and all five rDNA array-positive acrocentric p arms in the chimpanzee genome (hsa13, hsa14, hsa18, hsa21 and hsa22) have co-oriented SST1 arrays (Fig. 3a). The location of SST1 arrays on these chromosomes was validated by FISH (Supplementary Fig. 4).

Fig. 3: Evidence for exchange of SST1 on rDNA array-bearing chromosomes in chimpanzee and bonobo genomes.
figure 3

a, Ideograms of all the rDNA array-bearing chromosomes in human, chimpanzee and bonobo, annotated with the human numbering system (indicated by Hsa prefix). The directionality of 45S rRNA gene arrays (grey) and SST1 arrays (coloured bars) are indicated with arrowheads. b, Predicted PRDM9 binding sites were identified in the chimpanzee genome, and the number of sites per kb is plotted for SST1 arrays for the indicated subfamily. Random regions of the genome (randBins and randGC (GC-matched random regions)) were used to determine background. c, All SST1 monomers from the chimpanzee genome were subjected to phylogenetic analysis using the maximum-likelihood method. The colour indicates the source chromosome. The SST1 monomers from Hsa13, Hsa14, Hsa18, Hsa21 and Hsa22 chromosomes form a single branch, indicating a high degree of similarity. d, All SST1 monomers from the bonobo genome were subjected to phylogenetic analysis using the maximum-likelihood method. The SST1 monomers from chromosomes 14 and 22 form a single branch, indicating a high degree of similarity. e, SST1 monomers from human (Hs), chimpanzee (Pt) and bonobo (Pp) were subjected to phylogenetic analysis using the maximum-likelihood method. The three subfamilies are apparent.

To investigate the potential role of PRDM9 in SST1-mediated recombination, we examined the frequency of predicted PRDM9 DNA binding sites33,34 in SST1 repeats in the chimpanzee genome. The density is high on the rDNA array-positive chromosomes but lower on the other chromosomes (Fig. 3b). Of note, although the SST1 arrays on hsa15 in the chimpanzee genome are composed of sf1 monomers, these monomers are not enriched for predicted PRDM9 DNA binding sites. This is consistent with the monophyletic state of SST1 monomers from hsa15 in chimpanzees and bonobos (Fig. 3c,d). Hsa15 does not have an rDNA array, and its resident SST1 appears not to recombine with SST1 monomers on other chromosomes with rDNA arrays. Together, our findings suggest that further analysis of SST1 and rDNA arrays will provide insight into the recombination and evolution of karyotypes in primate genomes.

To probe the evolutionary history of SST1 arrays and their potential role in chromosomal rearrangements across closely related species, we conducted a comparative phylogenetic analysis of SST1 monomers from rDNA array-positive chromosomes in chimpanzee and bonobo. The analysis revealed patterns similar to the human genome. In the chimpanzee genome, monomers from Hsa13, Hsa14, Hsa18, Hsa21 and Hsa22 co-cluster, whereas in the bonobo genome, monomers from Hsa14 and Hsa22 co-cluster (Fig. 3c,d). These findings provide evidence for interchromosomal exchange between SST1 monomers on rDNA array-positive chromosomes in both chimpanzee and bonobo genomes, suggesting that this phenomenon is not unique to humans but is a shared feature among great apes. When we performed a combined phylogenetic analysis of monomers from chimpanzee, human and bonobo, we observed that the acrocentric chromosomes from all three species contain SST1-sf1 monomers (Fig. 3e). The sf3 monomers from the Y chromosome also co-cluster. Monomers from other parts of the genome form a separate branch for sf2.

The inverted orientation of the SST1 array on the p arm of chromosome Hsa15 in both chimpanzee and bonobo genomes resembles the configuration of the SST1 array on chromosome 14 in the human genome (Fig. 3a). However, the SST1 monomers from Hsa15 form a chromosome-specific branch in the phylogenetic trees (Fig. 3c–e), indicating a lack of recombination between SST1 monomers on Hsa15 and other chromosomes in the ape genomes. This observation correlates with the absence of rDNA on Hsa15 in the chimpanzee and bonobo, which we speculate brings the SST1-containing regions in proximity to the nucleolus, facilitating interchromosomal exchange. Consequently, the chimpanzee and bonobo genomes lack the specific placement and orientation of rDNA and SST1 arrays that we hypothesize facilitate the formation of human ROBs, suggesting that this structural arrangement is unique to the human genome. Correspondingly, the presence of the rDNA and absence of SST1 on human chromosome 15 suggest that the ancestral Hsa15 of human, chimpanzee and bonobo had both rDNA32 and SST1 arrays. The greater sequence divergence of chromosome 15 from other rDNA array-positive chromosomes in humans highlights the importance of SST1 for maintaining their sequence similarity7.

The association between SST1 and interchromosomal recombination events is further supported by the identification of regions that are syntenic with the segmental duplication shown in Fig. 2d across several nonhuman primate genomes (Supplementary Fig. 5a). This finding suggests that these duplication events occurred in a common ancestor. Moreover, we identified a second segmental duplication on chromosome 16 in primates (hsa16), which exhibits a different structure but also contains Y-like SST1 and α-satellite sequences derived from cenY of the common ancestor of apes and New World monkeys (Supplementary Fig. 5b). These findings provide compelling evidence that SST1 is associated with interchromosomal recombination in primate genomes and that past events have involved the Y chromosome as a donor. Notably, the ancestral version of the Y chromosome was likely to contain rDNA35.

Centromere activity on ROBs

All ROBs examined in this study have two centromeric DNA arrays, roughly 5 Mb apart, yet they are faithfully transmitted through mitosis. This observation, consistent with previous cytogenetic observations21, suggests that epigenetic alterations to the centromeres might occur to enable correct chromosome transmission. To investigate centromere activity in ROBs, we leveraged both microscopy and genomic data. To analyse ROBs by microscopy, we performed immunolabelling with FISH (immunoFISH) using FISH probes targeting cen13/21 and cen14/22, along with antibodies against CENP-C, an inner kinetochore protein that marks the active centromere, and CENP-B, which binds to the 17-bp CENP-B box present in α-satellite DNA except on cenY. CENP-B was present broadly across α-satellite DNA, consistent with previous reports for a dicentric chromosome36. Confocal fluorescence microscopy (Supplementary Fig. 6) revealed that the CENP-B signal was proportional to the size of the centromeric array determined by the assembly (Supplementary Fig. 7).

To gain more detailed insights into centromere activity in ROBs, we used structured illumination microscopy (SIM) and single-particle averaging (Methods) to evaluate the localization of CENP-C in the two t(13;14) fusion chromosomes, as it provides higher resolution than confocal microscopy. CENP-C signal was confined to cen14 (Fig. 4a,b), suggesting that cen14 is the active centromere, whereas cen13 is inactive. This observation is consistent with a previous study that found cen14 to be the active centromeric array in most 13;14 dicentrics21. The active array did not depend strictly on size, as in GM04890 cen14 was smaller than cen13 (Supplementary Fig. 7). By contrast, the CENP-C signal for the t(14;21) fusion was less binary. CENP-C signal overlapped with both cen14 and cen21 within a single chromosome (Fig. 4c), and the pattern exhibited heterogeneity between chromosomes, suggesting that both arrays may remain active. Similar to CENP-C, two CENP-A signals were sometimes detected on the t(14;21) ROB (Extended Data Fig. 8). Imaging of NDC80 revealed that these two signals were often encompassed within a single outer kinetochore signal (Extended Data Fig. 8), indicating only one attachment site for microtubules, which would help prevent erroneous attachments.

Fig. 4: Centromere activity in dicentric ROBs.
figure 4

ac, ImmunoFISH, DNA methylation and CENP-A CUT&Tag analyses were performed for GM03786 (a), GM04890 (b) and GM03417 (c). Left, representative SIM images of ROBs labelled by immunoFISH with centromeric satellite probes for cen14/22 (orange), cen13/21 (green) and CENP-C antibody (red). DNA was counterstained with DAPI. Bottom images, magnified views depicting single CENP-C foci on cen14 in GM03786 (a) and GM04890 (b), and double CENP-C foci on cen21 and cen14 in GM03417 (c). Scale bars, 1 µm. Bottom left, averaged intensity profiles of lines drawn through the individual kinetochore regions of sister chromatids of multiple ROBs. GM03786: n = 22 (11 chromosomes) (a). GM04890: n = 26 (13 chromosomes) (b); GM03417: n = 24 (12 chromosomes) (c). Intensity profiles were aligned to the peak of the Gaussian of the cen14 signal and normalized to the maximum intensity of each channel. Error bars denote s.d. Top centre and right, corresponding heat maps of sequence similarity calculated for 5-kb bins for each centromere. Below the heat maps, DNA methylation tracks show methylation calls from ONT (orange) or PacBio HiFi (turquoise) sequencing, with hypomethylated regions suggesting active centromere localization. Active centromere regions are indicated by CENP-A enrichment on CUT&RUN (blue) and CUT&Tag (black) tracks.

To further investigate centromere activity in ROBs, we examined epigenetic markers that were associated with active centromeres. Previous studies have identified a dip in CpG methylation in active centromere arrays that coincides with enrichment for CENP-A, the centromere histone H3 variant. This region is referred to as the centromere dip region (CDR) and probably marks the site of kinetochore assembly37,38. We inspected CpG methylation from both HiFi and ONT reads across the centromere arrays in the three assembled ROBs. As controls, we examined CDRs in chromosomes 13, 14 and 21 from the normal chromosomes in the same cell line. We visualized CDRs in conjunction with pairwise sequence similarity heat maps between 5-kb bins of the centromeric array39 and methylation, CUT&RUN and CUT&Tag data for CENP-A. In the t(13;14) ROBs, there is a CDR on cen14, coincident with CENP-A enrichment, and although there is low CENP-A enrichment in the adjacent cen13 array, there is no corresponding CDR (Fig. 4a,b and Extended Data Fig. 9). This finding is consistent with the CENP-C signal being confined to the region with the signal from the cen14 FISH probe, suggesting that cen14 is the active centromere and that this chromosome is functionally monocentric. By contrast, cen13 on the normal chromosome has a clear CDR and CENP-A enrichment (Extended Data Fig. 10). Sequence similarity between centromeres 13/21 and 14/22 are a challenge for mapping short read data from CUT&RUN and CUT&Tag, but have less effect on CDR identification from long read data or imaging.

In the t(14;21) ROB, we observed a dip in methylation in both cen14 and cen21 arrays, located near the adjacent edges of the two arrays (Fig. 4c and Extended Data Fig. 9) towards the fusion breakpoint. The CENP-A enrichment is stronger on cen14, but CENP-A enrichment also exists on cen21, coincident with a CDR. In combination with imaging, the data suggest that inner kinetochore proteins are located at both centromeres. The CDR and CENP-A enrichment on cen21 on the normal chromosome are more pronounced (Extended Data Fig. 10). A previous study showed that isodicentric X chromosomes can be stably transmitted through mitosis when centromeres are less than 12 Mb apart20. Our findings suggest that ROBs are stably propagated owing to epigenetic adaptation in centromere activity. In some cases, this is achieved by complete inactivation of one array, whereas in other cases, activity is shared between two arrays that are physically close enough to be encompassed by the outer kinetochore.

Discussion

Whereas 15% of human ROBs appear to form by idiosyncratic mechanisms, 85% involve chromosome 14 (ref. 14). These common ROBs are likely to arise owing to a unique combination of factors, including the inversion on chromosome 14, the presence of a rDNA array which draws the acrocentric short arms into physical proximity near nucleoli, and recombination initiation from a hotspot at or near the SST1 array. Tandem arrays of SST1-sf1 are found primarily on rDNA array-bearing chromosomes in chimpanzee, bonobo, gorilla and human genomes32. However, the assembled chimpanzee and bonobo genomes do not possess a chromosome containing both an inverted SST1 array and an rDNA array, suggesting that this particular structural arrangement is unique to the human lineage. Despite this, phylogenetic evidence suggests that interchromosomal exchange between SST1 arrays on rDNA-containing chromosomes is a common feature in human, bonobo and chimpanzee genomes. Furthermore, the SST1-sf1 arrays that show evidence of exchange in the human and chimpanzee genomes are also enriched for PRDM9 DNA binding sites. Together, the evidence suggests that the common ROBs in humans occur via a combination of four key factors: (1) large regions of homology on heterologous chromosomes; (2) an inversion on chromosome 14, which creates a unique structural arrangement; (3) the co-location of two of these regions in 3D space due to the presence of rDNA arrays that bring the regions into close proximity (within the nucleolus); and (4) meiotic recombination hotspots in SST1, potentially mediated by PRDM9, that lead to crossing over between the nonhomologous chromosomes17.

Our work consolidates many disparate observations regarding the SST1 macrosatellite family, efforts that have been limited by gaps in repetitive DNAs in reference genomes. Through careful analysis of multiple human, chimpanzee and bonobo genomes, we identified three distinct subfamilies of SST1. Several names for these subfamilies appear in the literature, including TTY2 for sf3 on the Y chromosome, MER22 for sf2 and NBL2 for sf1 on the acrocentric chromosomes. In several cases, SST1 sequence has been associated with genome instability and adverse health effects. For example, loss of SST1 repeats on the Y chromosome is associated with male infertility40, and hypomethylation is associated with cancer41,42, potentially contributing to its transcription and subsequent genomic instability43,44. Additionally, meiotic genes such as PRDM9 are often expressed in cancer45, and translocations in cancer often involve acrocentric chromosomes46. Our work is consistent with previous proposals that SST1 contributes to genome instability, and it may have a much broader role than previously appreciated.

SST1-sf1 sequences on the acrocentric chromosomes are highly enriched for PRDM9 DNA binding sites, further implicating this shared macrosatellite DNA as a candidate for meiotic recombination hotspots that would result in recombination between heterologous chromosomes and the formation of common ROBs. We speculate that PRDM9 DNA binding sites that are lost via erosion19 may be ‘regenerated’ by gene conversion or other exchange events, allowing SST1-hotspot associated activity to persist and self-sustain. It is possible that rare and functionally distinct PRDM9 alleles contribute to the incidence of ROB formation and segmental duplications, based on their binding properties, but more studies will be required. Together, the evidence suggests SST1 repeats have a substantial role in the stability and evolution of primate chromosomes, as first suggested for the gibbon genome31.

ROBs form more commonly in female meiosis12,13. Meiosis is sexually dimorphic in many ways that could contribute to the higher frequency of ROB formation in female meiosis compared with male meiosis. Female meiosis proceeds through the earliest stages of prophase with DNA in a more demethylated state than in male meiosis47, and open chromatin is more prone to recombination. Consistently, there is more recombination initiation and crossing over in female meiosis48,49,50. Furthermore, hotspots can be sexually dimorphic51. Chromosomes synapse later in female meiosis than in male meiosis49, which could make them less constrained and allow more interchromosomal exchange. Paradoxically, crossovers are suppressed at rDNA arrays in plants52 and yeast53, but the SST1 arrays implicated here may be sufficiently far from the rDNA to avoid any protective effect. Future efforts will be required to distinguish the transcriptional, epigenetic and hotspot status of the short arms of acrocentric chromosomes in developmental and disease states in males and females, and the factors that allow ROBs to occur more commonly in female meiosis.

ROBs are transmitted at a higher rate than Mendelian ratios in female meiosis, a phenomenon known as meiotic drive54. Drive can occur in females because of the segregation of material into ‘dead end’ polar bodies, potentially owing to weaker centromeres. The specific structure of each ROB, including centromere activity, may affect its transmission. In our study, the t(14;21) ROB appeared functionally dicentric for inner kinetochore proteins with a spanning outer kinetochore, which may create a stronger centromere, potentially influencing its transmission. Moreover, ROBs have two long q arms, so crossovers on the ROB are more likely relative to an individual acrocentric chromosome, potentially facilitating ROB segregation at M1. The structure of each ROB will differ based on variation in individual centromeres and other repeats, potentially conferring differential transmission. Analysis of many ROBs will be required to understand how individual features affect their propagation and carrier fertility. By integrating insights from genomic, cytogenetic and evolutionary studies, we can gain a more comprehensive understanding of the role of rDNA and SST1-based recombination in genome evolution and reproductive biology.

Methods

Cell culture

Human lymphoblastoid cell lines (LCL) GM03786, GM04890 and GM03417 were obtained from Coriell. All LCL cell lines were cultured in RPMI 1640 (Gibco) with l-glutamine supplemented with 15% fetal bovine serum (FBS) in a 37 °C incubator with 5% CO2.

ONT sequencing

The ultra-high molecular weight DNA was extracted from frozen cell pellets using the NEB Monarch HMW DNA Extraction Kit for Tissue and assessed for fragment size using a pulsed field gel. The fragments span from 50 kb to 1,000 kb in size. Genomic DNA libraries were prepared using the NEB_5ml_Ultra-Long Sequencing Kit (SQK-ULK001)-promethion protocol from Oxford Nanopore. Each library was loaded onto a FLO-PRO002 flow cell and ran for 72 h with two subsequent loadings at 24-h intervals. The libraries were sequenced on a PromethION (Oxford Nanopore) running MinKNOW software v.22.12.5. Basecalling and modified base detection (5mC) were performed on-instrument using Guppy 6.4.6 with the following model: dna_r9.4.1_450bps_modbases_5mc_cg_hac_prom.cfg.

PacBio HiFi sequencing

PacBio library preparation was conducted using the SMRTBELL Prep Kit 3.0. The prepared libraries were quantified and sequenced on a PacBio Revio system with Instrument Control Software v.12.0.0.183503 and chemistry v.12.0.0.172289. Sequencing was performed using two SMRT Cells, each with a movie length of 24 h.

Using Pacific Biosciences SMRTbell Prep Kit 3.0 with binding kit 102-739-100 and sequencing kit 102-118-800, three libraries (one per sample). The Megarupter (Diagenode) was used for shearing and SageELF (Sage Science) was used for size selection. Library size was assessed using a FemtoPulse (Agilent). Each library was run on v.25M SMRT Cells using the first generation polymerase and chemistry v.1 (P1-C1). Sequencing was performed on a PacBio Revio system running instrument control software v.12.0.0.183503 and a movie collection time of 24 h per SMRTCell. Using PacBio SMRTLink v.12.0.0.172289, CCS/HiFi reads generated on-instrument using ccs v.7.0.0, lima v.2.7.1 (demultiplexing), and primrose v.1.4.0 (5mC calling).

Hi-C sequencing

Hi-C libraries were generated according to manufacturer’s directions using the Arima High Coverage Hi-C User Guide for Mammalian Cell Lines (A160161 v.01) and Arima-HiC+ User Guide for Library Preparation Using the Arima Library Prep Module (A160432 v.01). Starting with 5 million cells per sample, the Standard Input Crosslinking protocol was followed, resulting in 1.49–1.86 μg of DNA available per sample to generate large proximally ligated DNA as assessed using the Qubit Fluorometer (Life Technologies). Library preparation was performed using the S220 Focused-ultrasonicator (Covaris) to shear samples to 550 bp followed by a DNA purification bead cleanup with no size selection, and 5 or 7 cycles of library PCR amplification per sample. Resulting short fragment libraries were checked for quality and quantity using the Bioanalyzer (Agilent) and Qubit Fluorometer (Life Technologies). Libraries were pooled, requantified and sequenced as 150 bp paired reads on both the Illumina NextSeq 2000 and NextSeq 500 instruments to obtain at least 600 M read pairs per sample, using real-time analysis and instrument software versions current at the time of processing. Demultiplexing was performed with bcl-convert v.3.10.5. The cut sites (^) for the enzymes used were ^GATC, G^ANTC, C^TNAG and T^TAA.

Library construction for GM04890 and GM03786

Libraries were generated from 100 ng genomic DNA using Covaris LE220 plus to shear the DNA and the 2S Plus DNA Library Kit (Integrated DNA Technologies 10009878) for library preparation. To minimize coverage bias, only four cycles of PCR amplification were used. The median insert sizes were approximately 300 bp. Libraries were tagged with unique dual index DNA barcodes to allow pooling of libraries and minimize the impact of barcode hopping. Libraries were pooled for sequencing on the NovaSeq X plus (Illumina) across 14 lanes to obtain at least 369 million 151-base read pairs per individual library.

Library construction for GM03417

PCR-free libraries were generated from 1 μg genomic DNA using a Covaris R230 to shear the DNA and the TruSeq DNA PCR-Free HT Sample Preparation Kit (Illumina) for library preparation. The median insert sizes were approximately 400 bp. Libraries were tagged with unique dual index DNA barcodes to allow pooling of libraries and minimize the impact of barcode hopping. Libraries were pooled for sequencing on the NovaSeq X plus (Illumina) across 7 lanes on 25B flowcells to obtain at least 388 million 151-base read pairs per individual library.

Assembly methods

Phased genome assemblies were generated using Verkko (v.1.4.1)23. The assembly process integrated PacBio HiFi reads and Oxford Nanopore (ONT) reads, with Hi-C reads used specifically for the phasing. The ONT reads included ultra-long reads, defined as reads that are at least 100 kb in length. Verkko was run with the command:

samples = (″GM03417″ ″GM03786″ ″GM04890″) for sample in ″${samples[@]}″; do  verkko --slurm -d $sample \   --screen human \   --graphaligner conda/bin/GraphAligner \   --mbg conda/bin/MBG \   --hifi-coverage 30 \   --hifi $sample/HiFi/*fa.gz \   --nano $sample/ONT/fastq/*fq.gz \   --hic1 $sample/HiC/*_1_[ACTG]*.fastq.gz \   --hic2 $sample/GM03786/HiC/*_2_[ACGT]*.fastq.gz done

Haplotype-consistent contigs and scaffolds were automatically extracted from the labelled Verkko graph, with unresolved gap sizes estimated directly from the graph structure. After the assembly was generated, we collapsed all nodes composed of only rDNA k-mers into a single node and added telomere nodes to the graph to indicate ends of chromosomes using the commands:

seqtk hpc rDNA.fasta > rDNA_compressed.fasta seqtk telo assembly.fasta > assembly.telomere.bed mash sketch -i 8-hicPipeline/unitigs.hpc.fasta -o compressed.sketch.msh $mash screen compressed.sketch.msh rDNA_compressed.fasta | awk '{if ($1 > 0.9 & & $4 < 0.05) print $NF}' > target.screennodes.out python remove_nodes_add_telomere.py -r target.screennodes.out -t assembly.telomere.bed

In this simplified graph, the Robertsonian translocation was apparent in all cases (Extended Data Figs. 13). We extracted the assembly path corresponding to the ROB and identified gaps in the assembly. There was one gap in GM03417, one gap in GM03786 and two gaps in GM04890. Manual interventions were used to complete the chromosomes.

Assembly quality evaluation

We evaluated the quality and gene completeness of the genome assemblies using two approaches: a k-mer-based, reference-free method and a gene content assessment. For the k-mer-based evaluation, we employed Merqury55, a tool that assesses assembly completeness and accuracy without relying on a reference genome. Merqury uses k-mer frequencies from sequencing reads to estimate the quality value of the assemblies, which represents the phred-scaled error rate. For our evaluation, we used PacBio HiFi reads for the quality value estimation.

To assess gene completeness, we used compleasm56, a tool based on BUSCO. Compleasm evaluates the presence and integrity of a curated set of BUSCOs expected to be present in the genomes of the taxonomic group under study. We used the primate-specific BUSCO dataset, which includes 13,780 genes, to quantify the completeness, duplication and fragmentation of conserved genes in our assemblies.

PRDM9 site predictions and density

In 147 human haploid genomes (from 72 diploid individuals plus the haploid CHM13 and diploid HG002 genomes), predicted PRDM9 DNA binding sites were identified by using Motifence (v.0.1.1, commit fb1ebc0; https://github.com/AndreaGuarracino/motifence) to find DNA sequences matching the canonical 13-mer motif CCNCCNTNNCCNC57 or its reverse complement. To compute the density of PRDM9 DNA binding sites per kb in SST1 regions, SST1 arrays were first identified using TideHunter58. For a region to be defined as an SST1 array, the following criteria were applied: monomeric unit within the array had to be at least 500 bp in length, there had to be at least two monomers, and the monomers had to overlap with RepeatMasker (v.4.1.5, http://repeatmasker.org/) SST1 annotations. The PRDM9 density was then calculated by dividing the number of PRDM9 binding sites in the SST1 regions by the total length of these SST1 regions. PRDM9 alleles were found by conducting a BLAST search (blast-plus/2.13.0) on GM3417, GM3786 and GM4890 with the A allele as the reference. To identify genotypes, these hits were aligned to the 69 alleles from Alleva et al.30 using MUSCLE and visualized in Geneious Prime 2024.0.7.

In the chimpanzee genome, PRDM9 site density in sites per kb on SST1 regions was calculated using R and Bioconductor. The function vmatchPattern from the Biostrings library was used to map the occurrence of the chimpanzee PRDM9 motifs: prdm9_E CNNCCNAANAA, prdm9_W CNGNNAANANTT and prdm9_pt1 ANTTNNATCNTCC, or their reverse compliments, on the genome. SST1-containing regions were then queried for overlap of PRDM9 sites using the countOverlaps function from the GenomicRanges library. Query width was used to calculate sites per kb. SST1 regions larger than 10 kb were broken into 3-kb tiles to approximate resolution near SST1 feature size. Background PRDM9 site density was assessed in two ways. Random background PRDM9 density for each chromosome was determined using 100 randomly chosen 3-kb segments. To account for GC bias, the genome was scored for GC content at 3 kb resolution, and fragments within one s.d. of the average GC content of the SST1-containing elements were chosen to calculate background PRDM9 site density.

SST1–segmental duplication association

To examine associations between SST1 repeats and segmental duplications, we performed the following analysis in 147 human genomes (from 72 diploid individuals plus the haploid CHM13 and diploid HG002 genomes). First, repetitive regions in the genomic sequences were masked using RepeatMasker (v.4.1.5, http://repeatmasker.org/) and Tandem Repeats Finder (v.4.09.1)59. Segmental duplications were then identified using SEDEF (v.1.1)60 on each haploid masked genome. SST1 repeats were detected using RepeatMasker and refined with TideHunter, as described above. Finally, we used the R package regioneR (v.1.36.0)61 to perform permutation testing (n = 10,000) to assess the significance of spatial associations between SST1 repeats and segmental duplications. This analysis was conducted on 147 haplotype-resolved genomes to provide a comprehensive view of these genomic features across diverse human genomes.

SST1 monomer characterization

We used RepeatMasker to find the regions. We retrieved all fasta files with 1 kb of flanking regions for all arrays. Then, we manually curated all clusters using visual inspection by generating dot plots with the Dotlet applet62 with a 15 bp word size and 60% similarity cut-off. We made regressive changes in the consensus sequences used and that enabled us to describe the sequences properly. By manual curation, we were able to identify the beginning and end of the arrays and each monomer regarding the consensus generated. All monomeric sequences analysed were characterized with the same initial and final point regarding the consensus for the sake of alignment.

Maximum-likelihood phylogenetic analysis

We aligned all SST1 full-length monomeric sequences retrieved from assembled genomes using MUSCLE63. We conducted the phylogenetic analysis by using the maximum-likelihood method based on the best-fit substitution model (Kimura two-parameter + G, parameter = 5.5047) inferred by Jmodeltest2 with 1,000 bootstrap replicates. Bootstrap values higher than 75 are indicated at the base of each node.

Chromosome spreads, FISH and immunoFISH

For the preparation of chromosome spreads, cells were blocked in mitosis by the addition of Karyomax colcemid solution (0.1 µg ml−1, Life Technologies) for 6–7 h. Adherent fibroblast cells were collected by trypsinization. Collected cells were incubated in hypotonic 0.4% KCl solution for 12 min and pre-fixed by addition of methanol:acetic acid (3:1) fixative solution (1% total volume). Pre-fixed cells were spun down and then fixed in methanol:acetic acid (3:1).

For SST1 and centromere FISH, spreads were dropped on a glass slide and incubated at 65 °C overnight. Before hybridization, slides were treated with 0.1 mg ml−1 RNAse A (Qiagen) in 2× SSC for 45 min at 37 °C and dehydrated in a 70%, 80% and 100% ethanol series for 2 min each. Slides were denatured in 70% deionized formamide/2× SSC solution pre-heated to 72 °C for 1.5 min. Denaturation was stopped by immersing slides in 70%, 80% and 100% ethanol series chilled to −20 °C. Labelled DNA probes were denatured separately in a hybridization buffer by heating to 80 °C for 10 min before applying to denatured slides. Fluorescently labelled human centromere probes for D13Z1/D21Z1 and D14Z1/D22Z1 were from Cytocell. The biotin-labelled BAC probe for SST1 (RP11-614F17) was obtained from Empire genomics. Specimens were hybridized to the probes under a glass coverslip or HybriSlip hybridization cover (GRACE Biolabs) sealed with the rubber cement or Cytobond (SciGene) in a humidified chamber at 37 °C for 48–72 h. After hybridization, slides were washed in 50% formamide/2× SSC 3 times for 5 min per wash at 45 °C, then in 1× SSC solution at 45 °C for 5 min twice and at room temperature once. For biotin detection, slides were incubated with fluorescent streptavidin conjugated with Cy5 (ThermoFisher Scientific) for 2–3 h in PBS containing 0.1% Triton X-100 and 5% bovine serum albumin (BSA), and then washed 3 times for 5 min with PBS/0.1% Triton X-100. Slides were mounted in Vectashield containing DAPI (Vector Laboratories). Confocal z-stack images were acquired on the Nikon TiE microscope equipped with PlanApo 100× oil immersion objective NA 1.45, Yokogawa CSU-W1 spinning disk, Flash 4.0 sCMOS camera (Hamamatsu), and NIS Elements software.

For chimpanzee and bonobo cell lines, chromosome spreads specimens were hybridized to the probes under a glass coverslip or HybriSlip hybridization cover (GRACE Biolabs) sealed with rubber cement or Cytobond (SciGene) in a humidified chamber at 37 °C for 48 h. After hybridization, slides were washed in 50% formamide/2× SSC 3 times for 5 min per wash at 45 °C, then in 1× SSC solution at 45 °C for 5 min twice and at room temperature once. For biotin detection, slides were incubated with fluorescent streptavidin conjugated with 488 (ThermoFisher Scientific) for 45 min in PBS containing 0.1% Triton X-100 and 5% BSA, and then washed 3 times for 5 min with PBS/0.1% Triton X-100. Slides were mounted in Vectashield containing DAPI (Vector Laboratories). Confocal z-stack images were acquired on the Zeiss LSM 800 microscope equipped with a 63×/1.4 Plan-Apochromat 63× oil immersion objective and Zen Blue software.

For chimpanzee and bonobo, we used the following SST1-sf1 probe: (5′-AGGCCAAATATCAGCTGCAAATTCAATCATCCATCAGCCCTCTGCCTACCTCTTCCTTTGAAAGGGCAGTGGCCGGCCCGGCTTGTAAAAGCCCTGGGGTTCCAGAAAGCCGACCGCGCTTTACAGAACAACTGTAATGAGGAACACAGGCGAATCCGAGGGGGTGACCATGTGACCACGCGTGGTACTGGCCAATCCCACAGCAGCTGGTGTTAATGTGTGTCACCGGAGGCATACGGGGCGACGGCGAAACAAAGGGTGGTGTCCAGGAATGTGCCGGTGGATGGGGAAACGGGTGACCTTTCCATCAATGCCAACGAAAATCAAAGAACAACTGGGACCCGGGGGTTGGGGGTGCCGCCTGTGCCTGACCCAAGCCACGTTTTCAAATGCCTACCAGAGGAGCAGAGAGGTTTCTGCAAAATTCGCAGCATCCCCAATCCTCCACCGACCTGGTAGCCCTGACGAAACTTCGGCTGGCACAAACCCAGAGAGGGTGGGGAGTCATACAGCAGAGGAGAGCAGCCCAGGGGCACGCAGGCCGACCCGTCATCGAGATCACGGACGGCCGCACGACTTTTCGGGAGACTCACCCCAGCCAACACCGTCCGTGCAGGCCTGAGGCTGGTATCCCGTGCTGCTTCCCCCCGTCTCCGCCTGGGGTTTCCTCATCAAGGTCGGCCCTTTGCGACTCCTGGCATCCGGAGACGTTCCCTTCGACCCCGTGGAGAGGTGAGGCTTTAGCCTCAGAGCCTCGACACCCAAGCACTGCAACGGAGGGCTCCTGCTCTGCCAAGCCTCGGGGCCTGGTTTCTAAGAAAACCGTGGGAACCACTGTGACGGGAGATACCGCTCGCGCCTCGCGCATGCGCATTGGCCGAGCCGATTCGCGCTCCACTGCTGACAGATAGGCTGCGTCCGCTTTAAATATCGCCACCACCACGCGGCGGCCTTGGTGCTCCTGCTGCCGCTGCGGCGGCGGCTGGATCCTGGGTCCTGTTTGGGGCGGCATGCGAAAGGGGACCGCGGGTGTCTCGTCCTGTCCCAGGCCCACACCCCCAGGGGTCCTGTCCACAGGACCTGCTTCAGCCGACTTCCACCGAGGGAGGGGGAGCTTCAGGACGCCTGCTGTGTTCTCCGGACTCCCGTTGAGATCCGATTTTGGCCCTCTCCGAGTGAGATAGGACGAGCTCACCACACCCGGACAGGCCGGCAGGGCCTC GCTGCAGCACAGAATGATCCCGTAGGTCTGA-3′).

For CENP-B and CENP-C immunoFISH, freshly prepared chromosome spreads were dropped on a glass slide, washed with PBS/0.1% Triton X-100, and blocked with 5% BSA in PBS/0.1% Triton X-100. Primary antibody (rabbit polyclonal anti-CENP-B, Abcam, ab25734, rabbit polyclonal anti-CENP-C, Millipore, ABE1957) and secondary antibody (goat anti-rabbit Alexa Fluor 647, ThermoFisher Scientific) were diluted in 5% BSA/PBS/0.1% Triton X-100. Specimens were incubated with primary antibody overnight, washed 3 times for 5 min, incubated with secondary antibody for 2–4 h and washed again 3 times for 5 min. All washes were performed with PBS/0.1% Triton X-100. After antibody incubation, spreads were post-fixed in 2% paraformaldehyde diluted in PBS for 15 min, washed in PBS, and processed for FISH as described above, starting with an ethanol dehydration series. DNA was stained with 1.5 µg ml−1 DAPI. Confocal z-stack images of CENP-B immunoFISH were acquired on the Nikon TiE microscope as described above. For SIM performed on CENP-C immunoFISH, slides were rinsed in ddH2O, air-dried in the dark, mounted in ProLong Glass antifade mountant (ThermoFisher Scientific) and allowed to cure for at least 24 h before imaging. z-stack images were acquired on an Elyra 7 Lattice SIM2 microscope (Zeiss) equipped with two PCO.edge 4.2 sCMOS cameras, four high power continuous wave lasers (405, 488, 561 and 642 nm) and a Zeiss PlanApo 63× oil immersion objective NA 1.4. The illumination pattern was set to 15 phases, and the z-stack spacing was set at 100 nm. Raw SIM images were reconstructed using the ZEN Black software (Zeiss) with 10.5 manual adjustments for sharpness and best-fit settings for all channels except 405 nm (DAPI), which was processed in the widefield mode. Image pre-processing for SPA-SIM included channel alignment; this analysis randomizes any residual chromatic shifts by averaging randomly oriented chromosomes.

For CENP-A and NDC80 immunoFISH experiments, fibroblasts plated on 150-mm dishes were treated with 100 µM Monastrol (Tocris Bioscience) and 100 µM Apcin (Selleck Chemicals) for 5 h and collected by mitotic shake-off. Collected cells were further incubated with Karyomax colcemid solution (0.1 µg ml−1, Life Technologies) for 15 min. After that, cells were spun down and resuspended in 0.075 M KCl swelling buffer containing 10 mM HEPES, incubated at room temperature for 12 min, washed with ice-cold PBS and kept on ice. Cells (3–4 × 105) were spun onto glass slides using Shandon Cytospin 4 centrifuge (Thermo scientific) at 700–1000 rpm for 3–5 min, washed in KCM buffer (120 mM KCl, 20 mM NaCl, 10 mM Tris-HCl, pH8, 0.5 mM EDTA, 0.1% (v/v) Triton X-100) and blocked in 5% (w/v) BSA/KCM for 30 min. Slides were then incubated with primary antibodies (mouse anti-CENP-A (3-19) Enzo ADI-KAM-CC006, or mouse anti-NDC80 (9G3.23) ThermoFisher Scientific MA1-23308 with rabbit anti-CENP-A ProSci 30-143) used at 1:100 dilution for 1 h at room temperature, washed 3 times in KCM for 5 min, followed by incubation for 1 h with species-specific secondary antibodies conjugated with Alexa Fluor Plus dyes at 2 µg ml−1, washed again and post-fixed in in 4% (v/v) paraformaldehyde/KCM for 10 min. Fixed slides were incubated in 50% glycerol/PBS at 4 °C for at least 1 h or overnight. Before hybridization, slides were subjected to a freeze-thaw treatment by dipping into liquid nitrogen, then treated with 0.1 N HCl for 5 min, washed twice in 2× SSC buffer, and pre-incubated in 50% formamide/2× SSC overnight. Fluorescently labelled probes were pre-denatured for 7 min at 80 °C, followed by incubation with the specimen for 3 min at 74 °C, and hybridized under HybriSlip hybridization cover (GRACE Biolabs) sealed with Cytobond (SciGene) in a humidified chamber at 37 °C for 24–48 h. After hybridization, slides were washed in 50% formamide/2× SSC 3 times for 5 min per wash at 45 °C, then in 1× SSC solution at 45 °C for 5 min twice and at room temperature once. DNA was stained with 1.5 µg ml−1 DAPI. After staining was completed, slides were rinsed in ddH2O, air-dried in the dark, mounted in ProLong Glass (ThermoFisher Scientific) and allowed to cure for at least 24 h before SIM imaging.

Centromere intensity profiling of centromere and SST1 FISH and CENP-B immunoFISH

Maximum intensity projections from spinning disk confocal z-stacks were generated, and chromosomes of interest were segmented manually on the basis of DNA and centromere labelling. Segmented chromosomes from each cell line were oriented vertically and assembled in a new stack consisting of identified specific chromosomes from multiple chromosome spreads. Intensity plot profiles were generated from 2 µm vertical lines with the width of 10 pixels drawn through centromeric regions of each chromosome. Intensity profiles were combined by channel, fit to single Gaussian functions, and aligned to the peak of the Gaussian of the indicated channel. These profiles were then averaged together and normalized to the maximum intensity of each peak. For each chromosome from each cell line, at least ten intensity profiles were averaged and plotted with the s.d. All image processing and analysis were performed using ImageJ/FIJI. A detailed description of this type of analysis and relevant plugins are available at https://research.stowers.org/imagejplugins/spasim.html.

Semi-automated intensity profiling of CENP-C immunoFISH from SIM images

Reconstructed SIM images were mean projected, except for the DAPI channel, which had the slice of highest contrast selected. ROBs and corresponding normal acrocentric chromosomes were identified using centromere FISH signals and segmented manually or with a Cellpose model trained on a combination of the DAPI and centromere signals. Individual chromosomes were transferred to a new image and oriented vertically using a second Cellpose model trained to find a skeleton of the chromosome. Bent chromosomes were straightened in ImageJ/FIJI by manually drawing two annotation lines across centromeres, one through each kinetochore. The straightened images were then aligned to the peak of the specified centromere FISH signal used as the anchor point, and the line intensity profiles were aggregated over multiple images and split by cell line. At least ten chromosomes were analysed for each instance from each cell line. All analysis was performed in ImageJ/FIJI and Python with code at https://github.com/jouyun/Gerton_Robertsonian_2024.

Methylation calls

HiFi BAM and ONT FASTQ files with 5mC methylation calls as MM and ML tags were aligned against the generated assemblies using pbmm2 (v.1.13.0, https://github.com/PacificBiosciences/pbmm2) for HiFi reads and Winnomap (v.2.03)64, for ONT reads. The alignments were then converted to sorted BAM files containing only primary mappings with samtools (v.1.17)65:

# HiFi reads pbmm2 align {genome}.mmi {bam_with_meth_calls} -j 42 > {output.bam} samtools view -@ 24 -Sb -F 2048 {output.bam} | samtools sort -@ 24 -T {temporary_directory} - > {output.bam} samtools index {output.bam} # ONT reads winnowmap -t 48 -W {genome}_repetitive_k15.txt -ax map-ont -y {assembly_fasta} {fastq_with_meth_calls} > {output.sam} samtools view -@ 24 -Sb -F 2048 {output.sam} | samtools sort -@ 24 -T {temporary_directory} - > {output.bam} samtools index {output.bam}

Aggregated methylation percentages at all CpGs were obtained using modbam2bed (v.0.10.0, https://github.com/epi2me-labs/modbam2bed) with bases with >0.8 probability called “methylated” and bases with <0.2 probability called “unmethylated”:

modbam2bed -t 48 -e -m 5mC --cpg -a 0.20 -b 0.80 {assembly_fasta} {output.bam} > {output.bed}

CUT&RUN library preparation

The CUT&RUN assay was performed using the CUT&RUN Assay Kit (86652, Cell Signaling Technology) in accordance with the manufacturer’s protocol. For each condition, 250,000 cells were pelleted and washed in 1× wash buffer, prepared from 10× wash buffer (31415, Cell Signaling Technology), 100× spermidine (27287, Cell Signaling Technology) and 200× protease inhibitor cocktail (7012, Cell Signaling Technology). Cell suspensions were then incubated with concanavalin A-coated beads for 5 min at room temperature to facilitate binding, followed by resuspension in 1× binding buffer containing 100× spermidine, 200× protease inhibitor cocktail, 40× digitonin solution (Cell Signaling Technology, 16359) and antibody binding buffer (Cell Signaling Technology, 15338). For the detection of CENP-A–DNA interactions, a monoclonal antibody against CENP-A (Enzo, ADI-KAM-CC006-E) was employed at a 1:50 dilution. As controls, tri-methyl-histone H3 (Lys4) (Cell Signaling Technology, 9751, C42D8 rabbit monoclonal antibody) was used at 1:50 dilution as a positive control, while a rabbit IgG XP isotype control (Cell Signaling Technology, 66362, DA1E monoclonal antibody) was applied at 1:10 dilution as a negative control. Antibody incubation was conducted at 4 °C overnight (16 h). Later, the beads were subjected to magnetic separation and washed in digitonin buffer.

The beads were then resuspended in digitonin buffer containing pAG-MNase enzyme (40366) and incubated at 4 °C for 1 h. Following another wash in digitonin buffer, the beads were treated with calcium chloride in digitonin buffer and incubated at 4 °C for 30 min to facilitate MNase activation. The enzymatic digestion was terminated by adding 1× stop buffer (prepared from 4× stop buffer (Cell Signaling Technology, 48105), digitonin solution and 200× RNase A (Cell Signaling Technology, 7013)). For normalization, spike-in DNA (Cell Signaling Technology, 40366) was introduced at a final concentration of 10 pg μl−1 (1:100 dilution). Samples were then incubated at 37 °C for 10 min, and the supernatants were collected by centrifugation. DNA was liberated via incubation at 65 °C for 2 h before purification. Input chromatin samples were sheared to fragments ranging from 100–700 base pairs using a Covaris S2 sonicator prior to purification.

DNA purification was performed using a DNA purification with spin columns kit (Cell Signaling Technology, 14209). DNA concentration was assessed using the Qubit dsDNA HS kit for the Qubit Fluorometer.

CUT&Tag library preparation

For anti-CENP-A CUT&Tag, library preparation was used the CUT&Tag-IT kit from Active Motif (53160). Each experiment was performed for 500,000 fresh cells. Fresh cells were washed using 1× wash buffer and nuclei were isolated and incubated with activated concanavalin A-coated magnetic beads in 2 ml PCR tubes at room temperature for 10 min. A 1:100 dilution of primary antibody anti-CENP-A (human) monoclonal antibody (D115-3) in antibody buffer was added and nuclei were incubated overnight at 4 °C. The next day tubes were incubated on a magnetic tube holder and supernatants were discarded. Secondary antibody (rabbit anti-mouse) was diluted at 1:100 in Dig-Wash buffer and nuclei were incubated for 1 h on an orbital rotator at room temperature. Nuclei were washed three times in Dig-Wash buffer and then incubated with a 1:100 dilution of CUT&Tag-IT pA–Tn5 Transposomes for 1 h on an orbital rotator at room temperature. After, 125 μl of tagmentation buffer was added to each sample. To stop tagmentation, 4.2 μl 0.5 M EDTA, 1.25 μl 10% SDS and 1.1 μl 10 mg ml−1 proteinase K was added to each reaction and incubated at 55 °C for 1 h. DNA was barcoded and amplified using the following conditions: a PCR mix of 25 μl NEBNext 2× mix, 2 μl each of barcoded forward and reverse 10 μM primers, and 21 μl of extracted DNA was amplified at: 58 °C for 5 min, 72 °C for 5 min, 98 °C for 45 s, 16× 98 °C for 15 s followed by 63 °C for 10 s, 72 °C for 1 min. Amplified DNA libraries were purified by adding a 1.1× volume of SPRI beads to each sample and incubating for 10 min at 23 °C. Samples were placed on a magnet and liquid was removed. Beads were rinsed twice with 80% ethanol, and DNA was eluted with 20 μl elution buffer. All individually i7-barcoded libraries were mixed at equimolar proportions for sequencing.

CUT&Tag and CUT&RUN libraries and sequencing

Libraries were quantified and individually converted to process on the Singular Genomics G4 with the SG Library Compatibility Kit (700141), following the Adapting Libraries for the G4–Retaining Original Indices protocol. The converted libraries were sequenced in individual lanes on an F3 flow cell (700125) on the G4 instrument, using Instrument Control Software 23.08.1-1 with 100 bp paired reads. Following sequencing, sgdemux 1.2.0 was run to generate FASTQ files.

CUT&Tag and CUT&RUN bioinformatic analysis

CUT&Tag and CUT&RUN sequencing reads were trimmed using the trim-galore tool (v.0.6.10, https://github.com/FelixKrueger/TrimGalore), which included adapter removal. The trimmed reads of each sample were then aligned to the corresponding generated de novo assemblies using bowtie2 (v.2.5.3)66. Post-alignment, the reads were sorted and indexed using samtools (v.1.17)65, to then extract depth information for primary alignments with mosdepth67.

Pairwise sequence identity heat maps

To generate pairwise sequence identity heat maps of each centromeric region, we used a modified version of StainedGlass (v.0.6)39 with the following parameters: window=5000, mm_f = 30000, and mm_s = 1000. Our modifications were applied to visualize the identity heat maps with methylation and CENP-A CUT&Tag information included at the bottom.

Synteny plots

To visualize the alignment between the generated assemblies and the CHM13 genome, we used NGenomeSyn26 to generate the synteny plots, which were then manually curated.

Hi-C data analysis

We mapped Hi-C reads against the CHM13 genome and the phased genome assemblies of the three cell lines with the BWA aligner68, configured to handle the chimeric nature of Hi-C reads by allowing local mapping and tuning the parameters to minimize gaps. Following read mapping, for each cell line, we constructed three Hi-C contact matrices, one against CHM13 and two against the 2 haplotypes of the respective assembly, by specifying a bin size of 10,000 bp and incorporating restriction site information using HiCExplorer tools69. The resulting matrices were then binned at various resolutions (100 kb, 200 kb and 500 kb) and corrected to normalize the contact frequencies across bins and remove GC and open chromatin biases. Finally, we visualized the corrected matrices using hicPlotMatrix, applying log transformation to handle the wide range of contact counts.

Genome versions used

We leveraged multiple reference genomes and assemblies. The primary reference was T2T-CHM13v2.0. We also incorporated the recent diploid T2T-HG002v1.1 genome and 72 samples from the Human Pangenome Reference Consortium (HPRC). The HPRC samples were assembled using Verkko v.2.123, using a combination of sequencing technologies for each sample. The assembly process utilized PacBio High-Fidelity (HiFi) reads and Oxford Nanopore Technology (ONT) long reads. For phasing, we primarily used short Illumina reads. In cases where trio information was unavailable, Hi-C reads were used for phasing instead.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.