- Methodology
- Open access
- Published:
cuteFC: regenotyping structural variants through an accurate and efficient force-calling method
Genome Biology volume 26, Article number: 166 (2025)
Abstract
Long-read sequencing technologies have great potential for the comprehensive discovery of structural variations (SVs). However, accurate genotype assignment for SVs remains challenging due to unavoidable sequencing errors, limited coverage, and the complexity of SVs. Herein, we propose cuteFC, which employs self-adaptive clustering along with a multiallele-aware clustering to achieve accurate SV regenotyping through a force-calling approach. cuteFC also applies a Genome Position Scanner algorithm to improve its application efficiency. Benchmarking evaluations demonstrate that cuteFC outperforms state-of-the-art methods with 2–5% higher F1 scores and constructs a higher-quality genomic atlas with minimal computational resources. cuteFC is available at https://github.com/Meltpinkg/cuteFC and https://zenodo.org/records/14671406.
Background
Structural variations (SVs) are genomic mutations no less than 50 base pairs (bp) in length, encompassing insertions, deletions, inversions, duplications, translocations, and complex variants [1]. The broad range of alterations induced by SVs affects numerous nucleotides and is strongly associated with molecular and phenotypic consequences [2] and clinical diseases [3,4,5] that greatly influence evolution [6,7,8] and population structure [9, 10]. Previous population studies, including genome-wide association studies (GWASs), have focused predominantly on the exhaustive characterization of single-nucleotide variants (SNVs) and short insertions/deletions (indels), overlooking the significant heritability of SVs [11]. Recently, long-read sequencing technologies have led to new developments in SV detection and analysis in an unprecedented fashion, which has facilitated the discovery of full-spectrum SVs in individual genomes [12,13,14,15,16,17,18,19], thereby providing an avenue for discovering population-based SVs through large-scale cohort sequencing [20,21,22].
Currently, joint variant calling is widely used to construct population-based genetic maps, allowing for simultaneous consideration of all samples and generating high-quality population-scale variation callsets, which can provide the overall variation distribution among a population cohort. GATK [23] has developed a comprehensive procedure for the joint calling of SNVs and indels, but the joint calling of SVs is still in progress [24, 25]. One approach of joint SV calling involves combining individuals through initial calling and merging to produce a fully genotyped population callsets [26]. However, this approach is restricted and only accepts the callsets generated from the same approach. With the rapidly growing amount of large-scale genetic studies, there is an urgent need to integrate callsets previously generated from diverse platforms. Therefore, the approach of a workflow containing a process of SV calling, merging, regenotyping, and remerging is more universally applicable and has been widely utilized [20, 27], as it allows for the direct integration of various callsets without being re-analyzed. Moreover, considering the unavoidable errors that arise during data processing, it is beneficial to introduce an ensemble of different analysis tools and to capitalize on their respective strengths [28]. The workflow also facilitates the simultaneous and flexible consideration among various analysis tools, thereby enhancing the scalability and quality of the population callsets.
In this workflow of joint SV calling, one of the key points is the regenotyping process through force-calling methods. Force calling aims at regenotyping all the target SV sites using individual sequencing data. Since population-scale SV callsets may update with increased population sizes or the development of sequencing technologies, it is essential to regenerate genotypes for each sample at all population-scale SV sites. Regenotyping enables the refinement or determination of the original SV zygosities, which are potentially influenced or lost due to insufficient coverage, thus achieving accurate allele frequency distributions in the population, which is critical for downstream analysis. Compared with short-read sequencing, long-read sequencing methods, such as Pacific Biosciences (PacBio) [29] and Oxford Nanopore Technologies (ONT) [30], can generate longer reads that can cover most regions of variation [31, 32] and become beneficial for the analysis of SVs. However, unignorable sequencing errors and high expenses restrict their wide application in SV genetic studies. The varied sizes and different types of SVs that are rendered by different sequencing signatures also make it difficult to obtain reliable genotypes for SVs across a population group.
To address these issues, several force-calling methods based on long-read sequencing have recently been developed for accurate SV regenotyping. SVJedi [33] filters out informative alignments to quantify the presence of SV alleles. Sniffles1 [34] constructs a self-balancing tree to compute the read fraction and determine the genotype according to variant allele frequency. Its successor, Sniffles2 [26], adopts a three-phase clustering process for improved genotyping. cuteSV [15] trims its clustering-and-refinement algorithms and extends to a force-calling module. However, SVJedi still lacks high sensitivity and accepts only the original reads as input, requiring a large amount of time to accomplish read alignment. Sniffles1 focuses on SV detection and makes mistakes in determining genotypes for SVs. Though its performance has seen considerable enhancement in Sniffles2, the genotype determination of the high-similarity SVs, especially in the multiallele genomic regions, is still limited. cuteSV employs a workflow similar to that used in the SV detection module, and its unoptimized implementation fails to effectively utilize prior knowledge. In general, these force-calling methods have some drawbacks in determining the correct genotype for SVs. These inaccuracies and inconveniences are still bottlenecks in the accurate determination of SV genotypes, which are developing issues in population SV joint calling and cause difficulties in the construction of highly resolved population-based SV genetic maps.
Here, we introduce cuteFC, an accurate and efficient regenotyping method, which aims to determine the zygosities of all target SVs by precisely recognizing the reads with the target variation and those without the variation. According to the polymorphism, cuteFC automatically selects self-adaptive clustering or multiallele-aware clustering to identify the alternative SV signatures. The zygosities are obtained by calculating the likelihood of the read distribution for each candidate. Additionally, a Genome Position Scanner (GPS) algorithm was designed to facilitate the ability of cuteFC to achieve linear time complexity in read statistics and improve whole force calling. The experiments were implemented at both the individual and cohort levels. The results indicate that our regenotyping method, cuteFC, has outstanding and reliable performance compared to homogeneous state-of-the-art methods, which suggests that it will be useful in obtaining accurate allele frequencies of homologous SVs for further population genetic measurement and estimation.
Results
Overview of cuteFC
cuteFC takes the individual read alignments (sorted BAM), reference genome, and initial callsets (VCF) as input to achieve the genotype force calling of all target SVs in the individual. The four major steps in cuteFC for accurate and efficient force calling are as follows:
-
Step 1:
Extract SV signature information and raw read coordinates from the individual sorted BAM file and extract target SVs from the initial callsets.
-
Step 2:
Identify the alternative alleles via signature spatial marking and clustering. The target SVs are defined as polymorphic SVs when they occur on the same locus. According to the polymorphism of SVs, self-adaptive clustering or multiallele-aware clustering is automatically selected to identify the signatures that correspond with the target SVs.
-
Step 3:
Identify the reference alleles by subtracting reads containing alternative alleles from the total reads. The GPS algorithm is used to calculate the total number of reads spanning or overlapping the target SVs (Additional file 1: Fig. S1). In the algorithm, a scanning line (termed Genome Scanner) that is an imaginary line scans through the genome to record the reads or SVs that overlap with it to statistics the read distribution.
-
Step 4:
Assign the genotype for each target SV using the maximum likelihood estimation.
A schematic illustration and a representative example are shown in Fig. 1 and Additional file 1: Fig. S2, and additional details are provided in the “Methods” and “Supplementary Methods” sections.
Schematic illustration of the cuteFC approach. Step 1: Various types of SV signatures, raw read coordinates, and target SVs were extracted from the sample BAM file and target VCF file. Step 2: The SV signatures were heuristically clustered to identify the signatures that corresponded to the target SV (termed alternative allele signatures). cuteFC first marks all the SV signatures around the target SV and subsequently clusters and refines the marked signatures to identify alternative allele signatures. Self-adaptive clustering and K-means-based clustering are utilized in various genomic regions to accurately identify alternative alleles. Step 3: All reads around the target SV were obtained using the Genome Position Scanner (GPS) algorithm. The reads containing alternative allele signatures were excluded from the total reads to identify the reference allele reads. Step 4: The genotype for each target SV was reassigned through maximum likelihood estimation. We used “0” to represent the reference allele and “1” to represent the alternative allele. Therefore, “1/1” represents homozygous alternative alleles, “0/1” represents heterozygous alternative alleles, and “0/0” represents homozygous reference alleles
The key advantage of cuteFC is its precise statistics of alignments to distinguish the reads with the target variation from those without the variation. This approach focuses on three bottleneck issues in regenotyping and large-cohort joint variant calling.
First, considering the noise and errors in both individual sequencing and read mapping, the occurrence of SVs in reads may be irregularly altered, which introduces challenges in recognizing reads with true target SVs. In Step 2, cuteFC first integrates and refines the alignment signatures to comprehensively call back the potential SV signatures on alignments and then implements a self-adaptive clustering strategy to better distinguish between noisy alignments and variant-related alignments.
Second, there are many multiallele regions in the genome where many polymorphic SVs can be found. The polymorphic SVs in the same multiallele regions share the same coordinates but have varying lengths in different haplotypes. The high similarity of these SVs increases the difficulty of accurate genotyping. To solve this issue, in Step 2, cuteFC identifies these genomic regions and introduces a multiallele-aware clustering for these polymorphic SV signatures, to achieve precise signature clustering via a K-means-based strategy.
Third, with the development of long-read sequencing, the scale of large cohort studies has been increasing continuously. Since the cumulative number of population-based SVs increases with the scale of the cohort, massive amounts of data will need to be processed and analyzed, placing extreme pressure on computational and storage resources in population joint calling. To alleviate the tight resources and analysis time, in Step 3, cuteFC designs a GPS algorithm to efficiently speed up the read statistics operation that makes up the largest proportion of runtime in regenotyping. The GPS algorithm facilitates the ability of cuteFC to achieve linear time complexity in read statistics and makes it suitable for large-scale cohort studies.
These designs enable the precise distinction of reads around each target SV, yielding the correct likelihood estimation of zygosities. This approach efficiently helps cuteFC to achieve outstanding regenotyping performance, which has been demonstrated on several datasets below.
Benchmarks of regenotyping performance on simulation datasets
To evaluate the regenotyping performance of cuteFC, we first produced a simulated donor genome and generated three sets of 30 × simulated sequencing data via different sequencing technologies (i.e., PacBio HiFi, PacBio CLR, and ONT; Additional file 2: Table S1) via VISOR (v1.1.2)[35]. Then, we applied cuteFC (v 1.0.1) and four other state-of-the-art methods (i.e., Sniffles1 (v 1.0.12), Sniffles2 (v 2.0.7), SVJedi (v 1.1.6), and cuteSV (v 1.0.13)) to regenotype the simulated individuals. The benchmarking results, shown in Fig. 2A, indicate that cuteFC achieved the best genotyping performance for data produced by each sequencing technology (F1 score: 92.81% for HiFi, 93.15% for CLR, 93.09% for ONT), followed by cuteSV (F1 score: 92.70% for HiFi, 90.41% for CLR, 89.65% for ONT), and Sniffles2 (F1 score: 90.37% for HiFi, 89.11% for CLR, 88.45% for ONT), while SVJedi (F1 score: 76.12% for HiFi, 80.48% for CLR, 80.70% for ONT) and Sniffles1 (F1 score: 81.33% for HiFi, 80.44% for CLR, 79.71% for ONT) exhibited a poorer performance. The outstanding performance of cuteFC can primarily be attributed to the dynamic self-adaptive clustering of SV signatures, which enables the attainment of the highest scores, with over 97% precision and over 88% recall, which are on average 1.74% and 2.37% higher, respectively, than those of the second-best tool, cuteSV (Additional file 2: Table S2). Then, we randomly downsampled the sequencing datasets and observed that the performance of each method decreased with decreasing coverage (Fig. 2A, Additional file 2: Table S2). The decrease in performance became more intense at coverage below 10 ×, indicating that more than 10 × sequencing might be the recommended coverage for the SV regenotyping task. Despite this, cuteFC consistently outperformed the other methods in genotype inference, even when the coverage came to 5 × (F1 score: 83.41% for HiFi, 82.52% for CLR, 83.62% for ONT).
Benchmarking results on simulation datasets. A The precision and recall of different tools on simulation datasets under different sequencing coverages (i.e., 30 ×, 20 ×, 10 ×, and 5 ×) and technologies (i.e., HiFi, CLR, and ONT). B Presence-F1 scores of different tools of various SV types (deletion (DEL), insertion (INS), inversion (INV), duplication (DUP), and translocation (TRA)) according to different sequencing technologies. C Genotyping-F1 scores of different tools of various SV types (deletion (DEL), insertion (INS), inversion (INV), duplication (DUP), and translocation (TRA)) according to different sequencing technologies. D The genotype discovery rates of different tools under different simulation proportions and different sequencing technologies. The darker red represents the better genotype discovery rate. The proportion “x:y” indicates the proportion of simulated reads for alternative alleles to simulated reads for reference alleles as x to y. Specifically, the proportions 0:10, 1:9, and 2:8 correspond to homozygous reference alleles, which should be genotyped as “0/0”; proportions 4:6, 5:5, and 6:4 correspond to heterozygous alternative alleles, which should be genotyped as “0/1”; and proportions 8:2, 9:1, and 10:0 correspond to homozygous alternative alleles, which should be genotyped as “1/1”
Next, we classified SVs by their type, and the evaluations across various SV types demonstrated that cuteFC consistently achieved superior performance. In terms of SV presence (Fig. 2B and Additional file 2: Table S3), cuteFC and Sniffles2 outperformed the other methods. However, for SV genotyping (Fig. 2C and Additional file 2: Table S3), the performance of all tools declined across various SV types (by an average of 15%), with cuteFC exhibiting the slowest overall decline (approximately 7%). For insertions and deletions, which were the most abundant SV types, cuteFC achieved higher F1 scores (> 93% for insertions and > 92% for deletions) compared to the other four methods. For inversions, cuteFC, cuteSV, Sniffles1, and SVJedi demonstrated similarly strong performance (with F1 scores around 95%), while Sniffles2 performed relatively worse, achieving an F1 score of approximately 70%. For duplications, the F1 scores showed the most significant reduction after accounting for zygosity (17% for cuteFC and > 33% for other methods). This decline is likely attributable to the inherent challenges in both precisely identifying read distributions around genomic regions with extensive base and coverage changes and accurately assigning different allele copies to their correct haplotypes. Notably, cuteFC significantly outperformed the other methods, achieving F1 scores of approximately 70%, nearly twice as high as those of the second-best tool. For translocations, the F1 scores of all methods were lower compared to the SV types mentioned above, highlighting the challenges in genotyping these complicated SVs at the chromosome level. Furthermore, when focusing on large insertions and translocations (10 ~ 100 kbp), we observed that cuteFC consistently achieved the highest genotype F1 scores (approximately 95% for large insertions and 68% for large translocations; Additional file 2: Table S4).
We further evaluated the performance for alternative alleles and reference alleles at different proportions of simulation depth. cuteFC achieved the best overall genotype consistency under various sequencing technologies, followed by cuteSV and Sniffles2, while SVJedi almost failed to genotype most homozygous SVs. For different proportions, most regenotyping methods tended to perform better when the proportion was close to 0, 0.5, or 1 (representing perfect zygosities of “0/0”, “0/1”, and “1/1”, respectively), but cuteFC still possessed a greater ability to accurately determine the genotypes for SVs without perfect allele proportions (Fig. 2D, Additional file 2: Table S5). Overall, these findings suggest that cuteFC exhibits outstanding and stable regenotyping performance, regardless of the sequencing dataset and SV type.
Benchmarks of regenotyping performance on the HG002 datasets
Assessments of simulations still have limitations because of the relatively restricted complexity of the simulated sequencing data. To overcome this, we then evaluated the regenotyping performance on the well-known human dataset HG002 from the Ashkenazim trio-family group based on the Genome in a Bottle (GIAB) ground truth set (SV v0.6) from the National Institute of Standards and Technology (NIST) [36]. The target SVs were identified from the “High confidence variant” truth set, and alignments from four different sequencing technologies (i.e., PacBio HiFi, PacBio CLR, ONT, Ultra-Long ONT; Additional file 2: Table S1) were used to regenotype the target SVs. The benchmarking results (Fig. 3A, Additional file 2: Table S6) show that cuteFC achieved the highest recall and precision rate (only one-thousandth lower precision than SVJedi on CLR dataset and two-thousandth lower recall than cuteSV on HiFi dataset) among the five methods, thereby attaining the highest F1 scores among the diverse sequencing technologies (95.77% for HiFi, 92.85% for CLR, 95.31% for ONT, and 94.36% for ULONT); the values observed for the second-best methods, cuteSV and Sniffles2, were approximately 2% lower, and the values observed for the other two methods were more than 10% lower.
Benchmarking results on the HG002 datasets. A The precision and recall of different tools on the HG002 datasets under different sequencing coverages and technologies (i.e., HiFi, CLR, ONT, and ULONT). The ground truth data were obtained from the Genome in a Bottle (GIAB) ground truth set (SV v0.6) from the National Institute of Standards and Technology (NIST). B The precision and recall of different tools on the HG002 datasets under different sequencing technologies in various genomic regions. C The precision and recall of different tools on the HG002 datasets under different sequencing coverages and technologies. The ground truth data were obtained from the GIAB Challenging Medically Relevant Gene Benchmark (CMRG) v1.00. D An example of two heterozygous insertions only being detected by cuteFC on 30 × PacBio HiFi HG002 data
Next, we divided the predictions according to zygosity (0/0 for the homozygous reference allele, 0/1 for the heterozygous alternative allele, and 1/1 for the homozygous alternative allele) and constructed confusion matrices for these methods (Additional file 1: Fig. S3). In the matrices, the rows and columns represented the SV genotype given in the ground truth dataset and that determined by the regenotyping methods, respectively. Thus, the SVs located in the diagonal from the top-left to the bottom-right were considered to have a concordant genotype to the ground truth. The heatmap illustrated that all methods other than Sniffles1 had concentrated SVs distributed on the diagonal, while Sniffles1 was imbalanced for genotyping too many homozygous alternative alleles, especially in the PacBio datasets. A comparison of the methods revealed that in general, cuteFC had the largest number of SVs on the diagonal; that is, regardless of the zygosity, cuteFC achieved the best genotyping results.
We also downsampled the original 30 × HG002 datasets to 20 ×, 10 ×, and 5 × and performed regenotyping. As in the simulation experiments, cuteFC achieved the overall best performance under the various coverage conditions (Fig. 3A). Both the precision and recall rates decreased in low-coverage sequencing datasets; the recall rates decreased approximately twice as fast as the precision rates, which may be because lower sequencing coverage may miss SV signatures, further influencing the sensitivity. Notably, SVJedi achieved extremely high precision under low-coverage conditions; however, its high precision comes at the cost of sacrificing the total number of predictions, and only approximately one-third of the target SVs were assigned valid genotypes using SVJedi, demonstrating its extremely insufficient performance. A comparison of the four sequencing technologies revealed that the performance on the HiFi and ONT datasets exhibited the slowest decrease (14.63% and 14.17%), followed by that on the ULONT dataset (16.26%); in the CLR dataset, the performance decreased substantially (19.93%).
We subsequently narrowed our view to include some typical genomic regions, such as repetitive regions and functional regions. Repetitive genomic regions, such as short tandem repeats (STRs), variable number tandem repeats (VNTRs), and segmental duplications, harbor large numbers of SVs. The high degree of repetition in these areas poses additional challenges for accurate regenotyping. In low-complexity regions, including homopolymers, STRs, and VNTRs, all the methods exhibited reduced performance; however, cuteFC consistently outperformed the other methods, with F1 scores decreasing by only 0.5 ~ 2.5% for each sequencing technology, whereas the second-best methods, Sniffles2 and cuteSV, both exhibited a 2 ~ 4% decrease (Fig. 3B, Additional file 2: Table S7). The decrease in performance was more pronounced for segmental duplications, highlighting the inherent difficulties in achieving comprehensive resolution within these complex regions. When comparing different sequencing technologies, the regenotyping performance on the PacBio datasets was generally superior to that on the ONT/ULONT datasets, indicating the greater accuracy of PacBio sequencing in these intricate genomic landscapes. In mapping challenging and functional genomic regions, cuteFC demonstrated the best overall regenotyping performance, followed closely by cuteSV, with F1 scores averaging 6% higher than those of Sniffles2, the third-best performer. These evaluations underscore the proficiency of cuteFC in regenotyping SVs not only in straightforward regions but also in complex and repetitive areas of the genome.
To further assess the performance of the regenotyping methods in recently reported complex and medically significant genomic regions, we employed a more challenging truth set, the GIAB Challenging Medically Relevant Genes (CMRG) Benchmark v1.00 [37]. This dataset encompasses more challenging regions and polymorphic SVs, which complicates genotype determination. The multiallele-aware clustering incorporated in cuteFC allows for the accurate identification of genotypes, especially within these complex genomic regions with polymorphic SVs. Consequently, cuteFC attained the highest F1 scores in various benchmark settings (Fig. 3C, Additional file 2: Table S8). To explore the availability of the multiallele-aware clustering, we selected the regions with polymorphic SVs and performed a separate benchmarking here. The benchmarks indicate that cuteFC achieved the highest F1 scores in these regions (Additional file 2: Table S9). Figure 3D illustrates an example that shows the advancement of the multiallele-aware clustering in cuteFC. Two allelic insertions occurred at chr1: 204,164,352, with diverse insertion alleles on each haplotype. cuteFC distinguished the insertion alleles by K-means-based clustering; thus, it was possible to identify the two heterozygous polymorphic SVs. Notably, cuteFC significantly outperformed competing methods on the HiFi and ONT datasets, achieving F1 scores more than 5% higher than those of the second-best method, Sniffles2. For these two datasets, the F1 scores of cuteFC were 88.06% and 89.38%, respectively, which surpassed those on the other datasets by more than 8%. These enhancements can be attributed to superior sequencing quality, such as increased base identity in the HiFi sequencing, and the latest nanopore design and base-calling technologies used in the ONT sequencing. These advancements have facilitated more effective exploration of challenging genomic regions, enabling cuteFC to achieve exceptional performance under these conditions.
Since the above evaluations focused on regenotyping the ground truth SVs for the HG002 individual, we extended our analysis to population-scale SVs. Specifically, we regenotyped HG002 at all SV sites derived from a population. For this, we extracted all SVs from the Human Genome Structural Variation Consortium (HGSVC, including the HG002 sample)[21] and used the four sequencing datasets mentioned earlier to regenotype HG002. Benchmarks were then performed using both the NIST and CMRG ground truths. The results demonstrated that under these conditions, cuteFC still achieved the best performance, with F1 scores approximately 3%/5% and 5%/10% higher than Sniffles2 and cuteSV on the NIST and CMRG ground truth datasets, respectively (Additional file 2: Table S10).
Benchmarks of regenotyping performance on a large-scale Chinese cohort
To evaluate the regenotyping performance of these methods on joint calling a large-scale cohort, we applied cuteFC, cuteSV, Sniffles1, and Sniffles2 to a real population group consisting of 100 Chinese individuals. SVJedi was excluded because of its excessive running time for large numbers of samples. The individuals were sequenced using the ONT platform at approximately 15 × (Additional file 2: Table S11), and the target SVs were identified using a population SV calling pipeline that included SV discovery and integration (see Methods for more details). The four tools were subsequently applied to regenotype the target population-scale SVs, and the variant allele frequency (VAF) was generated for evaluation using BCFtools [38]. We first assessed the Hardy–Weinberg equilibrium (HWE) and excess heterozygosity (ExcHet) scores of each population-level SV, which indirectly validated the accuracy of the inferred individual genotypes. According to the population genetic regulations, a variant site was filtered when its test P-value was less than 10−6 or the percentage of missing alleles exceeded 5%. Under these criteria, cuteFC reported the overall lowest percentage of low-quality SVs requiring filtering (i.e., 5.07% in HWE and 2.81% in ExcHet; Fig. 4A, B; Additional file 2: Tables S12 to S15). The violin and box plots illustrate that cuteFC reported more SVs with high test scores overall, which confirms the reliable distribution of population-level SVs. Since cuteFC filtered the lowest number of low-quality SVs, most high-quality SVs were retained for downstream studies (Fig. 4C). For the various remaining SV types, cuteFC identified the greatest number of SVs, including more frequent insertions and deletions and other infrequent SVs (e.g., inversions, duplications, and translocations; Fig. 4D). To ensure a robust benchmark, we also adjusted the threshold of the test P-value used to filter low-quality SVs and performed gradient benchmarks (Additional file 2: Table S16). For higher P-values (e.g., 0.1 and 0.01), over 10% of the total SVs for each tool were filtered, resulting in the loss of many potentially valid SVs. Conversely, for lower P-values (e.g., 10−3 to 10−6), a larger number of SVs were retained. Among the tools tested, cuteFC discarded the smallest proportion of low-quality SVs, ranging from 9.90% to 5.13%. Next, we analyzed the remaining SVs through their heterozygosity distribution together with their VAFs (Fig. 4E), and the distributions all exhibited good HWE. The R2 value of the fitting curve obtained by Sniffles1 reached the highest value (94.05%), and those obtained by cuteFC and Sniffles2 both exceeded 90% (92.26% and 91.77%, respectively), which indicates the logical sensitivity of the distributions. cuteFC achieved the highest SV discovery rate (DR) (91.69%), followed by cuteSV (89.91%), while Sniffles1 identified the fewest SVs, with a 56.51% discovery rate. Therefore, considering both of the above aspects, the highest harmonic mean score was achieved by cuteFC, indicating its better relation with the HWE theoretical distribution.
Benchmarking results of population statistics on the Chinese cohort. A The distribution of the Hardy–Weinberg equilibrium (HWE) test scores of different tools in the population. The SVs with HWE test scores < 10−6 or the percentage of missing alleles > 5% were considered low quality. B The distribution of the test scores of excess heterozygosity of different tools in the population. SVs with ExcHet test scores < 10−6 or the percentage of missing alleles > 5% were considered low quality. C The counts of low-quality SVs that different tools filtered and the counts of the remaining SVs in the population SV callsets. D The counts of different types of SVs in population SV callsets after filtering low-quality SVs. E The distribution of heterozygosity and variant allele frequency among 100 samples on chromosome 1. R2 is the coefficient of determination, which represents the goodness of fit. DR is the discovery rate, which represents the recall ability of different tools. H.M is the harmonic mean of R2 and DR
We further evaluated the reliability of the VAF distribution in these 100 Chinese individuals by comparing it with that of a well-studied cohort of 32 international individuals [21]. Considering the substantial overlap of SVs across population groups, the SV distributions from different datasets can be horizontally compared [39, 40]. The selected worldwide cohort was generated from the genome assembly and was of high quality and reliability; thus, we used it as a reference for assessing the consistency of VAFs for those SVs sharing loci between our 100 Chinese individual dataset and the reference dataset (termed shared SVs). Figure 5A indicates that most shared SVs had adjacent and consistent VAFs, and few SVs had a large gap in VAF, which is consistent with the original study showing the wide sharing of mutual SVs in the human population. After removing the low-quality SVs as described above, the greatest number of consistent SVs (within a 5% difference in VAF) was obtained using cuteFC (17,063 for cuteFC, 13,912 for cuteSV, 11,210 for Sniffles1, and 14,432 for Sniffles2; Additional file 2: Table S17). Considering both the consistency of the shared SVs (CR) and the ratios of shared SVs (SR), cuteFC achieved the highest harmonic mean, which indicates its consistency with the worldwide cohort (Fig. 5A). Next, we selected inconsistent SVs (within a VAF deviation greater than 50%) and analyzed the VAF deviations of the SVs obtained by cuteFC compared with those of other methods, which aim at investigating whether the large deviations of VAFs came from the difference between cohorts or the incorrect genotyping. Notably, Sniffles1 and Sniffles2 were included in this evaluation due to their distinct methodologies compared to cuteFC, which may provide additional insights into the observed discrepancies. The peak near 0.05 in Fig. 5B indicates that most inconsistent SVs reported by cuteFC had similar VAFs (< 0.05) when compared with Sniffles1 and Sniffles2. This trend indirectly supports the reliability of regenotyping performed by cuteFC.
Benchmarking results of variant allele frequency concordance and validation on the Chinese cohort. A The variant allele frequency (VAF) discordance between 32 international individuals and our Chinese population SV callsets (that is, the deviation in the VAF of the SVs that are shared between the worldwide cohort and our Chinese callsets). CR is the consistency rate, which represents the proportion of SVs with a VAF deviation smaller than 0.2. SR is the sharing rate, which represents the proportion of SVs that were shared between the worldwide cohort and our Chinese callsets. H.M is the harmonic mean of CR and SR. B The VAF deviation between cuteFC and other methods of those inconsistent SVs reported by cuteFC (for which the VAF deviation with the worldwide cohort is greater than 0.5). C Validation of the discovery of the singletons and doubletons of two selected individuals (i.e., Sample D99 and D100) based on HiFi assembly. The lighter columns indicate SVs with consistent presence but false genotypes in the assembly, while the darker columns indicate SVs with both consistent presence and true genotypes. D An example of a homozygous deletion and a non-existent deletion on Sample D100 ONT data only being genotyped correctly by cuteFC
Furthermore, the analysis of singletons and doubletons in groups was highly valuable for determining the characteristics of individual SVs and analyzing the rare diseases; however, these SVs were always mixed with false-positive SVs, and potential errors increased the difficulty in detecting these low-frequency SVs. To specifically confirm the false discovery rate (FDR) by distinguishing singletons and doubletons from false-positive predictions, we randomly selected two individuals (termed Samples D99 and D100) and performed PacBio HiFi sequencing and haplotype-resolved de novo assembly for validation. cuteFC identified a greater number of singletons and doubletons, which exhibited genotypes that were more consistent with the ground truth sets, thereby achieving higher sensitivities while maintaining a relatively low FDR (inconsistent singletons: 6/217, 1/171, 2/179, and 2/169; inconsistent doubletons: 16/116, 18/105, 7/97, and 17/100 for cuteFC, cuteSV, Sniffles1, and Sniffles2, respectively; Fig. 5C, Additional file 2: Table S18). We selected the singletons and doubletons which are only identified correctly by cuteFC, and found that a large amount of them (78.67%) are located in genomic repeat regions. It is worth noting that there are two or more similar SVs occurring in merged large-scale callsets, yet an individual may actually possess only one of them. However, force-calling approaches may report incorrect genotypes for the SVs that are not present in the individual. As depicted in Fig. 5D, there are two population-level deletions around chr1: 964,631; however, only one of the deletions was carried by the given individual according to the de novo assembly, while another was carried by other individuals. cuteFC applies a self-adaptive clustering to distinguish the individual alleles and align them with the corresponding population sites. Thus, through accurate identification of the corresponding alleles, only cuteFC obtained correct genotypes for the target SVs, reporting “1/1” for the deletion carried by the individual and “0/0” for another SV. These direct and indirect assessments demonstrated that cuteFC tended to yield more reasonable SV genotypes to generate the most consistent population SV callsets.
Evaluation of the computational performance
Finally, we examined the computational performance on different datasets to assess the practicality of the above methods. When evaluating the elapsed time on the HG002 datasets (Fig. 6A, Additional file 2: Table S19), Sniffles2 had the fastest speed in a single thread regardless of the dataset, with cuteFC exhibiting a slightly slower speed, Sniffles1 and cuteSV being much slower; SVJedi was the slowest tool and took above 3 h even under 16 threads (discarded in the experiments). Notably, a quasilinear increase in speed and a reduction in time with increasing threads were observed for cuteFC, cuteSV, and Sniffles2. Hence, there was no significant time discrepancy between cuteFC (i.e., HiFi: 2.95 m (minutes), CLR: 7.08 m, ONT: 2.23 m, ULONT: 5.95 m) and Sniffles2 (i.e., HiFi: 2.58 m, CLR: 8.63 m, ONT: 1.98 m, ULONT: 5.90 m) when additional CPU threads were applied (all under 16 threads). Regarding larger population cohorts, since the number of target sites increases along with the population scale, the regenotyping task requires additional computational resources. Under these conditions, the increase in speed of cuteFC becomes particularly pronounced. This is attributed to the ability of the GPS algorithm to optimize read statistics, which increases the suitability of cuteFC for regenotyping in larger cohorts. As illustrated in Fig. 6C, cuteFC was the fastest at regenotyping population SVs within the 100-individual Chinese cohort, taking only 3 min per sample (Additional file 2: Table S20).
Benchmarking results of the computational performance. A The elapsed times of different tools on the different HG002 datasets (i.e., HiFi, CLR, ONT, and ULONT) with various numbers of threads (i.e., 1, 2, 4, 8, 16). B The memory footprints of different tools on the different HG002 datasets (i.e., HiFi, CLR, ONT, and ULONT) with various numbers of threads (i.e., 1, 2, 4, 8, 16). C The elapsed times of different tools on regenotyping 100 Chinese individuals. D The memory footprints of different tools on regenotyping 100 Chinese individuals. E The elapsed times of SV detection of cuteSV (with/without equipped with the efficient designs of cuteFC) when disabling/enabling genotyping on the different HG002 datasets (i.e., HiFi, CLR, ONT, and ULONT) with 16 threads
In terms of the memory footprint, an increasing memory footprint was observed with multiple threads when using cuteFC, cuteSV, and Sniffles2. This increase was much slower for cuteFC than for cuteSV and Sniffles2, whereas a more stable memory footprint was observed using Sniffles1 (Fig. 6B). When 16 threads were used, cuteFC expended about 5 GB for stable memory footprints, which is available for most PCs. In contrast, Sniffles1 expended about 10 GB on the CLR datasets, Sniffles2 and cuteSV expended 15–45 GB on various datasets. In larger population cohorts with many more target SVs, stable memory footprints were also observed at 4.84 GB on average for cuteFC (Fig. 6D, Additional file 2: Table S20). The small, stable memory footprints as well as the shorter running time demonstrate the utility and scalability of using cuteFC as a practical regenotyping method.
The efficient designs in cuteFC, particularly the GPS algorithm, have also been implemented in cuteSV, significantly accelerating its SV discovery process. We compared the elapsed time for SV detection in the HG002 datasets using cuteSV before and after integrating the GPS algorithm (Additional file 2: Table S21). The results show substantial acceleration, especially when the genotyping module is enabled (Fig. 6E). This improvement is largely attributed to the GPS algorithm, which reduces the time required for sequencing data scanning. With this enhancement, SV detection was accelerated by nearly 3 to 5 times across various sequencing technologies, enabling efficient SV calling in less than 10 min per sample.
For further details regarding the evaluations, please refer to the “Methods” section.
Discussion
With the advent of long-read sequencing technologies, constructing full-spectrum SVs at the population level has become possible, although accurate genotype assignment is challenging. Here, we introduce an accurate and efficient regenotyping method, cuteFC, which replaces the stepwise refinement clustering approach proposed by cuteSV with a novel SV signature clustering strategy and a sequencing data scanning algorithm. This innovation significantly enhances genomic assessments based on SV genotypes and variant allele frequencies, enabling applications such as population genetic measurements and further analyses. To better apply cuteFC in practice, several advantages and disadvantages still need to be explained in detail.
On regenotyping SVs via long-read sequencing, there are two main innovations which facilitate cuteFC to achieve both outstanding accuracy and efficiency:
-
1)
On the one hand, the self-adaptive clustering in cuteFC accurately distinguishes the real SV alleles from numerous potential alleles. Since the SV integration methods aim at avoiding overmerging to preserve distinct SVs [41], the population callsets will contain similar SVs existing in different individuals. When performing force calling on each individual, it is especially vital to assign correct genotypes for the SVs that are carried by the individual while ignoring non-existent SVs. cuteFC removes the unrelated alleles and retains the corresponding alleles to the target SVs, thus attaining the correct genotypes (Fig. 5D). On the other hand, the multiallele-aware clustering sensitively identifies various alleles for the same locus in each haplotype. The multiallele regions in the genome always harbor multiple SVs at the same or adjacent locus, which are difficult to distinguish due to their high similarity. cuteFC is aware of these regions and applies a K-means-based clustering to identify the corresponding alleles, which provides opportunities for regenotyping these polymorphic SVs (Fig. 3D).
-
2)
cuteFC meticulously designs its algorithms and architectures to enhance its running efficiency, especially for large-scale population groups. The GPS algorithm designed in cuteFC facilitates linear time complexity in analyzing numerous sequencing reads, which significantly improves the consumption time in the zygosities determination. Additionally, cuteFC has been optimized to utilize multiple CPU cores through the reconstruction of the multiprocessing pool and employs serialization and deserialization modules to accelerate disk operations, thus further minimizing time and memory costs for both SV calling and force calling.
When it comes to regenotyping large-scale population callsets, the accurate genotypes achieved by cuteFC prove to be the foundation of the population callsets construction. Also, the efficiency of cuteFC brings advantages to the regenotyping of large-scale callsets. Since the number of target sites increases with increasing population scale, the extent of acceleration brought by the linear time complexity characteristic of the GPS algorithm becomes more obvious with an increasing number of target sites. These results suggest that cuteFC is more suitable for larger-scale population studies. Under the SV joint-calling workflow, cuteFC, as a force-calling method, can be easily integrated into the workflow and achieve better joint-calling results, which contributes to the construction of high-quality, large-scale population SV callsets.
However, while cuteFC has demonstrated significant proficiency in regenotyping, there are several limitations that need to be addressed to improve its application. First, as an alignment-based SV analysis method, cuteFC lacks sequence resolution ability. Therefore, determining genotypes for nearly identical SVs (those with highly similar SV breakpoints and lengths but various sequences) is still unavailable. Second, the absence of sequence resolution also brings about challenges in regenotyping copy number variations (CNVs), which may have different copies of the duplicated sequences on different haplotypes. Although cuteFC can recognize redundant alleles, it is difficult to assign each allele to the correct haplotype (Additional file 1: Fig. S4). In general, the accurate recognition of these SVs requires haplotype assemblies, which constrains the utility of read-based methods. Third, along with other force-calling-based methods, cuteFC primarily focuses on regenotyping simple SVs in diploid genomes rather than addressing multiploidy plant genomes or complex SV events. This is still a major bottleneck in the field of variant calling, and the development of new genotype estimation models is essential. We will further focus on these topics and try to solve them in the future.
Conclusions
In this article, we introduce cuteFC, an accurate and efficient regenotyping method that assigns genotypes for population-based SVs via long-read sequencing. It is owing to the two algorithms of automatically selected self-adaptive clustering or multiallele-aware clustering, and the GPS algorithm, that cuteFC achieves accurate alternative SV signature identification and effective read distribution statistics, benefitting from the reliable likelihood estimation of SV genotypes. cuteFC exhibits outstanding regenotyping performance compared with that of state-of-the-art methods in various aspects, and its application in a large-scale population cohort also proves its ability to analyze high-quality population-based SVs. We anticipate that cuteFC could effectively assist in obtaining accurate allele frequencies of cohort SVs, facilitating further cutting-edge genomic studies, such as population genetic studies, estimations, and comprehensive analysis.
Methods
The force-calling regenotyping method assigns genotypes for the target SVs using aligned individual sequencing data. To achieve this, cuteFC implements self-adaptive clustering for the integration and refinement of signatures, along with a multiallele-aware module that incorporates K-means-based clustering. These designs enable cuteFC to identify alternative allele signatures in various genomic regions. Additionally, cuteFC uses a Genome Position Scanner (GPS) algorithm to efficiently accelerate the estimation of reads surrounding the target SVs. In detail, this approach contains four major steps, as described below (Fig. 1).
Extract signatures of various SV types from alignments
The signature extraction module in cuteSV was applied and further refined to comprehensively collect signature information of various types of SVs, including insertions, deletions, inversions, duplications, and translocations. The start and end coordinates of the raw reads and the target SVs are also extracted from the input and recorded as serialization files on the disk.
Mark spatial similar signatures and select the clustering strategy
For each target SV, cuteFC searches all previously extracted signatures to identify those alternative allele signatures that correspond to the target SV. First, a binary search is implemented on the ordered signatures to locate the signature, denoted as \({Sig}_{flag}\), with coordinates closest to the target SV. Then, cuteFC recursively expands its search to mark nearby signatures that exhibit high breakpoint similarity. For a marked signature \({Sig}_{NN}\), its nearest neighborhood signature \({Sig}_{NN}{\prime}\) is marked when it satisfies the following criterion:
where \({BP}_{i}\) represents the breakpoint coordinate of \({Sig}_{i}\) and \(max\_cluster\_bias\) represents a threshold that varies according to SV type and sequencing platform (more details in Supplementary Commands 3.1). Notably, for translocations, an additional condition is required to ensure the breakpoint similarity of the collected signatures; that is, the signatures should share the same transferred chromosome ID.
For insertions and deletions that are more likely influenced by sequencing errors, cuteFC implements deep identification of the marked signatures. To comprehensively recall the signatures, cuteFC merges adjacent signatures from the same read to regenerate potential novel signatures. The newly generated signatures are defined as (\(BP'\), \(LEN'\)) when merging from the \(i\) th signature to the \(j\) th signature:
where \({LEN}_{i}\) represents the signature length of \({Sig}_{i}\).
In the signatures discussed above, only spatial breakpoint similarity was considered, while similarity in alleles was overlooked. Consequently, cuteFC further refines the marked signatures through clustering, retaining only those signatures related to the target SV as alternative allele signatures. During this step, cuteFC first determines whether the polymorphic SVs occur on the same genomic locus (We define the SVs occupied an identical breakpoint as polymorphic SVs in this work.). Self-adaptive clustering will be selected when the target SVs are not determined as polymorphic events. Conversely, multiallele-aware clustering will be employed for polymorphic target SVs.
Self-adaptive clustering via dynamic parameters fine-tuning
The extracted signatures are always contaminated with sequencing and alignment errors, and varied sizes and different types of target SVs are sometimes challenging to figure out, especially in large-scale callsets. Therefore, it is crucial to cluster the disordered alleles with fine-tuned parameters, so that adapt the distributions presented by diverse real alleles. To solve this problem, we design a self-adaptive clustering strategy to dynamically adjust the clustering threshold to match the signatures being clustered; that is, the threshold \({thres}_{i}\) for the \(i\) th marked signature is defined as follows:
where \(diff\_ratio\_merging\) represents the variable merging ratio dependent on SV type and sequencing platform (more details in Supplementary Commands 3.1). The adjacent signatures are refined into a single cluster if their length difference falls below the clustering threshold; otherwise, they are assigned to separate clusters. After clustering, the metric similarity between each cluster and the target SV is calculated as follows:
where \(n\) represents the size of the cluster and \({LEN}_{target}\) and \({LEN}_{i}\) indicate the length of the target SV and the length of the \(i\) th signature in the cluster, respectively. The cluster with the highest similarity is selected, and the signatures in this cluster are regarded as alternative allele signatures.
Multiallele-aware clustering to identify polymorphic alleles with high resolution
While self-adaptive clustering can accurately identify alternative allele signatures, it struggles with signatures in genomic regions containing polymorphic SVs due to the high similarity between adjacent SV alleles in these areas. To address this issue, cuteFC introduces a multiallele-aware clustering strategy specifically for identifying polymorphic target SVs. Here, the K-means-based clustering algorithm, configured with \(k=2\), is applied to the marked signatures, since at most two alleles can occur on the same coordinate in diploid genome. The two resulting clusters are considered to be multiallele sets if they meet the following criteria:
where \({size}_{i}\) represents the size of the \(i\) th subset, \({max}_{i}\) represents the maximum length in the \(i\) th subset, \({delta}_{i}\) represents the length deviation in the \(i\) th subset, and \(n\) and \(c\) represent the similarity assessments of the subsets (by default, 5 and 3, respectively). If the subset is a multiallele set, the signatures in the subset that meet the following criteria are regarded as alternative allele signatures:
where \(r\) represents the ratio of the multiallele identification (by default, 0.9).
On the other hand, for inversions, duplications, and translocations, the signatures extracted from the alignments are less discrete or chaotic; therefore, the marked signatures are directly regarded as alternative allele signatures.
Efficient identification of reference allele reads using a GPS algorithm
To quantify the reads that align with the reference, cuteFC gathers all reads around the target SVs using a GPS algorithm. In the algorithm, a scanning line (termed Genome Scanner) traverses the genome to obtain the total reads spanning or overlapping target SVs. Subsequently, reads with alternative allele signatures are excluded from the analysis to isolate those reference allele reads.
First, all the target SVs and aligned reads are recorded as 2-tuples (\({s}_{SV}\), \({e}_{SV}\)) and (\({s}_{read}\), \({e}_{read}\)), where \({s}_{read}\) and \({s}_{SV}\) denote the start coordinates of the read/SV and \({e}_{read}\) and \({e}_{SV}\) denote the end coordinates of the read/SV. For each chromosome, all breakpoints in the tuples are sorted by their coordinates, and the designed scanner traverses the ordered breakpoints on the genome. Two collections, \(RC\) and \(SC\), are utilized to store the reads and SVs that intersect the scanner:
where \({bp}_{line}\) represents the coordinate of the scanner. In detail, a read/SV is added to the read/SV collection when the scanner reaches \({s}_{read}\)/\({s}_{SV}\) and is removed from the read/SV collection when the scanner reaches \({e}_{read}\)/\({e}_{SV}\).
Then, for SVs that influence additional sequences in the reference genome, such as insertions, cuteFC gathers all reads spanning the target SV. Specifically, the reads at the intersection of \(RC\) collections when the scanner reached the start and end of an SV are considered to span that SV. For SVs that affect a continuous interval on the reference genome, such as deletions, cuteFC gathers all reads overlapping the target SV. Specifically, the overlapping reads consist of two parts: when the scanner reaches the start of an SV, the reads in the current \(RC\) collection are considered to overlap that SV; when the scanner reaches the start of a read, this read is considered to overlap all SVs in the current \(SC\) collection. Further details on the GPS algorithm are provided in Additional file 1: Fig. S1 and “Supplementary Methods”.
Assign genotypes based on likelihood estimation
The aligned reads surrounding the target SV are categorized into two groups: those containing the alternative allele signatures are recorded as alternative supporting reads and are marked as \(DV\), while the remaining reads not containing the SV signatures are recorded as reference supporting reads and are marked as \(DR\). Utilizing the \(DV\) and \(DR\) sets, the genotyping module from cuteSV is employed to determine genotypes by applying a biallelic assumption to estimate the likelihood of different zygosities. The zygosity with the maximum likelihood is assigned to the target SV.
Process the benchmarking datasets
For the simulation datasets, we first simulated 3668 structural variations on chromosomes 1 and 2, including 2191 insertions, 1333 deletions, 450 inversions, 450 duplications, and 250 inter-chromosomal translocations, and reconstructed the reference genome GRCh38 with these SVs for the donor genome. We also simulated another haplotype with large SVs whose SV lengths ranged from 10 ~ 100 kbp, including 500 insertions and 500 inter-chromosomal translocations. We randomly defined homozygous regions and heterozygous regions in the genome to determine the zygosity of each SV. Then, we used VISOR (v1.1.2) [35] to generate three sets of 30 × simulated sequencing data via PacBio HiFi, PacBio CLR, and ONT sequencing technologies. The simulated data were randomly downsampled to 20 ×, 10 ×, and 5 ×. For the datasets under different proportions of simulation depth, we set 9 different proportions, including “10:0”, “9:1”, “8:2”, “6:4”, “5:5”, “4:6”, “2:8”, “1:9”, “0:10”, where “x:y” indicates the proportion of simulated reads for alternative alleles to simulated reads for reference alleles as x to y (more details in Additional file 2: Table S5). Then, we used VISOR to generate 30 × simulated sequencing data via the three sequencing technologies above. Five methods (cuteFC (v 1.0.1), cuteSV (v 1.0.13), Sniffles1 (v 1.0.12), Sniffles2 (v 2.0.7), and SVJedi (v 1.1.6)) were applied to these simulated sequencing datasets to regenotype the simulated SVs.
For the HG002 datasets, we obtained target SVs from (1) two high-confidence datasets, including SVs in the high-confidence callsets (SV v0.6, marked as NIST) from Genome in a Bottle (GIAB) and the GIAB Challenging Medically Relevant Gene Benchmark v1.00 (CMRG), and (2) a population-scale SV callset from the Human Genome Structural Variation Consortium (HGSVC). We selected four alignment datasets of HG002 from GIAB, including PacBio HiFi sequencing, PacBio CLR sequencing, ONT sequencing, and ULONT sequencing. The sequencing data were downsampled to 30 × as the baseline sequencing coverage, and these 30 × sequencing datasets were further downsampled to 20 ×, 10 ×, and 5 ×. Then, we applied the five methods above to these sequencing datasets to regenotype the target SVs.
For the 100 samples from the Chinese population group, we first detected individual SVs through approximately 15 × ONT sequencing and cuteSV (v 1.0.13). The 100 individual SV callsets were merged using Truvari (v 3.5.0) [41] to generate cohort-level population VCFs. With the merged cohort-level SVs as targets, we applied regenotyping methods to 100 samples and applied SURVIVOR (v 1.0.7) [42] to merge the regenotyping results and obtain the revised cohort-level SV callsets. For validation, we implemented PacBio HiFi sequencing on two of these samples (i.e., Samples D99 and D100) and applied HiFiasm (v 0.19.5-r590) [43] to assemble the donor genomes. Dipcall (v 0.3) [44] and SVIM-asm (v 1.0.3) [45] were applied to the assembled genomes to validate the SV callsets.
The detailed commands can be found in the Supplementary Commands.
Evaluate the regenotyping performance
The regenotyping results were benchmarked using Truvari. The unknown SVs (with genotype equaled 0/0 or./.) in both callsets are filtered before evaluation. The precision, recall, and F1 score of genotyping are defined as follows:
where \({TP}_{gt}\) represents the number of SVs that have concordant genotypes with the ground truth, \({TP}_{call}\) and \(FP\) represent the number of SVs that are concordant or inconcordant with the ground truth, respectively, and \({TP}_{base}\) and \(FN\) represent the number of SVs in the ground truth that are detected or undetected, respectively. It is worth to mention that since the regenotyping didn’t change the coordinate of target SVs, the value of \({TP}_{call}\) equals \({TP}_{base}\).
In the Chinese cohort experiments, HWE scores represent the confidence score of an SV that is corresponding to the Hardy–Weinberg equilibrium. ExcHet scores represents the measure of genetic diversity of an SV in the population. The heterozygosity test checks each SV with the proportions of heterozygous individuals and variant allele frequencies (VAFs), These statistics were determined using the BCFtools (v 1.9) [38] fill-tags module. In the heterozygosity test, discovery rate (\(DR\)) is defined as follows:
where \(identified SVs\) represents the SVs that are regenotyped as heterozygous or homozygous alternative alleles by regenotyping methods and \(population\_scale SVs\) represents the total cohort-level SVs. In evaluating the variant allele frequency concordance with the worldwide reference dataset, consistency rate (\(CR\)) and sharing rate (\(SR\)) is defined as follows:
where \(consistent SVs\) represents the SVs whose VAF deviation between the reference dataset and the Chinese callsets is smaller than 0.2, \({shared}_{C} SVs\) and \({shared}_{R} SVs\) represents the SVs that shared in the Chinese callsets and reference dataset, and \(reference SVs\) represents the total SVs in the reference dataset.
The benchmarks were implemented using a server with 2 Intel(R) Xeon(R) Gold 6240 CPUs @ 2.60 GHz (32 cores in total), 128 gigabytes of RAM, running on the CentOS Linux release 7.5.1804 operating system. The elapsed time and memory footprint were assessed by using the “seff” command of the Slurm Workload Manager.
Data availability
The cuteFC was implemented in Python and can be easily installed via bioconda and PyPI. Its source code is available at https://github.com/Meltpinkg/cuteFC under MIT open source license. The cuteFC release used in this article was deposited on Zenodo with doi: https://zenodo.org/records/14671406 [47]. The simulation data were generated by our in-house scripts, and the related variant files and scripts are available at https://github.com/Meltpinkg/Simulation-datasets-for-force-calling and https://zenodo.org/records/15038103 [48]. The hs37d5 human reference genome is available at https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/ [49]. The GRCh38 human reference genome is available at https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/ [50]. The HiFi alignment data of the HG002 individual are available at https://downloads.pacbcloud.com/public/revio/2022Q4/ [51]. The ONT alignment data of the HG002 individual are available at s3://ont-open-data/giab_lsk114_2022.12/[52]. The CLR and ULONT alignment data of HG002 individuals are available at https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/ [53, 54]. The Genome in a Bottle (GIAB) ground truth set (SV v0.6) from the National Institute of Standards and Technology (NIST) is available at http://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/ [36]. The GIAB Challenging Medically Relevant Gene Benchmark (CMRG) v1.00 is available at http://ftp.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/CMRG_v1.00/GRCh37/StructuralVariant/ [37]. The genome stratification files are available at https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.3/ [55]. The HGSVC callsets including HG002 are available at http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGSVC2/release/v2.0/integrated_callset/ [21]. The well-studied cohort callsets, including 32 international individuals, are available at https://doi.org/10.5281/zenodo.4268828[56]. The Chinese population genetic variant resource used in our study is permitted by The Ministry of Science and Technology of the People’s Republic of China (permission number 2023BAT1252 and 2025BAT00595). The variant data have been deposited in the Genome Variation Map [57] in National Genomics Data Center [58] under accession number: GVM000420. The raw sequence data for the Chinese individuals have been deposited in the Genome Sequence Archive [59] in National Genomics Data Center under accession number: HRA010869. The raw data are available under restricted access, with authorization granted by the Data Access Committee (DAC). Researchers may request access by submitting an application form through the GSA platform. Comprehensive instructions for submitting data access requests are provided in the GSA-Human Request Guide for Users [https://ngdc.cncb.ac.cn/gsa-human/document/GSA-Human_Request_Guide_for_Users_us.pdf]. Typically, the DAC processes access requests within approximately 10 working days.
References
Alkan C, Coe BP, Eichler EE. Genome structural variation discovery and genotyping. Nat Rev Genet. 2011;12(5):363–76.
Weischenfeldt J, Symmons O, Spitz F, Korbel JO. Phenotypic impact of genomic structural variation: insights from and for human disease. Nat Rev Genet. 2013;14(2):125–38.
Stankiewicz P, Lupski JR. Structural variation in the human genome and its role in disease. Annu Rev Med. 2010;61:437–55.
Stevanovski I, Chintalaphani SR, Gamaarachchi H, Ferguson JM, Pineda SS, Scriba CK, Tchan M, Fung V, Ng K, Cortese A: Comprehensive genetic diagnosis of tandem repeat expansion disorders with programmable targeted nanopore sequencing. Sci Adv. 2022; 8(9):eabm5386.
Song T, Yao M, Yang Y, Liu Z, Zhang L, Li W. Integrative Identification by Hi-C Revealed Distinct Advanced Structural Variations in Lung Adenocarcinoma Tissue. Phenomics. 2023;3(4):390–407.
Hurles ME, Dermitzakis ET, Tyler-Smith C. The functional impact of structural variation in humans. Trends Genet. 2008;24(5):238–45.
Suzuki Y, Myers EW, Morishita S. Rapid and ongoing evolution of repetitive sequence structures in human centromeres. Sci Adv. 2020;6(50):eabd9230.
Hollox EJ, Zuccherato LW, Tucci S. Genome structural variation in human evolution. Trends Genet. 2022;38(1):45–58.
Consortium TGotN: Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat Genetics. 2014; 46(8):818–825.
Audano PA, Sulovari A, Graves-Lindsay TA, Cantsilieris S, Sorensen M, Welch AE, Dougherty ML, Nelson BJ, Shah A, Dutcher SK, et al. Characterizing the Major Structural Variant Alleles of the Human Genome. Cell. 2019;176(3):663-675.e619.
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–53.
Huddleston J, Chaisson MJ, Steinberg KM, Warren W, Hoekzema K, Gordon DS, Graves-Lindsay TA, Munson KM, Kronenberg ZN, Vives L. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 2017;27(5):677–85.
Jiang T, Liu B, Li J, Wang Y. rMETL: sensitive mobile element insertion detection with long read realignment. Bioinformatics. 2019;18:18.
Jiang T, Fu Y, Liu B, Wang Y: Long-Read based Novel Sequence Insertion Detection with rCANID. IEEE Transactions on NanoBioscience 2019:1–1.
Jiang T, Liu Y, Jiang Y, Li J, Gao Y, Cui Z, Liu Y, Liu B, Wang Y. Long-read-based human genomic structural variation detection with cuteSV. Genome Biol. 2020;21(1):189.
Liu Y, Jiang T, Su J, Liu B, Zang T, Wang Y. SKSV: ultrafast structural variation detection from circular consensus sequencing reads. Bioinformatics. 2021;37(20):3647–9.
Zhang Z, Jiang T, Li G, Cao S, Liu Y, Liu B, Wang Y. Kled: an ultra-fast and sensitive structural variant detection tool for long-read sequencing data. Brief Bioinform. 2024;25(2):bbae049.
Xia Y, Jin Z, Zhang C, Ouyang L, Dong Y, Li J, Guo L, Jing B, Shi Y, Miao S. TAGET: a toolkit for analyzing full-length transcripts from long-read sequencing. Nat Commun. 2023;14(1):5935.
Bilgrav Saether K, Eisfeldt J. Detecting transposable elements in long-read genomes using sTELLeR. Bioinformatics. 2024;40(11):btae686.
Beyter D, Ingimundardottir H, Oddsson A, Eggertsson HP, Bjornsson E, Jonsson H, Atlason BA, Kristmundsdottir S, Mehringer S, Hardarson MT, et al. Long-read sequencing of 3,622 Icelanders provides insight into the role of structural variants in human diseases and other traits. Nat Genet. 2021;53(6):779–86.
Ebert P, Audano PA, Zhu Q, Rodriguez-Martin B, Porubsky D, Bonder MJ, Sulovari A, Ebler J, Zhou W, Serra Mari R. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science. 2021;372(6537):eabf7117.
Shi L, Guo Y, Dong C, Huddleston J, Yang H, Han X, Fu A, Li Q, Li N, Gong S. Long-read sequencing and de novo assembly of a Chinese genome. Nat Commun. 2016;7(1):12065.
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–303.
Popic V, Rohlicek C, Cunial F, Hajirasouliha I, Meleshko D, Garimella K, Maheshwari A. Cue: a deep-learning framework for structural variant discovery and genotyping. Nat Methods. 2023;20(4):559–68.
Collins RL, Brand H, Karczewski KJ, Zhao X, Alföldi J, Francioli LC, Khera AV, Lowther C, Gauthier LD, Wang H. A structural variation reference for medical and population genetics. Nature. 2020;581(7809):444–51.
Smolka M, Paulin LF, Grochowski CM, Horner DW, Mahmoud M, Behera S, Kalef-Ezra E, Gandhi M, Hong K, Pehlivan D. Detection of mosaic and population-level structural variants with Sniffles2. Nat Biotechnol. 2024:1–10.
Quan C, Li Y, Liu X, Wang Y, Ping J, Lu Y, Zhou G. Characterization of structural variation in Tibetans reveals new evidence of high-altitude adaptation and introgression. Genome Biol. 2021;22(1):159.
Jiang T, Liu S, Cao S, Liu Y, Cui Z, Wang Y, Guo H. Long-read sequencing settings for efficient structural variation detection based on comprehensive evaluation. BMC Bioinformatics. 2021;22(1):1–17.
Wenger AM, Peluso P, Rowell WJ, Chang P-C, Hall RJ, Concepcion GT, Ebler J, Fungtammasan A, Kolesnikov A, Olson ND, et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol. 2019;37(10):1155–62.
Wang Y, Zhao Y, Bollas A, Wang Y, Au KF. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol. 2021;39(11):1348–65.
De Coster W, Weissensteiner MH, Sedlazeck FJ. Towards population-scale long-read sequencing. Nat Rev Genet. 2021;22(9):572–87.
Logsdon GA, Vollger MR, Eichler EE. Long-read human genome sequencing and its applications. Nat Rev Genet. 2020;21(10):597–614.
Lecompte L, Peterlongo P, Lavenier D, Lemaitre C. SVJedi: genotyping structural variations with long reads. Bioinformatics. 2020;36(17):4568–75.
Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, Von Haeseler A, Schatz MC: Accurate detection of complex structural variations using single-molecule sequencing. Nature Methods. 2018.
Bolognini D, Sanders A, Korbel JO, Magi A, Benes V, Rausch T. VISOR: a versatile haplotype-aware structural variant simulator for short- and long-read sequencing. Bioinformatics. 2019;36(4):1267–9.
Zook JM, Hansen NF, Olson ND, Chapman L, Mullikin JC, Xiao C, Sherry S, Koren S, Phillippy AM, Boutros PC, et al. A robust benchmark for detection of germline large deletions and insertions. Nat Biotechnol. 2020;38(11):1347–55.
Wagner J, Olson ND, Harris L, McDaniel J, Cheng H, Fungtammasan A, Hwang Y-C, Gupta R, Wenger AM, Rowell WJ, et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat Biotechnol. 2022;40(5):672–80.
Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10(2):giab008.
Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, Zhang Y, Ye K, Jun G, Hsi-Yang Fritz M, et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526(7571):75–81.
Abel HJ, Larson DE, Regier AA, Chiang C, Das I, Kanchi KL, Layer RM, Neale BM, Salerno WJ, Reeves C. Mapping and characterization of structural variation in 17,795 human genomes. Nature. 2020;583(7814):83–9.
English AC, Menon VK, Gibbs R, Metcalf GA, Sedlazeck FJ: Truvari: Refined structural variant comparison preserves allelic diversity. bioRxiv 2022.
Jeffares DC, Jolly C, Hoti M, Speed D, Shaw L, Rallis C, Balloux F, Dessimoz C, Bähler J, Sedlazeck FJ. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat Commun. 2017;8(1):14061.
Cheng H, Concepcion GT, Feng X, Zhang H, Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 2021;18(2):170–5.
Li H, Bloom JM, Farjoun Y, Fleharty M, Gauthier L, Neale B, MacArthur D. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat Methods. 2018;15(8):595–7.
Heller D, Vingron M. SVIM-asm: structural variant detection from haploid and diploid genome assemblies. Bioinformatics. 2020;36(22–23):5519–21.
Jiang T, Guo H, Liu Y, Li G, Cui Z, Cui X, Liu Y, Li Y, Zhang A, Cao S, et al. A comprehensive genetic variant reference for the Chinese population. Sci Bull (Beijing). 2024;69(24):3820–5.
Jiang T, Cao S. cuteFC-v1.0.1. Datasets. 2025. Zenodo https://zenodo.org/records/14671406.
Cao S, Jiang T. Simulation datasets for force calling. Datasets. 2025. Zenodo https://zenodo.org/records/15038103.
Consortium T 1000 GP: The hs37d5 human reference genome. Datasets. 2011. 1000 Genomes Project https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/.
Genome Reference Consortium: Genome assembly GRCh38. Datasets. 2013. Genome Reference Consortium https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/.
Pacific Biosciences: Human whole genome sequencing datasets from the Revio system. Datasets. 2022. PacBio Cloud https://downloads.pacbcloud.com/public/revio/2022Q4/.
Oxford Nanopore Technologies: Genome in a Bottle Ashkenazi Trio with Ligation Sequencing Kit V14. Datasets. Amazon S3 bucket s3://ont-open-data/giab_lsk114_2022.12/.
Biosciences P: Continuous Long Read sequencing for HG002. Datasets. 2018. https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_MtSinai_NIST/.
Zook J, Olson N, Jain M, Olsen HE, Miga K, Akeson M, Paten B: GIAB HG002 ONT Ultra-long UCSC. Datasets. 2020. NIH Hosted GIAB FTP https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/UCSC_Ultralong_OxfordNanopore_Promethion/.
Dwarshuis N, Kalra D, McDaniel J, Sanio P, Alvarez Jerez P, Jadhav B, Huang W, Mondal R, Busby B, Olson ND, et al. The GIAB genomic stratifications resource for human reference genomes. Nat Commun. 2024;15(1):9029.
Audano PA: HGSVC Key Callset Resources. Datasets. Zenodo https://doi.org/10.5281/zenodo.4268828.
Li C, Tian D, Tang B, Liu X, Teng X, Zhao W, Zhang Z, Song S. Genome Variation Map: a worldwide collection of genome variations across multiple species. Nucleic Acids Res. 2021;49(D1):D1186-d1191.
Members C-N. Partners: Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2024. Nucleic Acids Res. 2023;52(D1):D18–32.
Chen T, Chen X, Zhang S, Zhu J, Tang B, Wang A, Dong L, Zhang Z, Yu C, Sun Y, et al. The Genome Sequence Archive Family: Toward Explosive Data Growth and Diverse Data Types. Genomics Proteomics Bioinformatics. 2021;19(4):578–83.
Acknowledgements
We would like to express gratitude to Doctor Fang Wang for her valuable suggestions. We thank the Center for Bioinformatics at the Harbin Institute of Technology for providing the sequencing and data analysis platform that supported this work.
Funding
This work has been supported by the National Key R&D Program of China (Grant number 2022YFF1202101, 2024YFC3406303, 2017YFC0907503), the National Natural Science Foundation of China (Grant number 62472120, 62331012), China Postdoctoral Science Foundation (Grant Number 2022M720965), and Heilongjiang Postdoctoral Foundation (Grant Number LBH-Z22174).
National Key Research and Development Program of China,2024YFC3406303,2022YFF1202101,2017YFC0907503,National Natural Science Foundation of China,62472120,62331012,China Postdoctoral Science Foundation,2022M720965,Heilongjiang Postdoctoral Science Foundation, LBH-Z22174
Author information
Authors and Affiliations
Contributions
T.J. and S.C. designed the method. T.J., S.C. and Z.Z. implemented the method. T.J. and S.C. performed the experiments and data analysis. T.J., S.C., Y.L. wrote the manuscript. The authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
The raw data for the 100 Chinese individuals were from a previous study[46]. Since the experiments conducted in this manuscript are part of the same project, additional consent and approval were not required.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Jiang, T., Cao, S., Liu, Y. et al. cuteFC: regenotyping structural variants through an accurate and efficient force-calling method. Genome Biol 26, 166 (2025). https://doi.org/10.1186/s13059-025-03642-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13059-025-03642-2