- Methodology
- Open access
- Published:
CloseRead: a tool for assessing assembly errors in immunoglobulin loci applied to vertebrate long-read genome assemblies
Genome Biology volume 26, Article number: 131 (2025)
Abstract
Despite tremendous advances in long-read sequencing, some structurally complex and repeat-rich genomic regions remain challenging to assemble. Furthermore, we lack tools to assess local assembly quality, making it hard to identify problems and assess progress. Here we develop a new approach “CloseRead” for visualizing local assembly quality and diagnosing errors using multiple metrics. We apply CloseRead to evaluate how well immunoglobulin loci, paradigmatic cases of structurally complex regions, are assembled in 74 state-of-the-art vertebrate genomes. We then show that targeted, local re-assembly can correct the specific errors identified by CloseRead, highlighting the value of an iterative approach to genome assembly.
Background
Long-read sequencing technologies have been instrumental in overcoming the limitations of short reads, as they can span large, repetitive, and complex genomic regions that were previously difficult to assemble [1,2,3,4,5]. For example, highly abundant LINE repeats that represent a challenge for short read-based assemblies are easily spanned by long reads [5, 6]. Moreover, the increasing read accuracy of long-read platforms now allows even the most complex repeats to be resolved (like centromeric tandem repeats) as minor paralogous sequence variants between repeat copies can be detected at high resolution [7]. These advances have reduced gaps and ambiguities in genome assemblies, enabling a more complete and accurate depiction of genomic architecture [8,9,10]. And consequently, many research teams and consortia have been able to produce near-complete full-chromosome assemblies for an ever-growing number of organisms, including both model and non-model species; prominent examples include the Telomere-to-Telomere (T2T) consortia [7, 11, 12], the Vertebrate Genome Project (VGP) [13], and the California Conservation Genomics Project (CCGP) [14]. In addition to being of interest to evolutionary biologists and genome researchers, having complete assemblies for many species has the potential to be transformational for medical genetics, agriculture, conservation biology, and many other disciplines [15,16,17,18,19,20,21,22,23,24,25,26].
While modern long-read assemblies have made it possible to assemble repetitive and structurally complex regions of the genome, challenges remain in our ability to assess the accuracy and completeness of assemblies spanning such loci [2, 27, 28]. Ensuring that these regions are accurately represented within whole-genome assemblies is essential not just for the sake of completeness, but also because these regions often harbor critically important genes and functional sequences. Arguably the foremost example of these are the immunoglobulin (IG) loci, which harbor large and expanded families of antibody-encoding genes. These regions are known for their structural complexity [29,30,31,32,33,34,35,36], which has long thwarted efforts to reconstruct them at the genomic level. As a result, our understanding of intra- and inter-species IG genetic diversity remains limited.
Within mammalian and reptilian genomes, IG genes are most often localized to three primary loci: the IG heavy chain (IGH), and the kappa (IGK) and lambda (IGL) light-chain loci [37]. Each of these loci contain expanded families of variable genes (V), diversity genes (D; in IGH only), and joining genes (J) that represent templates of one of two antibody chains, the heavy chain or the light chain. During B cell development, the IGH loci and one of the light-chain loci undergo a somatic process called V(D)J recombination, in which one V gene, one D gene (in IGH only), and joining (J) gene are concatenated into a V(D)J sequence that forms the variable region of either a heavy or light-chain transcript [38]. Importantly, this combinatorial diversity allows for the generation of a diverse array of antibodies, enabling the immune system to recognize and neutralize a wide range of pathogens.
The organization and counts of germline IG genes vary within and across vertebrate species [26, 39]. For example, cows have only 11 reported IGH functional V genes, whereas murine IGH loci contain over 100 functional V genes, and the little brown bat is estimated to have 236 V genes in its IGH loci [40,41,42]. The number of genes observed within a locus can also vary among individuals within a species; for instance, humans have been characterized to have many structural variants between different individuals [43]. These variations are primarily due to gene duplication and mutation events [44].
Given their pivotal role in immunity, it is essential to characterize IG genes across species at the haplotype level to understand the evolution and functional diversity of immune responses. To ensure that these genes are effectively curated from whole-genome assemblies, the evaluation of assembly accuracy within the IG loci is crucial. Accurate assembly of these loci is vital not only for developing standardized nomenclature across different studies and databases but also serves as a critical test case for our ability to assemble complex genomic regions [43, 45, 46]. This is essential for achieving a complete diploid representation of any genome, which in turn is crucial for fully understanding the diversity and functional dynamics of the immune system [47].
Released assemblies invariably have been subjected to extensive quality control, and a number of methods exist for evaluating the support for a proposed assembly. Some common metrics such as N50, NG50, and L50 are used to assess assembly continuity and fragmentation [48]. QUAST is a tool used to calculate various genome assembly metrics, including these continuity metrics and misassemblies—structural differences between the assembly and the reference genome [49]. However, for mammalian genomes, the misassembly metric often loses significance due to numerous structural differences between individuals, making it necessary to use the genome of the same biological individual (i.e., not just the same species) as a reference for interpretable results. Thus, the effectiveness of QUAST is limited when there is no suitable reference genome available for comparison. Moreover, for many species, the lack of a reference genome necessitates the use of reference-free methods to evaluate the assembly. Furthermore, QUAST is not particularly suitable for evaluating diploid genomes. The Benchmarking Universal Single-Copy Orthologs (BUSCO) program is another key tool in this domain [50, 51]. BUSCO evaluates genome completeness using information inferred by highly conserved genes and is employed by both VGP and CCGP in their quality assessments. Other advanced techniques further refine the evaluation process, some of which are reference-free. The Long Terminal Repeat (LTR) Assembly Index (LAI) is one such reference-free method, assessing repeat space to refine base-level inaccuracies [52]. K-mer-based methods, including JASPER and Merqury, are also reference-free. Merqury refines base-level inaccuracies by assessing k-mer frequencies, while JASPER provides a range of statistical measures for assembly evaluation [53, 54]. Notably, Merqury is utilized by CCGP. Tools such as Pilon and ntEdit are also commonly used for assembly evaluation [55, 56]. With the advancement of long-read sequencing, correction methods for long-read assembly, such as Inspector, CONSENT, POLCA, Medaka (https://github.com/nanoporetech/medaka), GCpp (https://github.com/PacificBiosciences/gcpp), are also available [57, 58]. For detailed curation, CRAQ and Inspector are able to perform error detection at the single-nucleotide resolution and to correct large structural inaccuracies [48, 59]. Recently, GAEP and Klumpy have also emerged for genome assembly forensics [60, 61]. GAEP focuses on heterozygous redundancy and overall assembly quality, although it is geared toward single-haplotype representation and uses the map-pb flag optimized for CLR reads rather than HiFi data. Klumpy is similarly kmer-based, able to identify contigs that are falsely present in an assembly but does not address diploidy or missing haplotypes.
Even with these diverse methods, highly complex and polymorphic genomic regions—notably the IG loci—continue to pose great challenges. Given the biological importance, it is particularly important to determine whether an assembly accurately represents IG genes and their organization in the genome. Methods such as Inspector and CRAQ offer detailed analyses but are not well-suited for the unique requirements of IG loci where haplotypes are often incomplete or only one haplotype is present [48, 59]. Likewise, GAEP and Klumpy have limitations in handling the complexities of diploid or highly variable regions, as they focus on different types of assemblies such as single-haplotype representation or lack explicit support for missing haplotypes. These constraints underscore the need for a more specialized approach.
To allow us to gain a more fine scale understanding of the quality of the IG assemblies across vertebrate species, we developed our own targeted approach, which we call CloseRead. Our pipeline includes intuitive visualizations to highlight assembly quality in the IG loci, thereby enabling researchers to easily identify and rectify potential errors. These visualizations and corresponding metrics are designed to serve as a complement to the existing quality checks. We applied our pipeline to conduct a systematic evaluation of existing assemblies of 74 vertebrate species (61 mammals and 13 reptiles) to assess their accuracy and completeness. Finally, we conducted three case studies in which we demonstrated manual assessment of problematic regions, confirmed the presence of genuine errors and facilitated correction of the erroneous assembly.
Results
Overview of methods
The CloseRead pipeline takes a genome assembly and sequencing reads corresponding to it as an input and aligns the read to the assembly (Fig. 1A). CloseRead then runs IgDetective tool [62] to identify boundaries of IGH, IGK, and IGL loci in the assembly (Fig. 1B). Within the IG loci, CloseRead reviews general assembly statistics and identifies mismatches and coverage breaks (Fig. 1C). Finally, the identified errors are visualized and reported in a user-friendly format (Fig. 1D). The problematic regions revealed by CloseRead can be further manually inspected or reassembled (Fig. 1E).
Error identification and curation analysis overview. A Publicly available assemblies and the corresponding HiFi sequencing data is downloaded and used as input. B IGH, IGK, and IGL loci location is identified using IgDetective [62] for all species. C Alignment using minimap2 [63] is done to identify mismatches and breaks in coverage, which could be an indication of potential assembly error. (i) Low confidence positions are captured from a read view by collecting positions with ≥ 5 reads that have > 1% mismatch rate. (ii) Mismatch rate is calculated for every position, considering only mismatches occurred exactly at the given position (basepair-oriented). (iii) Coverage breaks are found by collecting positions with ≤ 2 reads mapped. D Visualizing assembly evaluation by coverage and basepair-oriented mismatch rate. Purple bars indicate break in coverage. Red highlights indicate positions with high read-oriented mismatch rate computed in C-ii. Shown example is of case with no substantial errors found. E Assembly errors and inaccuracies identified at step C can be further manually investigated and curated
Alignment to detect assembly error
The vast majority of near-complete vertebrate assemblies are based on PacBio HiFi reads [5]. HiFi reads combine high accuracy (< 0.5% error rate) with substantial length (15–25 kbp). Thus, alignments of HiFi reads to the resulting assembly represent an efficient tool for assembly validation: in case of an error-free assembly, we expect that most reads have almost perfect alignments. Reads with no alignments to the assembly may represent sample contaminations, but reads that have poor alignment to the assembly are very likely to indicate assembly errors or missing sequences.
Types of errors identified
In our comprehensive analysis of IG loci assemblies, we identified two primary types of errors that are possible indications of an inaccurate assembly: (i) mismatches and (ii) breaks in coverage [64]. (Preliminary analyses, which are not presented here, revealed that the impact of mis-alignment between reads and reference per se was negligible for these types of errors so we did not consider this potential source of error any further.) We also often observe reads that map to multiple locations in the assembly related to repetitive regions in the genome. Although it presents great challenges in read placement, as we discuss below, such observations are not necessarily an indication of assembly errors.
Beyond these errors, the quality of an assembly can also be largely influenced by the accuracy of its haplotype resolution. Depending on the input data and assembly pipeline, the result of genome assembly of a diploid organism may have different representations. The most complete representation includes a haplotype-resolved assembly capturing both parental haplotypes of a diploid genome and providing detailed genetic insights crucial for precise studies. However, in some assembly projects, available data is not sufficient to reconstruct chromosome-scale phased haplotypes. In such cases, genome assembly output represents a mixture of both haplotypes into one assembly (often referred to as a primary assembly), leaving alternate haplotypes fragmented and/or compressing regions with high heterozygosity, which can obscure important genetic variation [13]. We refer to such assemblies as haplotype-unresolved.
Mismatches
A prevalent type of assembly error in our results is missing sequences. This error is flagged when a substantial number of reads align to a specific region of the reference assembly—referred to as region \(i\)—with a high percentage of mismatches while having increased coverage. These mismatches arise due to failure of the alignment algorithm to find perfect placement of these reads, and it identifies region \(i\) as the best match for these reads, despite the high number of mismatches and often results in abnormally increased coverage. This often suggests that the reference assembly lacks the correct sequence to accurately represent these reads, indicating the complete absence of a large contig or sequence within the assembly. From our case studies, we often find these mismatch errors are resolved by reassembly and inclusion of the missing contigs.
To note, existing single-nucleotide resolution assembly evaluators, such as CRAQ and Inspector, assume that the assembly being evaluated is relatively complete, meaning all sequences are present but may not be in the correct order or position [48, 59]. Under this assumption, these tools interpret approximately 50% of reads with high errors or ratio of clipping as normal due to heterozygosity and report assembly errors only when there is a near 100% error rate (over 75% for CRAQ). This assumption fails in cases where the second haplotype is incomplete, leading to consistent error rates around 50%, as observed in the latter case study 1 section. Consequently, these tools likely would not detect the missing sequences, as they would not classify the ~ 50% error rate as an indicator of such errors and, at least for IG, miss important problems. Unlike CRAQ and Inspector, CloseRead does not assume a relatively complete assembly and uses two approaches to better locate these regions with many mismatches in alignment, one from a read-oriented view and one from basepair-oriented view.
First, we adopt a read-oriented view to analyze genome assembly quality (Fig. 1C-i). We consider \(R\) to be set of all reads associated with an assembly \(G\) with a nucleotide at every position 1 through \(l\) (i.e.,\(G = {G}_{1}, {G}_{2}, ..., {G}_{l}\)). For every position \(j\) in the genome, there is a set of reads \({J}_{j}\subseteq R\) that covers the position. Each individual read \(r\in {J}_{j}\) is of length \(n(r)\) and covers a set of positions \(P\) where \(j\in P\). We can compute the distance \(d\) between \(g = {G}_{P}\) and \(r\) by
As such, the mismatch percentage \(e\) for each read can be computed as:
We set \(\theta = 0.01\) to account for any sequencing errors, matching the error rate of HiFi reads. The overall quality of each genomic position from a read-oriented view is subsequently assessed using a threshold condition. For each position \(j\) in the genome assembly, if \({E}_{{J}_{j}}\), the count of reads in \({J}_{j}\) whose mismatch percentage exceeds the threshold \(\theta\), is greater than 5, then position \(j\) is deemed poorly supported. This condition can be expressed formally as:
If \({E}_{{J}_{j}}>5\), then position \(j\) is deemed poorly supported.
This process is repeated for every position in the reference genome to quantify the read-oriented mismatch error rate across the entire genome.
Second, we adopt a basepair-oriented view to evaluate the mismatch error rate at specific positions within a reference genome (Fig. 1C-ii). This approach focuses exclusively on each individual position, independent of adjacent base pairs. Again, for every position \(j\) in the genome, there is a set of reads \({J}_{j}\subseteq R\) that covers the position. We can compute \({\delta }_{j}\), the counts of errors occurred exactly at position \(j\), by
This process is repeated for every position in the reference genome to quantify the basepair-oriented mismatch error rate across the entire genome.
Breaks in coverage
Another critical indication of assembly error CloseRead identifies is a break in read coverage (Fig. 1C-iii). Continuity of coverage is essential for the accuracy of genomic assemblies; every base pair in the genome is expected to be uniformly (or nearly uniformly) covered by multiple reads [65]. Positions with low coverage, which we define as fewer than two reads aligned to a base pair, indicate minimal support and might represent erroneous sequences. In this paper, we use a threshold of two reads, as this is the minimum required to establish an overlap, thereby minimizing false positives. However, for broader applications, users are encouraged to select a threshold that aligns with their specific requirements and judgment. Notably, reassembling of the Mexican gray wolf's (Canis lupus baileyi, hereafter C. l. baileyi) IGH region, where we discovered a break in coverage, revealed that a large (1.5 Mb) inversion in the original genome was actually assembled in the reverse orientation. This finding highlights that breaks in coverage may correspond to substantial structural errors within the assembly, which can be corrected through reassembly.
Impact of repetitive sequences on read alignment
In addition to the aforementioned errors, we also encountered cases where repetitive sequences caused ambiguity in read placement. Specifically, when a read maps perfectly to two regions in the genome—regions Y and Z—the aligner is unable to determine the correct origin of the read due to the identical sequences in the reference. This issue is commonly seen in phased genomes, where similarities between haplotypes complicate accurate read placement, and it is even more prevalent in regions with repetitive sequences, such as IG loci. In these scenarios, the aligner assigns a mapping quality of 0 to these reads and arbitrarily places them in either region Y or region Z. It is important to note that this situation, characterized by perfect matches and no breaks in coverage, should not be automatically considered an assembly error. Instead, it may reflect the inherent complexity of repetitive sequences and the challenges they pose for accurate genome alignment and assembly [66]. On the other hand, ambiguous read alignments indicate very low divergence that is rarely observed in genomes. Thus, it could indicate erroneously collapsed divergence in the assembly and such cases require additional investigation.
Summary of IG assembly quality in vertebrate genomes
Since haplotype-resolved and haplotype-unresolved assemblies often show substantial quality differences in their second haplotype [67], we consider them separately in the following results. Overall, there were no large disparities in the overall error rates between haplotype-resolved and haplotype-unresolved assemblies. However, the prevalence of each error type was distinct in both assembly types.
We were able to identify the IGH loci for all 74 species using IgDetective [62]. The assembly quality results for these loci indicate substantial variability. Table 1 shows that, in the 26 haplotype-resolved assemblies, one species’ assembly (3.8%) exhibited mismatch errors, 15 species’ assemblies (57.7%) had breaks in coverage, and 11 species’ assemblies (42.3%) had no notable errors. In the 48 haplotype-unresolved assemblies, mismatch errors were found in 24 species’ assemblies (50%), 15 species’ assemblies (31.3%) had breaks in coverage, and 20 species’ assemblies (41.7%) had no notable errors. Overall, out of the 74 species, 43 species’ assembly (58.1%) showed some indication of assembly error, either through mismatches or breaks in coverage.
We identified IGK loci in 57 out of 74 species and IGL loci in 61 out of 74 species. In the remaining species, these loci may be undetected either due to high sequence divergence or their inherent absence. Some species naturally lack IGK loci, possessing only IGL loci, and vice versa [68, 69]. For IGK/IGL loci, there were 21/23 haplotype-resolved and 36/38 haplotype-unresolved assemblies. In the haplotype-resolved group, 1/2 assembly (4.8%/8.7%) displayed mismatch errors, 3/7 assemblies (14.3%/30.4%) had coverage breaks, and 18/16 assemblies (85.7%/69.6%) exhibited no substantial errors. In the haplotype-unresolved group, mismatch errors were observed in 9/17 assemblies (25%/44.7%), 4/12 assemblies (11.1%/31.6%) had coverage breaks, and 26/16 assemblies (72.2%/42.1%) showed no substantial errors. In total, 13 of the 57 species (22.8%) and 29 out of the 61 species (47.5%) presented some form of assembly error in IGK and IGL loci, respectively.
To further understand genome assembly challenges, we categorize breaks based on their coverage and the characteristics of the reference sequences as it could imply different things about the assembly [6, 70] (Additional file 1: Fig. S1). Overall, our analysis indicates that haplotype-unresolved assemblies have higher mismatch errors, suggesting a systematic issue with representing haplotypes that cannot be fully resolved despite sufficient data. Additionally, we observe a higher rate of coverage breaks in the IGH loci, highlighting a particular challenge in accurately assembling these regions.
Case studies
To verify the errors identified by CloseRead and better understand their causes, we present case studies that are emblematic of some of the broader features discovered across the datasets we examined. In the Additional file 2–5, we repeat the same detailed analyses for each of the other 71 genomes we evaluated in this study.
Case 1: Mismatches in alignment indicating missing IG sequences
The assemblies of the Greenland wolf (Canis lupus orion; hereafter C. l. orion; GCA_905319855.2) and the Philippine flying lemur (Cynocephalus volans; hereafter C. volans; GCA_027409185.1) were constructed as part of the Vertebrate Genome Project (VGP) and are haplotype-unresolved. Specifically, the primary IGH loci of the C. l. orion is 2.5 Mbp, while the alternate IGH loci is about 1.3 Mbp. Similarly, the primary IGH loci of C. volans is about 2.3 Mbp, and the alternate IGH loci sums up to 0.8 Mbp. In both case studies, the majority of the reads showed great mapping quality when aligned to the assembly, with 90.02% and 86.44% of reads, respectively, having a mapping quality of 60 (Additional file 1: Fig. S2A i, S3A i). However, there are a portion of the reads (6.4% and 3.7% respectively) that were soft-clipped in both case studies (Additional file 1: Fig. S2A ii, S3A ii). In addition, both species show substantial numbers of reads (7.9% and 14.5% respectively) aligned with high mismatch rates and indels (insertions and deletions greater than 2 consecutive base pairs) (Additional file 1: Fig. S2A iii-iv, S3A iii-iv). Given that the two haplotypes of the IGH loci in both species exhibit substantial differences in length, the mismatches identified likely suggest missing contigs in the alternate assemblies (Figs. 2A, B and 3A, B).
Detailed analysis of IGH locus assembly errors in Greenland wolf (C. l. orion). A Dotplot comparing gene locations and alignments is shown for primary vs alternate haplotypes. Lines on the top and right side of the dotplot indicate IG genes locations, colored by affected/unaffected defined in panel E. We generated dotplots using Gepard(v2.1) [71]. B IGH locus length in both the primary and alternate assemblies are compared using bar charts. C A detailed analysis of alignment mismatch in the primary IGH haplotype includes (i) read coverage across the entire IGH loci, color-coded by mapping quality, and (ii) basepair-oriented mismatch rate, a heatmap above indicating the frequency of high mismatch rate base pairs, with darker colors denoting more frequent occurrences. Light red highlights positions covered by ≥ 5 reads with an error rate > 1%, and purple bars indicate coverage breaks (coverage ≤ 2). D Similar analysis to panel C for the alternate IGH haplotype. E IG gene quality from (i) read-oriented view, affected genes are IGH genes covered by > 5 poorly aligned reads; and (ii) basepair-oriented view, affected genes are IGH genes without perfect support (> 80% confidence at every position)
Detailed analysis of IGH locus assembly errors in Philippine Flying Lemur (C. volans). A A dotplot comparing gene locations and alignments is shown for primary vs alternate haplotypes. Lines on the top and right side of the dotplot indicate IG genes locations, colored by affected/unaffected defined in panel E. We generated dotplots using Gepard(v2.1) [69]. B IGH locus length in both the primary and alternate assemblies are compared using bar charts. C A detailed analysis of alignment mismatch in the primary IGH haplotype includes (i) read coverage across the entire IGH loci, color-coded by mapping quality, and (ii) basepair-oriented mismatch rate, a heatmap above indicating the frequency of high mismatch rate base pairs, with darker colors denoting more frequent occurrences. Light red highlights positions covered by ≥ 5 reads with an error rate > 1%, and purple bars indicate coverage breaks (coverage ≤ 2). D Similar analysis to panel C for the alternate IGH haplotype. E IG gene quality from (i) read-oriented view, affected genes are IGH genes covered by > 5 poorly aligned reads and (ii) basepair-oriented view, affected genes are IGH genes without perfect support (> 80% confidence at every position)
The C. l. orion IGH loci exhibits robust coverage of approximately 30 × with high mapping quality reads (Fig. 2C i, D i). Similarly, the C. volans’ IGH loci show great coverage, around 20–40 × , with most reads having a mapping quality of 60 (Fig. 3C i, D i). In the C. l. orion assembly, there are no observed breaks in coverage, but both haplotypes show sudden increases in coverage, suggesting possible issues due to duplications or repetitive regions (Fig. 2C i, D i). In the C. volans, there is also a sudden increase in coverage in the primary haplotype (SUPER_3), suggesting similar issues (Fig. 3D i). Detailed analysis from both read-oriented and basepair-oriented perspectives reveals many regions in both species, which aligned with reads containing mismatches and substantial numbers of base pairs not well supported (2.8% and 2.2% respectively), overlap with these high-coverage regions (Figs. 2C, D and 3C, D). These mismatches, along with the indels, soft-clipped bases and increased coverage, highlight potential errors in the assemblies. Additional file 1: Fig. S2 and S3 also show additional evaluation statistics. Furthermore, looking more closely at the IGH gene sequences, where having accurate representations are the most critical, 13.5%/20.9% of the genes are not well supported from the read-oriented perspective, and 23.8%/23.3% of the genes lack perfect base support in C. l. orion and C. volans respectively (Figs. 2E and 3E).
To address these issues, we further analyzed the IGH loci for both the C. l. orion and C. volans using de Bruijn graphs.
Case 1: de Bruijn graph analysis and reassembly
Our observations indicated the existence of missing contigs in both the C. l. orion and the C. volans assemblies. To reconstruct missing sequences, we had to make additional manual analysis based on de Bruijn graphs. First we generated draft assemblies for these species using a different genome assembly tool. We chose to use La Jolla Assembler (LJA) as it was shown to generate the most reliable assemblies from HiFi reads [72, 73]. We discovered that in both species contigs reported by LJA contained regions of IG loci which were not present in the VGP assemblies. Nearly all of the reads (99.8%) with low-quality alignments to the VGP assembly had nearly perfect alignments to these new regions.
To provide additional confirmation of errors in VGP assemblies and confirm LJA assembly results, we analyzed de Bruijn graphs of IG loci. As a part of its output, LJA provides the error-corrected de Bruijn graph, where most erroneous nodes originating from errors in reads are eliminated. The resulting graph provides valuable information about the repeat structure of the genome. Figures 4A and 5A show subgraphs and contigs representing IGH loci in C. l. orion and C. volans, correspondingly. Below we discuss each of these two cases in more detail.
Reassembly analysis of IGH locus assembly errors in C. l. orion. A Primary and alternate assemblies of C. l. orion IGH locus produced by VGP and LJA are represented by blue (primary) and tan (alternate) rectangles. Telomere sequences (tandem repeats with repeating sequence CCCTAA) are shown in green. B A dotplot comparing gene locations and alignments is shown for the corrected alternate IGH assembly vs primary IGH assembly. Lines on the top and right side of the dotplot indicate IG genes locations, colored by affected/unaffected defined in panel E. C Re-analysis of alignment mismatch in the primary IGH haplotype, (i) read coverage across the entire IGH loci, color-coded by mapping quality, and (ii) delta of the basepair-oriented mismatch rate between VGP and LJA primary IGH locus (LJA—VGP basepair-oriented mismatch rate). Light red highlights positions covered by ≥ 5 reads with an error rate > 1%, and purple bars indicate coverage breaks (coverage ≤ 2). Contigs left unchanged in VGP are with names formatted as “VGP name/ LJA name.” D Similar analyses are performed for the corrected alternate haplotype. E IG gene quality from (i) read-oriented view, affected genes are IGH genes covered by > 5 poorly aligned reads; and (ii) basepair-oriented view, affected genes are IGH genes without perfect support (> 80% confidence at every position)
Reassembly analysis of IGH locus assembly errors in C. volans. A Primary and alternate sequences of IGH locus in VGP and LJA assemblies of C. volans are represented by blue (primary) and tan (alternate) rectangles. Telomere sequences (tandem repeats with repeating sequence CCCTAA) are shown in green. Dark blue edge and contig segment represent a 12-kb-long conserved fragment. Part of the IGH locus after the blue repeat is very diverged and its length varies from 0.17 Mbp for alternate haplotype to 0.7 Mbp for primary haplotype. B A dotplot comparing gene locations and alignments is shown for the corrected alternate IGH assembly vs primary IGH assembly. Lines on the top and right side of the dotplot indicate IG genes locations, colored by affected/unaffected defined in panel E. C Re-analysis of alignment mismatch in the primary IGH haplotype, (i) read coverage across the entire IGH loci, color-coded by mapping quality, and (i) delta of the basepair-oriented mismatch rate between VGP and LJA primary IGH locus (LJA—VGP basepair-oriented mismatch rate). Light red highlights positions covered by ≥ 5 reads with an error rate > 1%, and purple bars indicate coverage breaks (coverage ≤ 2). Contigs left unchanged in VGP are with names formatted as “VGP name/ LJA name.” D Similar analyses are performed for the corrected alternate haplotype. E IG gene quality from (i) read-oriented view, affected genes are IGH genes covered by > 5 poorly aligned reads; and (ii) basepair-oriented view, affected genes are IGH genes without perfect support (> 80% confidence at every position)
Figure 4A shows that LJA and VGP produced very similar primary assemblies of the C. l. orion IGH locus both consisting of a single contig covering the whole IGH locus and ending with a telomere. Both alternate assemblies consist of two contigs each but VGP assembly is missing a 1.4-Mbp fragment of contig “74247” generated by LJA (Fig. 4A, B). The de Bruijn graph generated by LJA shows a chain of “bulges,” representing alternating conserved and diverged homologous regions. This chain breaks approximately 1.4 Mbp before the end of the chromosome when the divergence between haplotypes becomes too high to be detected by the de Bruijn graph as it only collapses divergence-free sequences of length at least k (the value of k in the graph generated by LJA is 5001). Reads with low-quality alignment to VGP assembly chained together by perfectly overlapping each other and forming a series of edges in the de Bruijn graph, which eventually formed contig “74247”.
For correction, note that some regions of the IGH loci generated by VGP and LJA are nearly identical. To maintain better continuity with the rest of the genome, we retained the original primary IGH sequence from the VGP assembly. For example, SUPER_8 (from VGP) corresponds to contig “74267” (from LJA), and we opted to continue using SUPER_8. Similarly, atg000251l (from VGP) matches contig “17636” (from LJA), and we retained atg000251l. A key difference is the absence of contig “74247,” which was identified in the LJA assembly. To address this, we incorporated contig “74247” into the VGP assembly and removed any overlapping sequences, such as atg000600l. Incorporating the missing contig from the alternate assembly allowed us to bring both haplotype to length around 2.5 Mbp, and reveal 69 more IG genes than before (Fig. 4B). Furthermore, re-performing post-correction read alignment resolved the previously found erroneous reads, despite one region with low coverage in contig “74247” near the end (Fig. 4C, D; Additional file 1: Fig. S4). The break in the alternate LJA assembly is explained by a gap in alternate haplotype coverage. However, it is unclear how to explain 1.4 Mbp of missing sequence in the VGP alternate assembly. Looking more closely at the IGH gene sequences, now all IGH genes found are well supported from the read-oriented perspective, and only 1 gene lack perfect base support from basepair-oriented perspective, with 76.9% base support at 1 position (Fig. 4E).
Figure 5A shows that the primary LJA assembly of the C. volans IGH locus is split into two contigs while VGP assembly has a single contig covering the whole IGH locus. The de Bruijn graph shows that the two contigs generated by LJA are separated by a 12-kbp-long region that is perfectly conserved between chromosomes (shown in red on Fig. 5A). It can be assumed that LJA could not resolve this repeat and phase the haplotypes, thus making a decision to report multiple contigs rather than potentially report a longer chimeric contig that combines sequences from different haplotypes. The alternate VGP assembly is also broken by this repeat. It is unclear why this repeat is resolved in the primary assembly, but not in the alternate assembly. Even though VGP reported a less fragmented primary IGH assembly, the alternative assembly again is missing a 0.7-Mbp region from LJA contig “41809.” The De Bruijn graph indicates that the missing sequence is a part of a long fragment of the alternative haplotype that is greatly diverged from (does not share k-mers with) the primary haplotype.
For correction, again note that some regions of the IGH loci generated by VGP and LJA are highly identical and we apply a similar approach as in the previous case to retain continuity with the rest of the assembly. Here, SUPER_3 (from VGP) corresponds to contig “35074” and contig “35923” (from LJA), and we opted to continue using SUPER_3. Similarly, hap_ptg000193l_1 (from VGP) matches contig “40981” (from LJA), and we retained hap_ptg000193l_1. A key difference is the absence of contig “41809,” which was identified in the LJA assembly. To address this, we incorporated contig “41809” into the VGP assembly and removed any overlapping sequences, such as atg000034l and atg001440l. By incorporating this missing contig, it brings the two haplotype to more comparable lengths (2.3 and 1.5 Mbp) and identifies 77 more IG genes than before (Fig. 5B). Further analysis by re-aligning the reads resolved previously identified erroneous reads (Fig. 5C, D; Additional file 1: Fig. S5). Specifically examining the IGH gene sequences, all identified IGH genes are well supported from a read-oriented perspective. Notably, only 3 genes lack perfect base support from a basepair-oriented perspective: one gene has a deletion, and the other two have 66.7% and 76.9% base support at one position each (Fig. 5E).
Case 2: Break in coverage revealing false inversions
The assembly of the Mexican gray wolf (C. l. baileyi) was constructed as part of the Vertebrate Genome Project (VGP). Different from the previous cases, this individual has a haplotype-resolved assembly. Due to homology between haplotypes, not all reads were aligned with mapping quality 60 (Additional file 1: Fig. S6A i). We do not observe many reads aligned with mismatches, indels, soft-clipped, or hard-clipped bases in either haplotype, suggesting good confidence in base pair correctness overall (Additional file 1: Fig. S6A ii-iv). In addition, the two haplotypes are of similar length, suggesting little likelihood of a missing sequence (Fig. 6A, B). Looking at this assembly from a read-oriented view further confirmed that, despite both haplotypes exhibiting great coverage around 30 × , over half of the positions are covered by reads with a mapping quality less than 60, indicating that some aligned reads have lower confidence (Fig. 6C i, D i). From a basepair-oriented view, there are few poorly supported positions (Fig. 6C ii, D ii). Additional file 1: Fig. S6 also shows additional evaluation statistics.
Detailed analysis of IGH locus assembly errors in Mexican gray wolf (C. l. baileyi). A A dotplot comparing gene locations and alignments is shown for primary vs alternate haplotypes. Lines on the top and right side of the dotplot indicate IG genes locations, colored by affected/unaffected defined in panel E. We generated dotplots using Gepard(v2.1) [69]. B IGH locus length in both the primary and alternate assemblies are compared using bar charts. C A detailed analysis of alignment mismatch in the primary IGH haplotype includes (i) read coverage across the entire IGH loci, color-coded by mapping quality, and (ii) basepair-oriented mismatch rate, a heatmap above indicating the frequency of high mismatch rate base pairs, with darker colors denoting more frequent occurrences. Light red highlights positions covered by ≥ 5 reads with an error rate > 1%, and purple bars indicate coverage breaks (coverage ≤ 2). D Similar analysis to panel C for the alternate IGH haplotype
However, we observe an 8-kb-long break in coverage in the middle of the primary haplotype (scaffold_8:75,432,695–75,440,919) (Fig. 6C). Closer examination reveals that the reference assembly contains consecutive sequences of Ns at this break, indicating missing sequence information, and there are 0 reads spanning this break, confirming a lack of coverage (Additional file 1: Fig. S6B). Although it is possible that this break cannot be spanned by HiFi reads and that the Ns were introduced during the scaffolding step, the short length of the break in our case makes this less likely. By comparing the sequences, we observe a 1.5-Mbp inversion starting after this break in coverage in the C. l. baileyi’s haplotype 1 IGH assembly when compared to the corresponding haplotype 2 assembly (Fig. 6A). Considering that from case 1 we have two assemblies for two individuals from subspecies of the same species, it is reasonable to compare the two assemblies as they should have similar germline IG sequences. When comparing the assemblies, we confirmed again there is a 1.5-Mbp inversion located at the end of the haplotype 1 IGH contig (Additional file 1: Fig. S6D). While this inversion may be real, it is concerning that it coincides with the break in coverage we identified earlier. Therefore, we performed a reassembly of the IGH loci.
Case 2: de Bruijn graph analysis and reassembly
Figure 7A shows that both LJA and VGP produced primary and alternative assemblies fully covering the IGH locus of C. l. baileyi. However, the alternative assembly produced by LJA does not have the 1.5-Mbp inversion observed in VGP assembly. This inversion is bounded by the low covered region detected by read alignment analysis on one side and by telomeric sequence on the other side. Analysis of the homologous sequence in the LJA primary assembly revealed a telomere-like sequence (tandem repeat with units similar to CCCTAA) [74]. We speculate that a spurious overlap between reads from this region and the real telomeric region was detected, resulting in an erroneous link and introducing an erroneous inversion in the VGP assembly (highlighted in red on Fig. 7A). We then corrected this inversion in the primary haplotype of the VGP assembly while leaving the alternate haplotype unmodified. After correcting, the break in coverage was resolved, confirming again that the inversion was indeed the source of the coverage issue (Fig. 7B–D). Additional file 1: Fig. S7 shows additional re-assembly evaluation statistics.
Reassembly analysis of IGH locus assembly errors in C. l. baileyi. A Primary and alternate sequences of IGH locus in VGP and LJA assemblies of C. l. baileyi are represented by blue (primary) and tan (alternate) rectangles. Telomere sequences (tandem repeats with repeating sequence CCCTAA) are shown in green. Contigs produced by VGP for this region differ by a 1.5-Mbp inversion. This inversion puts the telomeric sequence highlighted in green into the middle of contig scaffold_8. Red link corresponds to the inversion breakpoint where it has poor read support. B A dotplot comparing gene locations and alignments is shown for the corrected haplotype 1 IGH assembly vs haplotype 2 IGH assembly. Lines on the top and right side of the dotplot indicate IG genes locations, colored by affected/unaffected defined in panel E. C Re-analysis of alignment mismatch in the corrected haplotype 1 IGH assembly, (i) read coverage across the entire IGH loci, color-coded by mapping quality, and (ii) basepair-oriented mismatch rate, a heatmap above indicating the frequency of high mismatch rate base pairs, with darker colors denoting more frequent occurrences. Light red highlights positions covered by ≥ 5 reads with an error rate > 1%, and purple bars indicate coverage breaks (coverage ≤ 2). Contigs left unchanged in VGP are with names formatted as “VGP name/ LJA name”. D Similar analyses are performed for the corrected alternate haplotype
Discussion
In our study, we focus on publicly available genome assemblies which are often utilized by the research community without additional validation. Here we are particularly interested in assessing how well these assemblies are reconstructed for the IG loci. These loci both serve as a test case for the broader problem of reconstructing complex, repetitive regions and are of great interest in their own right, given their great biomedical importance. Accurate assemblies are critical for annotating and naming the genes in the IG loci, characterizing diversity among and within species, as well as understanding the evolutionary causes and immunological consequences of this diversity—all of which may be important for developing new vaccines and therapeutics [25].
Existing single-nucleotide resolution assembly evaluation tools like CRAQ and Inspector have limitations in detecting errors specific to the IG loci. These tools fail to identify and correctly classify errors in complex, repetitive regions like our case studies (Additional file 1: Note S1). To address these shortcomings, we developed CloseRead. This pipeline is specifically designed for complex, highly heterozygous regions like IG. CloseRead offers intuitive visualizations and metrics to highlight assembly quality, enabling easy identification and rectification of errors. Again, we emphasize that our primary motivation for developing CloseRead was to address our own empirical question; we have not rigorously benchmarked this against alternative measures across a range of scenarios—this is beyond the scope of the current paper—although we suspect that it may have wide applicability and assessing this is an obvious point for future research.
Detecting errors is only the first step; further development of scalable methods for assembly process analysis is necessary to reconstruct correct sequences or at least investigate possible alternatives. This requires transforming genome assembly from a one-button black-box solution into a transparent process with clearly outlined alternatives and decisions. Such approaches are essential for identifying and correcting errors, so researchers can make reliable inferences about genomic variation and its consequences for phenotypes. Future work should prioritize automating these approaches and integrating them into existing assembly pipelines to enhance scalability and efficiency.
Conclusion
In conclusion, we conducted a systematic evaluation of 74 species (most of which were mammals) revealing large deficiencies in the representation of these loci in current assemblies. This finding is crucial as it shows that despite quality checks prior to their release, these assemblies still contain substantial inaccuracies, particularly in complex genomic regions like IG loci. Our results strongly suggest that, at least given current technologies and methodologies, that standard measures of genome-wide assembly support are not reliable indicators of the quality of complex genomic regions like the IG loci. In addition to hopefully spurring the development of more sophisticated tools for assessing and correcting, our results also serve as a useful reminder as to how poorly understood these loci are. Given that it is becoming apparent that germline variation in IG is a major determinant of variation in the adaptive immune response among individuals, it is critical that this knowledge gap is closed [26].
Methods
Data description
The dataset comprises genomic assemblies from three prominent projects: the Vertebrate Genome Project (VGP) [13], the T2T Primates project [12], and the California Conservation Genomics Project (CCGP) [14].
The Vertebrate Genome Project aims to generate high-quality, near-complete reference genomes for all vertebrate species. Here we utilize the assembled genomes of 59 species from VGP, with an average coverage of approximately 30 × . The sequencing methods employed in VGP include high-fidelity (HiFi) reads, Illumina, and HiC. Out of the 59 assemblies, 17 are haplotype-resolved, which means they provide more detailed genetic information by distinguishing between the two sets of chromosomes inherited from each parent.
The T2T Primates project focuses on creating complete, telomere-to-telomere (T2T) assemblies for primate species. Here we utilize the assembled genomes of 4 species, with an average coverage of around 160 × , using HiFi, Illumina, HiC, and Oxford Nanopore Technologies (ONT) sequencing methods. All four assemblies produced by the T2T Primates project are haplotype-resolved.
The California Conservation Genomics Project (CCGP) is dedicated to applying genomic technologies to the conservation of California’s biodiversity. Here we utilize the assembled genomes of 11 species using HiFi and HiC sequencing methods, with average coverage around 21 × . Among these assemblies, 5 are haplotype-resolved.
In total, the dataset includes genomic assemblies for 74 species, contributed by these three projects. Together, these efforts have resulted in 26 haplotype-resolved assemblies. Majority of these assemblies are for mammalian species, detailed descriptions are available in Additional file 1: Fig. S8. The specific datasets used can also be found in the Additional file 5. Here we specifically focus on the HiFi data, thus all available PacBio HiFi reads—regardless of whether they are in BAM or FASTQ format—were included in our analysis.
Alignment
For aligning the HiFi raw reads to the genome assemblies, we used Minimap2 (version 2.26-r1175) [63]. The specific arguments used for the alignment were minimap2 -ax map-hifi. This command ensures optimal alignment settings for PacBio HiFi reads, producing.bam necessary for subsequent analyses.
IG gene annotation
IG gene annotation was performed using IgDetective (version 1.1.0) [62]. IgDetective is a specialized tool designed to identify and annotate IG loci in genomic assemblies. The algorithm begins by using minimap2 to align input genomic sequences against a database of known immunoglobulin gene sequences, identifying contigs likely containing IG genes. It then searches these contigs for recombination signal sequences (RSS), which mark V(D)J gene segments. Once potential V, D, and J segments are identified, the tool refines these candidate genes through an iterative process, involving re-alignment and re-evaluation to enhance accuracy. The final output provides detailed annotations, specifying the locations and sequences of these genes within the genome.
Error quantification
We conducted a detailed analysis of genomic assemblies by evaluating aligned read data, with a special focus on spotting and quantifying errors at the IG loci. Initially, we processed the aligned raw data from the.bam and computed the.mpileup files using SAMtools [75]. Then we focus on the IG loci using the coordinates found by IgDetective.
To ensure the genomic assemblies were of high quality, we undertook a rigorous statistical examination to catalog the frequency and types of errors. Our analysis involved counting the mismatches between the reads and the reference genome to pinpoint potential assembly errors, and then calculating mismatch rates from both a read-oriented and a basepair-oriented view. Additionally, we focused on detecting breaks in coverage, specifically areas with fewer than two reads, which could indicate potential issues with the assembly process.
We then summarized this data into comprehensive statistics and insights, which are detailed in the “Results” section of our paper. Our summary included a detailed breakdown analysis of errors by gene type within the IG loci, with a specific focus on the IGH, IGK, and IGL loci. We also assessed how these errors impact the reliability of genomic assemblies across various species.
This thorough approach has allowed us to assess the quality of the assembly, providing valuable insights into the accuracy and completeness of the IG loci assemblies for the species we studied.
De Bruijn graph analysis
De Bruijn graph is a powerful instrument for visualization and analysis of sequencing reads. Observing low quality alignments of reads to assembly may just indicate an elevated error rate in reads. But if the de Bruijn graph shows that these reads perfectly overlap each other and form a path, we can conclude that these reads originate from a genome fragment that is missing from the assembly.
We analyzed the corrected de Bruijn graph printed by LJA as a part of its output (LJA was run with HiFI reads as an input and default parameters, which include default value of k = 5001). In addition, we used jumboDBG tool [72] that allowed us to specify the value of k and visualize paths of contigs through the graph (jumboDBG was run using contigs as an input with additional parameter -k 3500 to change the value of k and –paths < contigs_file > to visualize the paths of contigs through the graph.). Below we describe basic principles that we used for analysis of the de Bruijn graph.
De Bruijn graph collapses together perfectly matching fragments of the genome of length at least k. In particular homologous heterozygosity-free regions of length at least k would be collapsed, while divergent homologous regions would be represented as pairs of paths that share start and end vertex. Thus homology between haplotypes can be seen in de Bruijn graphs as a “bulge path” representing alternating diverged and non-diverged regions of the genome. In our case studies, we identified edges in the graph, corresponding to contig regions with poor read alignments and identified bulge paths containing these edges. Further analyses of edges in these bulge paths and their alignments to primary and alternate assemblies revealed the abnormalities described in case studies.
Data availability
All datasets used in this manuscript are publicly available, as described under Methods. Direct links to all datasets are available in Additional file 5. CloseRead is publicly available on GitHub under the GPLv3 license: https://github.com/applied-phylo-lab/CloseRead_ig-assembly-eval [76], and is also available on zenodo at https://doi.org/10.5281/zenodo.15242406 [77]. Curated assembly sequences are on github within the “curated_IGH” folder, and is also available on zenodo at https://doi.org/10.5281/zenodo.13125505 [78].
References
Dolzhenko E, English A, Dashnow H, De Sena BG, Mokveld T, Rowell WJ, et al. Characterization and visualization of tandem repeats at genome scale. Nat Biotechnol. 2024;2:1–9.
Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 2020;21(1):30.
Wang Y, Zhao Y, Bollas A, Wang Y, Au KF. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol. 2021;39(11):1348–65.
Hayden KE. Human centromere genomics: now it’s personal. Chromosome Res. 2012;20(5):621–33.
Wenger AM, Peluso P, Rowell WJ, Chang PC, Hall RJ, Concepcion GT, et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol. 2019;37(10):1155–62.
Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol. 2018;36(4):338–45.
Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, et al. The complete sequence of a human genome. Science. 2022;376(6588):44–53.
Wang T, Antonacci-Fulton L, Howe K, Lawson HA, Lucas JK, Phillippy AM, et al. The Human Pangenome Project: a global resource to map genomic diversity. Nature. 2022;604(7906):437–46.
Jarvis ED, Formenti G, Rhie A, Guarracino A, Yang C, Wood J, et al. Semi-automated assembly of high-quality diploid human reference genomes. Nature. 2022;611(7936):519–31.
De Coster W, Weissensteiner MH, Sedlazeck FJ. Towards population-scale long-read sequencing. Nat Rev Genet. 2021;22(9):572–87.
Kalbfleisch TS, McKay SD, Murdoch BM, et al. The Ruminant Telomere-to-Telomere (RT2T) Consortium. Nat Genet. 2024;56(8):1566–73. https://doi.org/10.1038/s41588-024-01835-2.
Yoo D, Rhie A, Hebbar P, Antonacci F, Logsdon GA, Solar SJ, et al. Complete sequencing of ape genomes. Nature. 2025;9:1–18.
Rhie A, McCarthy SA, Fedrigo O, Damas J, Formenti G, Koren S, et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature. 2021;592(7856):737–46.
Shaffer HB, Toffelmier E, Corbett-Detig RB, Escalona M, Erickson B, Fiedler P, et al. Landscape Genomics to Enable Conservation Actions: The California Conservation Genomics Project. Murphy W, editor. J Hered. 2022;113(6):577–88.
Moon KL, Huson HJ, Morrill K, Wang MS, Li X, Srikanth K, et al. Comparative genomics of Balto, a famous historic dog, captures lost diversity of 1920s sled dogs. Science. 2023;380(6643):eabn5887.
Sullivan PF, Meadows JRS, Gazal S, Phan BN, Li X, Genereux DP, et al. Leveraging base-pair mammalian constraint to understand genetic variation and human disease. Science. 2023;380(6643):eabn2937.
Christmas MJ, Kaplow IM, Genereux DP, Dong MX, Hughes GM, Li X, et al. Evolutionary constraint and innovation across hundreds of placental mammals. Science. 2023;380(6643):eabn3943.
Foley NM, Mason VC, Harris AJ, Bredemeyer KR, Damas J, Lewin HA, et al. A genomic timescale for placental mammal evolution. Science. 2023;380(6643):eabl8189.
Andrews G, Fan K, Pratt HE, Phalke N, Zoonomia Consortium, Karlsson EK, et al. Mammalian evolution of human cis-regulatory elements and transcription factor binding sites. Science. 2023;380(6643):eabn7930.
Kaplow IM, Lawler AJ, Schäffer DE, Srinivasan C, Sestili HH, Wirthlin ME, et al. Relating enhancer genetic variation across mammals to complex phenotypes using machine learning. Science. 2023;380(6643):eabm7993.
Wilder AP, Supple MA, Subramanian A, Mudide A, Swofford R, Serres-Armero A, et al. The contribution of historical processes to contemporary extinction risk in placental mammals. Science. 2023;380(6643):eabn5856.
Osmanski AB, Paulat NS, Korstian J, Grimshaw JR, Halsey M, Sullivan KAM, et al. Insights into mammalian TE diversity through the curation of 248 genome assemblies. Science. 2023;380(6643):eabn1430.
Keough KC, Whalen S, Inoue F, Przytycki PF, Fair T, Deng C, et al. Three-dimensional genome rewiring in loci with human accelerated regions. Science. 2023;380(6643):eabm1696.
Xue JR, Mackay-Smith A, Mouri K, Garcia MF, Dong MX, Akers JF, et al. The functional and evolutionary impacts of human-specific deletions in conserved elements. Science. 2023;380(6643):eabn2253.
Kirilenko BM, Munegowda C, Osipova E, Jebb D, Sharma V, Blumer M, et al. Integrating gene annotation with orthology inference at scale. Science. 2023;380(6643):eabn3107.
Pennell M, Rodriguez OL, Watson CT, Greiff V. The evolutionary and functional significance of germline immunoglobulin gene variation. Trends Immunol. 2023;44(1):7–21.
Tørresen OK, Star B, Mier P, Andrade-Navarro MA, Bateman A, Jarnot P, et al. Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases. Nucleic Acids Res. 2019;47(21):10994–1006.
Weissensteiner MH, Pang AWC, Bunikis I, Höijer I, Vinnere-Petterson O, Suh A, et al. Combination of short-read, long-read, and optical mapping assemblies reveals large-scale tandem repeat arrays with population genetic implications. Genome Res. 2017;27(5):697–708.
Townsend CL, Laffy JMJ, Wu YCB, Silva O’Hare J, Martin V, Kipling D, et al. Significant Differences in Physicochemical Properties of Human Immunoglobulin Kappa and Lambda CDR3 Regions. Front Immunol. 2016;7:388 https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2016.00388/full.
Gibson WS, Rodriguez OL, Shields K, Silver CA, Dorgham A, Emery M, et al. Characterization of the immunoglobulin lambda chain locus from diverse populations reveals extensive genetic variation. Genes Immun. 2022;24(1):21–31.
Matsuda F. CHAPTER 1 - Human Immunoglobulin Heavy Chain Locus. In: Honjo T, Alt FW, Neuberger MS, editors. Molecular Biology of B Cells. Burlington: Academic Press; 2004. p. 1–17. Available from: https://www.sciencedirect.com/science/article/pii/B9780120536412500022. Cited 2024 Jun 10.
Mikocziova I, Peres A, Gidoni M, Greiff V, Yaari G, Sollid LM. Germline polymorphisms and alternative splicing of human immunoglobulin light chain genes. iScience. 2021;24(10):103192.
Rodriguez OL, Gibson WS, Parks T, Emery M, Powell J, Strahl M, et al. A novel framework for characterizing genomic haplotype diversity in the human immunoglobulin heavy chain locus. Front Immunol. 2020;23(11):2136.
Corcoran MM, Phad GE, Vázquez Bernat N, Stahl-Hennig C, Sumida N, Persson MAA, et al. Production of individualized V gene databases reveals high levels of immunoglobulin genetic diversity. Nat Commun. 2016;20(7):13642.
Watson CT, Breden F. The immunoglobulin heavy chain locus: genetic variation, missing data, and implications for human disease. Genes Immun. 2012;13(5):363–73.
Watson CT, Steinberg KM, Huddleston J, Warren RL, Malig M, Schein J, et al. Complete haplotype sequence of the human immunoglobulin heavy-chain variable, diversity, and joining genes and characterization of allelic and copy-number variation. Am J Hum Genet. 2013;92(4):530–46.
Abbas AK, Lichtman AH, Pillai S. Cellular and Molecular Immunology. 10th ed. Elsevier; 2021.
Schatz DG. V(D)J recombination. Immunol Rev. 2004;200(1):5–11.
Das S, Nozawa M, Klein J, Nei M. Evolutionary dynamics of the immunoglobulin heavy chain variable region genes in vertebrates. Immunogenetics. 2008;60(1):47–55.
Choi NM, Loguercio S, Verma-Gaur J, Degner SC, Torkamani A, Su AI, et al. Deep sequencing of the murine IgH repertoire reveals complex regulation of nonrandom V gene rearrangement frequencies. J Immunol Baltim Md 1950. 2013;191(5):2393–402.
Bratsch S, Wertz N, Chaloner K, Kunz TH, Butler JE. The little brown bat, M. lucifugus, displays a highly diverse VH, DH and JH repertoire but little evidence of somatic hypermutation. Dev Comp Immunol. 2011;35(4):421–30.
Li TT, Xia T, Wu JQ, Hong H, Sun ZL, Wang M, et al. De novo genome assembly depicts the immune genomic characteristics of cattle. Nat Commun. 2023;14(1):6601.
Rodriguez OL, Safonova Y, Silver CA, Shields K, Gibson WS, Kos JT, et al. Genetic variation in the immunoglobulin heavy chain locus shapes the human antibody repertoire. Nat Commun. 2023;14(1):4419.
Ota Nei. Divergent evolution and evolution by the birth-and-death process in the immunoglobulin VH gene family. Mol Biol Evol. 1994;11(3):469–82 https://academic.oup.com/mbe/article/11/3/469/1104327/Divergent-evolution-and-evolution-by-the.
Ohlin M, Scheepers C, Corcoran M, Lees WD, Busse CE, Bagnara D, et al. Inferred allelic variants of immunoglobulin receptor genes: a system for their evaluation, documentation, and naming. Front Immunol. 2019;10:435.
AIRR community curation and standardised representation for immunoglobulin and T cell receptor germline sets - PMC. Available from: https://www-ncbi-nlm-nih-gov.libproxy2.usc.edu/pmc/articles/PMC10310305/. Cited 2024 Jul 19.
Mikocziova I, Greiff V, Sollid LM. Immunoglobulin germline gene variation and its impact on human disease. Genes Immun. 2021;22(4):205–17.
Li K, Xu P, Wang J, Yi X, Jiao Y. Identification of errors in draft genome assemblies at single-nucleotide resolution for quality assessment and improvement. Nat Commun. 2023;14(1):6556.
Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29(8):1072–5.
Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol. 2021;38(10):4647–54.
Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31(19):3210–2.
Ou S, Chen J, Jiang N. Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. 2018;46(21):e126.
Guo A, Salzberg SL, Zimin AV. JASPER: A fast genome polishing tool that improves accuracy of genome assemblies. PLOS Comput Biol. 2023;19(3):e1011032.
Rhie A, Walenz BP, Koren S, Phillippy AM. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 2020;21(1):245.
Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE. 2014;9(11):e112963.
Warren RL, Coombe L, Mohamadi H, Zhang J, Jaquish B, Isabel N, et al. ntEdit: scalable genome sequence polishing. Bioinformatics. 2019;35(21):4430–2.
Morisse P, Marchet C, Limasset A, Lecroq T, Lefebvre A. Scalable long read self-correction and assembly polishing with multiple sequence alignment. Sci Rep. 2021;11(1):761.
Zimin AV, Salzberg SL. The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies. PLOS Comput Biol. 2020;16(6):e1007981.
Chen Y, Zhang Y, Wang AY, Gao M, Chong Z. Accurate long-read de novo assembly evaluation with inspector. Genome Biol. 2021;22(1):312.
Zhang Y, Lu HW, Ruan J. GAEP: A comprehensive genome assembly evaluating pipeline. J Genet Genomics. 2023;50(10):747–54.
Madrigal G, Minhas BF, Catchen J. Klumpy: A tool to evaluate the integrity of long-read genome assemblies and illusive sequence motifs. Mol Ecol Resour. 2025;25(1):e13982.
Sirupurapu V, Safonova Y, Pevzner PA. Gene prediction in the immunoglobulin loci. Genome Res. 2022;32(6):1152–69 https://genome.cshlp.org/content/early/2022/06/13/gr.276676.122.
Li H. Minimap2: pairwise alignment for nucleotide sequences. Birol I, editor. Bioinformatics. 2018;34(18):3094–100.
Nagarajan N, Pop M. Sequence assembly demystified. Nat Rev Genet. 2013;14(3):157–67.
Mochizuki T, Sakamoto M, Tanizawa Y, Nakayama T, Tanifuji G, Kamikawa R, et al. A practical assembly guideline for genomes with various levels of heterozygosity. Brief Bioinform. 2023;24(6):bbad337.
Deschamps-Francoeur G, Simoneau J, Scott MS. Handling multi-mapped reads in RNA-seq. Comput Struct Biotechnol J. 2020;1(18):1569–76.
Cheng H, Concepcion GT, Feng X, Zhang H, Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 2021;18(2):170–5.
Olivieri DN, von Haeften B, Sánchez-Espinel C, Faro J, Gambón-Deza F. Genomic V exons from whole genome shotgun data in reptiles. Immunogenetics. 2014;66(7):479–92.
Olivieri DN, Garet E, Estevez O, Sánchez-Espinel C, Gambón-Deza F. Genomic structure and expression of immunoglobulins in Squamata. Mol Immunol. 2016;1(72):81–91.
Ross MT, Grafham DV, Coffey AJ, Scherer S, McLay K, Muzny D, et al. The DNA sequence of the human X chromosome. Nature. 2005;434(7031):325–37.
Krumsiek J, Arnold R, Rattei T. Gepard: a rapid and sensitive tool for creating dotplots on genome scale. Bioinformatics. 2007;23(8):1026–8.
Bankevich A, Bzikadze AV, Kolmogorov M, Antipov D, Pevzner PA. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nat Biotechnol. 2022;40(7):1075–81.
Rautiainen M, Nurk S, Walenz BP, Logsdon GA, Porubsky D, Rhie A, et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat Biotechnol. 2023;41(10):1474–82.
Chen JL, Blasco MA, Greider CW. Secondary Structure of Vertebrate Telomerase RNA. Cell. 2000;100(5):503–14.
Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10(2):giab008.
Zhu Y, Watson C, Safonova Y, Pennell M, and Bankevich A. CloseRead. Github. 2025. https://github.com/applied-phylo-lab/CloseRead_ig-assembly-eval.
Zhu Y, Watson C, Safonova Y, Pennell M, and Bankevich A. applied‑phylo‑lab/CloseRead_ig‑assembly‑eval: CloseRead v1.0 – Initial Release. Zenodo. 2025. https://doi.org/10.5281/zenodo.15242406.
Zhu Y, Watson C, Safonova Y, Pennell M, and Bankevich A. Assemblies of two Greenland wolf individuals and one flying lemur performed by LJA. Zenodo. 2024. https://zenodo.org/records/13125505.
Acknowledgements
We thank Katalin Voss, Eric Engelbrecht, and the rest of the Pennell, Edge, and Mooney lab groups for helpful discussion of this work. We thank two anonymous reviewers for critical feedback on this work.
Peer review information
Wenjing She was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The peer-review history is available in the online version of this article.
Funding
This work was supported by a Viterbi Fellowship from the Department of Quantitative and Computational Biology at the University of Southern California to YZ and NIH grant R35GM151348 to MP.
Author information
Authors and Affiliations
Contributions
All authors contributed to the design of the study and to writing the paper. YZ developed the bioinformatics tool, with help from other authors. YZ, YS, and AB analyzed the data. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Ethical approval is not applicable for this article.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
13059_2025_3594_MOESM5_ESM.xlsx
Additional file 5: Excel file containing additional data too large to fit in a PDF, related to summary of the 74 species’ IG assembly evaluation.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhu, Y., Watson, C., Safonova, Y. et al. CloseRead: a tool for assessing assembly errors in immunoglobulin loci applied to vertebrate long-read genome assemblies. Genome Biol 26, 131 (2025). https://doi.org/10.1186/s13059-025-03594-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13059-025-03594-7