这是indexloc提供的服务,不要输入任何密码
Skip to main content

Anchorage accurately assembles anchor-flanked synthetic long reads

Abstract

Modern sequencing technologies allow for the addition of short-sequence tags, known as anchors, to both ends of a captured molecule. Anchors are useful in assembling the full-length sequence of a captured molecule as they can be used to accurately determine the endpoints. One representative of such anchor-enabled technology is LoopSeq Solo, a synthetic long read (SLR) sequencing protocol. LoopSeq Solo also achieves ultra-high sequencing depth and high purity of short reads covering the entire captured molecule. Despite the availability of many assembly methods, constructing full-length sequence from these anchor-enabled, ultra-high coverage sequencing data remains challenging due to the complexity of the underlying assembly graphs and the lack of specific algorithms leveraging anchors. We present Anchorage, a novel assembler that performs anchor-guided assembly for ultra-high-depth sequencing data. Anchorage starts with a kmer-based approach for precise estimation of molecule lengths. It then formulates the assembly problem as finding an optimal path that connects the two nodes determined by anchors in the underlying compact de Bruijn graph. The optimality is defined as maximizing the weight of the smallest node while matching the estimated sequence length. Anchorage uses a modified dynamic programming algorithm to efficiently find the optimal path. Through both simulations and real data, we show that Anchorage outperforms existing assembly methods, particularly in the presence of sequencing artifacts. Anchorage fills the gap in assembling anchor-enabled data. We anticipate its broad use as anchor-enabled sequencing technologies become prevalent. Anchorage is freely available at https://github.com/Shao-Group/anchorage; the scripts and documents that can reproduce all experiments in this manuscript are available at https://github.com/Shao-Group/anchorage-test.

Background

Sequence assembly has long been a critical task in computational biology, serving as a foundational step in understanding genomic structures and functions. Assembly algorithms have been driven by the rapid evolution of sequencing technologies. Early approaches focused on Sanger sequencing data, requiring algorithms that could handle relatively low-throughput, high-accuracy reads. The advent of next-generation sequencing (NGS) technologies introduced a new era of high-throughput sequencing, producing short reads with low error rates that necessitated the development of efficient and scalable assembly algorithms. Methods based on de Bruijn graphs (dBGs) were developed, such as SPAdes series [1], Velvet [2], ABySS [3], where assembling full-length sequences is formulated as finding a Eulerian path in the dBG. Recently, third-generation sequencing technologies, such as those from PacBio and Oxford Nanopore, have enabled the production of long reads, which, despite higher error rates, provide crucial information for resolving complex genomic regions and structural variants. This has led to the creation of overlap-layout-consensus assemblers, such as Flye series [4], Canu [5], and hifiasm [6]. As sequencing technologies continue to advance, assembly algorithms must continuously adapt, incorporating new strategies and algorithms to take advantage of the new features in the data and to keep pace with the increasing data complexity.

Recent advancements of synthetic long read (SLR) sequencing technologies are able to label reads from the same molecule with the same barcode/index [7,8,9]. One representative of such technology is LoopSeq Solo, where exactly one molecule is captured in each plate well and attached with a synthetic oligonucleotide, a unique molecular index that is also known as the unique molecular identifier (UMI) in many sequencing technologies (Figure 1A). LoopSeq Solo distributes the molecular index to every read that evenly covers the entire (long) molecule, whereafter accurate paired-end short reads can be sequenced by any standard NGS technology [9, 10] (Figure 1B). LoopSeq Solo exhibits high purity of their read clouds, meaning that almost all reads with the same index originated from the same molecule [11]. Taking advantage of its high purity, we are able to assemble each molecule separately instead of assembling a read cloud of multiple molecules. This practice is especially beneficial when the sequenced molecules are similar to each other, such as transcript isoforms and 16S sequences of a microbiome [10].

Modern sequencing technologies can also ligate short adapters with known sequences to both ends of a captured molecule [10, 12, 13]. Those adapters often play functional roles in sequencing, for example, the capture of target molecules, template switch, and amplification in PCR [9, 10, 13]. Part of LoopSeq’s adapters are called anchors, which are short synthetic sequences of 12 base pairs and are dissimilar from the sequenced target. Anchors, and likewise, adapters in other sequencing technologies, are extremely useful for assembly as they mark the two endpoints of molecules [10, 13, 14]. We argue that, having the ends of molecules accurately determined transforms the assembly problem into finding a path that connects the ends in an assembly graph. This scheme is computationally easier than finding either the Euler path or Hamiltonian path. Previous studies, such as ref [15,16,17,18], have leveraged the edge of a similar idea in connecting both ends of a read pair in an assembly graph and ref [14] showcased end-guided assembly of transcripts with genome reference, but to our best knowledge, such formulation has not been explored in de novo assembly of a full-length single molecule with high sequencing coverage. Sequencing depths are crucial for accurate assembly with high-coverage generally preferred. LoopSeq Solo, for example, can produce ultra-high sequencing depth, e.g., more than 1 million reads per molecule. The whole target molecule is therefore sequenced in full coverage without a gap. However, high depth may cause sequencing artifacts, which increases the complexity of the assembly graphs, requiring robust assembly algorithms to fully utilize the high sequencing depth.

Even though assembly has been extensively studied and many assemblers have been developed, none of them is specifically designed for anchor-equipped, ultra-high-depth sequencing data. In this paper we present a new assembler, Anchorage, to fill the gap. Anchorage features a new formulation for assembling anchor-enabled data, that to seek a path in the underlying assembly graph that connects the start/end anchor nodes identified by mapping anchor sequences. Leveraging the high sequencing depth, we seek the path whose minimized node weight is maximized. We design an efficient dynamic programming to find the optimal connecting path. Additionally, Anchorage includes a novel method to estimate the length of the captured sequence. The dynamic programming is adapted to incorporate the estimated length as a selection criterion. Through both simulations and real biological data, we demonstrate that Anchorage outperforms existing assembly methods on anchor-enabled, ultra-high-depth sequencing data. Notably, when sequencing artifacts are present, Anchorage exhibits a significant performance advantage.

Methods

Anchorage takes reads with the same index, which are known to originate from the target molecule, and the associated anchor sequences, which mark the ends of the target molecule, as input, and assembles the full-length sequence of the target molecule. In its first module, it estimates the length of the target molecule based on frequencies of kmers. Its second module first constructs a compact de Bruijn graph (cdBG) from the raw reads and then searches for candidate paths guided by the anchor sequences. Lastly, the candidate paths will be examined against the range of the estimated target length so as to pick one path, forming the assembled molecule.

Estimating target length

An accurate estimation of the target length is critical for determining the correct target sequence. The LoopSeq data exhibits desirable properties, including high purity (i.e., nearly all reads of a run come from the target molecule), high and evenly distributed coverage, and a low error rate. Many previous studies proposed kmer-frequency-based methods to accurately estimate the size of a genome [19, 20]. In this work, we proposed a new method to estimate the target length accurately by leveraging these properties of LoopSeq.

Let M be the (unknown) length of the target molecule, N be the number of reads, and R be the length of each read. If we assume that reads are uniformly sampled from the target molecule, the target molecule does not contain repetitive kmers, and all reads are error-free, then the frequency of each kmer can be calculated as

$$F_{kmer} = N\times (R - k + 1) / M,$$

leading to

$$M = N\times (R - k + 1) / F_{kmer}.$$

In practice, \(N\) and \(R\) are known statistics. The choice of \(k\) should ensure that most kmers in the target sequence are unique, i.e., \(k\) cannot be too small, while also making most kmers error-free, i.e., \(k\) cannot be too large. In our method, we choose \(k = 33\) as a default value, which balances well these two considerations for SLR sequencing data. The accurate estimation of \(M\) now depends on a “good” estimation of kmer frequency \(F_{\text {kmer}}\). The distribution of kmer frequency can be calculated from the sequencing reads. We found that using the average, mode, or median of the frequencies is ineffective, as these statistics are prone to disturbances from sequencing errors and repetitive kmers.

We propose using the N50 kmer frequency, defined as the frequency for which the collection of all kmers of that frequency or higher accounts for at least 50% of the occurrences of all kmers. This concept is similar to the N50 of contig lengths, which is known to be more robust against long-tail distributions or local maxima in the frequencies. Although N50 is commonly used as a measure in evaluating genome assembly methods, using the N50 kmer frequency to estimate the target length is a novel approach. By substituting the unknown variable \(F_{kmer}\) with its estimator, N50 kmer frequency, denoted as \(F_{N50k}\), the length of the target molecule M can be easily computed by \(M = N\times (R - k + 1) / F_{N50k}\). Anchorage also sets upper and lower bounds for the target length (default: 50% and 200% of the estimated target length \(M\)). These bounds are later used to choose the best full-length sequence.

Anchor-guided assembly

In addition to cell barcodes and unique molecular index (UMI), modern sequencing technologies can ligate additional known short sequences to both ends of a captured molecule. These short sequences play important roles, such as in template switching and preamplification [13], but also serve as indicators of the endpoints of the target molecule [10, 14]. LoopSeq employs similar short sequences, known as “anchors” [10]. The sequences of the start/end anchors can be mapped to the underlying assembly graph (i.e., a compacted de Bruijn graph in Anchorage) to locate the start/end anchor nodes. The task of assembling the full-length molecule now becomes finding a path from the start anchor node to the end anchor node in the assembly graph. We refer to this task as anchor-guided assembly. Note that the search space of anchor-guided assembly is much smaller than searching for the best Eulerian path or Hamiltonian path in the classic assembly formulation, thanks to the critical information provided by the anchors.

We use compacted de Bruijn graph (cdBG) as the assembly graph to organize reads originating from a target molecule (i.e., reads with the same index). In the implementation of Anchorage, SPAdes [1] is called to construct the de Bruijn graph (dBG). In a node-centric dBG, a node represents a distinct kmer and its weight is equal to the number of appearances in the reads. The cdBG is constructed by concatenating each simple path of the dBG as a single node (called a unitig). Each node \(v \in V\) has a weight w(v) calculated as the coverage of the unitig v.

Given a weighted cdBG \(G=(V, E, w)\) and the anchor sequences, Anchorage starts with identifying start/end anchor nodes by aligning the anchor sequences to the nodes of G. Note that neither the anchor sequences nor the unitigs may be error-free. To be able to tolerate such errors, Anchorage employs an iterative approach: it first locates nodes that can exactly match the anchor sequences (i.e., assuming no errors); if such start/end anchor nodes can be identified and “appropriate” path (i.e., full-length molecule) can be assembled, then the algorithm terminates; otherwise Anchorage increments the tolerance of edit distance by 1 and repeats the procedure, until a user-defined maximum edit distance is reached (default: 2). Since anchor sequences are usually much shorter than the kmer size of a dBG (for instance, anchors of LoopSeq Solo have 12 base pairs), tolerating a maximal edit distance of 2 should be sufficient to locate anchors. In rare cases when this threshold is insufficient, users can change the parameter –max_nm_anchors to a desired value.

Note also that multiple start/end anchor nodes might be identified in each iteration, due to repeats or sequencing errors/artifacts. Anchorage will consider each pair of start/end anchor nodes and seek an optimal path that connects them in the cdBG. The optimal connecting paths will also be scored, and the one with the maximum score (across all pairs) will be selected and the corresponding full-length sequence will be reported. In case of multiple connecting paths having the same maximum score, the estimated sequence length will be used to break the tie by picking the one whose length is the closest to the estimation. The framework of Anchorage is given as the pseudo-code in Algorithm 1; the algorithm for finding the optimal connecting path together with its score is described in the next section.

Algorithm 1
figure a

Anchor-guided Assembly

Finding optimal connecting path

Let G, s, and t be the given cdBG and the start/end anchor nodes. We aim to find the “optimal” path in G from s to t. We define the optimal path first. Note that the “true” path corresponds to the target sequence, which must be covered by most reads. We therefore define the optimal connecting path to be the one whose smallest node weight is maximized. When there exist two paths whose smallest (node) weight is equally maximized, we compare their second smallest weight, and so on. Formally, let \(p_1\) and \(p_2\) be two paths in G from s to t. Let \(w_1^i\) and \(w_2^i\) be the ith smallest weight in path \(p_1\) and \(p_2\), respectively. We then define \(p_1\) to be better than \(p_2\), if there exists an integer k such that \(w_1^i = w_2^i\) for all \(1\le i < k\), and \(w_1^k > w_2^k\). We argue that this definition is suitable for anchor-guided assembly, as it selects the path with the strongest support from reads, while also automatically ruling out false paths due to sequencing errors which often have low coverage. We note that a similar formulation has been used in the context of reconstructing the entire fragment (or its alignment) of paired-end RNA-seq reads and in transcript assembly [15, 16, 21].

The optimal connecting path can be calculated efficiently using a dynamic programming algorithm, as this definition satisfies the optimal substructure property. Specifically, we define d(lv) as the maximized smallest weight from s to node v using up to l edges, \(v\in V\). We have this recurrence: \(d(l,v) = \max \{d(l-1,v), \max _{(u,v)\in E}(\min (d(l-1,u),w(v) )\}\). However, the length of this single optimal sequence may not fall in the reasonable range \([M_l, M_u]\) (default: \(M_l=0.5M, M_u=2M]\)). To take into account the estimated sequence length, Anchorage calculates the best c optimal paths, where c is a user-defined parameter (default: \(c=30\)). These top c optimal paths can be calculated by extending the above dynamic programming algorithm. Specifically, we replace d(lv) with a priority queue pq(lv) of size up to c, storing the maximized smallest weight of the best c paths from s to v using at most l edges. To update, we consider each in-edge \((u,v)\in E\) of v, and examine each element z stored in \(pq(l-1, u)\). Let \(z' = \min \{z, w(v)\}\). If pq(lv) is full and \(z' > pq(l,v).\text {smallest-key}()\), which means the examined path to u expanded by (uv) leads to a better path than the worst one stored in pq(lv), we update it by doing pq(lv).pop() and \(pq(l,v).\text {insert}(z')\); if pq(lv) is not full, we simply do \(pq(l,v).\text {insert}(z')\). This operation takes \(\Theta (c\cdot \log c)\) time, Hence, updating all in-edges of all vertices takes \(\Theta (c\cdot \log c \cdot |E|)\) time.

Note that if \(c = 1\) then l can be limited to \(|V|-1\), as the single optimal path must not contain cycles. However, when \(c > 1\), paths might contain cycles. Hence, we have to consider l from 1 all the way to |E|, which slows down the algorithm. We can leverage the estimated upper bound \(M_u\) to speed up. Once a path reaches the upper bound, we can exclude it from expanding. Specifically, an element in a priority queue is now a pair (zL) where z remains the smallest weight and L stores the corresponding sequence length. The above updating procedure will be executed only if \(L + L(u,v) \le M_u\), where L(uv) denotes the length increased by expanding edge (uv). Doing so will accelerate the termination, as after certain rounds, which is likely much smaller than |E|, the optimal c paths will not get better, and then the algorithm will (safely) terminate (lines 20–22 in Algorithm 2). The runtime of this algorithm is \(\Theta (c\cdot \log c \cdot |E|\cdot l^*)\), where \(l^*\) is the number of rounds executed, \(l^* \le |E|\). The space taken by this algorithm is \(O(c \cdot l^* \cdot |V|)\) which is the size of the dynamic programming table. The dynamic programming algorithm is given as the pseudo-code in Algorithm 2.

Algorithm 2
figure b

Connect a pair of start/end anchor nodes

Results

The anchor-enabled, ultra-high coverage sequencing technology represented by LoopSeq Solo offers an unprecedented opportunity for accurately detecting full-length captured molecules. The high purity of reads allows for assembling each molecule separately, the anchors enable precise determination of endpoints and the high coverage reveals the true molecule as the most abundant path in the assembly graph. All of these superior properties have been leveraged by and modeled in Anchorage. However, these advantages come with some costs. Sequencing artifacts may occur, resulting in more complicated assembly graphs. In Section Investigation of sequencing artifacts and depths, we investigate the sequencing artifacts on 7 real datasets produced by LoopSeq Solo, proving their presence; on the same dataset, we show that the N50 kmer frequency calculated in Section Estimating target length gives a more accurate estimation than other methods. We then compare the assembly accuracy of Anchorage with the state-of-the-art assemblers on these real data in Section Investigation of sequencing artifacts and depths, on simulated data without artifacts in Section Assembly of simulated SLR data without artifacts, on simulated data with two types of artifacts in Section Assembly of simulated SLR data with read-throughs only and Assembly of simulated SLR data with both read-throughs and repetitive regions.

Investigation of sequencing artifacts and depths

To investigate the presence of artifacts, we retrieved LoopSeq Solo sequencing reads of seven controlled 16S molecules whose ground truth nucleotide sequences are known (Table 1). Bacterial genomic DNA materials were retrieved from ATCC (catalog number 19718D-5, 47085D-5, BAA-3050, 27774D-5). Ground truth 16S DNA sequences were downloaded from the ATCC genome portal [22]. We first performed quality control using Trimmomatic [23]. Then, reads with the same molecular index were aggregated and their index sequences were trimmed. Afterward, we aligned all the reads to the ground truth sequences using STAR [24], enabling chimera detection with –chimSegmentMin 20. Chimeric reads were analyzed to reveal sequencing artifacts. See the results in Table 1.

We observed 235 to 569 uniquely mapped read pairs supporting back-splicing junctions (BSJs) of 100bp or longer (#BSJs in Table 1). A BSJ is a special type of junction where the donor site is downstream of its acceptor site, opposite to ordinary forward splicing junctions. The BSJs indicate read-through events in the circular amplicon sequencing used by LoopSeq [9], resulting in read-through reads that span the end and the start of the captured molecule. It is noteworthy that the reported reads with BSJs may greatly underestimate the number of read-through reads because Trimmomatic was applied, which trimmed away low-quality regions and read-through reads. Additionally, only properly aligned read pairs, that were reported by the STAR aligner in the SAM format, are included. STAR reported 7.63% to 25.28% unaligned reads, which we believe includes a significant portion of read-through reads as well.

We also observed that approximately 1% of the read pairs from each control formed chimeras between two 16S molecule species (#IMJs in Table 1). Furthermore, six of the seven controls have at least one forward splice junction supported by at least 1,000 uniquely mapped reads. These splice junctions indicate potential artifact molecules. One possible reason for the high IMJ rate is that recombination is more likely to happen between highly similar molecules, e.g. 16S sequences, and the recombinant was formed and amplified in the PCR cycles.

The observed sequencing artifacts significantly increase the complexity of the underlying assembly graph. For example, artifact forward splice junctions result in erroneous edges, inter-molecular chimeras can create both erroneous edges and additional anchor nodes, and intra-molecular back-splice junctions may introduce cycles in the graph. These factors must be considered for a comprehensive comparison of different methods. Therefore, we performed the evaluation on simulated data both with and without the introduction of sequencing artifacts (Sections Assembly of simulated SLR data without artifacts, Assembly of simulated SLR data with read-throughs only, and Assembly of simulated SLR data with both read-throughs and repetitive regions).

As illustrated in Section Estimating target length, an accurate estimation of target length is reliant on the accurate estimation of a “good” kmer frequency \(F_{kmer}\). We compared N50 kmer frequency with other estimators of \(F_{kmer}\), such as the average, median, and mode of kmer frequencies (Table 2). Notably, some estimators were computed based on the frequencies of those kmers whose frequency is higher than 10 or 100, respectively, to rule out fortuitous kmers due to random sequencing errors. Actually, the median of all kmer frequencies is always 1 for all seven controls. Also, estimators based on kmers with a frequency higher than 10 are extremely unreliable and always much worse than those based on kmers with a frequency higher than 100. The N50 kmer frequency is the best estimator of \(F_{kmer}\) in six controls and its difference from the ground truth is less than or equal to 10% in five controls, marking its superiority and reliability in application. Even though the mode of frequencies of kmers whose frequency is higher than 100 is the best in the fifth control, the N50 kmer frequency has a very close performance. Nevertheless, it is intriguing and arbitrary to determine under which frequency (in this example, 100) kmers are unreliable. The differences between estimations and ground truths are relatively large in the second and fourth control, this might indicate those two samples have more sequencing errors and artifacts, while N50 kmer frequency is a significantly better and more robust estimator than the other options.

Assembly of real biological samples by LoopSeq Solo

We evaluate the assembly accuracy on the same dataset with seven LoopSeq Solo samples studied in Section Investigation of sequencing artifacts and depths. Reads with the same index were grouped together and piped to each algorithm. Each control has 0.7M–1.9M paired-end reads (Table 1). However, assembling a \(\sim\)1500bp sequence (length of the known ground-truth molecule) with more than half a million reads is unnecessary, as the tremendously excessive reads (approximately 112,000-304,000\(\times\) depth) do not contribute new information but errors and artifacts. To fairly reduce time and computational complexity, we sampled 10,000 reads for each control and performed the experiments. This is approximately equal to 1,600\(\times\) depth which is already ultra-high compared to an ordinary sequence assembly and fits the purpose of this study.

We compared Anchorage with SPAdes [1] and MEGAHIT [25]. Both tools are very popular in assembling single-cell sequencing and metagenomic sequencing reads. Besides, SPAdes is the only assembler that was previously used by and integrated in the LoopSeq analysis pipeline [10, 26]. Since the ultra-high sequencing depth often brings detrimental artifacts to other assembly algorithms (discussed below), we also ran SPAdes and MEGAHIT with randomly down-sampled 500 reads. It is often possible that a state-of-the-art algorithm outputs multiple contigs for one instance. We selected the best contig with the following preference: contigs with both start and end anchors present, contigs with either anchor present, contigs without anchors present, and then trimmed extra sequences outside anchors. If multiple contigs have the same maximal number of anchors identified, the longest contig is selected as the output. Those selection criteria for SPAdes and MEGAHIT follow the current published LoopSeq pipeline [10]. We used QUAST [27] to evaluate the seven assemblies of each algorithm against their respective known ground-truth sequence. The minimal contig length was set to 200 bp. Four metrics are reported (see below). Experimental details such as the parameters used can be found in online code repositories (Section 5). The comparison is given in Figure 2.

QUAST aligns the assembled sequence to the reference (i.e., ground-truth sequence). The first metric is genome fraction percentage (GFP), defined as the percentage of bases in the reference that are aligned by the assembly. This metric reflects the sensitivity of assembly methods. Anchorage achieved a very high GFP for all assemblies, averaging 97.5%. This is 21% and 39% higher than that of the second and third-best methods, MEGAHIT with down-sampled 500 reads and SPAdes with down-sampled 500 reads. Notably, the GFP of SPAdes with all reads is less than half of that of its down-sampled counterpart and the same for MEGAHIT. The second metric is largest alignment ratio (LAR), defined as the ratio of the largest continuous alignment in the assembly to the total number of bases in the assembly, reflecting the precision of different assembly methods. Anchorage achieved a near-perfect LAR for all assembled sequences, 17.8% and 39.4% higher than that of MEGAHIT with down-sampling and SPAdes with down-sampling. Similar to the GFP value, the LARs of SPAdes/MEGAHIT with all reads are less than half of that of their 500-read counterpart. Nevertheless, SPAdes with all reads only assembled two contigs that passed QUAST’s quality filter and properly aligned to the reference. Whilst MEGAHIT with all reads assembled three contigs passing the filter, only one of them has both high GFP and LAR. This might indicate that sequencing artifacts are detrimental to Spades and MEGAHIT with all reads, but down-sampling could help in removing those damages. It is noteworthy that the 2 (resp. 5) assembled sequences by SPAdes with all reads (resp. SPAdes with down-sampling) are very accurate while the other 5 (resp. 2) assembled sequences almost did not align to the reference at all (blue dots in Figure 2).

QUAST also reported the average number of mismatches per 1000bp and the average number of indels per 100kbp. The rate of mismatches of Anchorage and SPAdes with all reads are very low and no indels were reported in them. On the other hand, MEGAHIT with all reads has the highest mismatch rate and indel rate, at least 10 times higher than the others. QUAST did not report misassemblies on any of the assemblies.

The presence of anchors in the assembled sequence is a positive signal suggesting an assembly contig is accurate and/or full-length. Supplementary Note S1 and Figure S1 show that many false positive contigs will be output if not screened by anchors. LoopSeq pipeline considers an assembled sequence as full-length if it contains both anchors [10]. In this experiment, Anchorage identified both anchors for all 7 sequences (not a surprise, as Anchorage directly models anchors), while the second best method, MEGAHIT with 500 reads, missed the end anchor for one control and missed both anchors for another control. However, the presence of anchors does not always indicate completeness. For example, SPAdes with all reads reported both anchors for 6 of its assemblies but only two of them aligned to their respective ground truths. This observation indicates that even though anchors provide many benefits in an accurate assembly, a sole reliance on anchors to determine the completeness of an assembly is inaccurate.

Anchorage had a reasonable time and memory usage on those real biological datasets, although its running time is the longest among all tools. Anchorage took 77.94s/205.8MB to assemble all seven contigs. SPAdes took 54.67s/205.7MB to assemble with all reads and 42.98s/187Mb to down-sample and assemble with 500 reads. MEGAHIT took 2.00s/208MB to assemble with all reads and 3.37s/205MB to down-sample and assemble with 500 reads. Random down-sampling of reads was performed using Python scripts from the LoopSeq pipeline and it was counted in the total time. All experiments were performed on a 2020-model iMac with 8 Intel i7 Cores and 64 GB Memory.

Assembly of simulated SLR data without artifacts

We then evaluated Anchorage and compared it with other methods on anchor-enabled SLR data with simulations. The ground-truth, full-length sequences were retrieved from NCBI, comprising of 23 16S genes of length ranging from 547bp to 3600bp (see detailed description in Table 3). Subsequently, two 12bp anchors (start: CGCAGAGTACAT, end: TTGGAGTTAAAG), which are the same as used in real LoopSeq Solo sequencing, are concatenated to respectively the start and end of each sequence. Afterward, Polyester [28] was used to simulate 110bp-long paired-end reads with a 0.5% error rate. We simulated a series of sequencing depths of 50\(\times\), 100\(\times\), 500\(\times\), 3000\(\times\) from ordinary depth to ultra-high depth. Reads from the same 16S gene are grouped together and piped to downstream assemblers, so that the grouped reads can be considered as index-aggregated LoopSeq Solo reads after quality control and trimming. Each method assembles each of the 23 samples into a full-length sequence using the method described in Section Assembly of real biological samples by LoopSeq Solo. The assemblies were then evaluated using QUAST [27] against their respective ground truths.

Overall, when no sequencing artifacts are present and the sequencing depths are high, all methods produce accurate assemblies. When depth is higher or equal to 500\(\times\), all methods achieved 95% or higher averaged GFP (over the 23 instances). SPAdes and MEGAHIT performed the best when using all reads, but they were closely followed by Anchorage. Anchorage has a lower GFP when the sequencing depth is lower than 100\(\times\). This could be because Anchorage models the minimal weight of nodes in a path that is less robust under low depths. When coverage is low, node weights will have a higher variation rate, but our dynamic programming algorithm may not handle large variances well. Additionally, there might be a lower chance of finding the proper anchor sequences under low coverage because NGS reads often have lower coverage at the termini of the molecule than in the middle of the molecule. The average GFP gradually increases as the sequencing depth increases for both Anchorage and SPAdes with all reads. However, this trend is not observed for methods with down-sampled 500 reads, likely due to missing or decreased coverage in some regions caused by down-sampling. As for LAR metrics, We can see that all methods achieved nearly perfect precision. Anchorage and two MEGAHIT methods reported zero mismatches and indels across all sequencing depths.

Assembly of simulated SLR data with read-throughs only

The aforementioned read-through scenarios are more prevalent in high sequencing depth. To evaluate the impact of such a scenario on the assembly methods, we simulated reads with read-throughs. The simulation was done by concatenating an “anchored” 16S gene to itself 5 times so that reads may span from the end of the sequence to its start, stimulating the “read-though” events. Using the 5\(\times\) self-concatenated 16S genes as references, simulations were performed with the same settings as in Section Assembly of simulated SLR data without artifacts.

The results were demonstrated in Figure 4. Compared to the simulation results without read-throughs (Figure 3), the accuracy of Anchorage was minimally impacted: it achieved a greater than 95% GFP which is better than all other methods on all sequencing depths, and nearly perfect LAR for all sequencing depths. On the contrary, the GFPs of the other methods were much reduced in the presence of read-throughs. The numbers of two MEGAHIT methods and SPAdes with 500 reads dropped to approximately 77-89% from >96% under various depths. Furthermore, SPAdes with all reads were impacted the most. Its GFP dropped to 41% from 99% under 3000\(\times\) depths. Whilst the LARs of all algorithms are almost equally satisfying with being around 99%, SPAdes using all reads had their PAR dropped from 100% to 73.5% under 3000\(\times\) depths. Those observations confirmed that Anchorage is more robust to sequencing artifacts such as read-throughs, thanks to its design that leverages the anchors to accurately determine the sequence ends. Anchorage and MEGAHIT reported no mismatches in this experiment, but both of the two MEGAHIT methods’ indel rate is very high around 18-63 indels per 100kbp. QUAST reported zero misassemblies for all assemblies.

Assembly of simulated SLR data with both read-throughs and repetitive regions

Repeats in the molecule pose a major challenge to assembly methods, as they cause tangled assembly graphs while making the length of the target molecule much harder to estimate. We simulated anchor-enabled SLR data with both repetitive regions and read-throughs to test its impact on the assembly methods. For each of the 16S gene used in Section Assembly of simulated SLR data without artifacts, we randomly copied 10% of each sequence and inserted it back into themselves at a random location. The other simulation parameters are again the same as previously described in Section Assembly of simulated SLR data without artifacts. All simulated data were tested with all methods.

The comparison was given in Figure 5. The assembly of sequences with repeats appears harder for all methods, evident by the drop in GFP of all methods under all depths. Anchorage achieved the highest GFPs, topping at 96.8%, for all sequencing depths, which is approximately 24%–29.5% higher than the second-best method under various sequencing depths. Unlike previous experiments where GFP of Anchorage increases as the depth increases, assemblies with repetitive regions in different sequencing depths exhibit roughly the same level of accuracy, indicating the source of error is mainly from the complicated structures of the assembly graphs caused by repetitives instead of insufficient coverage. The LARs of most algorithms are near-perfect, with the exception of SPAdes with all reads under 3000\(\times\). SPAdes with all reads have a higher mismatch rate while the other methods reported almost zero mismatches. Both two MEGAHIT methods report high indels rates in this experiment. Both Anchorage and MEGAHIT have misassemblies in several contigs under various sequencing depths, most of which are duplications of themselves. This indicates that the read-throughs with repetitive sequences impact both algorithms and more careful algorithm curation is needed. On the other hand, SPAdes with all reads take a more conservative strategy to assemble shorted contigs, as reflected in its low GFP.

Conclusions and discussions

We introduce Anchorage, a novel sequence assembler designed for anchor-enabled, high sequencing depth synthetic long reads data. Anchorage incorporates several algorithmic innovations, including a robust k-mer-based method for estimating the length of the target molecule, an innovative approach that efficiently models anchors and high sequencing depth while being resilient to sequencing errors and artifacts, and an efficient dynamic programming algorithm that identifies optimal paths while integrating the estimated sequence length. We evaluated Anchorage against state-of-the-art methods using both simulated and real datasets. Anchorage demonstrated significantly improved accuracy in the presence of sequencing artifacts. Moreover, unlike other methods that experienced decreased accuracy with larger input sizes, Anchorage maintains robust and consistent performance, particularly with high sequencing depth.

We would like to note that Anchorage is highly accommodated to assemble anchor-labeled single molecules, where the targeted molecules often have lengths between several hundred to dozens of thousand base pairs, such as RNA transcripts and 16S genes. One major advantage of SLRs, exemplified by LoopSeq, is that reads from a relatively small region are labeled and aggregated prior to assembly. Hence, assemblies of each SLR are separated in a pure read cloud. The end users of SLR often expect to receive “long reads” rather than a collection of contigs that are potentially partial or misassembled. Interpreting biological insights from such contigs becomes much harder than from a continuous synthetic long read. We emphasize that anchors are critical for a clear end-to-end definition of an SLR. The SLR assembly task differs from assembling a whole human-sized genome. Consequently, one single continuous assembly (i.e. a “synthetic” long read) is strongly preferred rather than scaffolds of a large genome. The two state-of-the-art assembly methods, SPAdes, and MEGAHIT, are not specifically designed for this task. While they may assemble partial scaffolds, stitching partial assemblies together for a continuous contig requires adequate manual curation and prior knowledge of the sequenced target. We admit that in the case of assembling large genomes, the information provided by anchors and coverage will be diluted and the admixed read clouds increase the problem complexity very much, so Anchorage requires considerable modifications to perform general-purpose genome assembly.

To the best of our knowledge, LoopSeq and LoopSeq Solo are the only sequencing technologies that produce anchor-equipped, high-coverage data. Consequently, these were the only real datasets we tested. However, Anchorage is applicable to any data that possesses these two properties. For example, in principle, adding short synthetic anchors to sequence adapters is practical and increasing sequencing depths requires only increasing PCR amplification cycles. As such sequencing technologies become more prevalent, we anticipate that Anchorage will see broad adoption. As a future direction, we plan to extend Anchorage to assemble multiple target molecules, enabling applications in transcript assembly, metagenome assembly, and synthetic long-read (SLR) assembly with lower purity.

We acknowledge that the dynamic programming algorithm can be further optimized by incorporating ideas from existing algorithms for the k shortest paths problem, such as Eppstein’s algorithm [29]. Anchorage currently employs a straightforward algorithm for ease of implementation. To scale for large-scale data in case of need, these advanced optimization techniques can be integrated in the future. We also acknowledge that when a large region of short tandem repeats is present, the nature of Anchorage’s algorithm may result in assembling a less accurate number of repeats. As a solution, our algorithm outputs \(c\) optimal paths instead of only 1 output and then selects the best contig by examining the contig length. Providing the SLR target molecules are often only a few thousand nucleotides long and supposedly do not carry complex tandem repeats, the tunable parameter \(c\) (default: 30) should be enough for most of the cases. In the future, we can improve the algorithm to better estimate the number and the unit size of tandem repeats by examining kmer counts in the repetitive region against the non-repetitive region and by incorporating the distribution of insert size.

Fig. 1
figure 1

Overview of SLR assembly. A Top: One molecule is targeted in the sequencing experiment. Middle: Known synthetic sequences, such as anchors, and a random unique molecular identifier (UMI) are ligated to the ends of the molecule. Bottom: During the SLR sequencing, standard NGS short reads are sequenced from the molecule and each of them carries one UMI that is the same as the original molecule. B Top: During the SLR sequencing, the molecule amplifies and circularizes with random sizes. Middle: Standard NGS short read pairs are sequenced from the circularized molecules. For each read pair, R1 is sequenced starting from the UMI and R2 is sequenced inside the fragment. Hence, all NGS read pairs from the same original molecule carry the same UMI. Bottom: Since the sizes of circular molecules are random, the short NGS reads are randomly and evenly distributed from the original molecule.

Fig. 2
figure 2

Comparison of assembly accuracy on real LoopSeq Solo sequencing datasets. Anchorage, SPAdes, and MEGAHIT used all reads; SPAdes500 and MEGAHIT500 used 500 reads via random downsampling. The height of each bar represents the average value of each metric and the average value is labeled on each bar. Each dot represents the value of one assembly.

Fig. 3
figure 3

Comparison of assembly accuracy on simulated reads without artifacts. Anchorage, SPAdes, and MEGAHIT used all reads; SPAdes500 and MEGAHIT500 used 500 reads via random downsampling. The height of each bar represents the average value of each metric and the average value is labeled on each bar. The whiskers in the GFP and LAR panels extend from the 25th to 75th percentile of values in each metric.

Fig. 4
figure 4

Comparison of assembly accuracy on simulated reads with read-throughs. Anchorage, SPAdes, and MEGAHIT used all reads; SPAdes500 and MEGAHIT500 used 500 reads via random downsampling. The height of each bar represents the average value of each metric and the average value is labeled on each bar. The whiskers in the GFP and LAR panels extend from the 25th to 75th percentile of values in each metric.

Fig. 5
figure 5

Comparison of assembly accuracy on simulated reads with repetitive sequences. Anchorage, SPAdes, and MEGAHIT used all reads; SPAdes500 and MEGAHIT500 used 500 reads via random downsampling. The height of each bar represents the average value of each metric and the average value is labeled on each bar. The whiskers in the GFP and LAR panels extend from the 25th to 75th percentile of values in each metric.

Table 1 Information of controlled LoopSeq Solo sequencing of seven 16S molecules
Table 2 Comparison of different estimators for a “good" kmer frequency
Table 3 Information of 16S sequences used in simulation

Data availability

The Anchorage tool is available at https://github.com/Shao-Group/anchorage. The scripts that reproduce the experimental results of this manuscript are available at https://github.com/Shao-Group/anchorage-test. The code of Anchorage is also archived on Software Heritage at https://archive.softwareheritage.org/swh:1:snp:79d1ae296aaa93c6ecd2fde89b245d05bfcb0eb3;origin=https://github.com/Shao-Group/anchorage.

References

  1. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19(5):455–77.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18(5):821–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Birol I, Jackman SD, Nielsen CB, Qian JQ, Varhol R, Stazyk G, Morin RD, Zhao Y, Hirst M, Schein JE, Horsman DE, Connors JM, Gascoyne RD, Marra MA, Jones SJM. De novo transcriptome assembly with ABySS. Bioinformatics. 2009;25(21):2872–7.

    Article  CAS  PubMed  Google Scholar 

  4. Kolmogorov M, Yuan J, Lin Y, Pevzner P. Assembly of long error-prone reads using repeat graphs. Nat Biotechnol. 2019;37:540–6.

    Article  CAS  PubMed  Google Scholar 

  5. Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Gen Res. 2017;27(5):722–36.

    Article  CAS  Google Scholar 

  6. Cheng H, Concepcion GT, Feng X, Zhang H, Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 2021;18(2):170–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Hiatt JB, Patwardhan RP, Turner EH, Lee C, Shendure J. Parallel, tag-directed assembly of locally derived short sequence reads. Nat Methods. 2010;7(2):119–22.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Hong LZ, Hong S, Wong HT, Aw PP, Cheng Y, Wilm A, de Sessions PF, Lim SG, Nagarajan N, Hibberd ML, Quake SR, Burkholder WF. BAsE-Seq: A method for obtaining long viral haplotypes from short sequence reads. Gen Biol. 2014;15(11):517.

    Article  Google Scholar 

  9. Stapleton JA, Kim J, Hamilton JP, Wu M, Irber LC, Maddamsetti R, Briney B, Newton L, Burton DR, Brown CT, Chan C, Buell CR, Whitehead TA. Haplotype-Phased Synthetic Long Reads from Short-Read Sequencing. PLoS ONE. 2016;11(1):0147229.

    Article  Google Scholar 

  10. Liu S, Wu I, Yu Y-P, Balamotis M, Ren B, Ben Yehezkel T, Luo J-H. Targeted transcriptome analysis using synthetic long read sequencing uncovers isoform reprograming in the progression of colon cancer. Commun Biol. 2021;4(1):1–11.

    Article  Google Scholar 

  11. Mak L, Meleshko D, Danko DC, Barakzai WN, Maharjan S, Belchikov N, Hajirasouliha I. Ariadne: Synthetic long read deconvolution using assembly graphs. Genome Biol. 2023;24(1):197.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Hagemann-Jensen M, Ziegenhain C, Chen P, Ramsk ldD, Hendriks G-J, Larsson AJM, Faridani OR, Sandberg R. Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nat Biotechnol. 2020;38:708–14.

    Article  CAS  PubMed  Google Scholar 

  13. Picelli S, Faridani OR, Bj K, rklund, Winberg G, Sagasser S, Sandberg R. Full-length RNA-seq from single cells using Smart-seq2. Nat Protoc. 2014;9:171–81.

    Article  CAS  PubMed  Google Scholar 

  14. Schon MA, Lutzmayer S, Hofmann F, Nodine MD. Bookend: Precise transcript reconstruction with end-guided assembly. Genome Biol. 2022;23(1):143.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Zhang Q, Shi Q, Shao M. Accurate assembly of multi-end RNA-seq data with scallop2. Nat Comput Sci. 2022;2(3):148–52.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Li X, Shao M. On de novo Bridging Paired-end RNA-seq Data. In: Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. BCB ’23, pp. 1–5, New York, NY, USA 2023.

  17. Kallenborn F, Schmidt B. CAREx: context-aware read extension of paired-end sequencing data. BMC Bioinf. 2024;25(1):186.

    Article  Google Scholar 

  18. Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S, Berlin AM, Aird D, Costello M, Daza R, Williams L, Nicol R, Gnirke A, Nusbaum C, Lander ES, Jaffe DB. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci. 2011;108(4):1513–8.

    Article  CAS  PubMed  Google Scholar 

  19. Hozza M, Vinař T, Brejová B. How Big is that Genome? Estimating Genome Size and Coverage from k-mer Abundance Spectra. In: Iliopoulos C, Puglisi S, Yilmaz E, editors. String Processing and Information Retrieval. Cham: Springer International Publishing; 2015. p. 199–209.

    Chapter  Google Scholar 

  20. Vurture GW, Sedlazeck FJ, Nattestad M, Underwood CJ, Fang H, Gurtowski J, Schatz MC. GenomeScope: Fast reference-free genome profiling from short reads. Bioinformatics. 2017;33(14):2202–4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Zahin T, Shi Q, Zang XC, Shao M. Accurate assembly of circular rnas with terrace. In: Ma J, editor. Research in Computational Molecular Biology. Cham: Springer; 2024. p. 444–7.

    Chapter  Google Scholar 

  22. Benton B, King S, Greenfield SR, Puthuveetil N, Reese AL, Duncan J, Marlow R, Tabron C, Pierola AE, Yarmosh DA, Combs PF, Riojas MA, Bagnoli J, Jacobs JL. The ATCC genome portal: microbial genome reference standards with data provenance. Microbiol Res Announc. 2023;10(47):00818–21.

    Google Scholar 

  23. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114–20.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21.

    Article  CAS  PubMed  Google Scholar 

  25. Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015;31(10):1674–6.

    Article  CAS  PubMed  Google Scholar 

  26. Callahan BJ, Grinevich D, Thakur S, Balamotis MA, Yehezkel TB. Ultra-accurate microbial amplicon sequencing with synthetic long reads. Microbiome. 2021;9(1):130.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29(8):1072–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Frazee AC, Jaffe AE, Langmead B, Leek JT. Polyester: simulating RNA-seq datasets with differential transcript expression. Bioinformatics. 2015;31(17):2778–84.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Eppstein D. Finding the k shortest paths. SIAM J Comput. 1998;28(2):652–73.

    Article  Google Scholar 

Download references

Acknowledgements

We thank Qimin Zhang and Qian Shi for constructive discussions and suggestions on this work. This work is supported by the US National Science Foundation (2019797 and 2145171 to M.S.) and by the US National Institutes of Health (R01HG011065 to M.S.).

Funding

This work is supported by the US National Science Foundation (2019797 and 2145171 to M.S.) and by the US National Institutes of Health (R01HG011065 to M.S.).

Author information

Authors and Affiliations

Authors

Contributions

X.C.Z., T.B.-Y., R.K., and M.S. conceived of the project. X.C.Z., X.L., and M.S. designed the algorithm. X.C.Z. implemented the software. K.M. generated the sequencing data. X.C.Z. and X.L. performed the experiments. X.C.Z. and M.S. wrote the manuscript with the help of all authors. R.K. and M.S. supervised the project. All authors reviewed and approved the manuscript.

Corresponding authors

Correspondence to Ryan Kelley or Mingfu Shao.

Ethics declarations

Competing interests

K.M., T.B.-Y., and R.K. are current or former employees of Element Biosciences and may hold stock options in the company.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional file

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zang, X.C., Li, X., Metcalfe, K. et al. Anchorage accurately assembles anchor-flanked synthetic long reads. Algorithms Mol Biol 20, 12 (2025). https://doi.org/10.1186/s13015-025-00288-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13015-025-00288-4

Keywords