![]() ![]() Although the tandem repeat unit is quite short in this illustrating example, such erroneous patterns are typical of long tandem units > 50 nt in size, and our greedy algorithm is effective in reducing such errors Our algorithm revises the consensus by scanning it forward and backward and fixes the sequencing error u colored red. The algorithm moves from the most frequent node rst to stu with an error of frequency 3 rather than to the error-free stu of frequency 2, thereby going through the three consecutive nodes with errors. In the middle, the de Bruijn graph of 3-mers in the input read shows the heaviest path (bold line) that our greedy algorithm outputs. ![]() Red underlined characters represent sequencing errors that are either substitutions, deletions or insertions. In the top, a read with tandem repeats is shown. ( D) This example illustrates how the consensus sequence with a k-mer of frequency one is fixed by our algorithm. The frequency of a k-mer becomes much lower than the average frequency when it has an error. ( C) Frequency of each k-mer of u in the focal range. Our greedy algorithm selects the next node of the maximum frequency until reaching the initial node and outputs the best path (denoted by u). l = 6) in raw reads to assemble k-mers into the original tandem repeat unit. ( B) Searching the de Bruijn graph of k-mers (e.g. ( A) Select a value for k such that many short k-mers have no sequencing errors in an approximate range of a hidden tandem repeat, allowing us to reconstruct the unit string of the tandem repeat based on the de Bruijn graph approach. Published by Oxford University Press.Įstimating the unit of a hidden tandem repeat in a noisy long read. Experimental results indicated that the proposed algorithm largely outperformed Tandem Repeats Finder, a widely used program for finding tandem repeats, in terms of sensitivity. We exploited this characteristic to develop a method for first estimating regions that could contain a tandem repeat, by analyzing the k-mer frequency distributions of fixed-size windows across the target read, followed by an algorithm that assembles the k-mers of a putative region into the consensus repeat unit by greedily traversing a de Bruijn graph. Namely, a long tandem repeat has hundreds or thousands of approximate copies of the repeated unit, so despite the error rate, many short k-mers will be error-free in many copies of the unit. Here, we report an efficient algorithm for solving this problem that takes advantage of the length of the repeat. ![]() Moreover, most traditional algorithms for finding tandem repeats are designed to find short tandem repeats (<1000 nt) and cannot effectively handle the high error rate of long reads in a reasonable amount of time. However, new long-read sequencing technologies can produce single reads of 10 000 nt or more that can span such repeat expansions, although these long reads have high error rates, of 10-20%, which complicates the detection of repetitive elements. Long tandem repeat expansions of more than 1000 nt have been suggested to be associated with diseases, but remain largely unexplored in individual human genomes because read lengths have been too short.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |