Efficient Approximate String Matching Techniques for Sequence Alignment

Efficient Approximate String Matching Techniques for Sequence Alignment PDF Author: Santiago Marco-Sola
Publisher:
ISBN:
Category :
Languages : en
Pages : 213

Get Book Here

Book Description
One of the outstanding milestones achieved in recent years in the field of biotechnology research has been the development of high-throughput sequencing (HTS). Due to the fact that at the moment it is technically impossible to decode the genome as a whole, HTS technologies read billions of relatively short chunks of a genome at random locations. Such reads then need to be located within a reference for the species being studied (that is aligned or mapped to the genome): for each read one identifies in the reference regions that share a large sequence similarity with it, therefore indicating what the read¿s point or points of origin may be. HTS technologies are able to re-sequence a human individual (i.e. to establish the differences between his/her individual genome and the reference genome for the human species) in a very short period of time. They have also paved the way for the development of a number of new protocols and methods, leading to novel insights in genomics and biology in general. However, HTS technologies also pose a challenge to traditional data analysis methods; this is due to the sheer amount of data to be processed and the need for improved alignment algorithms that can generate accurate results quickly. This thesis tackles the problem of sequence alignment as a step within the analysis of HTS data. Its contributions focus on both the methodological aspects and the algorithmic challenges towards efficient, scalable, and accurate HTS mapping. From a methodological standpoint, this thesis strives to establish a comprehensive framework able to assess the quality of HTS mapping results. In order to be able to do so one has to understand the source and nature of mapping conflicts, and explore the accuracy limits inherent in how sequence alignment is performed for current HTS technologies. From an algorithmic standpoint, this work introduces state-of-the-art index structures and approximate string matching algorithms. They contribute novel insights that can be used in practical applications towards efficient and accurate read mapping. More in detail, first we present methods able to reduce the storage space taken by indexes for genome-scale references, while still providing fast query access in order to support effective search algorithms. Second, we describe novel filtering techniques that vastly reduce the computational requirements of sequence mapping, but are nonetheless capable of giving strict algorithmic guarantees on the completeness of the results. Finally, this thesis presents new incremental algorithmic techniques able to combine several approximate string matching algorithms; this leads to efficient and flexible search algorithms allowing the user to reach arbitrary search depths. All algorithms and methodological contributions of this thesis have been implemented as components of a production aligner, the GEM-mapper, which is publicly available, widely used worldwide and cited by a sizeable body of literature. It offers flexible and accurate sequence mapping while outperforming other HTS mappers both as to running time and to the quality of the results it produces.

Efficient Approximate String Matching Techniques for Sequence Alignment

Efficient Approximate String Matching Techniques for Sequence Alignment PDF Author: Santiago Marco-Sola
Publisher:
ISBN:
Category :
Languages : en
Pages : 213

Get Book Here

Book Description
One of the outstanding milestones achieved in recent years in the field of biotechnology research has been the development of high-throughput sequencing (HTS). Due to the fact that at the moment it is technically impossible to decode the genome as a whole, HTS technologies read billions of relatively short chunks of a genome at random locations. Such reads then need to be located within a reference for the species being studied (that is aligned or mapped to the genome): for each read one identifies in the reference regions that share a large sequence similarity with it, therefore indicating what the read¿s point or points of origin may be. HTS technologies are able to re-sequence a human individual (i.e. to establish the differences between his/her individual genome and the reference genome for the human species) in a very short period of time. They have also paved the way for the development of a number of new protocols and methods, leading to novel insights in genomics and biology in general. However, HTS technologies also pose a challenge to traditional data analysis methods; this is due to the sheer amount of data to be processed and the need for improved alignment algorithms that can generate accurate results quickly. This thesis tackles the problem of sequence alignment as a step within the analysis of HTS data. Its contributions focus on both the methodological aspects and the algorithmic challenges towards efficient, scalable, and accurate HTS mapping. From a methodological standpoint, this thesis strives to establish a comprehensive framework able to assess the quality of HTS mapping results. In order to be able to do so one has to understand the source and nature of mapping conflicts, and explore the accuracy limits inherent in how sequence alignment is performed for current HTS technologies. From an algorithmic standpoint, this work introduces state-of-the-art index structures and approximate string matching algorithms. They contribute novel insights that can be used in practical applications towards efficient and accurate read mapping. More in detail, first we present methods able to reduce the storage space taken by indexes for genome-scale references, while still providing fast query access in order to support effective search algorithms. Second, we describe novel filtering techniques that vastly reduce the computational requirements of sequence mapping, but are nonetheless capable of giving strict algorithmic guarantees on the completeness of the results. Finally, this thesis presents new incremental algorithmic techniques able to combine several approximate string matching algorithms; this leads to efficient and flexible search algorithms allowing the user to reach arbitrary search depths. All algorithms and methodological contributions of this thesis have been implemented as components of a production aligner, the GEM-mapper, which is publicly available, widely used worldwide and cited by a sizeable body of literature. It offers flexible and accurate sequence mapping while outperforming other HTS mappers both as to running time and to the quality of the results it produces.

Approximate String Alignment and Its Application to Ests, Mrnas and Genome Mapping

Approximate String Alignment and Its Application to Ests, Mrnas and Genome Mapping PDF Author: Cheuk-Hon Terence Yim
Publisher:
ISBN: 9781361205648
Category :
Languages : en
Pages :

Get Book Here

Book Description
This dissertation, "Approximate String Alignment and Its Application to ESTs, MRNAs and Genome Mapping" by Cheuk-hon, Terence, Yim, 嚴卓漢, was obtained from The University of Hong Kong (Pokfulam, Hong Kong) and is being sold pursuant to Creative Commons: Attribution 3.0 Hong Kong License. The content of this dissertation has not been altered in any way. We have altered the formatting in order to facilitate the ease of printing and reading of the dissertation. All rights not granted by the above license are retained by the author. Abstract: Abstract of thesis entitled Approximate String Alignment and Its Application to ESTs, mRNAs and Genome Mapping Submitted by Yim Cheuk Hon Terence for the degree of Master of Philosophy at The University of Hong Kong in August 2004 Locating and annotating genes in the genome are critical steps towards a better un- derstanding of how the genes function. Di(R)erent techniques have been used for nding the location of genes, including mapping coding sequences such as cDNAs or ESTs to the genome, whole genome alignment between di(R)erent species, or mapping known gene sequences to the genome. All the techniques mentioned involve sequence comparisons. Hence, practical sequence comparisons algorithms are needed. Sequence comparisons in the genome sequence are presently performed by approximate string matching together with sequence alignment algorithms developed some time ago. However, due to the exceptional magnitude of the genome sequence, the high error ratio between sequences, and the complicated internal structure of the genes, new algorithms are now needed to overcome these challenges. This study proposes a new approximate string matching algorithm which can search on large genome text eciently by employing a new indexing method by combining the strength of sux tree and sux array. To maintain performance in a high error ratio situation, we also develop a new ltering scheme by exploring the relationship between the genome text and the query. Experiments show that the overall running time of our new algorithm is between 8 to 10 times faster than that of existing algorithms. Based on our new approximate string matching algorithm, we also develop a mRNA alignment tool that can align mRNA or EST sequences to the genome and eciently identify the correct internal structure of the sequence. Our alignment algorithm performs better than existing tools, especially in a high error situation. (Word counts: 257) DOI: 10.5353/th_b3145573 Subjects: Gene mapping - Data processing Nucleotide sequence - Data processing Molecular biology - Data processing Algorithms

Efficient String Algorithms with Applications in Bioinformatics

Efficient String Algorithms with Applications in Bioinformatics PDF Author: Sahar Hooshmand
Publisher:
ISBN:
Category :
Languages : en
Pages : 73

Get Book Here

Book Description
The work presented in this dissertation deals with establishing efficient methods for solving some algorithmic problems, which have applications to Bioinformatics. After a short introduction in Chapter 1, an algorithm for genome mappability problem is presented in Chapter 2. Genome mappability is a measure for the approximate repeat structure of the genome with respect to substrings of specific length and a tolerance to define the number of mismatches. The similarity between reads is measured by using the Hamming distance function. Genome mappability is computed for each position in the string and has several applications in designing high-throughput short-read sequencing experiments. Chapter 3, presents an algorithm to compute the Average Common Substring of two input sequences in their run-length encoded format. The distance between them based on the Average Common Substring measure can be computed in linearithmic time and linear space proportional to the total length of sequences after run-length encoding. Chapter 4, presents a method that produces a better approximation for Average Common Substring calculations where we are allowed to have mismatches. This method is applicable to the alignmentfree comparison of biological sequences at highly competitive speed. Finally, in Chapter 5, we present two algorithms to efficiently decode the Suffix Array/Inverse Suffix Array of the reveres text, by using the FM-index of the forward text. Additionally, our experimental results are competitive when compared to the standard approach of maintaining the FM-Index for both the forward and the reverse text in approximate string-matching applications.

Approximate String Matching in DNA Sequences

Approximate String Matching in DNA Sequences PDF Author:
Publisher:
ISBN:
Category :
Languages : en
Pages :

Get Book Here

Book Description
(Uncorrected OCR) Abstract of thesis entitled "Approximate String Matching in DNA Sequences" Submitted by Cheng Lok Lam for the degree of Master of Philosophy at The University of Hong Kong in August 2003 DNA (Deoxyribonucleic acid) sequences hold the code of life of living organisms. Approximate string matching on DNA sequences is very important in research in the fields of medicine and health. Although the approximate string matching problem has been studied extensively in the field of computer science, many solutions cannot be applied directly on DNA data. This is because DNA data is a very large data set (GenBank had recorded more than 20Gbp of DNA sequences at June 2002). A method based on suffix tree (or suffix array) and a partitioning of the query string into sub-queries was proposed recently. The method has been shown to be efficient for a long query with a large error bound for the main memory model. Due to the large data volume of DNA, in many instances, the suffix tree (or suffix array) is larger than the main memory and must be stored in external memory. In our study, the technique is extended to external memory model. It is shown that the method using suffix array performs better than suffix tree for the external memory based model, and an algorithm is proposed for building suffix array efficiently in external memory. A novel auxiliary data structure is also proposed. The data structure greatly improves the efficiency of suffix array in the approximate string matching problem in the external memory model. The parallel approximate matching problem in DNA sequences is also investigated. Two novel parallel algorithms for PC cluster are proposed. Experimental results show that when the error rate is small a partitioning of the suffix array over the machines in the cluster is a more efficient approach. Conversely, partitioning the data over the machines is a better approach, if the error rate is large.

Handbook of Exact String Matching Algorithms

Handbook of Exact String Matching Algorithms PDF Author: Christian Charras
Publisher: College PressPub Company
ISBN: 9780954300647
Category : Computers
Languages : en
Pages : 238

Get Book Here

Book Description
String matching is a very important subject in the wider domain of text processing. It consists of finding one, or more generally, all the occurrences of a string (more generally called a pattern) in a text. The Handbook of Exact String Matching Algorithms presents 38 methods for solving this problem. For each, it gives the main features, a description, its C code, an example and references.

Experimental and Efficient Algorithms

Experimental and Efficient Algorithms PDF Author: Sotiris E. Nikoletseas
Publisher: Springer
ISBN: 3540320784
Category : Computers
Languages : en
Pages : 637

Get Book Here

Book Description
This book constitutes the refereed proceedings of the 4th International Workshop on Experimental and Efficient Algorithms, WEA 2005, held in Santorini Island, Greece in May 2005. The 47 revised full papers and 7 revised short papers presented together with extended abstracts of 3 invited talks were carefully reviewed and selected from 176 submissions. The book is devoted to the design, analysis, implementation, experimental evaluation, and engineering of efficient algorithms. Among the application areas addressed are most fields applying advanced algorithmic techniques, such as combinatorial optimization, approximation, graph theory, discrete mathematics, scheduling, searching, sorting, string matching, coding, networking, data mining, data analysis, etc.

Combinatorial Pattern Matching

Combinatorial Pattern Matching PDF Author:
Publisher:
ISBN:
Category : Combinatorial analysis
Languages : en
Pages : 344

Get Book Here

Book Description


Fast Walking Tree Method Via Recurrence Reduction for Biological String Alignment

Fast Walking Tree Method Via Recurrence Reduction for Biological String Alignment PDF Author: Tai C. Hsu
Publisher:
ISBN:
Category : Genetics
Languages : en
Pages : 14

Get Book Here

Book Description
The meaning of biological sequences is a central problem of modern biology. Although string matching is well understood in the edit-distance model, biological strings with transpositions and inversions violate this model's assumptions. To align biologically reasonable strings, we proposed the Walking Tree Method, an approximate string alignment method that can handle insertion, deletions, substitutions, translocations, and more than one level of inversions. Our earlier versions were able to align whole bacterial genomes (approx.1 Mbps) and discover and verify genes. As extremely long sequences can now be deciphered rapidly and accurately without amplification, speeding up the method becomes necessary. Via a technique that we call it "recurrence reduction" in which same computations can be looked up rather than recomputed, we are able to significantly improve the performance, e.g., 400% for 1-million base pair alignments.

Experimental Algorithms

Experimental Algorithms PDF Author: Evripidis Bampis
Publisher: Springer
ISBN: 3319200860
Category : Computers
Languages : en
Pages : 401

Get Book Here

Book Description
This book constitutes the refereed proceedings of the 14th International Symposium on Experimental Algorithms, SEA 2015, held in Paris, France, in June/July 2015. The 30 revised full papers presented were carefully reviewed and selected from 76 submissions. The main theme of the symposium is the role of experimentation and of algorithm engineering techniques in the design and evaluation of algorithms and data structures. The papers are grouped in topical sections on data structures, graph problems, combinatorial optimization, scheduling and allocation, and transportation networks.

Multiple Biological Sequence Alignment

Multiple Biological Sequence Alignment PDF Author: Ken Nguyen
Publisher: John Wiley & Sons
ISBN: 1118229045
Category : Computers
Languages : en
Pages : 256

Get Book Here

Book Description
Covers the fundamentals and techniques of multiple biological sequence alignment and analysis, and shows readers how to choose the appropriate sequence analysis tools for their tasks This book describes the traditional and modern approaches in biological sequence alignment and homology search. This book contains 11 chapters, with Chapter 1 providing basic information on biological sequences. Next, Chapter 2 contains fundamentals in pair-wise sequence alignment, while Chapters 3 and 4 examine popular existing quantitative models and practical clustering techniques that have been used in multiple sequence alignment. Chapter 5 describes, characterizes and relates many multiple sequence alignment models. Chapter 6 describes how traditionally phylogenetic trees have been constructed, and available sequence knowledge bases can be used to improve the accuracy of reconstructing phylogeny trees. Chapter 7 covers the latest methods developed to improve the run-time efficiency of multiple sequence alignment. Next, Chapter 8 covers several popular existing multiple sequence alignment server and services, and Chapter 9 examines several multiple sequence alignment techniques that have been developed to handle short sequences (reads) produced by the Next Generation Sequencing technique (NSG). Chapter 10 describes a Bioinformatics application using multiple sequence alignment of short reads or whole genomes as input. Lastly, Chapter 11 provides a review of RNA and protein secondary structure prediction using the evolution information inferred from multiple sequence alignments. • Covers the full spectrum of the field, from alignment algorithms to scoring methods, practical techniques, and alignment tools and their evaluations • Describes theories and developments of scoring functions and scoring matrices •Examines phylogeny estimation and large-scale homology search Multiple Biological Sequence Alignment: Scoring Functions, Algorithms and Applications is a reference for researchers, engineers, graduate and post-graduate students in bioinformatics, and system biology and molecular biologists. Ken Nguyen, PhD, is an associate professor at Clayton State University, GA, USA. He received his PhD, MSc and BSc degrees in computer science all from Georgia State University. His research interests are in databases, parallel and distribute computing and bioinformatics. He was a Molecular Basis of Disease fellow at Georgia State and is the recipient of the highest graduate honor at Georgia State, the William M. Suttles Graduate Fellowship. Xuan Guo, PhD, is a postdoctoral associate at Oak Ridge National Lab, USA. He received his PhD degree in computer science from Georgia State University in 2015. His research interests are in bioinformatics, machine leaning, and cloud computing. He is an editorial assistant of International Journal of Bioinformatics Research and Applications. Yi Pan, PhD, is a Regents' Professor of Computer Science and an Interim Associate Dean and Chair of Biology at Georgia State University. He received his BE and ME in computer engineering from Tsinghua University in China and his PhD in computer science from the University of Pittsburgh. Dr. Pan's research interests include parallel and distributed computing, optical networks, wireless networks and bioinformatics. He has published more than 180 journal papers with about 60 papers published in various IEEE/ACM journals. He is co-editor along with Albert Y. Zomaya of the Wiley Series in Bioinformatics.