Scalable Methods for Genome Assembly

Scalable Methods for Genome Assembly PDF Author: Priyanka Ghosh
Publisher:
ISBN:
Category :
Languages : en
Pages : 155

Get Book Here

Book Description
De novo genome assembly is a fundamental problem in the field of computational biology. The goal is to reconstruct an unknown genome from short DNA fragments (called "reads") obtained from it. Over the last decade, with the advent of numerous next-generation sequencing (NGS) platforms (e.g., Illumina, 454 Roche), billions of reads can be generated in a matter of hours, leading to vast amounts of data accumulation per day. This has necessitated efficient parallelization of the assembly process to meet the growing data demands. While multiple parallel solutions to the problem have been proposed in the past, there still exists a gap in terms of the processing power between massively parallel NGS technologies and the ability of current state-of-the-art assemblers to analyze and assemble large and complex genomes. Conducting genome assembly at scale remains a challenge owing to the intense computational and memory requirements of the problem, coupled with inherent complexities in existing parallel tools associated with data movement, use of complex data structures, unstructured memory accesses and repeated I/O operations. In this dissertation, we address the challenges of conducting genome assembly at scale and develop new methods for conducting extreme-scale genome assembly for microbial and complex eukaryotic genomes. Our approach to the problem is two-fold, wherein we make the following contributions: i) FastEtch- a new method targeting fast and space-efficient assemblies, using probabilistic data structures (Count-Min sketch) that executes efficiently on shared-memory platforms with a minimal computational footprint (both memory and time). ii) PaKman- a fully distributed method that tackles assembly of large genomes through the combination of a novel data-structure (PaK-Graph) and algorithmic strategies to simplify the communication and I/O footprint during the assembly process. We present an extensive performance and qualitative evaluation of both our algorithms including comparisons to other state-of-the-art methods. Our results demonstrate that FastEtch can yield one of the best time-memory-quality trade-offs, when compared against many state-of-the-art genome assemblers. PaKman has shown the ability to achieve near-linear speedups on up to 8K cores; outperform state-of-the-art distributed and shared memory tools in performance while delivering comparable (if not better) quality; and reduce time to solution significantly.

Scalable Parallel Algorithms for Genome Analysis

Scalable Parallel Algorithms for Genome Analysis PDF Author: Evangelos Georganas
Publisher:
ISBN:
Category :
Languages : en
Pages : 129

Get Book Here

Book Description
A critical problem for computational genomics is the problem of de novo genome assembly: the development of robust scalable methods for transforming short randomly sampled "shotgun" sequences, namely reads, into the contiguous and accurate reconstruction of complex genomes. These reads are significantly shorter (e.g. hundreds of bases long) than the size of chromosomes and also include errors. While advanced methods exist for assembling the small and haploid genomes of prokaryotes, the genomes of eukaryotes are more complex. Moreover, de novo assembly has been unable to keep pace with the flood of data, due to the dramatic increases in genome sequencer capabilities, combined with the computational requirements and the algorithmic complexity of assembling large scale genomes and metagenomes. In this dissertation, we address this challenge head on by developing parallel algorithms for de novo genome assembly with the ambition to scale to massive concurrencies. Our work is based on the Meraculous assembler, a state-of-the-art de novo assembler for short reads developed at JGI. Meraculous identifies non-erroneous overlapping substrings of length k (k-mers) with high quality extensions and uniquely assembles genome regions into uncontested sequences called contigs by constructing and traversing a de Bruijn graph of k-mers, a special graph that is used to represent overlaps among k-mers. The original reads are subsequently aligned onto the contigs to obtain information regarding the relative orientation of the contigs. Contigs are then linked together to create scaffolds, sequences of contigs that may contain gaps among them. Finally gaps are filled using localized assemblies based on the original reads. First, we design efficient scalable algorithms for k-mer analysis and contig generation. K-mer analysis is characterized by intensive communication and I/O requirements and our parallel algorithms successfully reduce the memory requirements by 7×. Then, contig generation relies on efficient parallelization of the de Bruijn graph construction and traversal, which necessitates a distributed hash table and is a key component of most de novo assemblers. We present a novel algorithm that leverages one-sided communication capabilities of the UPC to facilitate the requisite fine-grained, irregular parallelism and the avoidance of data hazards. The sequence alignment is characterized by intensive I/O and large computation requirements. We introduce mer-Aligner, a highly parallel sequence aligner that employs parallelism in all of its components. Finally, this thesis details the parallelization of the scaffolding modules, enabling the first massively scalable, high quality, complete end-to-end de novo assembly pipeline. Experimental large-scale results using human and wheat genomes demonstrate efficient performance and scalability on thousands of cores. Compared to the original Meraculous code, which requires approximately 48 hours to assemble the human genome, our pipeline called HipMer computes the assembly in only 4 minutes using 23,040 cores of Edison - an overall speedup of approximately 720×. In the last part of the dissertation we tackle the problem of metagenome assembly. Metagenomics is currently the leading technology to study the uncultured microbial diversity. While accessing an unprecedented number of environmental samples that consist of thousands of individual microbial genomes is now possible, the bottleneck is becoming computational, since the sequencing cost improvements exceed that of Moore's Law. Metagenome assembly is further complicated by repeated sequences across genomes, polymorphisms within a species and variable frequency of the genomes within the sample. In our work we repurpose HipMer components for the problem of metagenome assembly and we design a versatile, high-performance metagenome assembly pipeline that outperforms state-of-the-art tools in both quality and performance.

High Performance and Scalable Matching and Assembly of Biological Sequences

High Performance and Scalable Matching and Assembly of Biological Sequences PDF Author: Anas Abu Doleh
Publisher:
ISBN:
Category :
Languages : en
Pages : 139

Get Book Here

Book Description
Next Generation Sequencing (NGS), the massive parallel and low-cost sequencing technology, is able to generate an enormous size of sequencing data. This facilitates the discovery of new genomic sequences and expands the biological and medical research. However, these big advancements in this technology also bring big computational challenges. In almost all NGS analysis pipelines, the most crucial and computationally intensive tasks are sequence similarity searching and de novo genome assembly. Thus, in this work, we introduced novel and efficient techniques to utilize the advancements in the High Performance Computing hardware and data computing platforms in order to accelerate these tasks while producing high quality results. For the sequence similarity search, we have studied utilizing the massively multithreaded architectures, such as Graphical Processing Unit (GPU), in accelerating and solving two important problems: reads mapping and maximal exact matching. Firstly, we introduced a new mapping tool, Masher, which processes long~(and short) reads efficiently and accurately. Masher employs a novel indexing technique that produces an index for huge genome, such as the human genome, with a small memory footprint such that it could be stored and efficiently accessed in a restricted-memory device such as a GPU. The results show that Masher is faster than state-of-the-art tools and obtains a good accuracy and sensitivity on sequencing data with various characteristics. Secondly, maximal exact matching problem has been studied because of its importance in detection and evaluating the similarity between sequences. We introduced a novel tool, GPUMEM, which efficiently utilizes GPU in building a lightweight indexing and finding maximal exact matches inside two genome sequences. The index construction is so fast that even by including its time, GPUMEM is faster in practice than state-of-the-art tools that use a pre-built index. De novo genome assembly is a crucial step in NGS analysis because of the novelty of discovered sequences. Firstly, we have studied parallelizing the de Bruijn graph based de novo genome assembly on distributed memory systems using Spark framework and GraphX API. We proposed a new tool, Spaler, which assembles short reads efficiently and accurately. Spaler starts with the de Bruijn graph construction. Then, it applies an iterative graph reduction and simplification techniques to generate contigs. After that, Spaler uses the reads mapping information to produce scaffolds. Spaler employs smart parallelism level tuning technique to improve the performance in each of these steps independently. The experiments show promising results in term of scalability, execution time and quality. Secondly, we addressed the problem of de novo metagenomics assembly. Spaler may not properly assemble the sequenced data extracted from environmental samples. This is because of the complexity and diversity of the living microbial communities. Thus, we introduced meta-Spaler, an extension of Spaler, to handle metagenomics dataset. meta-Spaler partitions the reads based on their expected coverage and applies an iterative assembly. The results show an improving in the assembly quality of meta-Spaler in comparison to the assembly of Spaler.

Forest Genomics and Biotechnology

Forest Genomics and Biotechnology PDF Author: Isabel Allona
Publisher: Frontiers Media SA
ISBN: 2889631788
Category :
Languages : en
Pages : 185

Get Book Here

Book Description
This Research Topic addresses research in genomics and biotechnology to improve the growth and quality of forest trees for wood, pulp, biorefineries and carbon capture. Forests are the world’s greatest repository of terrestrial biomass and biodiversity. Forests serve critical ecological services, supporting the preservation of fauna and flora, and water resources. Planted forests also offer a renewable source of timber, for pulp and paper production, and the biorefinery. Despite their fundamental role for society, thousands of hectares of forests are lost annually due to deforestation, pests, pathogens and urban development. As a consequence, there is an increasing need to develop trees that are more productive under lower inputs, while understanding how they adapt to the environment and respond to biotic and abiotic stress. Forest genomics and biotechnology, disciplines that study the genetic composition of trees and the methods required to modify them, began over a quarter of a century ago with the development of the first genetic maps and establishment of early methods of genetic transformation. Since then, genomics and biotechnology have impacted all research areas of forestry. Genome analyses of tree populations have uncovered genes involved in adaptation and response to biotic and abiotic stress. Genes that regulate growth and development have been identified, and in many cases their mechanisms of action have been described. Genetic transformation is now widely used to understand the roles of genes and to develop germplasm that is more suitable for commercial tree plantations. However, in contrast to many annual crops that have benefited from centuries of domestication and extensive genomic and biotechnology research, in forestry the field is still in its infancy. Thus, tremendous opportunities remain unexplored. This Research Topic aims to briefly summarize recent findings, to discuss long-term goals and to think ahead about future developments and how this can be applied to improve growth and quality of forest trees.

Scalable Parallell Processing of Multi-objective Optimized DNA Sequence Assembly

Scalable Parallell Processing of Multi-objective Optimized DNA Sequence Assembly PDF Author: Munib Ahmed
Publisher:
ISBN:
Category : Bioinformatics
Languages : en
Pages :

Get Book Here

Book Description
Bioinformatics is an emerging branch of science where issues pertaining to molecular biology are evaluated and resolved by leveraging the techniques and algorithms devised in the field of computer science. Most of these issues are due to the enormous amount of data and the computational complexity involved in generating expeditious and qualitatively viable solutions. This poses a challenge to the algorithm developers who must strive to achieve multiple conflicting objectives of processing very large dataset with the highest accuracy possible while keeping the execution time to a minimum. Genome assembly is one such problem in bioinformatics where a DNA sequence is reconstructed using millions of small fragments of DNA that are produced in the laboratory as a result of sequencing process. When examined purely as data, these fragments are small in size (

The Barley Genome

The Barley Genome PDF Author: Nils Stein
Publisher: Springer
ISBN: 3319925288
Category : Science
Languages : en
Pages : 400

Get Book Here

Book Description
This book presents an overview of the state-of-the-art in barley genome analysis, covering all aspects of sequencing the genome and translating this important information into new knowledge in basic and applied crop plant biology and new tools for research and crop improvement. Unlimited access to a high-quality reference sequence is removing one of the major constraints in basic and applied research. This book summarizes the advanced knowledge of the composition of the barley genome, its genes and the much larger non-coding part of the genome, and how this information facilitates studying the specific characteristics of barley. One of the oldest domesticated crops, barley is the small grain cereal species that is best adapted to the highest altitudes and latitudes, and it exhibits the greatest tolerance to most abiotic stresses. With comprehensive access to the genome sequence, barley’s importance as a genetic model in comparative studies on crop species like wheat, rye, oats and even rice is likely to increase.

Next Generation Sequencing and Sequence Assembly

Next Generation Sequencing and Sequence Assembly PDF Author: Ali Masoudi-Nejad
Publisher: Springer Science & Business Media
ISBN: 1461477263
Category : Medical
Languages : en
Pages : 92

Get Book Here

Book Description
The goal of this book is to introduce the biological and technical aspects of next generation sequencing methods, as well as algorithms to assemble these sequences into whole genomes. The book is organized into two parts; part 1 introduces NGS methods and part 2 reviews assembly algorithms and gives a good insight to these methods for readers new to the field. Gathering information, about sequencing and assembly methods together, helps both biologists and computer scientists to get a clear idea about the field. Chapters will include information about new sequencing technologies such as ChIp-seq, ChIp-chip, and De Novo sequence assembly. ​

Distance-aware Algorithms for Scalable Evolutionary and Ecological Analyses

Distance-aware Algorithms for Scalable Evolutionary and Ecological Analyses PDF Author: Metin Balaban
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Thanks to the advances in sequencing technologies in the last two decades, the set of available whole-genome sequences has been expanding rapidly. One of the challenges in phylogenetics is accurate large-scale phylogenetic inference based on whole-genome sequences. A related challenge is using incomplete genome-wide data in an assembly-free manner for accurate sample identification with reference to phylogeny. This dissertation proposes new scalable and accurate algorithms to address these two challenges. First, I present a family of scalable methods called TreeCluster for breaking a large set of sequences into evolutionary homogeneous clusters. Second, I present two algorithms for accurate phylogenetic placement of genomic sequences on ultra-large single-gene and whole-genome based trees. The first version, APPLES, scales linearly with the reference size while APPLES-2 scales sub-linearly thanks to a divide-and-conquer strategy based on the TreeCluster method. Third, I develop a solution for assembly-free sample phylogenetic placement for a particularly challenging case when the specimen is a mixture of two cohabiting species or a hybrid of two species. Fourth, I address one limitation of assembly-free methods--their reliance on simple models of sequence evolution--by developing a technique to compute evolutionary distances under a complex 4-parameter model called TK4. Finally, I introduce a divide-and-conquer workflow for incrementally growing and updating ultra-large phylogenies using many of the ingredients developed in other chapters. This workflow (uDance) is accurate in simulations and can build a 200,000-genome microbial tree-of-life based on 388 marker genes.

Toward a More Accurate Genome

Toward a More Accurate Genome PDF Author: William Jacob Benhardt Biesinger
Publisher:
ISBN: 9781321093667
Category :
Languages : en
Pages : 124

Get Book Here

Book Description
High-throughput sequencing enables basic and translational biology to query the mechanics of both life and disease at single-nucleotide resolution and with breadth that spans the genome. This revolutionary technology is a major tool in biomedical research, impacting our understanding of life's most basic mechanics and affecting human health and medicine. Unfortunately, this important technology produces very large, error-prone datasets that require substantial computational processing before experimental conclusions can be made. Since errors and hidden biases in the data may influence empirically-derived conclusions, accurate algorithms and models of the data are critical. This thesis focuses on the development of statistical models for high-throughput sequencing data which are capable of handling errors and which are built to reflect biological realities. First, we focus on increasing the fraction of the genome that can be reliably queried in biological experiments using high-throughput sequencing methods by expanding analysis into repeat regions of the genome. The method allows partial observation of the gene regulatory network topology through identification of transcription factor binding sites using Chromatin Immunoprecipitation followed by high-throughput sequencing (ChIP-seq). Binding site clustering, or "peak-calling", can be frustrated by the complex, repetitive nature of genomes. Traditionally, these regions are censored from any interpretation, but we re-enable their interpretation using a probabilistic method for realigning problematic DNA reads. Second, we leverage high-throughput sequencing data for the empirical discovery of underlying epigenetic cell state, enabled through analysis of combinations of histone marks. We use a novel probabilistic model to perform spatial and temporal clustering of histone marks and capture mark combinations that correlate well with cell activity. A first in epigenetic modeling with high-throughput sequencing data, we not only pool information across cell types, but directly model the relationship between them, improving predictive power across several datasets. Third, we develop a scalable approach to genome assembly using high-throughput sequencing reads. While several assembly solutions exist, most don't scale well to large datasets, requiring computers with copious memory to assemble large genomes. Throughput continues to increase and the large datasets available today and in the near future will require truly scalable methods. We present a promising distributed method for genome assembly which distributes the de Bruijn graph across many computers and seamlessly spills to disk when main memory is insufficient. We also show novel graph cleaning algorithms which should handle increased errors from large datasets better than traditional graph structure-based cleaning. High-throughput sequencing plays an important role in biomedical research, and has already affected human health and medicine. Future experimental procedures will continue to rely on statistical methods to provide crucial error and bias correction, in addition to modeling expected outcomes. Thus, further development of robust statistical models is critical to the future high-throughput sequencing, ensuring a strong foundation for correct biological conclusions.

Biological Sequence Analysis

Biological Sequence Analysis PDF Author: Richard Durbin
Publisher: Cambridge University Press
ISBN: 113945739X
Category : Science
Languages : en
Pages : 372

Get Book Here

Book Description
Probabilistic models are becoming increasingly important in analysing the huge amount of data being produced by large-scale DNA-sequencing efforts such as the Human Genome Project. For example, hidden Markov models are used for analysing biological sequences, linguistic-grammar-based probabilistic models for identifying RNA secondary structure, and probabilistic evolutionary models for inferring phylogenies of sequences from different organisms. This book gives a unified, up-to-date and self-contained account, with a Bayesian slant, of such methods, and more generally to probabilistic methods of sequence analysis. Written by an interdisciplinary team of authors, it aims to be accessible to molecular biologists, computer scientists, and mathematicians with no formal knowledge of the other fields, and at the same time present the state-of-the-art in this new and highly important field.