Scalable Parallel Algorithms for Genome Analysis

Scalable Parallel Algorithms for Genome Analysis PDF Author: Evangelos Georganas
Publisher:
ISBN:
Category :
Languages : en
Pages : 129

Get Book Here

Book Description
A critical problem for computational genomics is the problem of de novo genome assembly: the development of robust scalable methods for transforming short randomly sampled "shotgun" sequences, namely reads, into the contiguous and accurate reconstruction of complex genomes. These reads are significantly shorter (e.g. hundreds of bases long) than the size of chromosomes and also include errors. While advanced methods exist for assembling the small and haploid genomes of prokaryotes, the genomes of eukaryotes are more complex. Moreover, de novo assembly has been unable to keep pace with the flood of data, due to the dramatic increases in genome sequencer capabilities, combined with the computational requirements and the algorithmic complexity of assembling large scale genomes and metagenomes. In this dissertation, we address this challenge head on by developing parallel algorithms for de novo genome assembly with the ambition to scale to massive concurrencies. Our work is based on the Meraculous assembler, a state-of-the-art de novo assembler for short reads developed at JGI. Meraculous identifies non-erroneous overlapping substrings of length k (k-mers) with high quality extensions and uniquely assembles genome regions into uncontested sequences called contigs by constructing and traversing a de Bruijn graph of k-mers, a special graph that is used to represent overlaps among k-mers. The original reads are subsequently aligned onto the contigs to obtain information regarding the relative orientation of the contigs. Contigs are then linked together to create scaffolds, sequences of contigs that may contain gaps among them. Finally gaps are filled using localized assemblies based on the original reads. First, we design efficient scalable algorithms for k-mer analysis and contig generation. K-mer analysis is characterized by intensive communication and I/O requirements and our parallel algorithms successfully reduce the memory requirements by 7×. Then, contig generation relies on efficient parallelization of the de Bruijn graph construction and traversal, which necessitates a distributed hash table and is a key component of most de novo assemblers. We present a novel algorithm that leverages one-sided communication capabilities of the UPC to facilitate the requisite fine-grained, irregular parallelism and the avoidance of data hazards. The sequence alignment is characterized by intensive I/O and large computation requirements. We introduce mer-Aligner, a highly parallel sequence aligner that employs parallelism in all of its components. Finally, this thesis details the parallelization of the scaffolding modules, enabling the first massively scalable, high quality, complete end-to-end de novo assembly pipeline. Experimental large-scale results using human and wheat genomes demonstrate efficient performance and scalability on thousands of cores. Compared to the original Meraculous code, which requires approximately 48 hours to assemble the human genome, our pipeline called HipMer computes the assembly in only 4 minutes using 23,040 cores of Edison - an overall speedup of approximately 720×. In the last part of the dissertation we tackle the problem of metagenome assembly. Metagenomics is currently the leading technology to study the uncultured microbial diversity. While accessing an unprecedented number of environmental samples that consist of thousands of individual microbial genomes is now possible, the bottleneck is becoming computational, since the sequencing cost improvements exceed that of Moore's Law. Metagenome assembly is further complicated by repeated sequences across genomes, polymorphisms within a species and variable frequency of the genomes within the sample. In our work we repurpose HipMer components for the problem of metagenome assembly and we design a versatile, high-performance metagenome assembly pipeline that outperforms state-of-the-art tools in both quality and performance.

Scalable Parallel Algorithms for Genome Analysis

Scalable Parallel Algorithms for Genome Analysis PDF Author: Evangelos Georganas
Publisher:
ISBN:
Category :
Languages : en
Pages : 129

Get Book Here

Book Description
A critical problem for computational genomics is the problem of de novo genome assembly: the development of robust scalable methods for transforming short randomly sampled "shotgun" sequences, namely reads, into the contiguous and accurate reconstruction of complex genomes. These reads are significantly shorter (e.g. hundreds of bases long) than the size of chromosomes and also include errors. While advanced methods exist for assembling the small and haploid genomes of prokaryotes, the genomes of eukaryotes are more complex. Moreover, de novo assembly has been unable to keep pace with the flood of data, due to the dramatic increases in genome sequencer capabilities, combined with the computational requirements and the algorithmic complexity of assembling large scale genomes and metagenomes. In this dissertation, we address this challenge head on by developing parallel algorithms for de novo genome assembly with the ambition to scale to massive concurrencies. Our work is based on the Meraculous assembler, a state-of-the-art de novo assembler for short reads developed at JGI. Meraculous identifies non-erroneous overlapping substrings of length k (k-mers) with high quality extensions and uniquely assembles genome regions into uncontested sequences called contigs by constructing and traversing a de Bruijn graph of k-mers, a special graph that is used to represent overlaps among k-mers. The original reads are subsequently aligned onto the contigs to obtain information regarding the relative orientation of the contigs. Contigs are then linked together to create scaffolds, sequences of contigs that may contain gaps among them. Finally gaps are filled using localized assemblies based on the original reads. First, we design efficient scalable algorithms for k-mer analysis and contig generation. K-mer analysis is characterized by intensive communication and I/O requirements and our parallel algorithms successfully reduce the memory requirements by 7×. Then, contig generation relies on efficient parallelization of the de Bruijn graph construction and traversal, which necessitates a distributed hash table and is a key component of most de novo assemblers. We present a novel algorithm that leverages one-sided communication capabilities of the UPC to facilitate the requisite fine-grained, irregular parallelism and the avoidance of data hazards. The sequence alignment is characterized by intensive I/O and large computation requirements. We introduce mer-Aligner, a highly parallel sequence aligner that employs parallelism in all of its components. Finally, this thesis details the parallelization of the scaffolding modules, enabling the first massively scalable, high quality, complete end-to-end de novo assembly pipeline. Experimental large-scale results using human and wheat genomes demonstrate efficient performance and scalability on thousands of cores. Compared to the original Meraculous code, which requires approximately 48 hours to assemble the human genome, our pipeline called HipMer computes the assembly in only 4 minutes using 23,040 cores of Edison - an overall speedup of approximately 720×. In the last part of the dissertation we tackle the problem of metagenome assembly. Metagenomics is currently the leading technology to study the uncultured microbial diversity. While accessing an unprecedented number of environmental samples that consist of thousands of individual microbial genomes is now possible, the bottleneck is becoming computational, since the sequencing cost improvements exceed that of Moore's Law. Metagenome assembly is further complicated by repeated sequences across genomes, polymorphisms within a species and variable frequency of the genomes within the sample. In our work we repurpose HipMer components for the problem of metagenome assembly and we design a versatile, high-performance metagenome assembly pipeline that outperforms state-of-the-art tools in both quality and performance.

Scalable Graph Algorithms with Applications in Genetics

Scalable Graph Algorithms with Applications in Genetics PDF Author:
Publisher:
ISBN:
Category :
Languages : en
Pages : 92

Get Book Here

Book Description
Graph theoretical approaches have been widely used to solve problems arising in bioinformatics and genomic analysis. In particular, enumeration problems such as maximal clique and maximal biclique finding are cores for addressing many biological problems, such as the integration of genome mapping data. However, the enumeration problems may become computation and memory bottlenecks for genome-scale elucidation of biological networks due to their NP-hard nature and the huge memory requirements. Therefore, this research is interested in developing exact, scalable, and efficient algorithms for these biological problems. The Clique Enumerator including a maximal clique enumeration algorithm and its parallel implementation is developed to provide scalable and exact solutions to the maximal clique problem by taking advantage of (a) ultra-large globally addressable memory architectures, (b) the theory of fixed parameter tractability, and (c) a novel bitmap data representation scheme for memory and search space reduction. An implementation of the parallel algorithm scales linearly to at least 64 processors. Maximal biclique finding on bipartite graphs is the other focus of this research because of its key role in the study of gene-phenotype association graphs built from heterogeneous data types. A general purpose algorithm is developed for enumerating maximal bicliques in a bipartite graph, without any problem restriction. Experimental results using both gene expression data and synthetic data indicate that the new algorithm can be two to three orders of magnitude faster than the best previous alternatives. The clique and biclique enumeration algorithms are applied to two applications in genetics: the linkage disequilibrium analysis in population genetics with clique algorithms, and the ontology discovery for systems genetics with biclique algorithms. The clique model is applied to extract linkage disequilibrium networks from all loci on the whole-genome. The structures of such networks are analyzed to compare the genetic structure of populations. The biclique model is applied to enumerate bicliques in gene-phenotype association graphs constructed from empirical gene sets. Bicliques are integrated into a directed acyclic graph to provide an empirical discovery of phenotype ontology.

Analysis and Design of Scalable Parallel Algorithms for Scientific Computing

Analysis and Design of Scalable Parallel Algorithms for Scientific Computing PDF Author: Anshul Gupta
Publisher:
ISBN:
Category :
Languages : en
Pages : 386

Get Book Here

Book Description


Euro-Par 2017: Parallel Processing

Euro-Par 2017: Parallel Processing PDF Author: Francisco F. Rivera
Publisher: Springer
ISBN: 3319642030
Category : Computers
Languages : en
Pages : 740

Get Book Here

Book Description
This book constitutes the proceedings of the 23rd International Conference on Parallel and Distributed Computing, Euro-Par 2017, held in Santiago de Compostela, Spain, in August/September 2017. The 50 revised full papers presented together with 2 abstract of invited talks and 1 invited paper were carefully reviewed and selected from 176 submissions. The papers are organized in the following topical sections: support tools and environments; performance and power modeling, prediction and evaluation; scheduling and load balancing; high performance architectures and compilers; parallel and distributed data management and analytics; cluster and cloud computing; distributed systems and algorithms; parallel and distributed programming, interfaces and languages; multicore and manycore parallelism; theory and algorithms for parallel computation and networking; prallel numerical methods and applications; and accelerator computing.

Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale

Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale PDF Author:
Publisher:
ISBN:
Category :
Languages : en
Pages :

Get Book Here

Book Description
The field of bioinformatics and computational biology is currently experiencing a data revolution. The exciting prospect of making fundamental biological discoveries is fueling the rapid development and deployment of numerous cost-effective, high-throughput next-generation sequencing technologies. The result is that the DNA and protein sequence repositories are being bombarded with new sequence information. Databases are continuing to report a Moore's law-like growth trajectory in their database sizes, roughly doubling every 18 months. In what seems to be a paradigm-shift, individual projects are now capable of generating billions of raw sequence data that need to be analyzed in the presence of already annotated sequence information. While it is clear that data-driven methods, such as sequencing homology detection, are becoming the mainstay in the field of computational life sciences, the algorithmic advancements essential for implementing complex data analytics at scale have mostly lagged behind. Sequence homology detection is central to a number of bioinformatics applications including genome sequencing and protein family characterization. Given millions of sequences, the goal is to identify all pairs of sequences that are highly similar (or "homologous") on the basis of alignment criteria. While there are optimal alignment algorithms to compute pairwise homology, their deployment for large-scale is currently not feasible; instead, heuristic methods are used at the expense of quality. In this dissertation, we present the design and evaluation of a parallel implementation for conducting optimal homology detection on distributed memory supercomputers. Our approach uses a combination of techniques from asynchronous load balancing (viz. work stealing, dynamic task counters), data replication, and exact-matching filters to achieve homology detection at scale. Results for a collection of 2.56M sequences show parallel efficiencies of ~75-100% on up to 8K cores, representing a time-to-solution of 33 seconds. We extend this work with a detailed analysis of single-node sequence alignment performance using the latest CPU vector instruction set extensions. Preliminary results reveal that current sequence alignment algorithms are unable to fully utilize widening vector registers.

Scalable Systems and Algorithms for Genomic Variant Analysis

Scalable Systems and Algorithms for Genomic Variant Analysis PDF Author: Frank Austin Nothaft
Publisher:
ISBN:
Category :
Languages : en
Pages : 140

Get Book Here

Book Description
With the cost of sequencing a human genome dropping below $1,000, population-scale sequencing has become feasible. With projects that sequence more than 10,000 genomes becoming commonplace, there is a strong need for genome analysis tools that can scale across distributed computing resources while providing reduced analysis cost. Simultaneously, these tools must provide programming interfaces and deployment models that are easily usable by biologists. In this dissertation, we describe the ADAM system for processing large genomic datasets using distributed computing. ADAM provides a decoupled stack-based architecture that can accommodate many data formats, deployment models, and data access patterns. Additionally, ADAM defines schemas that describe common genomic datatypes. ADAM’s schemas and programming models enable the easy integration of disparate genomic datatypes and datasets into a single analysis. To validate the ADAM architecture, we implemented an end-to-end variant calling pipeline using ADAM’s APIs. To perform parallel alignment, we developed the Cannoli tool, which uses ADAM’s APIs to automatically parallelize single node aligners. We then implemented GATK-style alignment refinement as part of ADAM. Finally, we implemented a biallelic genotyping model, and novel reassembly algorithms in the Avocado variant caller. This pipeline provides state-of-the-art SNV calling accuracy, along with high (97%) INDEL calling accuracy. To further validate this pipeline, we reanalyzed 270 samples from the Simons Genome Diversity Dataset.

A Hybrid Parallel Approach to High-performance Compression of Big Genomic Files and in Compresso Data Processing

A Hybrid Parallel Approach to High-performance Compression of Big Genomic Files and in Compresso Data Processing PDF Author: Sandino N. Vargas Pérez
Publisher:
ISBN:
Category : Bioinformatics
Languages : en
Pages : 121

Get Book Here

Book Description
Due to the rapid development of high-throughput low cost Next-Generation Sequencing, genomic file transmission and storage is now one of the many Big Data challenges in computer science. Highly specialized compression techniques have been devised to tackle this issue, but sequential data compression has become increasingly inefficient and existing parallel algorithms suffer from poor scalability. Even the best available solutions can take hours to compress gigabytes of data, making the use of these techniques for large-scale genomics prohibitively expensive in terms of time and space complexity. This dissertation responds to the aforementioned problem by presenting a novel hybrid parallel approach to speed up the compression of big genomic datasets by combining the features of both distributed and shared memory architectures and parallel programing models. The algorithm that the approach relies on has been developed with several goals in mind: to balance the work load among processes and threads, to alleviate memory latency by exploiting locality, and to accelerate I/O by reducing excessive read/write operations and inter-node message exchange. To make the algorithm scalable, an innovative timestampbased file structure was designed. It allows the compressed data to be written in a distributed and non-deterministic fashion while retaining the capability of decompressing the dataset back to its original state. In an effort to lessen the dependency on decompression, the proposed file structure facilitates the handling of DNA sequences in their compressed state by using fine-grained decompression in a technique that is identified as in compresso data manipulation. Theoretical analysis and experimental results from this research suggest strong scalability, with many datasets yielding super-linear speedups and constant efficiency. Performance measurements were executed to test the limitations imposed by Amdahl's law when doubling the number of processing units. The compression of a FASTQ file of size 1 terabyte took less than 3.5 minutes, reporting a 90% time decrease against compression algorithms with multithreaded parallelism and more than 98% time decrease for those running sequential code--i.e., 10 to 50 times faster, respectively--improving the execution time proportionally to the added system resources. In addition, the proposed approach achieved better compression ratios by employing an entropy-based encoder optimized to work close to its Shannon entropy and a dictionary-based encoder with variable-length codes. Consequently, in compresso data manipulation was accelerated for FASTQ to FASTA format conversion, basic FASTQ file statistics, and DNA sequence pattern finding, extraction, and trimming. Findings of this research provide evidence that hybrid parallelism can also be implemented using CPU+GPU models, potentially increasing the computational power and portability of the approach.

Parallel Processing for Scientific Computing

Parallel Processing for Scientific Computing PDF Author: Michael A. Heroux
Publisher: SIAM
ISBN: 0898716195
Category : Computers
Languages : en
Pages : 407

Get Book Here

Book Description
Scientific computing has often been called the third approach to scientific discovery, emerging as a peer to experimentation and theory. Historically, the synergy between experimentation and theory has been well understood: experiments give insight into possible theories, theories inspire experiments, experiments reinforce or invalidate theories, and so on. As scientific computing has evolved to produce results that meet or exceed the quality of experimental and theoretical results, it has become indispensable.Parallel processing has been an enabling technology in scientific computing for more than 20 years. This book is the first in-depth discussion of parallel computing in 10 years; it reflects the mix of topics that mathematicians, computer scientists, and computational scientists focus on to make parallel processing effective for scientific problems. Presently, the impact of parallel processing on scientific computing varies greatly across disciplines, but it plays a vital role in most problem domains and is absolutely essential in many of them. Parallel Processing for Scientific Computing is divided into four parts: The first concerns performance modeling, analysis, and optimization; the second focuses on parallel algorithms and software for an array of problems common to many modeling and simulation applications; the third emphasizes tools and environments that can ease and enhance the process of application development; and the fourth provides a sampling of applications that require parallel computing for scaling to solve larger and realistic models that can advance science and engineering. This edited volume serves as an up-to-date reference for researchers and application developers on the state of the art in scientific computing. It also serves as an excellent overview and introduction, especially for graduate and senior-level undergraduate students interested in computational modeling and simulation and related computer science and applied mathematics aspects.Contents List of Figures; List of Tables; Preface; Chapter 1: Frontiers of Scientific Computing: An Overview; Part I: Performance Modeling, Analysis and Optimization. Chapter 2: Performance Analysis: From Art to Science; Chapter 3: Approaches to Architecture-Aware Parallel Scientific Computation; Chapter 4: Achieving High Performance on the BlueGene/L Supercomputer; Chapter 5: Performance Evaluation and Modeling of Ultra-Scale Systems; Part II: Parallel Algorithms and Enabling Technologies. Chapter 6: Partitioning and Load Balancing; Chapter 7: Combinatorial Parallel and Scientific Computing; Chapter 8: Parallel Adaptive Mesh Refinement; Chapter 9: Parallel Sparse Solvers, Preconditioners, and Their Applications; Chapter 10: A Survey of Parallelization Techniques for Multigrid Solvers; Chapter 11: Fault Tolerance in Large-Scale Scientific Computing; Part III: Tools and Frameworks for Parallel Applications. Chapter 12: Parallel Tools and Environments: A Survey; Chapter 13: Parallel Linear Algebra Software; Chapter 14: High-Performance Component Software Systems; Chapter 15: Integrating Component-Based Scientific Computing Software; Part IV: Applications of Parallel Computing. Chapter 16: Parallel Algorithms for PDE-Constrained Optimization; Chapter 17: Massively Parallel Mixed-Integer Programming; Chapter 18: Parallel Methods and Software for Multicomponent Simulations; Chapter 19: Parallel Computational Biology; Chapter 20: Opportunities and Challenges for Parallel Computing in Science and Engineering; Index.

Genome-Scale Algorithm Design

Genome-Scale Algorithm Design PDF Author: Veli Mäkinen
Publisher: Cambridge University Press
ISBN: 1316342948
Category : Science
Languages : en
Pages : 415

Get Book Here

Book Description
High-throughput sequencing has revolutionised the field of biological sequence analysis. Its application has enabled researchers to address important biological questions, often for the first time. This book provides an integrated presentation of the fundamental algorithms and data structures that power modern sequence analysis workflows. The topics covered range from the foundations of biological sequence analysis (alignments and hidden Markov models), to classical index structures (k-mer indexes, suffix arrays and suffix trees), Burrows–Wheeler indexes, graph algorithms and a number of advanced omics applications. The chapters feature numerous examples, algorithm visualisations, exercises and problems, each chosen to reflect the steps of large-scale sequencing projects, including read alignment, variant calling, haplotyping, fragment assembly, alignment-free genome comparison, transcript prediction and analysis of metagenomic samples. Each biological problem is accompanied by precise formulations, providing graduate students and researchers in bioinformatics and computer science with a powerful toolkit for the emerging applications of high-throughput sequencing.

Scalable Algorithms for Analysis of Genomic Diversity Data

Scalable Algorithms for Analysis of Genomic Diversity Data PDF Author: Bogdan Pașaniuc
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description