High-performance Processing of Next-generation Sequencing Data on CUDA-enabled GPUs

High-performance Processing of Next-generation Sequencing Data on CUDA-enabled GPUs PDF Author: Felix Kallenborn
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
With the technological advances in the field of genomics and sequencing, the processing of vast amounts of generated data becomes more and more challenging. Nowadays, software for processing large-scale datasets of sequencing reads may take hours to days to complete, even on high-end workstations. This explains the need for new approaches to achieve faster, high-performance applications. In contrast to traditional CPU-based software, algorithms utilizing the massively-parallel many-core architecture and fast memory of GPUs are potentially able to deliver the desired performance in many fields. In this thesis, we introduce two novel GPU-accelerated applications, CARE and CAREx, for common steps in sequence processing pipelines, error correction and read extension of Next Generation Sequencing (NGS) Illumina data, to improve the results of down-stream data analysis. To the best of our knowledge, CARE and CAREx are the first modern GPU-accelerated solutions for the respective problems. A key component of our algorithm is the identification of similar DNA sequences within a dataset. For this purpose, we developed a minhashing-based index data structure for large-scale read datasets. In conjunction with our fast bit-parallel shifted hamming distance computations, this allows for the efficient identification of similar reads. The resulting set of similar sequences is subsequently arranged into a gap-free multiple-sequence alignment to solve the problem at hand. Sequencing machines introduce both systematic errors and random errors. CARE, Context-Aware Read Error corrector, accurately removes errors introduced by NGS sequencing machines during the initial sequencing of a biological sample. With the help of a pre-trained Random Forest, CARE generates two orders-of-magnitude fewer false positives than its competitors. At the same time, it shows similar numbers of true positives. Read extension describes the process of elongating DNA sequences. The presence of longer sequences improves the resolution of more, larger structures within a genome. CAREx, Context-Aware Read Extender, produces longer sequences, so called pseudo-long reads, by connecting the two reads of read pairs which were sequenced in close proximity. Evaluation shows that CAREx produces significantly more highly accurate pseudo-long reads than the state-of-the-art. With algorithms tailored towards high-performance GPU computations, both CARE and CAREx run significantly faster than the CPU-based competitors, while, at the same time, produce more accurate results. The processing of a large Human dataset with 30x coverage with CARE requires less than 30 minutes using a single A100 GPU. This time can be further reduced down to 10 minutes on multi-GPU systems. In contrast, CPU-based tools like Musket or BFC take 3 hours and 1.5 hours, respectively. Read extension of a Human dataset with CAREx takes 3.3 hours to complete on a single GPU, whereas Konnector2 requires over a day to complete. This shows that large-scale sequence processing can greatly benefit from the usage of GPUs, and that multiple-sequence alignment-based algorithms should be considered despite their increased complexity because they provide great accuracy. While our general building blocks have been tailored towards our needs for error correction and read extension, they could also prove useful in other GPU-accelerated applications that process sequence data.

High-performance Processing of Next-generation Sequencing Data on CUDA-enabled GPUs

High-performance Processing of Next-generation Sequencing Data on CUDA-enabled GPUs PDF Author: Felix Kallenborn
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
With the technological advances in the field of genomics and sequencing, the processing of vast amounts of generated data becomes more and more challenging. Nowadays, software for processing large-scale datasets of sequencing reads may take hours to days to complete, even on high-end workstations. This explains the need for new approaches to achieve faster, high-performance applications. In contrast to traditional CPU-based software, algorithms utilizing the massively-parallel many-core architecture and fast memory of GPUs are potentially able to deliver the desired performance in many fields. In this thesis, we introduce two novel GPU-accelerated applications, CARE and CAREx, for common steps in sequence processing pipelines, error correction and read extension of Next Generation Sequencing (NGS) Illumina data, to improve the results of down-stream data analysis. To the best of our knowledge, CARE and CAREx are the first modern GPU-accelerated solutions for the respective problems. A key component of our algorithm is the identification of similar DNA sequences within a dataset. For this purpose, we developed a minhashing-based index data structure for large-scale read datasets. In conjunction with our fast bit-parallel shifted hamming distance computations, this allows for the efficient identification of similar reads. The resulting set of similar sequences is subsequently arranged into a gap-free multiple-sequence alignment to solve the problem at hand. Sequencing machines introduce both systematic errors and random errors. CARE, Context-Aware Read Error corrector, accurately removes errors introduced by NGS sequencing machines during the initial sequencing of a biological sample. With the help of a pre-trained Random Forest, CARE generates two orders-of-magnitude fewer false positives than its competitors. At the same time, it shows similar numbers of true positives. Read extension describes the process of elongating DNA sequences. The presence of longer sequences improves the resolution of more, larger structures within a genome. CAREx, Context-Aware Read Extender, produces longer sequences, so called pseudo-long reads, by connecting the two reads of read pairs which were sequenced in close proximity. Evaluation shows that CAREx produces significantly more highly accurate pseudo-long reads than the state-of-the-art. With algorithms tailored towards high-performance GPU computations, both CARE and CAREx run significantly faster than the CPU-based competitors, while, at the same time, produce more accurate results. The processing of a large Human dataset with 30x coverage with CARE requires less than 30 minutes using a single A100 GPU. This time can be further reduced down to 10 minutes on multi-GPU systems. In contrast, CPU-based tools like Musket or BFC take 3 hours and 1.5 hours, respectively. Read extension of a Human dataset with CAREx takes 3.3 hours to complete on a single GPU, whereas Konnector2 requires over a day to complete. This shows that large-scale sequence processing can greatly benefit from the usage of GPUs, and that multiple-sequence alignment-based algorithms should be considered despite their increased complexity because they provide great accuracy. While our general building blocks have been tailored towards our needs for error correction and read extension, they could also prove useful in other GPU-accelerated applications that process sequence data.

CUDA for Engineers

CUDA for Engineers PDF Author: Duane Storti
Publisher: Addison-Wesley Professional
ISBN: 013417755X
Category : Computers
Languages : en
Pages : 739

Get Book Here

Book Description
CUDA for Engineers gives you direct, hands-on engagement with personal, high-performance parallel computing, enabling you to do computations on a gaming-level PC that would have required a supercomputer just a few years ago. The authors introduce the essentials of CUDA C programming clearly and concisely, quickly guiding you from running sample programs to building your own code. Throughout, you’ll learn from complete examples you can build, run, and modify, complemented by additional projects that deepen your understanding. All projects are fully developed, with detailed building instructions for all major platforms. Ideal for any scientist, engineer, or student with at least introductory programming experience, this guide assumes no specialized background in GPU-based or parallel computing. In an appendix, the authors also present a refresher on C programming for those who need it. Coverage includes Preparing your computer to run CUDA programs Understanding CUDA’s parallelism model and C extensions Transferring data between CPU and GPU Managing timing, profiling, error handling, and debugging Creating 2D grids Interoperating with OpenGL to provide real-time user interactivity Performing basic simulations with differential equations Using stencils to manage related computations across threads Exploiting CUDA’s shared memory capability to enhance performance Interacting with 3D data: slicing, volume rendering, and ray casting Using CUDA libraries Finding more CUDA resources and code Realistic example applications include Visualizing functions in 2D and 3D Solving differential equations while changing initial or boundary conditions Viewing/processing images or image stacks Computing inner products and centroids Solving systems of linear algebraic equations Monte-Carlo computations

Encyclopedia of Bioinformatics and Computational Biology

Encyclopedia of Bioinformatics and Computational Biology PDF Author:
Publisher: Elsevier
ISBN: 0128114320
Category : Medical
Languages : en
Pages : 3421

Get Book Here

Book Description
Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics, Three Volume Set combines elements of computer science, information technology, mathematics, statistics and biotechnology, providing the methodology and in silico solutions to mine biological data and processes. The book covers Theory, Topics and Applications, with a special focus on Integrative –omics and Systems Biology. The theoretical, methodological underpinnings of BCB, including phylogeny are covered, as are more current areas of focus, such as translational bioinformatics, cheminformatics, and environmental informatics. Finally, Applications provide guidance for commonly asked questions. This major reference work spans basic and cutting-edge methodologies authored by leaders in the field, providing an invaluable resource for students, scientists, professionals in research institutes, and a broad swath of researchers in biotechnology and the biomedical and pharmaceutical industries. Brings together information from computer science, information technology, mathematics, statistics and biotechnology Written and reviewed by leading experts in the field, providing a unique and authoritative resource Focuses on the main theoretical and methodological concepts before expanding on specific topics and applications Includes interactive images, multimedia tools and crosslinking to further resources and databases

Accelerating Bioinformatics Applications on CUDA-enabled Multi-GPU Systems

Accelerating Bioinformatics Applications on CUDA-enabled Multi-GPU Systems PDF Author: Robin Kobus
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
A wide range of bioinformatics applications have to deal with a continuously growing amount of data generated by high-throughput sequencing techniques. Exclusively CPU-based workstations fail to keep up with the task. Instead of employing dozens of CPU cluster nodes to increase the computational power, massively parallel accelerators like modern CUDA-enabled GPUs can be used to achieve higher throughput and reduce execution times. However, memory capacity of such devices is often limited. Efficient parallelization and data distribution are essential to accelerate performance critical components of bionformatics pipelines like read classification and read mapping. In this thesis we analyze and optimize tasks common to many GPU-based applications in the context of bioinformatics. We study sequence processing, construction and querying of k-mer-based hash tables, segmented sort as well as multi-GPU communication. With these methods we accelerate suffix array construction and metagenomic read classification on CUDA-enabled GPUs by overcoming the aforementioned challenges. By leveraging multiple GPUs, we extend the limited memory available from a single GPU to allow for the construction of larger indices. Our communication library, called Gossip, introduces optimized scatter, gather and all-to-all patterns for multi-GPU systems. Gossip's all-to-all communication pattern is successfully applied to suffix array construction, accelerating it to run in 3.44 s for a full-length human genome on an 8-GPU server, which is faster than previously reported 4.8 seconds achieved by employing 1600 cores on 100 nodes on a CPU-based HPC cluster. Furthermore, we introduce MetaCache-GPU -- an ultra-fast metagenomic short read classifier specifically tailored to fit the characteristics of CUDA-enabled accelerators. Our approach employs a novel hash table variant featuring efficient minhash fingerprinting of reads for locality-sensitive hashing and their rapid insertion using warp-aggregated operations. Our performance evaluation shows that MetaCache-GPU is able to build large reference databases in a matter of seconds, enabling instantaneous operability, while popular CPU-based tools such as Kraken2 require over an hour for index construction on the same data. In the light of an ever-growing number of reference genomes, MetaCache-GPU is the first metagenomic classifier that makes analysis pipelines with on-demand composition of large-scale reference genome sets practical. Although many sub-problems in this thesis are optimized in a specific application context, they also apply to other bioinformatics problems like k-mer counting, sequence alignment and assembly, which would benefit from GPU acceleration. In addition to the insights from this work, we make our source code publicly available to allow for easier adaptation of our methods to related problems.

Algorithms for Next-Generation Sequencing Data

Algorithms for Next-Generation Sequencing Data PDF Author: Mourad Elloumi
Publisher: Springer
ISBN: 3319598260
Category : Computers
Languages : en
Pages : 356

Get Book Here

Book Description
The 14 contributed chapters in this book survey the most recent developments in high-performance algorithms for NGS data, offering fundamental insights and technical information specifically on indexing, compression and storage; error correction; alignment; and assembly. The book will be of value to researchers, practitioners and students engaged with bioinformatics, computer science, mathematics, statistics and life sciences.

Computational Methods for Next Generation Sequencing Data Analysis

Computational Methods for Next Generation Sequencing Data Analysis PDF Author: Ion Mandoiu
Publisher: John Wiley & Sons
ISBN: 1119272165
Category : Computers
Languages : en
Pages : 464

Get Book Here

Book Description
Introduces readers to core algorithmic techniques for next-generation sequencing (NGS) data analysis and discusses a wide range of computational techniques and applications This book provides an in-depth survey of some of the recent developments in NGS and discusses mathematical and computational challenges in various application areas of NGS technologies. The 18 chapters featured in this book have been authored by bioinformatics experts and represent the latest work in leading labs actively contributing to the fast-growing field of NGS. The book is divided into four parts: Part I focuses on computing and experimental infrastructure for NGS analysis, including chapters on cloud computing, modular pipelines for metabolic pathway reconstruction, pooling strategies for massive viral sequencing, and high-fidelity sequencing protocols. Part II concentrates on analysis of DNA sequencing data, covering the classic scaffolding problem, detection of genomic variants, including insertions and deletions, and analysis of DNA methylation sequencing data. Part III is devoted to analysis of RNA-seq data. This part discusses algorithms and compares software tools for transcriptome assembly along with methods for detection of alternative splicing and tools for transcriptome quantification and differential expression analysis. Part IV explores computational tools for NGS applications in microbiomics, including a discussion on error correction of NGS reads from viral populations, methods for viral quasispecies reconstruction, and a survey of state-of-the-art methods and future trends in microbiome analysis. Computational Methods for Next Generation Sequencing Data Analysis: Reviews computational techniques such as new combinatorial optimization methods, data structures, high performance computing, machine learning, and inference algorithms Discusses the mathematical and computational challenges in NGS technologies Covers NGS error correction, de novo genome transcriptome assembly, variant detection from NGS reads, and more This text is a reference for biomedical professionals interested in expanding their knowledge of computational techniques for NGS data analysis. The book is also useful for graduate and post-graduate students in bioinformatics.

Intelligent Distributed Computing XIV

Intelligent Distributed Computing XIV PDF Author: David Camacho
Publisher: Springer Nature
ISBN: 3030966275
Category : Technology & Engineering
Languages : en
Pages : 464

Get Book Here

Book Description
This book collects 43 regular papers received from 18 countries that present innovative advances in intelligent and distributed computing, encompassing both architectural and algorithmic results related to these fields. Significant attention is given to new models, techniques, and applications for distributed intelligent architectures and high-performance architectures, machine learning techniques, Internet of Things, blockchain, intelligent transport systems, data analytics, trust and reputation systems, and many others. The book includes the peer-reviewed proceedings of the 14th International Symposium on Intelligent Distributed Computing (IDC 2021), which was held in online mode due to the COVSARS2 pandemic situation, during September 16–18, 2021. The IDC 2021 event included sessions on Internet of Things, data analytics, machine learning, multi-agent systems, algorithms, future intelligent transport solutions, blockchain, intelligent distributed computing for cyber-physical security, and security and trust and reputation in intelligent environments.

Domain-Specific Computer Architectures for Emerging Applications

Domain-Specific Computer Architectures for Emerging Applications PDF Author: Chao Wang
Publisher: CRC Press
ISBN: 1040031986
Category : Computers
Languages : en
Pages : 417

Get Book Here

Book Description
With the end of Moore’s Law, domain-specific architecture (DSA) has become a crucial mode of implementing future computing architectures. This book discusses the system-level design methodology of DSAs and their applications, providing a unified design process that guarantees functionality, performance, energy efficiency, and real-time responsiveness for the target application. DSAs often start from domain-specific algorithms or applications, analyzing the characteristics of algorithmic applications, such as computation, memory access, and communication, and proposing the heterogeneous accelerator architecture suitable for that particular application. This book places particular focus on accelerator hardware platforms and distributed systems for various novel applications, such as machine learning, data mining, neural networks, and graph algorithms, and also covers RISC-V open-source instruction sets. It briefly describes the system design methodology based on DSAs and presents the latest research results in academia around domain-specific acceleration architectures. Providing cutting-edge discussion of big data and artificial intelligence scenarios in contemporary industry and typical DSA applications, this book appeals to industry professionals as well as academicians researching the future of computing in these areas.

Parallelization on Graphic Hardware

Parallelization on Graphic Hardware PDF Author: Guillaume Rizk
Publisher:
ISBN:
Category :
Languages : en
Pages : 112

Get Book Here

Book Description
Bioinformatics require the analysis of large amounts of data. With the recent advent of next generation sequencing technologies generating data at a cheap cost, the computational power needed has increased dramatically. Graphic Processing Units (GPU) are now programmable beyond simple graphic computations, providing cheap high performance for general purpose applications. This thesis explores the usage of GPUs for bioinformatics applications. First, this work focuses on the computation of secondary structures of RNA sequences. It is traditionally conducted with a dynamic programming algorithm, which poses significant challenges for a GPU implementation. We introduce a new tiled implementation providing good data locality and therefore very efficient GPU code. We note that our algorithmic modification also enables tiling and subsequent vectorization of the CPU program, allowing us to conduct a fair CPU-GPU comparison. Secondly, this works addresses the short sequence alignment problem. We present an attempt at GPU parallelization using the seed-and-extend paradigm. Since this attempt is unsuccessful, we then focus on the development of a program running on CPU. Our main contribution is the development of a new algorithm filtering candidate alignment locations quickly, based on the pre computation of tiles of the dynamic programming matrix. This new algorithm proved to be in fact more effective on a sequential CPU program and lead to an efficient new CPU aligner. Our work provides the example of both successful an unsuccessful attempts at GPU parallelization. These two points of view allow us to evaluate GPUs efficiency and the role they can play in bioinformatics.

Algorithms in Computational Molecular Biology

Algorithms in Computational Molecular Biology PDF Author: Mourad Elloumi
Publisher: John Wiley & Sons
ISBN: 1118101987
Category : Science
Languages : en
Pages : 1027

Get Book Here

Book Description
This book represents the most comprehensive and up-to-date collection of information on the topic of computational molecular biology. Bringing the most recent research into the forefront of discussion, Algorithms in Computational Molecular Biology studies the most important and useful algorithms currently being used in the field, and provides related problems. It also succeeds where other titles have failed, in offering a wide range of information from the introductory fundamentals right up to the latest, most advanced levels of study.