Complex Genome Analysis with High-throughput Sequencing Data

Complex Genome Analysis with High-throughput Sequencing Data PDF Author: Xin Li
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
The genomes of most eukaryotes are large and complex. The presence of large amounts of non-coding sequences is a general property of the genomes of complex eukaryotes. High-throughput sequencing is increasingly important for the study of complex genomes. In this dissertation, we focus on two computational problems for high-throughput sequence data analysis, including detecting circular RNA and calling structural variations (especially deletions). Circular RNA (or circRNA) is a kind of non-coding RNA, which consists of a circular configuration through a typical 5' to 3' phosphodiester bond by non-canonical splicing. CircRNA was originally thought as the byproduct from the process of mis-splicing and considered to be of low abundance. Recently, however, circRNA is considered as a new class of functional molecule, and the importance of circRNA in gene regulation and their biological functions in some human diseases have started to be recognized. In this research work, we propose two algorithms to detect potential circRNA. In order to improve the performance of running time, we design an algorithm called CircMarker to find circRNA by creating k-mer table rather than conventional reads mapping. Furthermore, we develop an algorithm named CircDBG by taking advantage of the information from both reads and annotated genome to create de Bruijn graph for circRNA detection, which improves the accuracy and sensitivity. Structural variation (SV), which ranges from 50 bp to ~3 Mb in size, is an important type of genetic variations. Deletion is a type of SV in which a part of a chromosome or a sequence of DNA is lost during DNA replication. In this research work, we develop a new method called EigenDel for detecting genomic deletions. EigenDel first takes advantage of discordant read-pairs and clipped reads to get initial deletion candidates. Then, EigenDel clusters similar deletion candidates together and calls true deletions from each cluster by using unsupervised learning method. EigenDel outperforms other major methods in terms of balancing accuracy and sensitivity as well as reducing bias. Our results in this dissertation show that sequencing data can be used to study complex genomes by using effective computational approaches.

Complex Genome Analysis with High-throughput Sequencing Data

Complex Genome Analysis with High-throughput Sequencing Data PDF Author: Xin Li
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
The genomes of most eukaryotes are large and complex. The presence of large amounts of non-coding sequences is a general property of the genomes of complex eukaryotes. High-throughput sequencing is increasingly important for the study of complex genomes. In this dissertation, we focus on two computational problems for high-throughput sequence data analysis, including detecting circular RNA and calling structural variations (especially deletions). Circular RNA (or circRNA) is a kind of non-coding RNA, which consists of a circular configuration through a typical 5' to 3' phosphodiester bond by non-canonical splicing. CircRNA was originally thought as the byproduct from the process of mis-splicing and considered to be of low abundance. Recently, however, circRNA is considered as a new class of functional molecule, and the importance of circRNA in gene regulation and their biological functions in some human diseases have started to be recognized. In this research work, we propose two algorithms to detect potential circRNA. In order to improve the performance of running time, we design an algorithm called CircMarker to find circRNA by creating k-mer table rather than conventional reads mapping. Furthermore, we develop an algorithm named CircDBG by taking advantage of the information from both reads and annotated genome to create de Bruijn graph for circRNA detection, which improves the accuracy and sensitivity. Structural variation (SV), which ranges from 50 bp to ~3 Mb in size, is an important type of genetic variations. Deletion is a type of SV in which a part of a chromosome or a sequence of DNA is lost during DNA replication. In this research work, we develop a new method called EigenDel for detecting genomic deletions. EigenDel first takes advantage of discordant read-pairs and clipped reads to get initial deletion candidates. Then, EigenDel clusters similar deletion candidates together and calls true deletions from each cluster by using unsupervised learning method. EigenDel outperforms other major methods in terms of balancing accuracy and sensitivity as well as reducing bias. Our results in this dissertation show that sequencing data can be used to study complex genomes by using effective computational approaches.

Statistical Analysis of Next Generation Sequencing Data

Statistical Analysis of Next Generation Sequencing Data PDF Author: Somnath Datta
Publisher: Springer
ISBN: 3319072129
Category : Medical
Languages : en
Pages : 438

Get Book Here

Book Description
Next Generation Sequencing (NGS) is the latest high throughput technology to revolutionize genomic research. NGS generates massive genomic datasets that play a key role in the big data phenomenon that surrounds us today. To extract signals from high-dimensional NGS data and make valid statistical inferences and predictions, novel data analytic and statistical techniques are needed. This book contains 20 chapters written by prominent statisticians working with NGS data. The topics range from basic preprocessing and analysis with NGS data to more complex genomic applications such as copy number variation and isoform expression detection. Research statisticians who want to learn about this growing and exciting area will find this book useful. In addition, many chapters from this book could be included in graduate-level classes in statistical bioinformatics for training future biostatisticians who will be expected to deal with genomic data in basic biomedical research, genomic clinical trials and personalized medicine. About the editors: Somnath Datta is Professor and Vice Chair of Bioinformatics and Biostatistics at the University of Louisville. He is Fellow of the American Statistical Association, Fellow of the Institute of Mathematical Statistics and Elected Member of the International Statistical Institute. He has contributed to numerous research areas in Statistics, Biostatistics and Bioinformatics. Dan Nettleton is Professor and Laurence H. Baker Endowed Chair of Biological Statistics in the Department of Statistics at Iowa State University. He is Fellow of the American Statistical Association and has published research on a variety of topics in statistics, biology and bioinformatics.

Genome Analysis and Human Health

Genome Analysis and Human Health PDF Author: Leena Rawal
Publisher: Springer
ISBN: 9811042985
Category : Medical
Languages : en
Pages : 170

Get Book Here

Book Description
This book highlights selected current data and its relevance in the human health care system, offering a fundamental primer on genetics and human health. With the advent of new high-throughput technologies (for the whole genome including exome sequencing), the conventional focus on genetics and individual genes is now shifting toward the analysis of complex genes, gene-gene interactions and the association between genes and environment, including epigenetics. The rapidly changing scientific research landscape, with the ever-growing influx of data on one hand and emergence of newer and more complicated diseases on the other, has created a dilemma for researchers and caregivers, who are still hopeful that advances in genetics and genomics will provide avenues for the understanding, prevention and possible cure of human diseases. The book focuses on the interactions between genes and proteins at both the transcriptome and proteome levels, which in turn affect the human genome and health. Additionally, it covers the domain that must be explored in order to understand the gene-gene and protein-protein interactions that contribute to human health. The book offers a valuable guide for all students and researchers working in the area of molecular genetics and genomics.

Genetic Analysis of Complex Disease

Genetic Analysis of Complex Disease PDF Author: William K. Scott
Publisher: John Wiley & Sons
ISBN: 1119104076
Category : Science
Languages : en
Pages : 340

Get Book Here

Book Description
Genetic Analysis of Complex Diseases An up-to-date and complete treatment of the strategies, designs and analysis methods for studying complex genetic disease in human beings In the newly revised Third Edition of Genetic Analysis of Complex Diseases, a team of distinguished geneticists delivers a comprehensive introduction to the most relevant strategies, designs and methods of analysis for the study of complex genetic disease in humans. The book focuses on concepts and designs, thereby offering readers a broad understanding of common problems and solutions in the field based on successful applications in the design and execution of genetic studies. This edited volume contains contributions from some of the leading voices in the area and presents new chapters on high-throughput genomic sequencing, copy-number variant analysis and epigenetic studies. Providing clear and easily referenced overviews of the considerations involved in genetic analysis of complex human genetic disease, including sampling, design, data collection, linkage and association studies and social, legal and ethical issues. Genetic Analysis of Complex Diseases also provides: A thorough introduction to study design for the identification of genes in complex traits Comprehensive explorations of basic concepts in genetics, disease phenotype definition and the determination of the genetic components of disease Practical discussions of modern bioinformatics tools for analysis of genetic data Reflecting on responsible conduct of research in genetic studies, as well as linkage analysis and data management New expanded chapter on complex genetic interactions This latest edition of Genetic Analysis of Complex Diseases is a must-read resource for molecular biologists, human geneticists, genetic epidemiologists and pharmaceutical researchers. It is also invaluable for graduate students taking courses in statistical genetics or genetic epidemiology.

Beginners Guide To Bioinformatics For High Throughput Sequencing

Beginners Guide To Bioinformatics For High Throughput Sequencing PDF Author: Eric Cheng-yu Lee
Publisher: World Scientific
ISBN: 9813230533
Category : Science
Languages : en
Pages : 277

Get Book Here

Book Description
Biologists find computing bewildering; yet they are expected to be able to process the voluminous data available from the machines they buy and the datasets that has accumulated in genomic databanks worldwide. It is now increasingly difficult for them to avoid dealing with large volumes of data, that goes beyond just doing manual programming.Most books in this realm are full of equations and complex code but this book gives a much gentler entry point particularly for biologists, with code snippets users can use to cut and paste, and run on their Linux or MacOSX operating system or cloud instance. It also provides a step by step installation instructions which they can easily follow. Those who are in the field of genome sequencing and already familiar with the procedures of analysis, may also find this book useful in closing some knowledge gaps.High throughput sequencing requires high throughput and high performance computing. This book provides a gentle entry to high throughput sequencing by dealing with simple skills which the average biologist is increasingly required to master. You will find this book a breeze to read, and some suggestions in this book maybe new to you, something you might want to try out.

Toward a More Accurate Genome

Toward a More Accurate Genome PDF Author: William Jacob Benhardt Biesinger
Publisher:
ISBN: 9781321093667
Category :
Languages : en
Pages : 124

Get Book Here

Book Description
High-throughput sequencing enables basic and translational biology to query the mechanics of both life and disease at single-nucleotide resolution and with breadth that spans the genome. This revolutionary technology is a major tool in biomedical research, impacting our understanding of life's most basic mechanics and affecting human health and medicine. Unfortunately, this important technology produces very large, error-prone datasets that require substantial computational processing before experimental conclusions can be made. Since errors and hidden biases in the data may influence empirically-derived conclusions, accurate algorithms and models of the data are critical. This thesis focuses on the development of statistical models for high-throughput sequencing data which are capable of handling errors and which are built to reflect biological realities. First, we focus on increasing the fraction of the genome that can be reliably queried in biological experiments using high-throughput sequencing methods by expanding analysis into repeat regions of the genome. The method allows partial observation of the gene regulatory network topology through identification of transcription factor binding sites using Chromatin Immunoprecipitation followed by high-throughput sequencing (ChIP-seq). Binding site clustering, or "peak-calling", can be frustrated by the complex, repetitive nature of genomes. Traditionally, these regions are censored from any interpretation, but we re-enable their interpretation using a probabilistic method for realigning problematic DNA reads. Second, we leverage high-throughput sequencing data for the empirical discovery of underlying epigenetic cell state, enabled through analysis of combinations of histone marks. We use a novel probabilistic model to perform spatial and temporal clustering of histone marks and capture mark combinations that correlate well with cell activity. A first in epigenetic modeling with high-throughput sequencing data, we not only pool information across cell types, but directly model the relationship between them, improving predictive power across several datasets. Third, we develop a scalable approach to genome assembly using high-throughput sequencing reads. While several assembly solutions exist, most don't scale well to large datasets, requiring computers with copious memory to assemble large genomes. Throughput continues to increase and the large datasets available today and in the near future will require truly scalable methods. We present a promising distributed method for genome assembly which distributes the de Bruijn graph across many computers and seamlessly spills to disk when main memory is insufficient. We also show novel graph cleaning algorithms which should handle increased errors from large datasets better than traditional graph structure-based cleaning. High-throughput sequencing plays an important role in biomedical research, and has already affected human health and medicine. Future experimental procedures will continue to rely on statistical methods to provide crucial error and bias correction, in addition to modeling expected outcomes. Thus, further development of robust statistical models is critical to the future high-throughput sequencing, ensuring a strong foundation for correct biological conclusions.

Bioinformatics, Supercomputing And Complex Genome Analysis - Proceedings Of The 2nd International Conference

Bioinformatics, Supercomputing And Complex Genome Analysis - Proceedings Of The 2nd International Conference PDF Author: Hwa A Lim
Publisher: World Scientific
ISBN: 9814602558
Category :
Languages : en
Pages : 682

Get Book Here

Book Description
Since the beginning of the genome project, the necessary involvement of scientists of widely divergent backgrounds has been evident. The proper handling, analysis, dissemination of information, and the control and data gathering of automated process are areas where computers are directly involved. Thus computers are intimately tied into the production and analysis of biological data. However, many challenges lie ahead.This volume is a collection of selected oral and poster presentations given at The Second International Conference on Bioinformatics, Supercomputing and Complex Genome Analysis, organized to address some of these challenges. The topics include the current status and future prospects of genome map, mapping and sequencing, complex genome analysis,linguistic and neural network approaches, database issues, and computer tools in the genome project. The volume will be ideal for students, newcomers, young researchers and experts alike, who are computationally or experimentally oriented.Keynote Speakers: C L Smith, D Grothues, T Ito, T Sano, D Wang, Y-W Zhu, C R Canton & R J Rohins.

Statistical Methods for the Analysis of Genomic Data

Statistical Methods for the Analysis of Genomic Data PDF Author: Hui Jiang
Publisher: MDPI
ISBN: 3039361406
Category : Science
Languages : en
Pages : 136

Get Book Here

Book Description
In recent years, technological breakthroughs have greatly enhanced our ability to understand the complex world of molecular biology. Rapid developments in genomic profiling techniques, such as high-throughput sequencing, have brought new opportunities and challenges to the fields of computational biology and bioinformatics. Furthermore, by combining genomic profiling techniques with other experimental techniques, many powerful approaches (e.g., RNA-Seq, Chips-Seq, single-cell assays, and Hi-C) have been developed in order to help explore complex biological systems. As a result of the increasing availability of genomic datasets, in terms of both volume and variety, the analysis of such data has become a critical challenge as well as a topic of great interest. Therefore, statistical methods that address the problems associated with these newly developed techniques are in high demand. This book includes a number of studies that highlight the state-of-the-art statistical methods for the analysis of genomic data and explore future directions for improvement.

Evolution of Translational Omics

Evolution of Translational Omics PDF Author: Institute of Medicine
Publisher: National Academies Press
ISBN: 0309224187
Category : Science
Languages : en
Pages : 354

Get Book Here

Book Description
Technologies collectively called omics enable simultaneous measurement of an enormous number of biomolecules; for example, genomics investigates thousands of DNA sequences, and proteomics examines large numbers of proteins. Scientists are using these technologies to develop innovative tests to detect disease and to predict a patient's likelihood of responding to specific drugs. Following a recent case involving premature use of omics-based tests in cancer clinical trials at Duke University, the NCI requested that the IOM establish a committee to recommend ways to strengthen omics-based test development and evaluation. This report identifies best practices to enhance development, evaluation, and translation of omics-based tests while simultaneously reinforcing steps to ensure that these tests are appropriately assessed for scientific validity before they are used to guide patient treatment in clinical trials.

New High Throughput Technologies for DNA Sequencing and Genomics

New High Throughput Technologies for DNA Sequencing and Genomics PDF Author: Keith R. Mitchelson
Publisher: Elsevier
ISBN: 0080471285
Category : Science
Languages : en
Pages : 399

Get Book Here

Book Description
Since the independent invention of DNA sequencing by Sanger and by Gilbert 30 years ago, it has grown from a small scale technique capable of reading several kilobase-pair of sequence per day into today's multibillion dollar industry. This growth has spurred the development of new sequencing technologies that do not involve either electrophoresis or Sanger sequencing chemistries. Sequencing by Synthesis (SBS) involves multiple parallel micro-sequencing addition events occurring on a surface, where data from each round is detected by imaging. New High Throughput Technologies for DNA Sequencing and Genomics is the second volume in the Perspectives in Bioanalysis series, which looks at the electroanalytical chemistry of nucleic acids and proteins, development of electrochemical sensors and their application in biomedicine and in the new fields of genomics and proteomics. The authors have expertly formatted the information for a wide variety of readers, including new developments that will inspire students and young scientists to create new tools for science and medicine in the 21st century. Reviews of complementary developments in Sanger and SBS sequencing chemistries, capillary electrophoresis and microdevice integration, MS sequencing and applications set the framework for the book. * 'Hot Topic' with DNA sequencing continuing as a major research activity in many areas of life science and medicine. * Bringing together new developments in DNA sequencing technology * Reviewing issues relevant to the new applications used