High-Dimensional Methods to Model Biological Signal in Genome-Wide Studies

High-Dimensional Methods to Model Biological Signal in Genome-Wide Studies PDF Author: Andrew J. Bass
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Recent advancements in sequencing technology have substantially increased the quality and quantity of data in genomics, presenting novel analytical challenges for biological discovery. In particular, foundational ideas developed in statistics over the past century are not easily extended to these high-dimensional datasets. Therefore, creating novel methodologies to analyze this data is a key challenge faced in statistics, and more generally, biology and computational science.Here I focus on building statistical methods for genome-wide analysis that are statistically rigorous, computationally fast, and easy to implement. In particular, I develop four methods that improve statistical inference of high-dimensional biological data. The first focuses on differential expression analysis where I extend the optimal discovery procedure (ODP) to complex study designs and RNA-seq studies. I find that the extended ODP leverages shared biological signal to substantially improve the statistical power compared to other commonly used testing procedures. The second aims to model the functional relationship between sequencing depth and statistical power in RNA-seq differential expression studies. The resulting model, superSeq, accurately predicts the improvement in statistical power when sequencing additional reads in a completed study. Thus superSeq can guide researchers in choosing a sufficient sequencing depth to maximize statistical power while avoiding unnecessary sequencing costs.The third method estimates the posterior distribution of false discovery rate (FDR) quantities, such as local FDRs and q-values, using a Bayesian nonparametric approach. Specifically, I implement an approximation to these posterior distributions that is scalable to genome-wide datasets using variational inference. These estimated posterior distributions are informative in a significance analysis as they capture the uncertainty of FDR quantities in reported results.Finally, I develop a likelihood-based approach to estimating unobserved population structure on the canonical parameter scale. I demonstrate that this framework can flexibly capture arbitrary structure and provide accurate allele frequency estimates while being computationally fast for large population genetic studies. Therefore, this framework is useful for many applications in population genetics, such as accounting for structure in the genome-wide association testing procedure GCATest.Collectively, these four methods address problems typically encountered in a biological analysis and can thus help improve downstream inferences in high-dimensional settings.

High-Dimensional Methods to Model Biological Signal in Genome-Wide Studies

High-Dimensional Methods to Model Biological Signal in Genome-Wide Studies PDF Author: Andrew J. Bass
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Recent advancements in sequencing technology have substantially increased the quality and quantity of data in genomics, presenting novel analytical challenges for biological discovery. In particular, foundational ideas developed in statistics over the past century are not easily extended to these high-dimensional datasets. Therefore, creating novel methodologies to analyze this data is a key challenge faced in statistics, and more generally, biology and computational science.Here I focus on building statistical methods for genome-wide analysis that are statistically rigorous, computationally fast, and easy to implement. In particular, I develop four methods that improve statistical inference of high-dimensional biological data. The first focuses on differential expression analysis where I extend the optimal discovery procedure (ODP) to complex study designs and RNA-seq studies. I find that the extended ODP leverages shared biological signal to substantially improve the statistical power compared to other commonly used testing procedures. The second aims to model the functional relationship between sequencing depth and statistical power in RNA-seq differential expression studies. The resulting model, superSeq, accurately predicts the improvement in statistical power when sequencing additional reads in a completed study. Thus superSeq can guide researchers in choosing a sufficient sequencing depth to maximize statistical power while avoiding unnecessary sequencing costs.The third method estimates the posterior distribution of false discovery rate (FDR) quantities, such as local FDRs and q-values, using a Bayesian nonparametric approach. Specifically, I implement an approximation to these posterior distributions that is scalable to genome-wide datasets using variational inference. These estimated posterior distributions are informative in a significance analysis as they capture the uncertainty of FDR quantities in reported results.Finally, I develop a likelihood-based approach to estimating unobserved population structure on the canonical parameter scale. I demonstrate that this framework can flexibly capture arbitrary structure and provide accurate allele frequency estimates while being computationally fast for large population genetic studies. Therefore, this framework is useful for many applications in population genetics, such as accounting for structure in the genome-wide association testing procedure GCATest.Collectively, these four methods address problems typically encountered in a biological analysis and can thus help improve downstream inferences in high-dimensional settings.

Model Selection Methods for High-dimensional Data and Their Applications to Genome-wide Association Studies

Model Selection Methods for High-dimensional Data and Their Applications to Genome-wide Association Studies PDF Author: Zheyang Wu
Publisher:
ISBN:
Category :
Languages : en
Pages : 418

Get Book Here

Book Description


Capturing Hidden Signals From High-Dimensional Data and Applications to Genomics

Capturing Hidden Signals From High-Dimensional Data and Applications to Genomics PDF Author: Elior Rahmani
Publisher:
ISBN:
Category :
Languages : en
Pages : 223

Get Book Here

Book Description
The analysis of high-dimensional data, albeit challenging owing to various computational and statistical aspects, often provides opportunities to uncover hidden signals by leveraging inherent structure in the data. In the context of genomics, where molecular markers are probed at ever-increasing resolution and throughput, large sets of features that follow specific patterns, in conjunction with large sample sizes, allow us to implement richer and more sophisticated models than before in attempt to extract signal that is not immediately evident from the data. Particularly, genomic markers are often affected by multiple genetic and environmental factors, they may differ in their regulation and presentation in different tissues, cell types, conditions, or over time, and some markers may affect multiple biological processes; unveiling those signals is likely to be pivotal in advancing our understanding of complex biology and disease. This dissertation introduces novel computational methodologies and theory that address several key challenges faced in the analysis of high-dimensional genomic data coming from heterogeneous sources ("bulk" genomics) with a particular focus on DNA methylation data. Through a range of simulations and the analysis of multiple data sets, we demonstrate that our proposed methods provide opportunities to conduct powerful and statistically sound population-level studies at an unprecedented resolution and scale.

Statistical Analysis of Next Generation Sequencing Data

Statistical Analysis of Next Generation Sequencing Data PDF Author: Somnath Datta
Publisher: Springer
ISBN: 3319072129
Category : Medical
Languages : en
Pages : 438

Get Book Here

Book Description
Next Generation Sequencing (NGS) is the latest high throughput technology to revolutionize genomic research. NGS generates massive genomic datasets that play a key role in the big data phenomenon that surrounds us today. To extract signals from high-dimensional NGS data and make valid statistical inferences and predictions, novel data analytic and statistical techniques are needed. This book contains 20 chapters written by prominent statisticians working with NGS data. The topics range from basic preprocessing and analysis with NGS data to more complex genomic applications such as copy number variation and isoform expression detection. Research statisticians who want to learn about this growing and exciting area will find this book useful. In addition, many chapters from this book could be included in graduate-level classes in statistical bioinformatics for training future biostatisticians who will be expected to deal with genomic data in basic biomedical research, genomic clinical trials and personalized medicine. About the editors: Somnath Datta is Professor and Vice Chair of Bioinformatics and Biostatistics at the University of Louisville. He is Fellow of the American Statistical Association, Fellow of the Institute of Mathematical Statistics and Elected Member of the International Statistical Institute. He has contributed to numerous research areas in Statistics, Biostatistics and Bioinformatics. Dan Nettleton is Professor and Laurence H. Baker Endowed Chair of Biological Statistics in the Department of Statistics at Iowa State University. He is Fellow of the American Statistical Association and has published research on a variety of topics in statistics, biology and bioinformatics.

Genomic Signal Processing and Statistics

Genomic Signal Processing and Statistics PDF Author: Edward R. Dougherty
Publisher: Hindawi Publishing Corporation
ISBN: 9775945070
Category : DNA microarrays
Languages : en
Pages : 456

Get Book Here

Book Description
Recent advances in genomic studies have stimulated synergetic research and development in many cross-disciplinary areas. Processing the vast genomic data, especially the recent large-scale microarray gene expression data, to reveal the complex biological functionality, represents enormous challenges to signal processing and statistics. This perspective naturally leads to a new field, genomic signal processing (GSP), which studies the processing of genomic signals by integrating the theory of signal processing and statistics. Written by an international, interdisciplinary team of authors, this invaluable edited volume is accessible to students just entering this emergent field, and to researchers, both in academia and in industry, in the fields of molecular biology, engineering, statistics, and signal processing. The book provides tutorial-level overviews and addresses the specific needs of genomic signal processing students and researchers as a reference book. The book aims to address current genomic challenges by exploiting potential synergies between genomics, signal processing, and statistics, with special emphasis on signal processing and statistical tools for structural and functional understanding of genomic data. The first part of this book provides a brief history of genomic research and a background introduction from both biological and signal-processing/statistical perspectives, so that readers can easily follow the material presented in the rest of the book. In what follows, overviews of state-of-the-art techniques are provided. We start with a chapter on sequence analysis, and follow with chapters on feature selection, classification, and clustering of microarray data. We then discuss the modeling, analysis, and simulation of biological regulatory networks, especially gene regulatory networks based on Boolean and Bayesian approaches. Visualization and compression of gene data, and supercomputer implementation of genomic signal processing systems are also treated. Finally, we discuss systems biology and medical applications of genomic research as well as the future trends in genomic signal processing and statistics research.

Genomic Signal Processing

Genomic Signal Processing PDF Author: Ilya Shmulevich
Publisher: Princeton University Press
ISBN: 1400865263
Category : Science
Languages : en
Pages : 314

Get Book Here

Book Description
Genomic signal processing (GSP) can be defined as the analysis, processing, and use of genomic signals to gain biological knowledge, and the translation of that knowledge into systems-based applications that can be used to diagnose and treat genetic diseases. Situated at the crossroads of engineering, biology, mathematics, statistics, and computer science, GSP requires the development of both nonlinear dynamical models that adequately represent genomic regulation, and diagnostic and therapeutic tools based on these models. This book facilitates these developments by providing rigorous mathematical definitions and propositions for the main elements of GSP and by paying attention to the validity of models relative to the data. Ilya Shmulevich and Edward Dougherty cover real-world situations and explain their mathematical modeling in relation to systems biology and systems medicine. Genomic Signal Processing makes a major contribution to computational biology, systems biology, and translational genomics by providing a self-contained explanation of the fundamental mathematical issues facing researchers in four areas: classification, clustering, network modeling, and network intervention.

Modern Molecular Biology:

Modern Molecular Biology: PDF Author: Srinivasan Yegnasubramanian
Publisher: Springer Science & Business Media
ISBN: 0387697454
Category : Medical
Languages : en
Pages : 196

Get Book Here

Book Description
Molecular biology has rapidly advanced since the discovery of the basic flow of information in life, from DNA to RNA to proteins. While there are several important and interesting exceptions to this general flow of information, the importance of these biological macromolecules in dictating the phenotypic nature of living creatures in health and disease is paramount. In the last one and a half decades, and particularly after the completion of the Human Genome Project, there has been an explosion of technologies that allow the broad characterization of these macromolecules in physiology, and the perturbations to these macromolecules that occur in diseases such as cancer. In this volume, we will explore the modern approaches used to characterize these macromolecules in an unbiased, systematic way. Such technologies are rapidly advancing our knowledge of the coordinated and complicated changes that occur during carcinogenesis, and are providing vital information that, when correctly interpreted by biostatistical/bioinformatics analyses, can be exploited for the prevention, diagnosis, and treatment of human cancers. The purpose of this volume is to provide an overview of modern molecular biological approaches to unbiased discovery in cancer research. Advances in molecular biology allowing unbiased analysis of changes in cancer initiation and progression will be overviewed. These include the strategies employed in modern genomics, gene expression analysis, and proteomics.

Design of Efficient and Accurate Statistical Approaches to Correct for Confounding Effects and Identify True Signals in Genetic Association Studies

Design of Efficient and Accurate Statistical Approaches to Correct for Confounding Effects and Identify True Signals in Genetic Association Studies PDF Author: JONG WHA JOANNE JOO
Publisher:
ISBN:
Category :
Languages : en
Pages : 144

Get Book Here

Book Description
Over the past decades, genome-wide association studies have dramatically improved especially with the advent of the hight-throughput technologies such as microarray and next generation sequencing. Although genome-wide association studies have been extremely successful in identifying tens of thousands of variants associated with various disease or traits, many studies have reported that some of the associations are spurious induced by various confounding factors such as population structure or technical artifacts. In this dissertation, I focus on effectively and accurately identifying true signals in genome-wide association studies in the presence of confounding effects. First, I introduce a method that effectively identifying regulatory hotspots while correcting for false signals induced by technical confounding effects in expression quantitative loci studies. Technical confounding factors such as a batch effect complicates the expression quantitative loci analysis by inducing heterogeneity in gene expressions. This creates correlations between the samples and may cause spurious associations leading to spurious regulatory hotspots. By formulating the problem of identifying genetic signals in a linear mixed model framework, I show how we can identify regulatory hotspots while capturing heterogeneity in expression quantitative loci studies. Second, I introduce an efficient and accurate multiple-phenotype analysis method for high-dimensional data in the presence of population structure. Recently, large amounts of genomic data such as expression data have been collected from genome-wide association studies cohorts and in many cases it is preferable to analyze more than thousands of phenotypes simultaneously than analyze each phenotype one at a time. However, when confounding factors, such as population structure, exit in the data, even a small bias is induced by the confounding effects, the bias accumulates for each phenotype and may cause serious problems in multiple-phenotype analysis. By incorporating linear mixed model in the statistics of multivariate regression, I show we can increase the accuracy of multiple phenotype analysis dramatically in high- dimensional data. Lastly, I introduce an efficient multiple testing correction method in linear mixed model. The significance threshold differs as a function of species, marker densities, genetic relatedness, and trait heritability. However, none of the previous multiple testing correction methods can comprehensively account for these factors. I show that the significant threshold changes with the dosage of genetic relatedness and introduce a novel multiple testing correction approach that utilizes linear mixed model to account for the confounding effects in the data.

Multivariate Methods for Heterogeneous High-Dimensional Data in Genome Biology

Multivariate Methods for Heterogeneous High-Dimensional Data in Genome Biology PDF Author: Britta Velten
Publisher:
ISBN:
Category :
Languages : en
Pages :

Get Book Here

Book Description


Statistical Inference from High Dimensional Data

Statistical Inference from High Dimensional Data PDF Author: Carlos Fernandez-Lozano
Publisher: MDPI
ISBN: 3036509445
Category : Science
Languages : en
Pages : 314

Get Book Here

Book Description
• Real-world problems can be high-dimensional, complex, and noisy • More data does not imply more information • Different approaches deal with the so-called curse of dimensionality to reduce irrelevant information • A process with multidimensional information is not necessarily easy to interpret nor process • In some real-world applications, the number of elements of a class is clearly lower than the other. The models tend to assume that the importance of the analysis belongs to the majority class and this is not usually the truth • The analysis of complex diseases such as cancer are focused on more-than-one dimensional omic data • The increasing amount of data thanks to the reduction of cost of the high-throughput experiments opens up a new era for integrative data-driven approaches • Entropy-based approaches are of interest to reduce the dimensionality of high-dimensional data