High-Dimensional Methods to Model Biological Signal in Genome-Wide Studies

High-Dimensional Methods to Model Biological Signal in Genome-Wide Studies PDF Author: Andrew J. Bass
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Recent advancements in sequencing technology have substantially increased the quality and quantity of data in genomics, presenting novel analytical challenges for biological discovery. In particular, foundational ideas developed in statistics over the past century are not easily extended to these high-dimensional datasets. Therefore, creating novel methodologies to analyze this data is a key challenge faced in statistics, and more generally, biology and computational science.Here I focus on building statistical methods for genome-wide analysis that are statistically rigorous, computationally fast, and easy to implement. In particular, I develop four methods that improve statistical inference of high-dimensional biological data. The first focuses on differential expression analysis where I extend the optimal discovery procedure (ODP) to complex study designs and RNA-seq studies. I find that the extended ODP leverages shared biological signal to substantially improve the statistical power compared to other commonly used testing procedures. The second aims to model the functional relationship between sequencing depth and statistical power in RNA-seq differential expression studies. The resulting model, superSeq, accurately predicts the improvement in statistical power when sequencing additional reads in a completed study. Thus superSeq can guide researchers in choosing a sufficient sequencing depth to maximize statistical power while avoiding unnecessary sequencing costs.The third method estimates the posterior distribution of false discovery rate (FDR) quantities, such as local FDRs and q-values, using a Bayesian nonparametric approach. Specifically, I implement an approximation to these posterior distributions that is scalable to genome-wide datasets using variational inference. These estimated posterior distributions are informative in a significance analysis as they capture the uncertainty of FDR quantities in reported results.Finally, I develop a likelihood-based approach to estimating unobserved population structure on the canonical parameter scale. I demonstrate that this framework can flexibly capture arbitrary structure and provide accurate allele frequency estimates while being computationally fast for large population genetic studies. Therefore, this framework is useful for many applications in population genetics, such as accounting for structure in the genome-wide association testing procedure GCATest.Collectively, these four methods address problems typically encountered in a biological analysis and can thus help improve downstream inferences in high-dimensional settings.

High-Dimensional Methods to Model Biological Signal in Genome-Wide Studies

High-Dimensional Methods to Model Biological Signal in Genome-Wide Studies PDF Author: Andrew J. Bass
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Recent advancements in sequencing technology have substantially increased the quality and quantity of data in genomics, presenting novel analytical challenges for biological discovery. In particular, foundational ideas developed in statistics over the past century are not easily extended to these high-dimensional datasets. Therefore, creating novel methodologies to analyze this data is a key challenge faced in statistics, and more generally, biology and computational science.Here I focus on building statistical methods for genome-wide analysis that are statistically rigorous, computationally fast, and easy to implement. In particular, I develop four methods that improve statistical inference of high-dimensional biological data. The first focuses on differential expression analysis where I extend the optimal discovery procedure (ODP) to complex study designs and RNA-seq studies. I find that the extended ODP leverages shared biological signal to substantially improve the statistical power compared to other commonly used testing procedures. The second aims to model the functional relationship between sequencing depth and statistical power in RNA-seq differential expression studies. The resulting model, superSeq, accurately predicts the improvement in statistical power when sequencing additional reads in a completed study. Thus superSeq can guide researchers in choosing a sufficient sequencing depth to maximize statistical power while avoiding unnecessary sequencing costs.The third method estimates the posterior distribution of false discovery rate (FDR) quantities, such as local FDRs and q-values, using a Bayesian nonparametric approach. Specifically, I implement an approximation to these posterior distributions that is scalable to genome-wide datasets using variational inference. These estimated posterior distributions are informative in a significance analysis as they capture the uncertainty of FDR quantities in reported results.Finally, I develop a likelihood-based approach to estimating unobserved population structure on the canonical parameter scale. I demonstrate that this framework can flexibly capture arbitrary structure and provide accurate allele frequency estimates while being computationally fast for large population genetic studies. Therefore, this framework is useful for many applications in population genetics, such as accounting for structure in the genome-wide association testing procedure GCATest.Collectively, these four methods address problems typically encountered in a biological analysis and can thus help improve downstream inferences in high-dimensional settings.

Model Selection Methods for High-dimensional Data and Their Applications to Genome-wide Association Studies

Model Selection Methods for High-dimensional Data and Their Applications to Genome-wide Association Studies PDF Author: Zheyang Wu
Publisher:
ISBN:
Category :
Languages : en
Pages : 418

Get Book Here

Book Description


Capturing Hidden Signals From High-Dimensional Data and Applications to Genomics

Capturing Hidden Signals From High-Dimensional Data and Applications to Genomics PDF Author: Elior Rahmani
Publisher:
ISBN:
Category :
Languages : en
Pages : 223

Get Book Here

Book Description
The analysis of high-dimensional data, albeit challenging owing to various computational and statistical aspects, often provides opportunities to uncover hidden signals by leveraging inherent structure in the data. In the context of genomics, where molecular markers are probed at ever-increasing resolution and throughput, large sets of features that follow specific patterns, in conjunction with large sample sizes, allow us to implement richer and more sophisticated models than before in attempt to extract signal that is not immediately evident from the data. Particularly, genomic markers are often affected by multiple genetic and environmental factors, they may differ in their regulation and presentation in different tissues, cell types, conditions, or over time, and some markers may affect multiple biological processes; unveiling those signals is likely to be pivotal in advancing our understanding of complex biology and disease. This dissertation introduces novel computational methodologies and theory that address several key challenges faced in the analysis of high-dimensional genomic data coming from heterogeneous sources ("bulk" genomics) with a particular focus on DNA methylation data. Through a range of simulations and the analysis of multiple data sets, we demonstrate that our proposed methods provide opportunities to conduct powerful and statistically sound population-level studies at an unprecedented resolution and scale.

Modern Molecular Biology:

Modern Molecular Biology: PDF Author: Srinivasan Yegnasubramanian
Publisher: Springer Science & Business Media
ISBN: 0387697454
Category : Medical
Languages : en
Pages : 196

Get Book Here

Book Description
Molecular biology has rapidly advanced since the discovery of the basic flow of information in life, from DNA to RNA to proteins. While there are several important and interesting exceptions to this general flow of information, the importance of these biological macromolecules in dictating the phenotypic nature of living creatures in health and disease is paramount. In the last one and a half decades, and particularly after the completion of the Human Genome Project, there has been an explosion of technologies that allow the broad characterization of these macromolecules in physiology, and the perturbations to these macromolecules that occur in diseases such as cancer. In this volume, we will explore the modern approaches used to characterize these macromolecules in an unbiased, systematic way. Such technologies are rapidly advancing our knowledge of the coordinated and complicated changes that occur during carcinogenesis, and are providing vital information that, when correctly interpreted by biostatistical/bioinformatics analyses, can be exploited for the prevention, diagnosis, and treatment of human cancers. The purpose of this volume is to provide an overview of modern molecular biological approaches to unbiased discovery in cancer research. Advances in molecular biology allowing unbiased analysis of changes in cancer initiation and progression will be overviewed. These include the strategies employed in modern genomics, gene expression analysis, and proteomics.

Design of Efficient and Accurate Statistical Approaches to Correct for Confounding Effects and Identify True Signals in Genetic Association Studies

Design of Efficient and Accurate Statistical Approaches to Correct for Confounding Effects and Identify True Signals in Genetic Association Studies PDF Author: JONG WHA JOANNE JOO
Publisher:
ISBN:
Category :
Languages : en
Pages : 144

Get Book Here

Book Description
Over the past decades, genome-wide association studies have dramatically improved especially with the advent of the hight-throughput technologies such as microarray and next generation sequencing. Although genome-wide association studies have been extremely successful in identifying tens of thousands of variants associated with various disease or traits, many studies have reported that some of the associations are spurious induced by various confounding factors such as population structure or technical artifacts. In this dissertation, I focus on effectively and accurately identifying true signals in genome-wide association studies in the presence of confounding effects. First, I introduce a method that effectively identifying regulatory hotspots while correcting for false signals induced by technical confounding effects in expression quantitative loci studies. Technical confounding factors such as a batch effect complicates the expression quantitative loci analysis by inducing heterogeneity in gene expressions. This creates correlations between the samples and may cause spurious associations leading to spurious regulatory hotspots. By formulating the problem of identifying genetic signals in a linear mixed model framework, I show how we can identify regulatory hotspots while capturing heterogeneity in expression quantitative loci studies. Second, I introduce an efficient and accurate multiple-phenotype analysis method for high-dimensional data in the presence of population structure. Recently, large amounts of genomic data such as expression data have been collected from genome-wide association studies cohorts and in many cases it is preferable to analyze more than thousands of phenotypes simultaneously than analyze each phenotype one at a time. However, when confounding factors, such as population structure, exit in the data, even a small bias is induced by the confounding effects, the bias accumulates for each phenotype and may cause serious problems in multiple-phenotype analysis. By incorporating linear mixed model in the statistics of multivariate regression, I show we can increase the accuracy of multiple phenotype analysis dramatically in high- dimensional data. Lastly, I introduce an efficient multiple testing correction method in linear mixed model. The significance threshold differs as a function of species, marker densities, genetic relatedness, and trait heritability. However, none of the previous multiple testing correction methods can comprehensively account for these factors. I show that the significant threshold changes with the dosage of genetic relatedness and introduce a novel multiple testing correction approach that utilizes linear mixed model to account for the confounding effects in the data.

Multivariate Methods for Heterogeneous High-Dimensional Data in Genome Biology

Multivariate Methods for Heterogeneous High-Dimensional Data in Genome Biology PDF Author: Britta Velten
Publisher:
ISBN:
Category :
Languages : en
Pages :

Get Book Here

Book Description


Statistical Inference from High Dimensional Data

Statistical Inference from High Dimensional Data PDF Author: Carlos Fernandez-Lozano
Publisher: MDPI
ISBN: 3036509445
Category : Science
Languages : en
Pages : 314

Get Book Here

Book Description
• Real-world problems can be high-dimensional, complex, and noisy • More data does not imply more information • Different approaches deal with the so-called curse of dimensionality to reduce irrelevant information • A process with multidimensional information is not necessarily easy to interpret nor process • In some real-world applications, the number of elements of a class is clearly lower than the other. The models tend to assume that the importance of the analysis belongs to the majority class and this is not usually the truth • The analysis of complex diseases such as cancer are focused on more-than-one dimensional omic data • The increasing amount of data thanks to the reduction of cost of the high-throughput experiments opens up a new era for integrative data-driven approaches • Entropy-based approaches are of interest to reduce the dimensionality of high-dimensional data

Machine Learning in Radiation Oncology

Machine Learning in Radiation Oncology PDF Author: Issam El Naqa
Publisher: Springer
ISBN: 3319183052
Category : Medical
Languages : en
Pages : 336

Get Book Here

Book Description
​This book provides a complete overview of the role of machine learning in radiation oncology and medical physics, covering basic theory, methods, and a variety of applications in medical physics and radiotherapy. An introductory section explains machine learning, reviews supervised and unsupervised learning methods, discusses performance evaluation, and summarizes potential applications in radiation oncology. Detailed individual sections are then devoted to the use of machine learning in quality assurance; computer-aided detection, including treatment planning and contouring; image-guided radiotherapy; respiratory motion management; and treatment response modeling and outcome prediction. The book will be invaluable for students and residents in medical physics and radiation oncology and will also appeal to more experienced practitioners and researchers and members of applied machine learning communities.

Statistical Methods Development and Benchmarking for the Analysis of Three-dimensional Chromatin Organization

Statistical Methods Development and Benchmarking for the Analysis of Three-dimensional Chromatin Organization PDF Author: Ye Zheng
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Recently developed chromatin conformation capture-based assays enabled the study of three-dimensional chromosomal architecture in a high throughput fashion. Hi-C, particularly, elucidated genome-wide long-range interactions among loci. Ability to simulate realistic high-throughput chromatin conformation (Hi-C) data is foundational for developing and benchmarking statistical and computational methods for Hi-C data analysis. We propose FreeHi-C, a data-driven Hi-C simulator for simulating and augmenting Hi-C datasets. FreeHi-C employs a non-parametric strategy for estimating interaction distribution of genome fragments from a given sample and simulates Hi-C reads from interacting fragments. Data from FreeHi-C exhibit higher fidelity to the biological Hi-C data compared with other tools in its class. FreeHi-C not only enables benchmarking a wide range of Hi-C analysis methods but also boosts the precision and power of differential chromatin interaction detection methods while preserving false discovery rate control through data augmentation. Although the number of statistical analysis methods for Hi-C data is growing rapidly, a key impediment is their inability to accommodate reads aligning to multiple locations, i.e., multi-mapping reads. Hence, current Hi-C processing approaches underestimate the biological signals from repetitive regions of genomes. We developed and validated mHi-C, a multi-read mapping strategy to probabilistically allocate Hi-C multi-reads. mHi-C exhibited superior performance over utilizing only uni-reads and heuristic approaches aimed at rescuing multi-reads on benchmarks. Specifically, mHi-C increased the sequencing depth by an average of 20% resulting in higher reproducibility of contact matrices and detected interactions across biological replicates. The impact of the multi-reads on the detection of significant interactions is influenced marginally by the relative contribution of multi-reads to the sequencing depth compared to uni-reads, cis-to-trans ratio of contacts, and the broad data quality as reflected by the proportion of mappable reads of datasets. Computational experiments highlighted that in Hi-C studies with short read lengths, mHi-C rescued multi-reads can emulate the effect of longer reads. mHi-C also revealed biologically supported bona fide promoter-enhancer interactions and topologically associating domains involving repetitive genomic regions, thereby unlocking a previously masked portion of the genome for conformation capture studies.

Leveraging Structure in the Analysis of High-dimensional Biological Data

Leveraging Structure in the Analysis of High-dimensional Biological Data PDF Author: Jason Zhu
Publisher:
ISBN:
Category :
Languages : en
Pages :

Get Book Here

Book Description
With the advancement of technologies to collect data automatically, researchers now have access to large datasets unimaginable until recently. In biology and medicine, multiple types of data have been generated in massive quantities with the promise to explain complex biological phenomena. While previously data acquisition was the most costly process, the main bottleneck today has become data analysis because many datasets also contain hyper- informative details which result in high dimensionality. Detecting signals in such data is like searching for a needle in a haystack. Fortunately, certain structures underlying the data can provide opportunities to make the analysis statistically powerful and computationally efficient. Borrowing insights from different structures in the data, I have developed efficient, principled, and interpretable methods to analyze high-dimensional data, with a focus on the modern biological applications. In this thesis, I first introduce a method to analyze high-throughput single-cell data, combining gene expression and immuno-sequencing data. Next, I propose GLISS, a novel framework that integrates Spatial Gene Expression data with single-cell RNA-sequencing data to simultaneously select spatial gene features and identify hidden spatial cellular structures. GLISS utilizes a graph-based feature selection method that is sensitive to non-monotonic associations to determine spatial genes. Finally, I present AEGIS, an exploratory data analysis method for Gene Ontology applications. AEGIS entails new visualization strategies that leverage the Directed Acyclic Graph structure of the Gene Ontology to facilitate information retrieval and power calculations for research study design.