Statistical Inference from High Dimensional Data

Statistical Inference from High Dimensional Data PDF Author: Carlos Fernandez-Lozano
Publisher: MDPI
ISBN: 3036509445
Category : Science
Languages : en
Pages : 314

Get Book Here

Book Description
• Real-world problems can be high-dimensional, complex, and noisy • More data does not imply more information • Different approaches deal with the so-called curse of dimensionality to reduce irrelevant information • A process with multidimensional information is not necessarily easy to interpret nor process • In some real-world applications, the number of elements of a class is clearly lower than the other. The models tend to assume that the importance of the analysis belongs to the majority class and this is not usually the truth • The analysis of complex diseases such as cancer are focused on more-than-one dimensional omic data • The increasing amount of data thanks to the reduction of cost of the high-throughput experiments opens up a new era for integrative data-driven approaches • Entropy-based approaches are of interest to reduce the dimensionality of high-dimensional data

Statistical Inference from High Dimensional Data

Statistical Inference from High Dimensional Data PDF Author: Carlos Fernandez-Lozano
Publisher: MDPI
ISBN: 3036509445
Category : Science
Languages : en
Pages : 314

Get Book Here

Book Description
• Real-world problems can be high-dimensional, complex, and noisy • More data does not imply more information • Different approaches deal with the so-called curse of dimensionality to reduce irrelevant information • A process with multidimensional information is not necessarily easy to interpret nor process • In some real-world applications, the number of elements of a class is clearly lower than the other. The models tend to assume that the importance of the analysis belongs to the majority class and this is not usually the truth • The analysis of complex diseases such as cancer are focused on more-than-one dimensional omic data • The increasing amount of data thanks to the reduction of cost of the high-throughput experiments opens up a new era for integrative data-driven approaches • Entropy-based approaches are of interest to reduce the dimensionality of high-dimensional data

Statistical Inference for High-Dimensional Genetic Data

Statistical Inference for High-Dimensional Genetic Data PDF Author: Xuan Li
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
This dissertation focuses on three types of high-dimensional genetic data: protein sequences, DNA methylation data, and microRNA expression data. The four major parts are presented in Chapters 2-5, respectively. In Chapter 2, we develop a new clustering method for protein sequences. First, we reduce the dimensionality based on entropy. Second, the sequences are clustered using the Hamming distance vectors of chosen sites. We apply this new method to an influenza A H3N2 HA data set, which consists of 1960 viral sequences. Our method aggregates these sequences into 23 clusters. Based on the temporal evolution pattern of these clusters, we find that the dominant clusters change from time to time and are often different from the clusters housing vaccine strains. In Chapter 3, we conduct systematic simulation studies and real data analysis to compare the performance of seven statistical tests for equal-variance hypothesis. Our results show that Brown-Forsythe test and trimmed-mean-based-Levene's test have better performance on DNA methylation data in comparison with other tests. Detection of differential DNA methylation and differential variability have received a lot of attention in the literature. In Chapter 4, we derive the asymptotic distribution of a joint score test (AW), proposed by Anh and Wang (2013). Furthermore, we propose three improved joint score tests, namely iAW.Lev, iAW.BF, and iAW.TM. Systematic simulation studies show that at least one of the proposed tests performs better than the existing tests for data with outliers or from non-normal distributions. The real data analyses demonstrate that the three proposed tests have higher true validation rates than the existing tests. Besides DNA methylation, microRNA regulation is another important epigenetic mechanism. In Chapter 5, we propose a novel model-based clustering method to detect differentially variable (DV) miRNAs. We impose biologically meaningful structures on covariance matrices for each cluster of miRNAs. Simulation studies show that the proposed method performs better than other model-based methods when miRNA expression levels are from a multivariate normal distribution. In real data analysis, the proposed method has a higher validation rate than other methods.

Two Sample Inference for High Dimensional Mean with Application to Gene Expression Data

Two Sample Inference for High Dimensional Mean with Application to Gene Expression Data PDF Author: Dongjun You
Publisher:
ISBN:
Category :
Languages : en
Pages : 24

Get Book Here

Book Description
Motivated by the gene expression analysis from biological science, we consider the two sample problem, where the number of variables is much larger than the sample size. Due to technology advances, high dimensional data have been increasingly encountered in many applications of statistics over the past few decades. Classical inference procedures from multivariate statistical analysis such as the Hotelling's Tsquared test, however, cannot be directly applied to such high dimensional data sets. To tackle the challenge arising in high dimension, several testing procedures have been recently proposed in the literature. We briefly review the sum-of-square type tests and maximum type tests as powerful alternatives to the Hotelling's T-squared test in the high dimensional setting. We then provide an extension which aims to combine the strength of the maximum type test with that of the sum-of-square test. Furthermore, we propose a bootstrap caliberation for the maximum type test which allows the data vectors to be temporally dependent. Simulation studies are conducted to compare and contrast the finite sample performance of these tests. We apply these methods to test significance of sets of genes in a real data example.

Statistical Inference from Genetic Data on Pedigrees

Statistical Inference from Genetic Data on Pedigrees PDF Author: Elizabeth Alison Thompson
Publisher: IMS
ISBN: 9780940600492
Category : Reference
Languages : en
Pages : 194

Get Book Here

Book Description
Annotation While this monograph is not about show dogs or cats, its statistical methods could be applied to tracing the pedigree of these species as well as humans. Thompson (U. of Washington) covers such topics as genetic models, population allele frequencies, kinship/inbreeding coefficients, and Monte Carlo estimation. Includes supporting tables and figures. Suitable as a supplementary text or primary text for advanced students. Lacks an index. c. Book News Inc.

Advanced Statistical Methods in Data Science

Advanced Statistical Methods in Data Science PDF Author: Ding-Geng Chen
Publisher: Springer
ISBN: 9811025940
Category : Mathematics
Languages : en
Pages : 229

Get Book Here

Book Description
This book gathers invited presentations from the 2nd Symposium of the ICSA- CANADA Chapter held at the University of Calgary from August 4-6, 2015. The aim of this Symposium was to promote advanced statistical methods in big-data sciences and to allow researchers to exchange ideas on statistics and data science and to embraces the challenges and opportunities of statistics and data science in the modern world. It addresses diverse themes in advanced statistical analysis in big-data sciences, including methods for administrative data analysis, survival data analysis, missing data analysis, high-dimensional and genetic data analysis, longitudinal and functional data analysis, the design and analysis of studies with response-dependent and multi-phase designs, time series and robust statistics, statistical inference based on likelihood, empirical likelihood and estimating functions. The editorial group selected 14 high-quality presentations from this successful symposium and invited the presenters to prepare a full chapter for this book in order to disseminate the findings and promote further research collaborations in this area. This timely book offers new methods that impact advanced statistical model development in big-data sciences.

Statistical Inference from Large-scale Genomic Data

Statistical Inference from Large-scale Genomic Data PDF Author:
Publisher:
ISBN:
Category :
Languages : en
Pages : 372

Get Book Here

Book Description


High Dimensional Data Analysis and Biomedical Genomics

High Dimensional Data Analysis and Biomedical Genomics PDF Author:
Publisher:
ISBN:
Category :
Languages : en
Pages : 72

Get Book Here

Book Description
This thesis considers statistical inference for experiments where the number of predictors $d$ that are potentially associated with a response is in the hundred of thousands, while the sample size $n$ is substantially smaller. One approach to this large $d$ small $n$ problem is based on Principal Component Analysis where, when investigating the association between a predictor $X$ and the response $Y$, the other predictors are replaced by their first $q$ eigenvectors of the $n\times n$ dual covariance matrix $d^{-1}(\boldsymbol{X}-\boldsymbol{\bar X})(\boldsymbol{X}-\boldsymbol{\bar X})^T$, where typically $q\le 10$. This approach has been used to deal with unobserved population stratification that may generate spurious association; as well as the $d\gg n$ issue. In this approach, a statistical association test of the hypothesis that $X$ and $Y$ are independent is applied to the new low dimensional vector of predictors with the hope that it will produce $p$-values that are approximately uniformly distributed on $(0,1)$ under the hypothesis of no association. This thesis develops this approach in general, and for the special case of genome wide association studies (GWAS) that examines potential relationships between genetic markers and disease for case-control data. It examines, conducts and compares methods for testing correlation between predictors and a response using linear and logistic models from population genetics. The methods are illustrated using data from the Wellcome Trust Case Control Consortium.

Pre-processing and Statistical Inference Methods for High-throughput Genomic Data with Application to Biomarker Detection and Regenerative Medicine

Pre-processing and Statistical Inference Methods for High-throughput Genomic Data with Application to Biomarker Detection and Regenerative Medicine PDF Author: Jeea Choi
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Genome research advances of the last two decades allow us to obtain various forms of data, such as next-generation sequencing, genotyping, phenotyping, as well as clinical information. However, our ability to derive useful information from these data remains to be improved. This motivated me to develop a pipeline with new computational methods. In this dissertation, I develop, implement, evaluate, and apply statistical and computational methods for high-dimensional data analysis to facilitate efforts in regenerative medicine and to uncover novel insights in cancer genomics. The first method is an integrative pathway-index (IPI) model to identify a clinically actionable biomarker of high-risk advanced ovarian cancer patients. Despite improvements in operative management and therapies, overall survival rates in advanced ovarian cancer have remained largely unchanged over the past three decades. The IPI model is applied to messenger RNA expression and survival data collected on ovarian cancer patients as part of the Cancer Genome Atlas project. The approach identifies signatures that are strongly associated with overall and progression-free survival, and also identifies group of patients who may benefit from enhanced adjuvant therapy. The second method is called SCDC for removing increased variability due to oscillating genes in a snapshot scRNA-seq experiment. Single-cell RNA sequencing provides a new avenue for studying oscillatory gene expression. However, in many studies, oscillations (e.g., cell cycle) are not of interest, and the increased variability imposed by them masks the effects of interest. In bulk RNA-seq, the increase in variability caused by oscillatory genes is mitigated by averaging over thousands of cells. However, in typical unsynchronized scRNA-seq, this variability remains. Simulation and case studies demonstrate that by removing increased variability due to oscillations, both the power and accuracy of downstream analysis is increased. Finally, in this thesis, we have extended a data analysis pipeline for both single- cell and bulk RNA-seq data. In this pipeline, we review current standards and resources for (sc)RNA-seq data analysis and provide an extended pipeline that incorporates a quality control scheme and user friendly advanced statistical analysis software for visualization and projected principal component analysis (PCA).

Simultaneous Statistical Inference

Simultaneous Statistical Inference PDF Author: Thorsten Dickhaus
Publisher: Springer Science & Business Media
ISBN: 3642451829
Category : Science
Languages : en
Pages : 182

Get Book Here

Book Description
This monograph will provide an in-depth mathematical treatment of modern multiple test procedures controlling the false discovery rate (FDR) and related error measures, particularly addressing applications to fields such as genetics, proteomics, neuroscience and general biology. The book will also include a detailed description how to implement these methods in practice. Moreover new developments focusing on non-standard assumptions are also included, especially multiple tests for discrete data. The book primarily addresses researchers and practitioners but will also be beneficial for graduate students.

Statistical Inference for Large and Complex Data

Statistical Inference for Large and Complex Data PDF Author: Xinkai Zhou
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
Statistical inference aims to quantify the amount of uncertainty in parameters or functions estimated from a statistical procedure and lies at the heart of modern decision-making. The problem is, however, when data sets become large and high-dimensional, which is the case for many modern health-related applications (electronic health records, multiomics, imaging data, etc.), classical statistical inference tools fail due to computational and methodological issues. The problem is further exacerbated when data sets also exhibit dependency structures or nonignorable missingness due to censoring. This dissertation summarizes our effort in addressing some of these challenges. Specifically, chapter 1 provides a bag of little bootstrap (BLB) based method for conducting statistical inference of linear mixed models on massive and distributed longitudinal data sets such as electronic health records. For the statistical inference of variance component parameters, our software package MixedModelsBLB.jl achieves 200 times speedup on the scale of 1 million subjects (20 million total observations), and is the only currently available tool that can handle more than 10 million subjects (200 million total observations) using a desktop computer. Chapter 2 provides an extremely flexible and general framework called proximal Markov Chain Monte Carlo (ProxMCMC) for conducting statistical inference on constrained or regularized estimation procedures, which are indispensable for analyzing high-dimensional data and the inference of which has been considered difficult. Many frequently encountered statistical learning tasks such as constrained lasso, graphical lasso, matrix completion, and sparse low-rank matrix regression fall into this category. Chapter 3 provides tools for the estimation and inference of heteroscedastic linear models for analyzing censored data using synthetic variables. Our motivating applications are adjusting for treatment effects in studies of quantitative traits and variance quantitative trait loci (vQTL) analysis, which arise frequently in genetic and epidemiological studies, but our method is general and computationally scalable to be applied to other fields of applications where censored data can arise from, for example, measurements that are out of the limit of detection.