Statistical Methods to Enhance Clinical Prediction with High-dimensional Data and Ordinal Response

Statistical Methods to Enhance Clinical Prediction with High-dimensional Data and Ordinal Response PDF Author:
Publisher:
ISBN:
Category :
Languages : en
Pages : 118

Get Book Here

Book Description
Advancing technology has enabled us to study the molecular configuration of single cells or whole tissue samples. Molecular biology produces vast amounts of high-dimensional omics data at continually decreasing costs, so that molecular screens are increasingly often used in clinical applications. Personalized diagnosis or prediction of clinical treatment outcome based on high-throughput omics data are modern applications of machine learning techniques to clinical problems. In practice, clinical parameters, such as patient health status or toxic reaction to therapy, are often measured on an ...

Statistical Methods to Enhance Clinical Prediction with High-dimensional Data and Ordinal Response

Statistical Methods to Enhance Clinical Prediction with High-dimensional Data and Ordinal Response PDF Author:
Publisher:
ISBN:
Category :
Languages : en
Pages : 118

Get Book Here

Book Description
Advancing technology has enabled us to study the molecular configuration of single cells or whole tissue samples. Molecular biology produces vast amounts of high-dimensional omics data at continually decreasing costs, so that molecular screens are increasingly often used in clinical applications. Personalized diagnosis or prediction of clinical treatment outcome based on high-throughput omics data are modern applications of machine learning techniques to clinical problems. In practice, clinical parameters, such as patient health status or toxic reaction to therapy, are often measured on an ...

Regularization Methods for Predicting an Ordinal Response Using Longitudinal High-dimensional Genomic Data

Regularization Methods for Predicting an Ordinal Response Using Longitudinal High-dimensional Genomic Data PDF Author: Jiayi Hou
Publisher:
ISBN:
Category :
Languages : en
Pages : 642

Get Book Here

Book Description
Ordinal scales are commonly used to measure health status and disease related outcomes in hospital settings as well as in translational medical research. Notable examples include cancer staging, which is a five-category ordinal scale indicating tumor size, node involvement, and likelihood of metastasizing. Glasgow Coma Scale (GCS), which gives a reliable and objective assessment of conscious status of a patient, is an ordinal scaled measure. In addition, repeated measurements are common in clinical practice for tracking and monitoring the progression of complex diseases. Classical ordinal modeling methods based on the likelihood approach have contributed to the analysis of data in which the response categories are ordered and the number of covariates (p) is smaller than the sample size (n). With the emergence of genomic technologies being increasingly applied for obtaining a more accurate diagnosis and prognosis, a novel type of data, known as high-dimensional data where the number of covariates (p) is much larger than the number of samples (n), are generated. However, corresponding statistical methodologies as well as computational software are lacking for analyzing high-dimensional data with an ordinal or a longitudinal ordinal response. In this thesis, we develop a regularization algorithm to build a parsimonious model for predicting an ordinal response. In addition, we utilize the classical ordinal model with longitudinal measurements to incorporate the cutting-edge data mining tool for a comprehensive understanding of the causes of complex disease on both the molecular level and environmental level. Moreover, we develop the corresponding R package for general utilization. The algorithm was applied to several real datasets as well as to simulated data to demonstrate the efficiency in variable selection and precision in prediction and classification. The four real datasets are from: 1) the National Institute of Mental Health Schizophrenia Collaborative Study; 2) the San Diego Health Services Research Example; 3) A gene expression experiment to understand `Decreased Expression of Intelectin 1 in The Human Airway Epithelium of Smokers Compared to Nonsmokers' by Weill Cornell Medical College; and 4) the National Institute of General Medical Sciences Inflammation and the Host Response to Burn Injury Collaborative Study.

Advanced Medical Statistics

Advanced Medical Statistics PDF Author: Ying Lu
Publisher: World Scientific
ISBN: 9789810248000
Category : Medical
Languages : en
Pages : 1118

Get Book Here

Book Description
This book presents new and powerful advanced statistical methods that have been used in modern medicine, drug development, and epidemiology. Some of these methods were initially developed for tackling medical problems. All 29 chapters are self-contained. Each chapter represents the new development and future research topics for a medical or statistical branch. For the benefit of readers with different statistical background, each chapter follows a similar style: the explanation of medical challenges, statistical ideas and strategies, statistical methods and techniques, mathematical remarks and background and reference. All chapters are written by experts of the respective topics.

Modern Statistical Methods for Health Research

Modern Statistical Methods for Health Research PDF Author: Yichuan Zhao
Publisher: Springer Nature
ISBN: 3030724379
Category : Medical
Languages : en
Pages : 506

Get Book Here

Book Description
This book brings together the voices of leading experts in the frontiers of biostatistics, biomedicine, and the health sciences to discuss the statistical procedures, useful methods, and novel applications in biostatistics research. It also includes discussions of potential future directions of biomedicine and new statistical developments for health research, with the intent of stimulating research and fostering the interactions of scholars across health research related disciplines. Topics covered include: Health data analysis and applications to EHR data Clinical trials, FDR, and applications in health science Big network analytics and its applications in GWAS Survival analysis and functional data analysis Graphical modelling in genomic studies The book will be valuable to data scientists and statisticians who are working in biomedicine and health, other practitioners in the health sciences, and graduate students and researchers in biostatistics and health.

Statistical Methods for High-dimensional Data Analysis

Statistical Methods for High-dimensional Data Analysis PDF Author:
Publisher:
ISBN:
Category :
Languages : en
Pages :

Get Book Here

Book Description
Statistical methods for high-dimensional data analysis.

Clinical Prediction Models

Clinical Prediction Models PDF Author: Ewout Steyerberg
Publisher: Springer
ISBN: 9781441926487
Category : Medical
Languages : en
Pages : 500

Get Book Here

Book Description
Prediction models are important in various fields, including medicine, physics, meteorology, and finance. Prediction models will become more relevant in the medical field with the increase in knowledge on potential predictors of outcome, e.g. from genetics. Also, the number of applications will increase, e.g. with targeted early detection of disease, and individualized approaches to diagnostic testing and treatment. The current era of evidence-based medicine asks for an individualized approach to medical decision-making. Evidence-based medicine has a central place for meta-analysis to summarize results from randomized controlled trials; similarly prediction models may summarize the effects of predictors to provide individu- ized predictions of a diagnostic or prognostic outcome. Why Read This Book? My motivation for working on this book stems primarily from the fact that the development and applications of prediction models are often suboptimal in medical publications. With this book I hope to contribute to better understanding of relevant issues and give practical advice on better modelling strategies than are nowadays widely used. Issues include: (a) Better predictive modelling is sometimes easily possible; e.g. a large data set with high quality data is available, but all continuous predictors are dich- omized, which is known to have several disadvantages.

Statistical Methods for Data with Complex Dependence Structure

Statistical Methods for Data with Complex Dependence Structure PDF Author: Ting Fung Ma (Ph.D.)
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
In this dissertation, I develop new statistical methodology for potentially high-dimensional data with complex dependence structures. Spatial ordinal data observed separately for multiple subjects are common in biomedical research, yet, statistical methodology for such ordinal data analysis is limited. The existing methodology often assumes a single realization of spatial ordinal data without replication, which is commonplace in disease mapping studies, and thus are not directly applicable. Motivated by a dataset evaluating periodontal disease (PD) status, we propose a multi-subject spatial ordinal model, that assumes a geostatistical spatial structure within a regression framework through an elegant latent variable representation. For achieving computational scalability within a classical inferential framework, we develop a maximum composite likelihood method for parameter estimation, and establish the asymptotic properties of the parameter estimates. Another major contribution is the development of model diagnostic measures for our dependent data scenario using generalized surrogate residuals. A simulation study suggests sound finite sample properties of the proposed methods. We also illustrate our proposed methodology via application to the motivating PD dataset. A companion R package clordr is available for easy implementation. Next, we explore a hierarchical generalized latent factor model for discrete and bounded response variables and in particular, binomial responses. Specifically, we develop a novel two-step estimation procedure and corresponding statistical inference that is computationally efficient and scalable for the high dimension in terms of both the number of subjects and the number of features per subject. We also establish the validity of the estimation procedure, particularly the asymptotic properties of the estimated effect size and the latent structure, as well as the estimated number of latent factors. The results are corroborated by a simulation study and for illustration, the proposed methodology is applied to analyze a dataset in a gene-environment association study. With the rapid advances of data acquisition techniques, spatio-temporal data are becoming increasingly abundant in a diverse array of disciplines. Here we develop spatio-temporal regression methodology for analyzing large amounts of spatially referenced data collected over time, motivated by environmental studies utilizing remotely sensed satellite data. In particular, we specify a semiparametric autoregressive model without the usual Gaussian assumption and devise a computationally scalable procedure that enables the regression analysis of large datasets. We estimate the model parameters by quasi maximum likelihood and show that the computational complexity can be reduced from cubic to linear of the sample size. Asymptotic properties under suitable regularity conditions are further established that inform the computational procedure to be efficient and scalable. A simulation study is conducted to evaluate the finite-sample properties of the parameter estimation and statistical inference. We illustrate our methodology by a dataset with over 2.96 million observations of annual land surface temperature and the comparison with an existing state-of-the-art approach to spatio-temporal regression highlights the advantages of our method.

Statistical Methods for Data with Complex Dependence Structure

Statistical Methods for Data with Complex Dependence Structure PDF Author: Ting Fung Ma (Ph.D.)
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
In this dissertation, I develop new statistical methodology for potentially high-dimensional data with complex dependence structures. Spatial ordinal data observed separately for multiple subjects are common in biomedical research, yet, statistical methodology for such ordinal data analysis is limited. The existing methodology often assumes a single realization of spatial ordinal data without replication, which is commonplace in disease mapping studies, and thus are not directly applicable. Motivated by a dataset evaluating periodontal disease (PD) status, we propose a multi-subject spatial ordinal model, that assumes a geostatistical spatial structure within a regression framework through an elegant latent variable representation. For achieving computational scalability within a classical inferential framework, we develop a maximum composite likelihood method for parameter estimation, and establish the asymptotic properties of the parameter estimates. Another major contribution is the development of model diagnostic measures for our dependent data scenario using generalized surrogate residuals. A simulation study suggests sound finite sample properties of the proposed methods. We also illustrate our proposed methodology via application to the motivating PD dataset. A companion R package clordr is available for easy implementation. Next, we explore a hierarchical generalized latent factor model for discrete and bounded response variables and in particular, binomial responses. Specifically, we develop a novel two-step estimation procedure and corresponding statistical inference that is computationally efficient and scalable for the high dimension in terms of both the number of subjects and the number of features per subject. We also establish the validity of the estimation procedure, particularly the asymptotic properties of the estimated effect size and the latent structure, as well as the estimated number of latent factors. The results are corroborated by a simulation study and for illustration, the proposed methodology is applied to analyze a dataset in a gene-environment association study. With the rapid advances of data acquisition techniques, spatio-temporal data are becoming increasingly abundant in a diverse array of disciplines. Here we develop spatio-temporal regression methodology for analyzing large amounts of spatially referenced data collected over time, motivated by environmental studies utilizing remotely sensed satellite data. In particular, we specify a semiparametric autoregressive model without the usual Gaussian assumption and devise a computationally scalable procedure that enables the regression analysis of large datasets. We estimate the model parameters by quasi maximum likelihood and show that the computational complexity can be reduced from cubic to linear of the sample size. Asymptotic properties under suitable regularity conditions are further established that inform the computational procedure to be efficient and scalable. A simulation study is conducted to evaluate the finite-sample properties of the parameter estimation and statistical inference. We illustrate our methodology by a dataset with over 2.96 million observations of annual land surface temperature and the comparison with an existing state-of-the-art approach to spatio-temporal regression highlights the advantages of our method.

Statistical Methods for the Analysis of High Dimension Heterogeneous Data

Statistical Methods for the Analysis of High Dimension Heterogeneous Data PDF Author: Linlin Sha
Publisher:
ISBN:
Category : Statistics
Languages : en
Pages : 0

Get Book Here

Book Description
In the genetic study, the advance of high-through technology allowed scientists to collect data on a larger scale and with more complexity. Thus, it is common that the collected data is high-dimension and heterogenous, id est the number of features is greater than that of observations. For example, in the study of liver cancer, the number of genes is more than 25000 while the sample size in only around hundreds. However, a common fact is that not all variables are useful in solving a problem, only a small proportion should be used. Hence selecting a useful subset of variables based on clinical information receives a lot of attention. Since supervised learning and unsupervised learning are two major problems of statistical learning. Supervised learning is given a labeled training set with variables and responses, then fit a model to predict the response for new test data. When the response is continuous, it's often known as regression. If the response is categorical, then it's a classification problem. While sometimes responses are not available, then this turns to be an unsupervised learning problem. For unsupervised learning, first, we need to recover the responses from the input variables. This is often referred to as a clustering problem. When the data is of high dimension both supervised and unsupervised problems will confront statistical and computational challenges. Therefore, there are three problems we mainly focused on. Firstly, many gene expression data are not along with a response. Thus, we need to cluster patients based on input variables to transfer the unsupervised learning into a supervised learning problem. Secondly, although many studies of high-dimension data were proposed, they are not suitable for heterogenous gene expression data. Moreover, an efficient method to select a subset of variables to discern different groups grow more and more attention. Thirdly, it is important to find a suitable model to predict new data labels based on the selected variables and recovered labels. In real data analysis, we mainly studied on identifying biomarkers and personalized treatments based on their gene expression data. For the supervised learning problem in which the label information is available, we proposed a framework of identifying biomarkers, containing three steps: differential gene expression analysis based on the labels, pathway analysis and random forest with 10 folds cross-validation. This framework provides the subset of useful genes and identifies the biomarkers based on the votes of random forest. For unsupervised learning problem, we proposed the framework of clustering cancer patients for treatments, to sequential biclustering patients and assign the different treatments. Sequential biclustering is a novel biclustering method that only allows overlapping genes for different groups. This framework returns labels for patients which leads to the next step of identifying biomarkers and assign suitable treatments for different clustered patients. Moreover, based on the studies of real data, we consider the clustering problem on high dimension and heterogeneous data, we proposed a more efficient procedure based on marginal screening for a mixture regression model. This algorithm takes advantage of heterogeneity of the data to filter out variables which leads to lower storage costs and higher computation speed. The performance of our method is more stable and with higher accuracy compared with the existing method. In the end, we discuss some future works, including real data applications and extension of generalizing linear models.

Statistical Methods for High Dimensional Data Arising from Large Epidemiological Studies

Statistical Methods for High Dimensional Data Arising from Large Epidemiological Studies PDF Author: Hui Xu
Publisher:
ISBN:
Category :
Languages : en
Pages :

Get Book Here

Book Description
In this thesis, we propose statistical models for addressing commonly encountered data types and study designs in large epidemiologic investigations aimed at understanding the molecular basis of complex disorders. The motivating applications come from diverse disease areas in Women's Health, including the study of type II diabetes in the Women's Health Initiative (WHI), invasive breast cancer in the Nurses' Health Study and the study of the metabolomic underpinnings of cardiovascular disease in the WHI. We have also put significant effort into making the implementation of the proposed methods accessible through freely available, user-friendly software packages in R. The first chapter is motivated by the self-reported outcomes of incident diabetes that were collected periodically for approximately 160,000 women enrolled in the Women's Health Initiative (WHI). While self-reported outcomes are cost efficient, they are also subject to error. With a goal of variable selection in a high dimensional data setting, we adapt the Random Survival Forests algorithm to accommodate the characteristics of error-prone self-reports. We propose a novel likelihood-based splitting rule and associated variable selection algorithm to select the subset of relevant biomarkers that are associated with the time to event of interest. We compare the proposed methods to existing approaches in simulation studies. We apply the proposed algorithm to discover single nucleotide polymorphisms associated with incident type II diabetes risk in a dataset of 909,622 SNPs on 10,832 African American and Hispanic women. We implement the proposed algorithm in an R package icRSF. The second chapter is aimed at estimating and evaluating prediction rules in data generated in matched case-control studies that are nested within large prospective cohorts. This work is motivated by a matched case-control study nested within the Nurses' Health Study, where the goal was to determine if the inclusion of a set of seven endogenous hormone measurements will enhance the predictive ability of breast cancer risk when compared to the previously published Gail Score. For this setting, we propose an algorithm for estimating the summary index, area under the curve (AUC) corresponding to the Receiver Operating Characteristic (ROC) curve associated with a set of pre-defined covariates for predicting a binary outcome. By combining data from the parent cohort with that generated in a matched case-control study, we describe methods for estimation of the population parameters of interest and the corresponding AUC. We evaluate the bias associated with the proposed methods in simulations by considering a range of parameter settings. We illustrate the methods in the motivating study of endogenous hormones and breast cancer risk, nested within the Nurses' Health Study. The third chapter is aimed at estimating and evaluating prediction rules in high dimensional datasets generated in matched case control studies nested within large prospective cohorts. In this setting, the goals include simultaneous variable selection, estimation of a prediction rule and its corresponding summary index such as the AUC for quantifying the strength of prediction. This work is motivated by an ongoing study of metabolomics of cardiovascular disease in the WHI. Through extensive simulations, we compare three disparate variable selection procedures in conjunction with the parameter estimation and inverse probability weighted estimation of the AUC proposed in Chapter 2. We also evaluate the extent of overfitting observed when the multi-step procedure is carried out within one, two and three independent datasets. The common thread underlying all three chapters of this thesis is the development and application of statistical models useful in the study of complex disorders, with illustrative applications drawn from diverse areas of women's health.