Valid Causal Inference in High-dimensional and Complex Settings

Valid Causal Inference in High-dimensional and Complex Settings PDF Author: Niloofar Moosavi
Publisher:
ISBN: 9789178558827
Category :
Languages : en
Pages : 0

Get Book Here

Book Description

Valid Causal Inference in High-dimensional and Complex Settings

Valid Causal Inference in High-dimensional and Complex Settings PDF Author: Niloofar Moosavi
Publisher:
ISBN: 9789178558827
Category :
Languages : en
Pages : 0

Get Book Here

Book Description


Elements of Causal Inference

Elements of Causal Inference PDF Author: Jonas Peters
Publisher: MIT Press
ISBN: 0262037319
Category : Computers
Languages : en
Pages : 289

Get Book Here

Book Description
A concise and self-contained introduction to causal inference, increasingly important in data science and machine learning. The mathematization of causality is a relatively recent development, and has become increasingly important in data science and machine learning. This book offers a self-contained and concise introduction to causal models and how to learn them from data. After explaining the need for causal models and discussing some of the principles underlying causal inference, the book teaches readers how to use causal models: how to compute intervention distributions, how to infer causal models from observational and interventional data, and how causal ideas could be exploited for classical machine learning problems. All of these topics are discussed first in terms of two variables and then in the more general multivariate case. The bivariate case turns out to be a particularly hard problem for causal learning because there are no conditional independences as used by classical methods for solving multivariate cases. The authors consider analyzing statistical asymmetries between cause and effect to be highly instructive, and they report on their decade of intensive research into this problem. The book is accessible to readers with a background in machine learning or statistics, and can be used in graduate courses or as a reference for researchers. The text includes code snippets that can be copied and pasted, exercises, and an appendix with a summary of the most important technical concepts.

Semi-Parametric Estimation in Network Data and Tools for Conducting Complex Simulation Studies in Causal Inference

Semi-Parametric Estimation in Network Data and Tools for Conducting Complex Simulation Studies in Causal Inference PDF Author: Oleg Sofrygin
Publisher:
ISBN:
Category :
Languages : en
Pages : 146

Get Book Here

Book Description
This dissertation is concerned with application of robust semi-parametric methods to problems of estimation in network-dependent data and the conduct of large-scale simulation studies for causal inference research in epidemiological and medical data. Specifically, Chapter 1 presents a modern semi-parametric approach to estimation of causal effects in a population connected by a single social network. The connectivity of the population units will typically imply that the observed data on these units is no longer independent and identically distributed. Moreover, such social settings typically result in highly dimensional data. This chapter contributes to current statistical methodology by presenting an approach that allows valid estimation and inference and addresses the statistical issues specific to such networked population datasets. The framework of semi-parametric estimation, called the targeted maximum likelihood estimation (TMLE), is presented. This framework improves upon the existing methods by offering robustness, weakened sensitivity to near positivity violations, as well as the ability to deal with high-dimensionality issues of social network data. In particular, this approach relies on the accurate reflection of the background knowledge available for a given scientific problem, allowing estimation and inference without having to make unrealistic assumptions about the structure of the data. In addition, this chapter generalizes previous work describing estimation of complex causal parameters, such as the direct treatment effects under interference and the causal effects of interventions on social network structure. Although the past decade has produced many contributions towards estimation of causal effects in social network settings, there has been considerably less research on the topic of variance estimation for such highly-dependent data. This chapter presents an approach to constructing valid inference, providing a variance estimator that is scalable to very large datasets with highly-connected observations. The efficient open-source software implementation of these methods also accompanies this chapter. Chapter 2 presents open-source software tools for conduct of reproducible simulation studies for complex parameters that emerge from application of causal inference methods in epidemiological and medical research. This simulation software is build on the framework of non-parametric structural equation modeling. This chapter also studies simulation-based testing of statistical methods in causal inference for longitudinal data with time-varying exposure and confounding. It contributes to existing literature by presenting a unified syntax for non-parametrically defining complex causal parameters, which can be used as the model-free and agnostic gold standard for comparison of different statistical methods for causal inference. For instance, this chapter provides various examples of specification and evaluation of causal parameters that arise naturally in longitudinal causal effect analyses when using marginal structural models (MSMs). The application of these newly developed software tools to replication of several previously published simulation studies in causal inference are also described. Chapter 3 builds on the work described in Chapter 2 and addresses the issue of dependent data simulation for causal inference research in social network data. In particular, it provides a model-free approach to test the validity of various estimation procedures in simulated network-settings. This chapter first outlines a non-parametric causal model for units connected by a network and provides various applied examples of simulations with social network data. This chapter also showcases a possible application of the highly scalable open-source software implementation of the semi-parametric estimation methods described in Chapter 1. In particular, a large scale social network simulation study is described, and the performance of three dependent-data estimators from Chapter 1 is examined. This simulation study also examines the problem of inference for network-dependent data, specifically, by comparing the performance of the dependent-data TMLE variance estimator from Chapter 1 to the true TMLE variance derived from simulations. Finally, Chapter 3 concludes with a simulation study of an HIV epidemic described in terms of a longitudinal process which evolves over a static network in discrete time-steps among several highly inter-connected communities. The abstracts of the three works which make up this dissertation are reproduced below. Chapter 1: This chapter describes the robust semi-parametric approach towards estimation and inference for the sample average treatment-specific mean in observational settings where data are collected on a single network of connected units (e.g., in the presence of interference or spillover). Despite recent advances, many of the currently used statistical methods rely on assumption of a specific parametric model for the outcome, even though some of the most important statistical assumptions required by these models are most likely violated in the observational network data settings, resulting in invalid and anti-conservative statistical inference. In this chapter, we rely on the recent methodological advances for the targeted maximum likelihood estimation (TMLE) for data collected on a single population of causally connected units, to describe an estimation approach that permits for more realistic classes of data-generative models and provides valid statistical inference in the context of such network-dependent data. The approach is applied to an observational setting with a single time point stochastic intervention. We start by assuming that the true observed data-generating distribution belongs to a large class of semi-parametric statistical models. We then impose some restrictions on the possible set of the data-generative distributions that may belong to our statistical model. For example, we assume that the dependence among units can be fully described by the known network, and that the dependence on other units can be summarized via some known (but otherwise arbitrary) summary measures. We show that under our modeling assumptions, our estimand is equivalent to an estimand in a hypothetical IID data distribution, where the latter distribution is a function of the observed network data-generating distribution. With this key insight in mind, we show that the TMLE for our estimand in dependent network data can be described as a certain IID data TMLE algorithm, also resulting in a new simplified approach to conducting statistical inference. We demonstrate the validity of our approach in a network simulation study. We also extend prior work on dependent-data TMLE towards estimation of novel causal parameters, e.g., the unit-specific direct treatment effects under interference and the effects of interventions that modify the initial network structure. Chapter 2: This chapter introduces the \pkg{simcausal} \proglang{R} package - an open-source software tool for specification and simulation of complex longitudinal data structures that are based on non-parametric structural equation models. The package aims to provide a flexible tool for simplifying the conduct of transparent and reproducible simulation studies, with a particular emphasis on the types of data and interventions frequently encountered in real-world causal inference problems, such as, observational data with time-dependent confounding, selection bias, and random monitoring processes. The package interface allows for concise expression of complex functional dependencies between a large number of nodes, where each node may represent a measurement at a specific time point. The package allows for specification and simulation of counterfactual data under various user-specified interventions (e.g., static, dynamic, deterministic, or stochastic). In particular, the interventions may represent exposures to treatment regimens, the occurrence or non-occurrence of right-censoring events, or of clinical monitoring events. Finally, the package enables the computation of a selected set of user-specified features of the distribution of the counterfactual data that represent common causal quantities of interest, such as, treatment-specific means, the average treatment effects and coefficients from working marginal structural models. The applicability of \pkg{simcausal} is demonstrated by replicating the results of two published simulation studies. Chapter 3: The past decade has seen an increasing body of literature devoted to the estimation of causal effects in network-dependent data. However, the validity of many classical statistical methods in such data is often questioned. There is an emerging need for objective and practical ways to assess which causal methodologies might be applicable and valid in such novel network-based datasets. In this chapter we describe a set of tools implemented as part of the \pkg{simcausal} \proglang{R} package that allow simulating data based on the non-parametric structural equation model for connected units. We also provide examples of how these simulations may be applied to evaluation of different statistical methods for estimation of causal effects in such data. In particular, these simulation tools are targeted to the types of data and interventions frequently encountered in real-world causal inference research in social networks, such as, observational studies with spill-over or interference. We developed a novel \proglang{R} language interface which simplifies the specification of network-based functional relationships between connected units. Moreover, this network-based syntax can be combined with.

Semiparametric and Robust Methods for Complex Parameters in Causal Inference

Semiparametric and Robust Methods for Complex Parameters in Causal Inference PDF Author: Wenjing Zheng
Publisher:
ISBN:
Category :
Languages : en
Pages : 169

Get Book Here

Book Description
This dissertation focuses on developing robust semiparametric methods for complex parameters that emerge at the interface of causal inference and biostatistics, with applications to epidemiological and medical research. Specifically, it address three important topics: Part I (chapter 1) presents a framework to construct and analyze group sequential covariate-adjusted response-adaptive (CARA) randomized controlled trials (RCTs) that admits the use of data-adaptive approaches in constructing the randomization schemes and in estimating the conditional response model. This framework adds to the existing literature on CARA RCTs by allowing flexible options in both their design and analysis. Part II (chapters 2 and 3) concerns two parameters that arise in longitudinal causal effect analysis using marginal structural models (MSMs). Chapter 2 presents a targeted maximum likelihood estimator (TMLE) for the the dynamic MSM for the hazard function. This estimator improves upon the existing inverse probability weighted (IPW) estimators by providing efficiency gain and robustness protection against model misspecification. Chap- ter 3 addresses the issue of effect modification (in a MSM) by an effect modifier that is post exposure. This parameter is particularly relevant if an effect modifier of interest is missing at random; or if one wishes to evaluate the effect modification of a second-line-treatment by a post first-line-treatment variable, where assignment of the first-line-treatment shares common determinants with the outcome of interest. We also present a TMLE for this parameter. Part III (chapters 4 and 5) addresses semiparametric inference for mediation analysis. Chapter 4 presents a TMLE estimator for the natural direct and indirect effects in a one-time point setting; it improves upon existing estimators by offering robustness, weakened sensitivity to near positivity violations, and potential applications to situations with high-dimensional mediators. Chapter 5 studies longitudinal mediation analysis with time-varying exposure and mediators. In it, we propose a reformulation of the mediation problem in terms of stochastic interventions, establish an identification formula for the mediation functional, and present a TMLE for this parameter. This chapter contributes to existing literature by presenting a nonparametrically defined parameter of interest in longitudinal mediation and a multiply robust and efficient estimator for it. Chapter 1: An adaptive trial design allows pre-specified modifications to some aspects of the on-going trial based on analysis of the accruing data, while preserving the validity and integrity of the trial. This flexibility potentially translates into more efficient studies (e.g. shorter duration, fewer subjects) or greater chance of answering clinical questions of interest (e.g. detecting a treatment effect if one exists, broader does-response information, etc). In an adaptive CARA RCT, the treatment randomization schemes are allowed to depend on the patient's pre-treatment covariates, and the investigators have the opportunity to adjust these schemes during the course of the trial based on accruing information, including previous responses, in order to meet some pre-specified objectives. In a group-sequential CARA RCT, such adjustments take place at interim time points given by sequential inclusion of blocks of c patients, where c ≥ 1 is a pre-specified integer. In this chapter, we present a novel group-sequential CARA RCT design and corresponding analytical procedure that admits the use of flexible approaches in constructing randomization schemes and a wide range of data-adaptive techniques in estimating the conditional response model. Under the proposed framework, the sequence of randomization schemes is group-sequentially determined, using the accruing data, by targeting a formal, user- specified optimal randomization design. The parameter of interest is nonparametrically defined and is estimated using the paradigm of targeted minimum loss estimation. We establish that under appropriate empirical process conditions, the resulting sequence of randomization schemes converges to a fixed design, and the proposed estimator is consistent and asymptotically Gaussian, with an asymptotic variance that is estimable from data, thus giving rise to valid confidence intervals of given asymptotic levels. To illustrate the pro- posed framework, we consider LASSO regression in estimating the conditional outcome given treatment and baseline covariates. The asymptotic results ensue under minimal condition on the growth of the dimension of the regression coefficients and mild conditions on the complexity of the classes of randomization schemes. Chapter 2: In many applications, one is often interested in the effect of a longitudinal exposure on a time-to-event process. In particular, consider a study where subjects are followed over time; in addition to their baseline covariates, at various time points we also record their time-varying exposure of interest, time-varying covariates, and indicators for the event of interest (say death). Time varying confounding is ubiquitous in these situations: the exposure of interest depends on past covariates that confound the effect of the exposure on the outcome of interest, in turn exposure affects future confounders; right censoring may also be present in a study of this nature, often in response to past covariates and exposure. One way to assess the comparative effect of different regimens of interest is to study the hazard as a function of such regimens. The features of this hazard are often encoded in a marginal structural model. This chapter builds upon the work of Petersen, Schwab, Gruber, Blaser, Schomaker, and van der Laan (2014) to present a targeted maximum likelihood estimator for the marginal structural model for the hazard function under longitudinal dynamic interventions. The proposed estimator is efficient and doubly robust, hence offers an improvement over the incumbent IPW estimator. Chapter 3: A crucial component of comparative effectiveness research is evaluating the modification of an exposure's effect by a given set of baseline covariates (effect modifiers). In complex longitudinal settings where time-varying confounding exists, this effect modification analysis is often performed using a marginal structural model. Generally, the conditioning effect modifiers in a MSM are cast as variables of the observed past. Yet, in some applications the effect modifiers of interest are in fact counterfactual. For in- stance, for a specific value of the first-line treatment, one may wish to evaluate the effect modification of a second-line-treatment by a post first-line-treatment variable, wherein the first-line-treatment assignment shares common determinants with the outcome of interest. In this case a simple stratification on the first-line treatment will only yield effect modification over a subpopulation given by said determinants. Hence, the wished parameter of interest should be formulated in terms of randomization on first-line treatment as well. In another example, the effect modifiers may be subject to missingness, which may depend on other baseline confounders; a simple complete-case analysis may introduce selection bias due to the high correlation of these confounders with the missingness of the effect modifier. In this case, one would formulate the wish parameter of interest in terms of an intervention on missingness. We call these counterfactual effect modifiers. In such situations, analysis by stratification alone may harbor selection bias. In this chapter, we investigate MSM defined by counterfactual effect modifiers. Firstly, we determine the identification of the causal dose-response curve and MSM parameters in this setting. Secondly, we establish the semiparametric efficiency theory for these statistical parameters, and present a substitution-based, semiparametric efficient and doubly robust estimator us- ing the targeted maximum likelihood estimation methodology. However, as we shall see, due to the form of the efficient influence curve, the implementation of this estimator may prove arduous in applications where the effect modifier is high dimensional. To address this problem, our third contribution is a projected influence curve (and the corresponding TMLE estimator), which retains most of the robustness of its efficient peer and can be easily implemented in applications where the use of the efficient influence curve becomes taxing. In addition to these two robust estimators, we also present an IPW estimator, and a non-targeted G-computation estimator. Chapter 4: In many causal inference problems, one is interested in the direct causal effect of an exposure on an outcome of interest that is not mediated by certain intermediate variables. Robins and Greenland (1992) and Pearl (2001) formalized the definition of two types of direct effects (natural and controlled) under the counterfactual framework. The efficient influence curves (under a nonparametric model) for the various natural effect parameters and their general robustness conditions, as well as an estimating equation based estimator using the efficient influence curve, are provided in Tchetgen Tchetgen and Shpitser (2011a). In this chapter, we apply the targeted maximum likelihood frame- work to construct a semiparametric efficient, multiply robust, substitution estimator for the natural direct effect which satisfies the efficient influence curve equation derived in Tchetgen Tchetgen and Shpitser (2011a). We note that the robustness conditions in Tchetgen Tchetgen and Shpitser (2011a) may be weakened, thereby placing less reliance on the estimation of the mediator density. More.

Causal Inference in Statistics

Causal Inference in Statistics PDF Author: Judea Pearl
Publisher: John Wiley & Sons
ISBN: 1119186862
Category : Mathematics
Languages : en
Pages : 162

Get Book Here

Book Description
CAUSAL INFERENCE IN STATISTICS A Primer Causality is central to the understanding and use of data. Without an understanding of cause–effect relationships, we cannot use data to answer questions as basic as "Does this treatment harm or help patients?" But though hundreds of introductory texts are available on statistical methods of data analysis, until now, no beginner-level book has been written about the exploding arsenal of methods that can tease causal information from data. Causal Inference in Statistics fills that gap. Using simple examples and plain language, the book lays out how to define causal parameters; the assumptions necessary to estimate causal parameters in a variety of situations; how to express those assumptions mathematically; whether those assumptions have testable implications; how to predict the effects of interventions; and how to reason counterfactually. These are the foundational tools that any student of statistics needs to acquire in order to use statistical methods to answer causal questions of interest. This book is accessible to anyone with an interest in interpreting data, from undergraduates, professors, researchers, or to the interested layperson. Examples are drawn from a wide variety of fields, including medicine, public policy, and law; a brief introduction to probability and statistics is provided for the uninitiated; and each chapter comes with study questions to reinforce the readers understanding.

Doubly Robust Causal Inference with Complex Parameters

Doubly Robust Causal Inference with Complex Parameters PDF Author: Edward H. Kennedy
Publisher:
ISBN:
Category :
Languages : en
Pages : 248

Get Book Here

Book Description
Semiparametric doubly robust methods for causal inference help protect against bias due to model misspecification, while also reducing sensitivity to the curse of dimensionality (e.g., when high-dimensional covariate adjustment is necessary). However, doubly robust methods have not yet been developed in numerous important settings. In particular, standard semiparametric theory mostly only considers independent and identically distributed samples and smooth parameters that can be estimated at classical root-n rates. In this dissertation we extend this theory and develop novel methodology for three settings outside these bounds: (1) matched cohort studies, (2) nonparametric dose-response estimation, and (3) complex high-dimensional effects with continuous instrumental variables. After giving an introduction in Chapter 1, we show in Chapter 2 that, for matched cohort studies, efficient and doubly robust estimators of effects on the treated are computationally equivalent to standard estimators that ignore the non-standard sampling. We also show that matched cohort studies are often more efficient than random sampling for estimating effects on the treated, and derive the optimal number of matches for given matching variables. We apply our methods in a study of the effect of hysterectomy on the risk of cardiovascular disease. In Chapter 3 we develop a novel approach for causal dose-response curve estimation that is doubly robust without requiring any parametric assumptions, and which naturally incorporates general off-the-shelf machine learning. We derive asymptotic properties for a kernel-based version of our approach and propose a data-driven method for bandwidth selection. The methods are used to study the effect of hospital nurse staffing on excess readmissions penalties. In Chapter 4 we develop novel estimators of the local instrumental variable curve, which represents the treatment effect among compliers who would take treatment when the instrument passes some threshold. Our methods do not require parametric assumptions, allow for flexible data-adaptive estimation of effect modification, and are doubly robust. We derive asymptotic properties under weak conditions, and use the methods to study infant mortality effects of neonatal intensive care units with high versus low technical capacity, using travel time as an instrument.

Plug-in Estimation Approaches to Causal Inference and Discovery

Plug-in Estimation Approaches to Causal Inference and Discovery PDF Author: Gabriel Ruiz
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Get Book Here

Book Description
This dissertation covers techniques for the estimation of parameters related to making causal inferences and discoveries. Both for its generality and its simplicity, the focus is in the plug-in estimation of these parameters, whereby the statistical estimator(s) of a parameter(s) is plugged in to obtain an estimator for another, possibly more difficult to estimate, parameter. In particular, the following is addressed. In Chapter 2, we focus on causal discovery, the learning of causality in a data mining scenario. Causal discovery has been of strong scientific and theoretical interest as a starting point to identify ``what causes what?'' Contingent on assumptions and a proper learning algorithm, it is sometimes possible to identify and accurately estimate a causal directed acyclic graph (DAG), as opposed to a Markov equivalence class of graphs that gives ambiguity of causal directions. The focus of this chapter is in highlighting the identifiability and estimation of DAGs with general error distributions through a general sequential sorting procedure that orders variables one at a time, starting at root nodes, followed by children of the root nodes, and so on until completion. We demonstrate a novel application of this general approach to estimate the topological ordering of a DAG. At each step of the procedure, only simple likelihood ratio scores are calculated on regression residuals to decide the next node to append to the current partial ordering. The computational complexity of our algorithm on a p-node problem is O(pd), where d is the maximum neighborhood size. Under mild assumptions, the population version of our procedure provably identifies a true ordering of the underlying DAG. We provide extensive numerical evidence to demonstrate that this sequential procedure scales to possibly thousands of nodes and works well for high-dimensional data. We accompany these numerical experiments with an application to a single-cell gene expression dataset. The focus of the Chapter 3 is the Linear Non-Gaussian Acyclic Model (LiNGAM). Compared to what has been done, we present a novel estimation approach which involves specifying a parametric objective function and arguing when our sequential optimization approach will be statistically consistent, including if the dimension of underlying graph diverges, and when we can provide finite sample guarantees on its accuracy. This involves (1) defining well our target parameter: an ordering of the Directed acyclic graph (DAG)'s vertices such that parents always precede children; and (2) translating deviation bounds on the parameters for the corresponding structural equation model (SEM) into a statement about our topological order estimate's deviation from a true topological ordering. We also incorporate the use of a priori known neighborhood sets to our theoretical results. In Chapter 4, we assume that the underlying causal structure is known, for example, due to the successful application of a causal discovery algorithm similar to those in the previous two chapters. This grants us the identifiability of parameters on the distribution of so-called potential outcomes, the key random variables we would like to make causal claims about. The premise of this chapter, in a vein similar to predictive inference with quantile regression, is that observations may lie far away from their conditional expectation. In the context of causal inference, due to the missing-ness of one outcome, it is difficult to check whether an individual's treatment effect lies close to its prediction given by the estimated Average Treatment Effect (ATE) or Conditional Average Treatment Effect (CATE). With the aim of augmenting the inference with these estimands in practice, we further study an existing distribution-free framework for the plug-in estimation of bounds on the probability an individual benefits from treatment (PIBT), a generally inestimable quantity that would concisely summarize an intervention's efficacy if it could be known. Given the innate uncertainty in the target population-level bounds on PIBT, we seek to better understand the margin of error for the estimation of these target parameters in order to help discern whether estimated bounds on treatment efficacy are tight (or wide) due to random chance or not. In particular, we present non-asymptotic guarantees to the estimation of bounds on marginal PIBT for a randomized experiment setting. We also derive new non-asymptotic results for the case where we would like to understand heterogeneity in PIBT across strata of pre-treatment covariates, with one of our main results in this setting making strategic use of regression residuals. These results, especially those in the randomized experiment case, can be used to help with formal statistical power analyses and frequentist confidence statements for settings where we are interested in inferring PIBT through the target bounds under minimal parametric assumptions. Our results extend to both real-valued and binary-valued outcomes, and these results can also instead be applied to reason about whether an individual is likely to be harmed by an intervention.

Handbook of Matching and Weighting Adjustments for Causal Inference

Handbook of Matching and Weighting Adjustments for Causal Inference PDF Author: José R. Zubizarreta
Publisher: CRC Press
ISBN: 1000850811
Category : Mathematics
Languages : en
Pages : 634

Get Book Here

Book Description
An observational study infers the effects caused by a treatment, policy, program, intervention, or exposure in a context in which randomized experimentation is unethical or impractical. One task in an observational study is to adjust for visible pretreatment differences between the treated and control groups. Multivariate matching and weighting are two modern forms of adjustment. This handbook provides a comprehensive survey of the most recent methods of adjustment by matching, weighting, machine learning and their combinations. Three additional chapters introduce the steps from association to causation that follow after adjustments are complete. When used alone, matching and weighting do not use outcome information, so they are part of the design of an observational study. When used in conjunction with models for the outcome, matching and weighting may enhance the robustness of model-based adjustments. The book is for researchers in medicine, economics, public health, psychology, epidemiology, public program evaluation, and statistics who examine evidence of the effects on human beings of treatments, policies or exposures.

Targeted Learning

Targeted Learning PDF Author: Mark J. van der Laan
Publisher: Springer Science & Business Media
ISBN: 1441997822
Category : Mathematics
Languages : en
Pages : 628

Get Book Here

Book Description
The statistics profession is at a unique point in history. The need for valid statistical tools is greater than ever; data sets are massive, often measuring hundreds of thousands of measurements for a single subject. The field is ready to move towards clear objective benchmarks under which tools can be evaluated. Targeted learning allows (1) the full generalization and utilization of cross-validation as an estimator selection tool so that the subjective choices made by humans are now made by the machine, and (2) targeting the fitting of the probability distribution of the data toward the target parameter representing the scientific question of interest. This book is aimed at both statisticians and applied researchers interested in causal inference and general effect estimation for observational and experimental data. Part I is an accessible introduction to super learning and the targeted maximum likelihood estimator, including related concepts necessary to understand and apply these methods. Parts II-IX handle complex data structures and topics applied researchers will immediately recognize from their own research, including time-to-event outcomes, direct and indirect effects, positivity violations, case-control studies, censored data, longitudinal data, and genomic studies.

Causal Inference and Model Selection in Complex Settings

Causal Inference and Model Selection in Complex Settings PDF Author: Shandong Zhao
Publisher:
ISBN: 9781321301366
Category :
Languages : en
Pages : 145

Get Book Here

Book Description
Propensity score methods have become a part of the standard toolkit for applied researchers who wish to ascertain causal effects from observational data. While they were originally developed for binary treatments, several researchers have proposed generalizations of the propensity score methodology for non-binary treatment regimes. In this article, we firstly review three main methods that generalize propensity scores in this direction, namely, inverse propensity weighting (IPW), the propensity function (P-FUNCTION), and the generalized propensity score (GPS), along with recent extensions of the GPS that aim to improve its robustness. We compare the assumptions, theoretical properties, and empirical performance of these methods. We propose three new methods that provide robust causal estimation based on the P-FUNCTION and GPS. While our proposed P-FUNCTION-based estimator preforms well, we generally advise caution in that all available methods can be biased by model misspecification and extrapolation. In a related line of research, we consider adjustment for posttreatment covariates in causal inference. Even in a randomized experiment, observations might have different compliance performance under treatment and control assignment. This posttreatment covariate cannot be adjusted using standard statistical methods. We review the principal stratification framework which allows for modeling this effect as part of its Bayesian hierarchical models. We generalize the current model to add the possibility of adjusting for pretreatment covariates. We also propose a new estimator of the average treatment effect over the entire population. In a third line of research, we discuss the spectral line detection problem in high energy astrophysics. We carefully review how this problem can be statistically formulated as a precise hypothesis test with point null hypothesis, why a usual likelihood ratio test does not apply for problem of this nature, and a doable fix to correctly quantify the p-value using the likelihood ratio test statistic via posterior predictive p-values. However, as p-values (including posterior predictive p-values) tend to overstate the evidence for the alternative hypothesis for precise hypothesis testing, we review a Bayesian alternative method to do the line detection problem using the Bayes factor. Although Bayes factors are generally criticized to be sensitive to the choice of prior distributions, we show that such prior dependence can reflect different scientific questions and thus be sensible. In fact, p-values have similar "subjective influence'' in that testing for the existance of a line at a fixed location or in an area with broad range can lead to very different conclusions. This is usually known as the look elsewhere effect in astrophysics.