Causal Inference with Random Forests

Causal Inference with Random Forests PDF Author: Stefan Wager
Publisher:
ISBN:
Category :
Languages : en
Pages :

Get Book Here

Book Description
Random forests, introduced by Breiman [2001], have become one of the most popular machine learning algorithms among practitioners, and reliably achieve good predictive performance across several application areas. This has led to considerable interest in using random forests for doing science, or drawing statistical inferences in problems that do not reduce immediately to prediction. As a step in this direction, this thesis studies how random forests can be used for understanding treatment effect heterogeneity as it may arise in, e.g., personalized medicine. Our main contributions are as follows: - We develop a causal forest algorithm for heterogeneous treatment effect estimation, and find our method to be substantially more powerful at identifying treatment heterogeneity than traditional methods based on nearest-neighbor matching, especially when the number of considered covariates is large. - We provide an asymptotic statistical analysis of causal forests, and prove a Gaussian limit result. We then propose a practical method for estimating the noise scale of causal forests, thus allowing for valid statistical inference with causal forests. - In a high-dimensional regime where the problem complexity and the number of observations jointly approach infinity, we identify the signal strength at which tree-based methods become able to accurately detect treatment heterogeneity. Perhaps strikingly, we find that the required signal strength only scales logarithmically in the dimension of the problem. Taken together, these results show that random forests -- despite often being understood as a mere black box predictive algorithm -- provide a powerful toolbox for heterogeneous treatment effect estimation in modern large-scale problems.

Causal Inference with Random Forests

Causal Inference with Random Forests PDF Author: Stefan Wager
Publisher:
ISBN:
Category :
Languages : en
Pages :

Get Book Here

Book Description
Random forests, introduced by Breiman [2001], have become one of the most popular machine learning algorithms among practitioners, and reliably achieve good predictive performance across several application areas. This has led to considerable interest in using random forests for doing science, or drawing statistical inferences in problems that do not reduce immediately to prediction. As a step in this direction, this thesis studies how random forests can be used for understanding treatment effect heterogeneity as it may arise in, e.g., personalized medicine. Our main contributions are as follows: - We develop a causal forest algorithm for heterogeneous treatment effect estimation, and find our method to be substantially more powerful at identifying treatment heterogeneity than traditional methods based on nearest-neighbor matching, especially when the number of considered covariates is large. - We provide an asymptotic statistical analysis of causal forests, and prove a Gaussian limit result. We then propose a practical method for estimating the noise scale of causal forests, thus allowing for valid statistical inference with causal forests. - In a high-dimensional regime where the problem complexity and the number of observations jointly approach infinity, we identify the signal strength at which tree-based methods become able to accurately detect treatment heterogeneity. Perhaps strikingly, we find that the required signal strength only scales logarithmically in the dimension of the problem. Taken together, these results show that random forests -- despite often being understood as a mere black box predictive algorithm -- provide a powerful toolbox for heterogeneous treatment effect estimation in modern large-scale problems.

Achieving Reliable Causal Inference with Data-Mined Variables

Achieving Reliable Causal Inference with Data-Mined Variables PDF Author: Mochen Yang
Publisher:
ISBN:
Category :
Languages : en
Pages : 53

Get Book Here

Book Description
Combining machine learning with econometric analysis is becoming increasingly prevalent in both research and practice. A common empirical strategy involves the application of predictive modeling techniques to "mine" variables of interest from available data, followed by the inclusion of those variables into an econometric framework, with the objective of estimating causal effects. Recent work highlights that, because the predictions from machine learning models are inevitably imperfect, econometric analyses based on the predicted variables are likely to suffer from bias due to measurement error. We propose a novel approach to mitigate these biases, leveraging the ensemble learning technique known as the random forest. We propose employing random forest not just for prediction, but also for generating instrumental variables to address the measurement error embedded in the prediction. The random forest algorithm performs best when comprised of a set of trees that are individually accurate in their predictions, yet which also make "different" mistakes, i.e., have weakly correlated prediction errors. A key observation is that these properties are closely related to the relevance and exclusion requirements of valid instrumental variables. We design a data-driven procedure to select tuples of individual trees from a random forest, in which one tree serves as the endogenous covariate and the other trees serve as its instruments. Simulation experiments demonstrate the efficacy of the proposed approach in mitigating estimation biases, and its superior performance over an alternative method (simulation-extrapolation), which has been suggested by prior work as a reasonable method of addressing the measurement error problem.

Topics in Machine Learning for Causal Inference with Applications in Social Science

Topics in Machine Learning for Causal Inference with Applications in Social Science PDF Author: Rina Siller Friedberg
Publisher:
ISBN:
Category :
Languages : en
Pages :

Get Book Here

Book Description
Random forests are a powerful method for non-parametric regression, but are limited in their ability to fit smooth signals. Taking the perspective of random forests as an adaptive kernel method, we pair the forest kernel with a local linear regression adjustment to better capture smoothness. The resulting procedure, local linear forests, enables us to improve on asymptotic rates of convergence for random forests with smooth signals, and provides substantial gains in accuracy on both real and simulated data. We prove a central limit theorem valid under regularity conditions on the forest and smoothness constraints, propose a computationally efficient construction for confidence intervals, and discuss an extension to local linear causal forests for learning heterogeneous treatment effects. Following this deep dive into local linear forests, we discuss two applications of machine learning for causal inference. The first is a retirement reform in Denmark, in which shifting eligibility ages for an early retirement program provide an opportunity to analyze heterogeneous treatment effects of the age of retirement eligibility. The second is a randomized controlled trial in Nairobi, Kenya, aiming to lower rates of gender-based violence against adolescent students living in informal settlements. In the latter example, we explore how local linear causal forests help to uncover and emphasize trends in marginalized student responses to the intervention. In both cases, we address how machine learning and causal inference are powerful tools to discover patterns in individual treatment effects, and to advocate for marginalized groups when estimates reveal troubling patterns in the data.

Mean-weighted Case Specific Random Forests for Estimating Causal Effects

Mean-weighted Case Specific Random Forests for Estimating Causal Effects PDF Author: Linus Addae
Publisher:
ISBN:
Category :
Languages : en
Pages :

Get Book Here

Book Description
Causal inference is a branch of statistics that deals with determining how responses are affected by treatments. In this dissertation, we examine two problems in causal inference under the Neyman-Rubin causal model (NRCM): estimation of counterfactuals-hypothetical unobserved responses of units under different treatment conditions-and treatment effect estimation under treatment spillover-when the treatment status of one unit affects the response of another. First, we extend the case specific random forest (CSRF) methodology to develop mean- weighted case specific random forests (MWCSRF) to estimate the average treatment effect for the treated (ATT). We consider a setting under which the data contains many control and very few treated units, and covariate space for the treated units is a small subspace of that for the control units. For example, treated units may be those that underwent an experimental procedure and control units may be the set of units in a national database. Our approach is as follows. First, we compute bootstrap sample weights for each treated unit to oversample control units nearby the treated unit. Then, we average these weights together to construct one set of "treated" sample weights. Next, we use random forests to estimate the prognostic score-the expected control outcome given a set of covariates- for each treated unit. Finally, we estimate the ATT by taking the average difference of the responses and the estimated prognostic scores across all treated units. We show via a simulation study that MWCSRF performs favorably compared to the standard random forest, causal forests, and genetic matching under both homogeneous and heterogeneous treatment effect settings, especially when the number of treated units is small. Additionally, we demonstrate that, when parallelization is not available, MWCSRF requires significantly less runtime than CSRF. We confirm our findings on a study on the efficacy of the National Supported Work Demonstration. Additionally, we develop an R package for MWCSRF. Secondly, we discuss the problem of treatment spillover in the context of Fisher's Lady Tasting Tea experiment. We show that, by design, Lady Tasting Tea can violate the stable unit treatment value assumption (SUTVA), which requires the response of a unit to be only affected by the treatment status of that unit. We show that SUTVA may be violated under this model even when, for a given cup, the Lady's milk-first likelihood is always higher when that cup actually receives milk first. Moreover, we show that SUTVA holds under two conditions: one in which the Lady's likelihood for a cup is the same regardless of whether that cup was given milk first or tea first, and one in which the Lady always makes perfect guesses. These results further emphasize that SUTVA cannot be classified solely as treatment spillover problems, but can be inherent in the design of an experiment. Additionally, this result may have implications for teaching causal inference, as it may be preferable to introduce randomized experiments using examples that do not inherently violate SUTVA

Machine Learning for Experiments in the Social Sciences

Machine Learning for Experiments in the Social Sciences PDF Author: Jon Green
Publisher: Cambridge University Press
ISBN: 1009197843
Category : Political Science
Languages : en
Pages : 127

Get Book Here

Book Description
Causal inference and machine learning are typically introduced in the social sciences separately as theoretically distinct methodological traditions. However, applications of machine learning in causal inference are increasingly prevalent. This Element provides theoretical and practical introductions to machine learning for social scientists interested in applying such methods to experimental data. We show how machine learning can be useful for conducting robust causal inference and provide a theoretical foundation researchers can use to understand and apply new methods in this rapidly developing field. We then demonstrate two specific methods – the prediction rule ensemble and the causal random forest – for characterizing treatment effect heterogeneity in survey experiments and testing the extent to which such heterogeneity is robust to out-of-sample prediction. We conclude by discussing limitations and tradeoffs of such methods, while directing readers to additional related methods available on the Comprehensive R Archive Network (CRAN).

Machine Learning and Causality: The Impact of Financial Crises on Growth

Machine Learning and Causality: The Impact of Financial Crises on Growth PDF Author: Mr.Andrew J Tiffin
Publisher: International Monetary Fund
ISBN: 1513518305
Category : Computers
Languages : en
Pages : 30

Get Book Here

Book Description
Machine learning tools are well known for their success in prediction. But prediction is not causation, and causal discovery is at the core of most questions concerning economic policy. Recently, however, the literature has focused more on issues of causality. This paper gently introduces some leading work in this area, using a concrete example—assessing the impact of a hypothetical banking crisis on a country’s growth. By enabling consideration of a rich set of potential nonlinearities, and by allowing individually-tailored policy assessments, machine learning can provide an invaluable complement to the skill set of economists within the Fund and beyond.

Causality in Time Series: Challenges in Machine Learning

Causality in Time Series: Challenges in Machine Learning PDF Author: Florin Popescu
Publisher:
ISBN: 9780971977754
Category : Computers
Languages : en
Pages : 152

Get Book Here

Book Description
This volume in the Challenges in Machine Learning series gathers papers from the Mini Symposium on Causality in Time Series, which was part of the Neural Information Processing Systems (NIPS) confernce in 2009 in Vancouver, Canada. These papers present state-of-the-art research in time-series causality to the machine learning community, unifying methodological interests in the various communities that require such inference.

Targeted Learning

Targeted Learning PDF Author: Mark J. van der Laan
Publisher: Springer Science & Business Media
ISBN: 1441997822
Category : Mathematics
Languages : en
Pages : 628

Get Book Here

Book Description
The statistics profession is at a unique point in history. The need for valid statistical tools is greater than ever; data sets are massive, often measuring hundreds of thousands of measurements for a single subject. The field is ready to move towards clear objective benchmarks under which tools can be evaluated. Targeted learning allows (1) the full generalization and utilization of cross-validation as an estimator selection tool so that the subjective choices made by humans are now made by the machine, and (2) targeting the fitting of the probability distribution of the data toward the target parameter representing the scientific question of interest. This book is aimed at both statisticians and applied researchers interested in causal inference and general effect estimation for observational and experimental data. Part I is an accessible introduction to super learning and the targeted maximum likelihood estimator, including related concepts necessary to understand and apply these methods. Parts II-IX handle complex data structures and topics applied researchers will immediately recognize from their own research, including time-to-event outcomes, direct and indirect effects, positivity violations, case-control studies, censored data, longitudinal data, and genomic studies.

An Introduction to Causal Inference

An Introduction to Causal Inference PDF Author: Judea Pearl
Publisher: Createspace Independent Publishing Platform
ISBN: 9781507894293
Category : Causation
Languages : en
Pages : 0

Get Book Here

Book Description
This paper summarizes recent advances in causal inference and underscores the paradigmatic shifts that must be undertaken in moving from traditional statistical analysis to causal analysis of multivariate data. Special emphasis is placed on the assumptions that underly all causal inferences, the languages used in formulating those assumptions, the conditional nature of all causal and counterfactual claims, and the methods that have been developed for the assessment of such claims. These advances are illustrated using a general theory of causation based on the Structural Causal Model (SCM) described in Pearl (2000a), which subsumes and unifies other approaches to causation, and provides a coherent mathematical foundation for the analysis of causes and counterfactuals. In particular, the paper surveys the development of mathematical tools for inferring (from a combination of data and assumptions) answers to three types of causal queries: (1) queries about the effects of potential interventions, (also called "causal effects" or "policy evaluation") (2) queries about probabilities of counterfactuals, (including assessment of "regret," "attribution" or "causes of effects") and (3) queries about direct and indirect effects (also known as "mediation"). Finally, the paper defines the formal and conceptual relationships between the structural and potential-outcome frameworks and presents tools for a symbiotic analysis that uses the strong features of both. The tools are demonstrated in the analyses of mediation, causes of effects, and probabilities of causation. -- p. 1.

Statistical Causal Inferences and Their Applications in Public Health Research

Statistical Causal Inferences and Their Applications in Public Health Research PDF Author: Hua He
Publisher: Springer
ISBN: 3319412590
Category : Medical
Languages : en
Pages : 324

Get Book Here

Book Description
This book compiles and presents new developments in statistical causal inference. The accompanying data and computer programs are publicly available so readers may replicate the model development and data analysis presented in each chapter. In this way, methodology is taught so that readers may implement it directly. The book brings together experts engaged in causal inference research to present and discuss recent issues in causal inference methodological development. This is also a timely look at causal inference applied to scenarios that range from clinical trials to mediation and public health research more broadly. In an academic setting, this book will serve as a reference and guide to a course in causal inference at the graduate level (Master's or Doctorate). It is particularly relevant for students pursuing degrees in statistics, biostatistics, and computational biology. Researchers and data analysts in public health and biomedical research will also find this book to be an important reference.