2025-11-20
Consider a treatment
Interested in the population average treatment effect (PATE) of
Observed data regression of
We try to condition on as many observed confounders as possible to mitigate potential confounding bias
Commonly assumed that there are “no unobserved confounders” (NUC) but this is unverifiable
When there are unmeasured confounders, additional assumptions are needed to identify causal effects
Sensitivity analysis: how strong would unmeasured confounding have to be to explain away the observed association? Cinelli and Hazlett (2020)
Null controls: use negative control exposures or outcomes to detect and adjust for unmeasured confounding (Shi, Miao, and Tchetgen 2020)
Observational data from the National Health and Nutrition Examination Study (NHANES) on alcohol consumption.
Light alcohol consumption is positively correlated with blood levels of HDL (“good cholesterol”)
Define “light alcohol consumption’’ as 1-2 alcoholic beverages per day
Non-drinkers: self-reported drinking of one drink a week or less
Control for age, gender and indicator for educational attainment
Call:
lm(formula = Y[, "Methylmercury"] ~ drinking + X)
Residuals:
Min 1Q Median 3Q Max
-2.3570 -0.7363 -0.0728 0.6242 4.1127
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.442044 0.096385 4.586 4.91e-06 ***
drinking 0.364096 0.097244 3.744 0.000188 ***
Xage 0.008186 0.001536 5.330 1.14e-07 ***
Xgender -0.062664 0.052290 -1.198 0.230966
Xeduc 0.269815 0.054126 4.985 6.95e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.975 on 1434 degrees of freedom
Multiple R-squared: 0.05209, Adjusted R-squared: 0.04945
F-statistic: 19.7 on 4 and 1434 DF, p-value: 8.41e-16
. . .
Pearson's product-moment correlation
data: hdl_fit$residuals and mercury_fit$residuals
t = 3.7569, df = 1437, p-value = 0.0001789
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.04718758 0.14953581
sample estimates:
cor
0.0986225
Residual correlation might be indicative of confounding bias
Multiple outcomes JASA (2023)
Multiple exposures
Multiple outcomes and exposures (preprint)
This talk: spatial confounding in environmental epidemiology (preprint)
Call:
lm(formula = bw_mean ~ pm25, data = mutate(pm25_data, pm25 = pm25/10))
Residuals:
Min 1Q Median 3Q Max
-501.56 -38.76 -1.35 37.71 330.60
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3337.181 2.784 1198.52 <2e-16 ***
pm25 -47.688 2.857 -16.69 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 64.38 on 8461 degrees of freedom
Multiple R-squared: 0.03187, Adjusted R-squared: 0.03175
F-statistic: 278.5 on 1 and 8461 DF, p-value: < 2.2e-16
A 10 μg/m
Outcome: 2018–2024 ZIP code–level birth counts by weight category (California Vital Data, Cal-ViDa Query Tool)
Exposure: Annual ZIP code–level PM
Observed covariates: Maternal age, race, education, nativity, prenatal care timing, household income, geographic coordinates, and calendar year
Potential Unmeasured confounders: neighborhood deprivation, other socioeconmic factors, cultural and lifestyle factors
Meta-analyses: Birth weight decreases by 16–28 g per 10 μg/m
High between-study heterogeneity (range −79.3 g to +24.9 g)
None handles unmeasured confounders (not measurable by space)
Motivation: challenges from confounding and measurement variability call for robust causal inference
We need to avoid overfitting / double-dipping using a technique analogous to cross-validation.
In practice, confounders vary over space with idiosyncratic differences, mixing spatial and non-spatial variations
Interference: outcomes may depend on exposures at neighboring locations and past time points
DML does not explicitly address causal identifiability or omitted variable bias
We will further leverage variation over time and space to help identify causal effects in the presence of unmeasured confounding
Setup: panel with
Latent variable modeling: residual correlations in space and time reflect unmeasured confounders
Panel at
Population average treatment effect (PATE) of
In general, PATE is not the same as
Assumption (Limited interference)
For every
Assumption (Latent positivity)
Assumption (Latent unconfoundedness)
Assume the exposures and outcomes are linear in a latent m-dimensional Gaussian confounder:
Bias is
The proposed model implies a factor structure:
Proposition
Under the proposed model and assumptions on factor identifiability, the causal effect
- The interval on the right is identifiable for all
-
Assumption (Off-Neighborhood Rank — informal)
In practice,
Theorem
Under the structural model and identification assumptions, the causal effect functions
Intuition:
For
The rank condition ensures enough entries of
Closely related approach from econometrics (Bai 2009)
Treats all parameters as fixed effects, with identification in an asymptotic framework (N, P → ∞)
Do not explicitly address causal identifiability or omitted variable bias
Assume outcomes are linear in the exposures
Estimators compared
DML (NUC): Adjusts for observed covariates, treating unmeasured confounding as a smooth function of space and time (Chernozhukov et al. 2018)
IFE: Interactive fixed effects estimator (Bai 2009)
FC (Proposed): Factor confounding approach, explicitly modeling latent confounders
Meta-analyses: Birth weight decreases by 16–28 g per 10 μg/m
Outcome: 2018–2024 ZIP code–level birth counts by weight category (California Vital Data, Cal-ViDa Query Tool)
Exposure: Annual ZIP code–level PM
Observed confounders: Maternal age, race, education, nativity, prenatal care timing, household income, geographic coordinates, and calendar year
In general, we do not know the true causal effect in observational data
Predictive accuracy is not a good substitute for causal accuracy
Negative control / placebo checks: PM
Robustness to covariate adjustment: if our method handles unmeasured confounding, estimates should not change much when adding or removing observed covariates
Double machine learning with spatial location as a proxy for unmeasured confounders may not fully account for confounding bias
Latent variable models can help account for unmeasured confounding
For the proposed model, with mild rank and partial interference assumptions, causal effects are identifiable and unbiased estimates can be achieved from spatiotemporal data
More general forms of confounding, non-linear latent variable models, etc
Analytic results much more complicated in non-linear latent variable models
Need to identify E[U | D, X]
Causal inference with tensor data (multiple outcomes / exposures across space and time)
Mixtures of multiple polutants, multiple health outcomes, etc
Thank You!