2025-11-20
Consider a treatment \(D\) and outcome \(Y\)
Interested in the population average treatment effect (PATE) of \(D\) on \(D\): \[E[Y | do(D=d)] - E[Y | do(D=d')]\]
Observed data regression of \(D\) on \(Y\) fails because the distribution of \(U\) varies in the two treatment arms
We try to condition on as many observed confounders as possible to mitigate potential confounding bias
Commonly assumed that there are “no unobserved confounders” (NUC) but this is unverifiable
When there are unmeasured confounders, additional assumptions are needed to identify causal effects
Sensitivity analysis: how strong would unmeasured confounding have to be to explain away the observed association? Cinelli and Hazlett (2020)
Null controls: use negative control exposures or outcomes to detect and adjust for unmeasured confounding (Shi, Miao, and Tchetgen 2020)
Observational data from the National Health and Nutrition Examination Study (NHANES) on alcohol consumption.
Light alcohol consumption is positively correlated with blood levels of HDL (“good cholesterol”)
Define “light alcohol consumption’’ as 1-2 alcoholic beverages per day
Non-drinkers: self-reported drinking of one drink a week or less
Control for age, gender and indicator for educational attainment
Call:
lm(formula = Y[, "Methylmercury"] ~ drinking + X)
Residuals:
Min 1Q Median 3Q Max
-2.3570 -0.7363 -0.0728 0.6242 4.1127
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.442044 0.096385 4.586 4.91e-06 ***
drinking 0.364096 0.097244 3.744 0.000188 ***
Xage 0.008186 0.001536 5.330 1.14e-07 ***
Xgender -0.062664 0.052290 -1.198 0.230966
Xeduc 0.269815 0.054126 4.985 6.95e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.975 on 1434 degrees of freedom
Multiple R-squared: 0.05209, Adjusted R-squared: 0.04945
F-statistic: 19.7 on 4 and 1434 DF, p-value: 8.41e-16
. . .
Pearson's product-moment correlation
data: hdl_fit$residuals and mercury_fit$residuals
t = 3.7569, df = 1437, p-value = 0.0001789
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.04718758 0.14953581
sample estimates:
cor
0.0986225
Residual correlation might be indicative of confounding bias
Multiple outcomes JASA (2023)
Multiple exposures
Multiple outcomes and exposures (preprint)
This talk: spatial confounding in environmental epidemiology (preprint)
Call:
lm(formula = bw_mean ~ pm25, data = mutate(pm25_data, pm25 = pm25/10))
Residuals:
Min 1Q Median 3Q Max
-501.56 -38.76 -1.35 37.71 330.60
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3337.181 2.784 1198.52 <2e-16 ***
pm25 -47.688 2.857 -16.69 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 64.38 on 8461 degrees of freedom
Multiple R-squared: 0.03187, Adjusted R-squared: 0.03175
F-statistic: 278.5 on 1 and 8461 DF, p-value: < 2.2e-16
A 10 μg/m\(^3\) increase in PM\(_{2.5}\) is associated with a 48 g decrease in birth weight
Outcome: 2018–2024 ZIP code–level birth counts by weight category (California Vital Data, Cal-ViDa Query Tool)
Exposure: Annual ZIP code–level PM\(_{2.5}\) concentrations, derived from high-resolution estimates (Shen et al. 2024)
Observed covariates: Maternal age, race, education, nativity, prenatal care timing, household income, geographic coordinates, and calendar year
Potential Unmeasured confounders: neighborhood deprivation, other socioeconmic factors, cultural and lifestyle factors
Meta-analyses: Birth weight decreases by 16–28 g per 10 μg/m\(^3\) increase in PM\(_{2.5}\) exposure during pregnancy (Gong et al. 2022).
High between-study heterogeneity (range −79.3 g to +24.9 g)
None handles unmeasured confounders (not measurable by space)
Motivation: challenges from confounding and measurement variability call for robust causal inference
We need to avoid overfitting / double-dipping using a technique analogous to cross-validation.
In practice, confounders vary over space with idiosyncratic differences, mixing spatial and non-spatial variations
Interference: outcomes may depend on exposures at neighboring locations and past time points
DML does not explicitly address causal identifiability or omitted variable bias
We will further leverage variation over time and space to help identify causal effects in the presence of unmeasured confounding
Setup: panel with \(N\) locations over \(T\) time points; exposure \(D_{it}\) and outcome \(Y_{it}\)
Latent variable modeling: residual correlations in space and time reflect unmeasured confounders
Panel at \(N\) locations over \(T\) time points: exposure \(D_{it}\), outcome \(Y_{it}\)
Population average treatment effect (PATE) of \(D\) on \(Y\): \[\mathbb{E}\left[ Y_{it}\bigl(\mathbf d^{(1)}_{\mathcal N_{it}}\bigr) -Y_{it}\bigl(\mathbf d^{(2)}_{\mathcal N_{it}}\bigr)\right]\]
In general, PATE is not the same as \[\mathbb{E}\left[Y_{it}\mid \mathbf D_{\mathcal N_{it}} =\mathbf d^{(1)}_{\mathcal N_{it}}\right] - \mathbb{E}\left[Y_{it}\mid\mathbf D_{\mathcal N_{it}} =\mathbf d^{(2)}_{\mathcal N_{it}}\right]\]
Assumption (Limited interference)
For every \((i,t)\) and exposure \(\mathbf d\), \(Y_{it}(\mathbf d)=Y_{it}(\mathbf d_{\mathcal N_{it}})\), where \(\mathbf d_{\mathcal N_{it}} :=\{d_{jk}:(j,k)\in\mathcal N_{it}\}\).
Assumption (Latent positivity)
\(f_{\mathbf D_{\mathcal N_{it}}\mid \mathbf X_{it},\mathbf S_i,\mathbf U_{it}} (\mathbf d\mid \mathbf x, \mathbf s, \mathbf u) > 0\) for every \((\mathbf d, \mathbf x, \mathbf s, \mathbf u)\) in the support.
Assumption (Latent unconfoundedness)
\(Y_{it}(\mathbf d_{\mathcal N_{it}}) \perp \!\!\! \perp\mathbf D_{\mathcal N_{it}} \mid (\mathbf X_{it},\,\mathbf S_i,\,\mathbf U_{it})\) for all \(\mathbf d_{\mathcal N_{it}}\).
Assume the exposures and outcomes are linear in a latent m-dimensional Gaussian confounder:
\[\begin{align} \mathbf U_{t} &\sim \mathcal{N}_M(0, I_M)\\ \mathbf{D}_t &= \nu(X) + B \mathbf U_{t} + \mathbf{\xi}_t \\ \mathbf{Y}_t &= g(\mathbf{D}, \mathbf{X}) + \Gamma\Sigma_{U\mid D}^{-1/2} \mathbf U_{t} + \mathbf{\epsilon}_t \end{align}\] where \(\Sigma_{U\mid D}\) are the conditional mean and covariance of unmeasured confounders
Bias is \(\Gamma\Sigma_{U\mid D}^{-1/2}E[U \mid D]\)
The proposed model implies a factor structure: \[\begin{aligned} \operatorname{Cov}(\mathbf D_t \mid \mathbf{X}) &= BB^{\top} + \Lambda_D\\ \operatorname{Cov}(\mathbf Y_t \mid \mathbf D_t) &= \Gamma\Gamma^{\top} + \Lambda_Y \end{aligned} \]
\(\Gamma\) and \(B\) are the outcome and exposure factor loadings, respectively. Identying assumptions are well established (anderson1965statistical?).
Proposition
Under the proposed model and assumptions on factor identifiability, the causal effect \(g(\cdot)\) is partially identified. Let \(\check \gamma_i\) be the \(i\)th row of \(\check \Gamma\). For site \(i\), the omitted variable bias for exposure vector \(\mathbf d\) is
\[
\text{Bias}(\mathbf d)_i
= \check \gamma_i \Theta \check \Sigma_{U \mid D}^{-1/2} \check B^{\top} \Sigma_{D}^{-1} \mathbf d
\in
\pm \|\check \gamma_i\|_2 \,
\bigl\|\check \Sigma_{U \mid D}^{-1/2}\check B^{\top}\Sigma_{D}^{-1}\mathbf d\bigr\|_2.
\] - \(\Theta \in \mathcal O_M\) is an orthogonal matrix.
- The interval on the right is identifiable for all \(i\).
- \(\Theta\), and hence \(g(\cdot)\), remain unidentified without further assumptions.
Assumption (Off-Neighborhood Rank — informal)
In practice, \(M \ll N\) and neighborhoods are small, so this condition is mild and typically satisfied.
Theorem
Under the structural model and identification assumptions, the causal effect functions \(g_i\bigl((D_{jt})_{j \in \mathcal N_i}\bigr)\) are identified for all units \(i\).
Intuition:
\(N \times T\) bias matrix \(C = \Gamma \Sigma_{U \mid D}^{-1/2} B^\top \Sigma_D^{-1}\) is rank \(m\)
For \(j \notin \mathcal N_i\), any association between \(Y_{it}\) and \(D_{jt}\) reflects unmeasured confounding and identifies \(C_{ij}\).
The rank condition ensures enough entries of \(C\) are known to recover the whole matrix.
Closely related approach from econometrics (Bai 2009)
Treats all parameters as fixed effects, with identification in an asymptotic framework (N, P → ∞)
Do not explicitly address causal identifiability or omitted variable bias
Assume outcomes are linear in the exposures
Estimators compared
DML (NUC): Adjusts for observed covariates, treating unmeasured confounding as a smooth function of space and time (Chernozhukov et al. 2018)
IFE: Interactive fixed effects estimator (Bai 2009)
FC (Proposed): Factor confounding approach, explicitly modeling latent confounders
Meta-analyses: Birth weight decreases by 16–28 g per 10 μg/m\(^3\) increase in PM\(_{2.5}\) exposure during pregnancy.
Outcome: 2018–2024 ZIP code–level birth counts by weight category (California Vital Data, Cal-ViDa Query Tool)
Exposure: Annual ZIP code–level PM\(_{2.5}\) concentrations, derived from high-resolution estimates (Shen et al. 2024)
Observed confounders: Maternal age, race, education, nativity, prenatal care timing, household income, geographic coordinates, and calendar year
In general, we do not know the true causal effect in observational data
Predictive accuracy is not a good substitute for causal accuracy
Negative control / placebo checks: PM\(_{2.5}\) in the year after birth should not affect birth weight (structural assumption in our model)
Robustness to covariate adjustment: if our method handles unmeasured confounding, estimates should not change much when adding or removing observed covariates
Double machine learning with spatial location as a proxy for unmeasured confounders may not fully account for confounding bias
Latent variable models can help account for unmeasured confounding
For the proposed model, with mild rank and partial interference assumptions, causal effects are identifiable and unbiased estimates can be achieved from spatiotemporal data
More general forms of confounding, non-linear latent variable models, etc
Analytic results much more complicated in non-linear latent variable models
Need to identify E[U | D, X]
Causal inference with tensor data (multiple outcomes / exposures across space and time)
Mixtures of multiple polutants, multiple health outcomes, etc
Thank You!