Latent Variable Models for Spatial Confounding

Causal Inference From Observational Data

Consider a treatment $D$ and outcome $Y$
Interested in the population average treatment effect (PATE) of $D$ on $D$ : $E [Y | d o (D = d)] - E [Y | d o (D = d^{'})]$

In general, the PATE is not the same as $E [Y | D = d] - E [Y | D = d^{'}]$

Confounders

Need to control for

U

to consistently estimate the causal effect

Confounding bias

Observed data regression of $D$ on $Y$ fails because the distribution of $U$ varies in the two treatment arms
We try to condition on as many observed confounders as possible to mitigate potential confounding bias
Commonly assumed that there are “no unobserved confounders” (NUC) but this is unverifiable

Unmeasured Confounding

When there are unmeasured confounders, additional assumptions are needed to identify causal effects
Sensitivity analysis: how strong would unmeasured confounding have to be to explain away the observed association? Cinelli and Hazlett (2020)
Null controls: use negative control exposures or outcomes to detect and adjust for unmeasured confounding (Shi, Miao, and Tchetgen 2020)

A Simple Example

Observational data from the National Health and Nutrition Examination Study (NHANES) on alcohol consumption.
Light alcohol consumption is positively correlated with blood levels of HDL (“good cholesterol”)

Define “light alcohol consumption’’ as 1-2 alcoholic beverages per day
Non-drinkers: self-reported drinking of one drink a week or less
Control for age, gender and indicator for educational attainment

HDL and alcohol consumption

What must be true for this correlation to be non-causal?

Blood mercury and alcohol consumption

summary(lm(Y[, "Methylmercury"] ~ drinking + X))


Call:
lm(formula = Y[, "Methylmercury"] ~ drinking + X)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3570 -0.7363 -0.0728  0.6242  4.1127 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.442044   0.096385   4.586 4.91e-06 ***
drinking     0.364096   0.097244   3.744 0.000188 ***
Xage         0.008186   0.001536   5.330 1.14e-07 ***
Xgender     -0.062664   0.052290  -1.198 0.230966    
Xeduc        0.269815   0.054126   4.985 6.95e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.975 on 1434 degrees of freedom
Multiple R-squared:  0.05209,   Adjusted R-squared:  0.04945 
F-statistic:  19.7 on 4 and 1434 DF,  p-value: 8.41e-16

. . .

But… no plausible causal mechanism in this case

Residual Correlation

hdl_fit <- lm(Y[, "HDL"] ~ drinking + X)
mercury_fit <- lm(Y[, "Methylmercury"] ~ drinking + X)

cor.test(hdl_fit$residuals, mercury_fit$residuals)hdl_fit <- lm(Y[, "HDL"] ~ drinking + X)
mercury_fit <- lm(Y[, "Methylmercury"] ~ drinking + X)

cor.test(hdl_fit$residuals, mercury_fit$residuals)


    Pearson's product-moment correlation

data:  hdl_fit$residuals and mercury_fit$residuals
t = 3.7569, df = 1437, p-value = 0.0001789
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.04718758 0.14953581
sample estimates:
      cor 
0.0986225

Residual correlation might be indicative of confounding bias

Multivariate Causal Inference and Unmeasured Confounding

Multiple outcomes JASA (2023)
Multiple exposures
- JMLR (2024) and AISTATS (2022)
Multiple outcomes and exposures (preprint)
This talk: spatial confounding in environmental epidemiology (preprint)

The effect of Pollution on Birth Weight

How does pre-natal exposure to PM

_{2.5}

affect birth weight?

PM $_{2.5}$ and Birth Weight

The effect of Polution on Birth Weight


Call:
lm(formula = bw_mean ~ pm25, data = mutate(pm25_data, pm25 = pm25/10))

Residuals:
    Min      1Q  Median      3Q     Max 
-501.56  -38.76   -1.35   37.71  330.60 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3337.181      2.784 1198.52   <2e-16 ***
pm25         -47.688      2.857  -16.69   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 64.38 on 8461 degrees of freedom
Multiple R-squared:  0.03187,   Adjusted R-squared:  0.03175 
F-statistic: 278.5 on 1 and 8461 DF,  p-value: < 2.2e-16

A 10 μg/m $^{3}$ increase in PM $_{2.5}$ is associated with a 48 g decrease in birth weight

The effect of Polution on Birth Weight

Outcome: 2018–2024 ZIP code–level birth counts by weight category (California Vital Data, Cal-ViDa Query Tool)
Exposure: Annual ZIP code–level PM $_{2.5}$ concentrations, derived from high-resolution estimates (Shen et al. 2024)
Observed covariates: Maternal age, race, education, nativity, prenatal care timing, household income, geographic coordinates, and calendar year
Potential Unmeasured confounders: neighborhood deprivation, other socioeconmic factors, cultural and lifestyle factors

The effect of Polution on Birth Weight

Meta-analyses: Birth weight decreases by 16–28 g per 10 μg/m $^{3}$ increase in PM $_{2.5}$ exposure during pregnancy (Gong et al. 2022).
High between-study heterogeneity (range −79.3 g to +24.9 g)
None handles unmeasured confounders (not measurable by space)
Motivation: challenges from confounding and measurement variability call for robust causal inference

Spatial Confounding

Spatial confounding: unmeasured confounders vary over space and are correlated with exposure

Common to assume unmeasured confounders is a measurable function of space (Gilbert, Datta, and Ogburn 2021)
- Spatial location then can serve as a proxy for unmeasured confounders
- Nonspatial component in the exposure required

Use Double Machine Learning (DML) to adjust for spatial location and other observed covariates (Chernozhukov et al. 2018)

Double Machine Learning

Partially linear model: $Y = β D + g (X) + ϵ$ .

Estimate $μ (X) = E [Y | X]$ and $e (X) = E [Z | X]$ using nonparametric / ML methods

Let $\tilde{Y} = Y - \hat{μ} (X)$ and $\tilde{Z} = Z - \hat{e} (X)$ and estimate causal effect by regressing $\tilde{Y}$ ~ $\tilde{Z}$ to get $\hat{β}$ .

This approach is doubly robust

Cross-fitting

We need to avoid overfitting / double-dipping using a technique analogous to cross-validation.

Divide the data into $K$ chunks.

For each chunk:
- Train the outcome and treatment models on the other K-1 folds and predict $\hat{μ} (X)$ and $\hat{e} (X)$ on the $k t h$ held out fold.
- Regress ${\tilde{Y}}^{(k)} \sim {\tilde{Z}}^{(k)}$ to estimate ${\hat{β}}^{(k)}$

Estimate $\hat{β}$ as $\frac{1}{K} \sum {\hat{β}}^{(k)}$

Why DML alone may not be enough

In practice, confounders vary over space with idiosyncratic differences, mixing spatial and non-spatial variations
Interference: outcomes may depend on exposures at neighboring locations and past time points
DML does not explicitly address causal identifiability or omitted variable bias

We will further leverage variation over time and space to help identify causal effects in the presence of unmeasured confounding

A panel data approach

Setup: panel with $N$ locations over $T$ time points; exposure $D_{i t}$ and outcome $Y_{i t}$
Latent variable modeling: residual correlations in space and time reflect unmeasured confounders
- capture long-range/global correlations
- borrow strength from sparse or irregular data
- robust to nonstationarity

A panel data approach

Negative controls: exposures or outcomes from other locations/times serve as negative controls.
- Allow for some for some spillover / interference

Model-agnostic: fit any spatiotemporal model for observed data and apply our method to the residuals

Spatial Causal Inference

Panel at $N$ locations over $T$ time points: exposure $D_{i t}$ , outcome $Y_{i t}$
Population average treatment effect (PATE) of $D$ on $Y$ : $E [Y_{i t} (d_{N_{i t}}^{(1)}) - Y_{i t} (d_{N_{i t}}^{(2)})]$
In general, PATE is not the same as $E [Y_{i t} ∣ D_{N_{i t}} = d_{N_{i t}}^{(1)}] - E [Y_{i t} ∣ D_{N_{i t}} = d_{N_{i t}}^{(2)}]$

Assumptions

Assumption (Limited interference)

For every $(i, t)$ and exposure $d$ , $Y_{i t} (d) = Y_{i t} (d_{N_{i t}})$ , where $d_{N_{i t}} := {d_{j k} : (j, k) \in N_{i t}}$ .

Assumption (Latent positivity)

$f_{D_{N_{i t}} ∣ X_{i t}, S_{i}, U_{i t}} (d ∣ x, s, u) > 0$ for every $(d, x, s, u)$ in the support.

Assumption (Latent unconfoundedness)

$Y_{i t} (d_{N_{i t}}) ⊥ ⊥ D_{N_{i t}} ∣ (X_{i t}, S_{i}, U_{i t})$ for all $d_{N_{i t}}$ .

Structural Equation Model

Assume the exposures and outcomes are linear in a latent m-dimensional Gaussian confounder:

$\begin{aligned} U_{t} & \sim N_{M} (0, I_{M}) \\ D_{t} & = ν (X) + B U_{t} + ξ_{t} \\ Y_{t} & = g (D, X) + Γ Σ_{U ∣ D}^{- 1 / 2} U_{t} + ϵ_{t} \end{aligned}$ where $Σ_{U ∣ D}$ are the conditional mean and covariance of unmeasured confounders

Bias is $Γ Σ_{U ∣ D}^{- 1 / 2} E [U ∣ D]$

When is the bias (partially) identifiable?

Structural Equation Model

The proposed model implies a factor structure: $\begin{aligned} Cov (D_{t} ∣ X) & = B B^{⊤} + Λ_{D} \\ Cov (Y_{t} ∣ D_{t}) & = Γ Γ^{⊤} + Λ_{Y} \end{aligned}$

$Γ$ and $B$ are the outcome and exposure factor loadings, respectively. Identying assumptions are well established (anderson1965statistical?).

Bounding the Bias

Proposition

Under the proposed model and assumptions on factor identifiability, the causal effect $g (\cdot)$ is partially identified. Let ${\overset{ˇ}{γ}}_{i}$ be the $i$ th row of $\overset{ˇ}{Γ}$ . For site $i$ , the omitted variable bias for exposure vector $d$ is
$Bias (d)_{i} = {\overset{ˇ}{γ}}_{i} Θ {\overset{ˇ}{Σ}}_{U ∣ D}^{- 1 / 2} {\overset{ˇ}{B}}^{⊤} Σ_{D}^{- 1} d \in \pm ∥ {\overset{ˇ}{γ}}_{i} ∥_{2} ‖ {\overset{ˇ}{Σ}}_{U ∣ D}^{- 1 / 2} {\overset{ˇ}{B}}^{⊤} Σ_{D}^{- 1} d ‖_{2} .$ - $Θ \in O_{M}$ is an orthogonal matrix.
- The interval on the right is identifiable for all $i$ .
- $Θ$ , and hence $g (\cdot)$ , remain unidentified without further assumptions.

Limits on Interference

Assumption (Off-Neighborhood Rank — informal)

Let $N_{i}$ be the interference neighborhood of unit $i$ (units $j$ such that $\partial g_{i} (D_{t}) / \partial D_{j t} \neq 0$ ).
There exist $M$ “informative” units $i_{1}, \dots, i_{M}$ such that:
1. Their outcome loadings ${γ_{i_{ℓ}}}_{ℓ = 1}^{M}$ are linearly independent (span $R^{M}$ );
2. Considering only indices outside each neighborhood ( $j \notin N_{i_{ℓ}}$ ), the exposure–confounder directions have full row rank $M$ .

In practice, $M ≪ N$ and neighborhoods are small, so this condition is mild and typically satisfied.

Identification Result

Theorem

Under the structural model and identification assumptions, the causal effect functions $g_{i} ((D_{j t})_{j \in N_{i}})$ are identified for all units $i$ .

Intuition:

$N \times T$ bias matrix $C = Γ Σ_{U ∣ D}^{- 1 / 2} B^{⊤} Σ_{D}^{- 1}$ is rank $m$
For $j \notin N_{i}$ , any association between $Y_{i t}$ and $D_{j t}$ reflects unmeasured confounding and identifies $C_{i j}$ .
The rank condition ensures enough entries of $C$ are known to recover the whole matrix.

Interactive Fixed Effects Model

Closely related approach from econometrics (Bai 2009)
Treats all parameters as fixed effects, with identification in an asymptotic framework (N, P → ∞)
Do not explicitly address causal identifiability or omitted variable bias
Assume outcomes are linear in the exposures

Simulation Study

Estimators compared

DML (NUC): Adjusts for observed covariates, treating unmeasured confounding as a smooth function of space and time (Chernozhukov et al. 2018)
IFE: Interactive fixed effects estimator (Bai 2009)
FC (Proposed): Factor confounding approach, explicitly modeling latent confounders

Simulation: Comparing Estimators

Simulation: Spatial Interference

Effect of PM $_{2.5}$ on Birth Weight

Meta-analyses: Birth weight decreases by 16–28 g per 10 μg/m $^{3}$ increase in PM $_{2.5}$ exposure during pregnancy.
Outcome: 2018–2024 ZIP code–level birth counts by weight category (California Vital Data, Cal-ViDa Query Tool)
Exposure: Annual ZIP code–level PM $_{2.5}$ concentrations, derived from high-resolution estimates (Shen et al. 2024)
Observed confounders: Maternal age, race, education, nativity, prenatal care timing, household income, geographic coordinates, and calendar year

Effect of PM $_{2.5}$ on Birth Weight

Validation in Causal Settings

In general, we do not know the true causal effect in observational data
Predictive accuracy is not a good substitute for causal accuracy
Negative control / placebo checks: PM $_{2.5}$ in the year after birth should not affect birth weight (structural assumption in our model)
Robustness to covariate adjustment: if our method handles unmeasured confounding, estimates should not change much when adding or removing observed covariates

Robustness to Covariate Adjustment

Results

Takeaways

Double machine learning with spatial location as a proxy for unmeasured confounders may not fully account for confounding bias
Latent variable models can help account for unmeasured confounding
For the proposed model, with mild rank and partial interference assumptions, causal effects are identifiable and unbiased estimates can be achieved from spatiotemporal data

Future Directions

More general forms of confounding, non-linear latent variable models, etc
Analytic results much more complicated in non-linear latent variable models
Need to identify E[U | D, X]
Causal inference with tensor data (multiple outcomes / exposures across space and time)
Mixtures of multiple polutants, multiple health outcomes, etc

Acknowledgements

Lead author, Jiaxi Wu (Former Phd student, now at Amazon)

Paper: A Latent Factor Panel Approach to Spatiotemporal Causal Inference, Wu and Franks

References

Bai, Jushan. 2009. “Panel Data Models with Interactive Fixed Effects.” Econometrica 77 (4): 1229–79.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” Oxford University Press Oxford, UK.

Cinelli, Carlos, and Chad Hazlett. 2020. “Making Sense of Sensitivity: Extending Omitted Variable Bias.” Journal of the Royal Statistical Society Series B: Statistical Methodology 82 (1): 39–67.

Gilbert, Brian, Abhirup Datta, and Elizabeth Ogburn. 2021. “A Causal Inference Framework for Spatial Confounding.” arXiv Preprint arXiv:2112.14946.

Gong, Chen, Jianmei Wang, Zhipeng Bai, David Q Rich, and Yujuan Zhang. 2022. “Maternal Exposure to Ambient PM2. 5 and Term Birth Weight: A Systematic Review and Meta-Analysis of Effect Estimates.” Science of The Total Environment 807: 150744.

Shen, Siyuan, Chi Li, Aaron Van Donkelaar, Nathan Jacobs, Chenguang Wang, and Randall V Martin. 2024. “Enhancing Global Estimation of Fine Particulate Matter Concentrations by Including Geophysical a Priori Information in Deep Learning.” ACS ES&T Air 1 (5): 332–45.

Shi, Xu, Wang Miao, and Eric Tchetgen Tchetgen. 2020. “A Selective Review of Negative Control Methods in Epidemiology.” Current Epidemiology Reports 7 (4): 190–202.

Thank You!