Theory and Foundations of Statistics in the Era of Big Data
Conference Honoring Bahadur and Basu
Duke University
Circus Owner plans to ship his 50 elephants
Circus Statistician is shocked!
Everyone was happy when Sambo was selected!
Statistical Modelling Agency has ad out for help sampling models
Circus statistician applies
Agency: we want to implement Bayesian Model Averaging and need your help in sampling models and estimation
Former Circus Statistician: I know all about sampling! Tell me more!
and being an instructor of statistics does not pay enough
Observe response vector \(\mathbf{Y}\) with predictor variables \(X_1 \dots X_p\).
Model for data under a specific model \({\cal M}_\gamma\): \[\begin{equation*} \mathbf{Y}\mid \alpha, \boldsymbol{\beta_\gamma}, \phi, {\cal M}_\gamma\sim \textsf{N}(\mathbf{1}_n\alpha + \mathbf{X}_{\boldsymbol{\gamma}}\boldsymbol{\beta_\gamma}, \mathbf{I}_n/\phi) \label{eq:linear.model} \end{equation*}\]
Models \({\cal M}_\gamma\) encoded by \(\boldsymbol{\gamma}= (\gamma_1, \ldots \gamma_p)^T\) binary vector with \(\gamma_j = 1\) indicating that \(X_j\) is included in model \({\cal M}_\gamma\) where \[\begin{align*} \gamma_j = 0 & \Leftrightarrow \beta_j = 0 \\ \gamma_j = 1 & \Leftrightarrow \beta_j \neq 0 \end{align*}\]
\(\mathbf{X}_{\boldsymbol{\gamma}}\): the \(n \times p_{\boldsymbol{\gamma}}\) design matrix for model \({\cal M}_\gamma\)
\(\boldsymbol{\beta_\gamma}\): the \(p_{\boldsymbol{\gamma}}\) vector of non-zero regression coefficients under \({\cal M}_\gamma\)
intercept \(\alpha\), precision \(\phi\) common to all models
prior distributions on all unknowns \((\boldsymbol{\beta_\gamma}, \boldsymbol{\gamma}, \alpha, \phi)\) and turn the Bayesian crank to get posterior distributions!
key component is posterior distribution over models \[ p({\cal M}_\gamma\mid\ Y) = \frac {p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)} {\sum_{\boldsymbol{\gamma}\in \Gamma} p(\mathbf{Y}\mid {\cal M}_\gamma)p({\cal M}_\gamma)} \]
for nice priors, we can integrate out the parameters \(\boldsymbol{\theta}_{\boldsymbol{\gamma}}= (\boldsymbol{\beta_\gamma}, \alpha, \phi)\) to obtain the marginal likelihood of \({\cal M}_\gamma\) which is proportional to \[ p(\mathbf{Y}\mid {\cal M}_\gamma) = \int p(\mathbf{Y}\mid \boldsymbol{\theta}_{\boldsymbol{\gamma}}, {\cal M}_\gamma)p(\boldsymbol{\theta}_{\boldsymbol{\gamma}}\mid {\cal M}_\gamma) d\boldsymbol{\theta}_{\boldsymbol{\gamma}} \]
posterior distribution of quantities \(\Delta\) of interest under BMA \[ \sum_{\boldsymbol{\gamma}\in \Gamma} p({\cal M}_\gamma\mid \mathbf{Y}) p(\Delta \mid \mathbf{Y}, {\cal M}_\gamma) \]
estimations \(\textsf{E}[\mathbf{Y}]\), predictive distribution of \(\mathbf{Y}^*\), \(\gamma_j\) (marginal inclusion probabilities)
The Former Circus Statistician asked - “What if you can’t enumerate all of the models in \(\gamma\)?
With \(88\) predictor variables there are \(2^{88} > 3 \times 10^{26}\) models!
more than the number of stars (\(10^{24}\)) in the universe!
Well, then we just use a sample, said the agency statistician
simple random sampling with or without replacement doesn’t work very well (inefficient)
Box used importance sampling in the 80’s, but that didn’t seem to scale
non-random samples using Branch & Bound, but those are biased (also does not scale)
but then Bayesians discovered MCMC so Markov Chain Monte Carlo is now pretty standard for implementation of BMA
design a Markov Chain to transition through \(\Gamma\) so that ultimately models are sampled proportional to their posterior probabilities \[ p({\cal M}_\gamma\mid \mathbf{Y}) \propto p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma) \]
propose a new model from \(q(\boldsymbol{\gamma}^* \mid \boldsymbol{\gamma})\)
accept moving to \(\boldsymbol{\gamma}^*\) with probability \[ \textsf{MH} = \max(1, \frac{p({\cal M}_\gamma* \mid \mathbf{Y})p({\cal M}_\gamma^*)/q(\boldsymbol{\gamma}^* \mid \boldsymbol{\gamma})} {p({\cal M}_\gamma\mid \mathbf{Y})p({\cal M}_\gamma)/q(\boldsymbol{\gamma})}) \]
otherwise stay at model \({\cal M}_\gamma\)
the normalizing constant is the same in the numerator and denominator so we just need \(p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)\)!
The Former Circus Statistician asks: “So how do you estimate the probabilities of models?
we just use the Monte Carlo frequencies of models or ergodic averages \[\begin{align*} \widehat{p({\cal M}_\gamma\mid \mathbf{Y})} & = \frac{\sum_{t = 1}^T I({{\cal M}}_t = {\cal M}_\gamma)} {T} \\ & = \frac{\sum_{\boldsymbol{\gamma}\in S} n_{\boldsymbol{\gamma}} I({\cal M}_\gamma\in S)} {\sum n_{\boldsymbol{\gamma}}} \\ \end{align*}\]
\(T\) = # MCMC samples
\(S\) is the collection of unique sampled models
\(n_{\boldsymbol{\gamma}}\) is the frequency of model \({\cal M}_\gamma\) in \(S\)
\(n = \sum_{\boldsymbol{\gamma}\in S} n_{\boldsymbol{\gamma}}\) total number of unique models . . .
asymptotically unbiased as \(T \to \infty\)
What! You just the Monte Carlo frequencies! exclaimed the Former Circus Statistician
what happened to the (observed) marginal likelihoods \(\times\) prior probabilities?
since you are sampling from a finite population, can you use survey sampling estimates of \[C = \sum_{\boldsymbol{\gamma}\in \Gamma } p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)\] with the observed \(p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)\)?
Agency Statistician: We tried using \[ \widehat{p({\cal M}_\gamma\mid\ Y)} = \frac {p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)} {\sum_{\boldsymbol{\gamma}\in S} p(\mathbf{Y}\mid {\cal M}_\gamma)p({\cal M}_\gamma)} \]
it’s Fisher Consistent, but biased in finite samples
that’s why we need you!
Former Circus Statistician: Let’s try the Hansen-Hurwitz estimator in PPS sampling
Since \(C\) is unknown, we can apply the ratio estimator \[ \hat{C} = \frac{\frac1 n \sum_i^n \frac{p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)}{\rho_i}}{ \frac1 n \sum_i^n \frac{1}{\rho_i}} = \left[ \sum_i \frac{1}{p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)} \right]^{-1} \]
But this is the “infamous” harmonic mean estimator, said the Agency Statistician! While unbiased, it’s is highly unstable!
The Former Circus Statistician (hoping to keep this gig):
Wait! - I know that the Horvitz-Thompson (HT) estimator uses only the unique sampled values and not the frequencies
HT dominates the Hansen-Hurwitz in terms of variance!
but we can’t use MCMC…
Former Circus Statistician: So tell me more about \(q\) in MCMC
ratio \(\frac{p(\mathbf{Y}\mid {\cal M}_\gamma^*) p({\cal M}_\gamma^*)}{q({\cal M}_\gamma^* \mid {\cal M}_\gamma)}\) looks like importance sampling
what are choices for \(q({\cal M}_\gamma^* \mid {\cal M}_\gamma)\)?
what about independent proposals \(q({\cal M}_\gamma^*)\)?
Griffin et al (2021):
Independent Metropolis Hastings \[q(\boldsymbol{\gamma}) = \prod \pi_i^{\gamma_i} (1 - \pi_i)^{1-\gamma_i}\]
Product of Independent Bernoullis
Use adaptive Independent Metropolis Hastings to learn \(\pi_i\) from past samples plus Rao-Blackwellization
optimal for target where \(\gamma_j\) are independent a posteriori
still uses Metropolis-Hastings Ratio to accept
And they use Monte Carlo frequencies!
but it does not work well for sampling with replacement and importance sampling in general
The joint posterior distribution of \(\boldsymbol{\gamma}\) (dropping \(\mathbf{Y}\)) may be factored: \[p({\cal M}_\gamma\mid \mathbf{Y}) \equiv p(\boldsymbol{\gamma}) = \prod_{j = 1}^p p(\gamma_j \mid \boldsymbol{\gamma}_{<j}) \] where \(\boldsymbol{\gamma}_{< j}\equiv \{\gamma_k\}\) for \(k < j\) and \(p(\gamma_1 \mid \boldsymbol{\gamma}_{<1}) \equiv p(\gamma_1)\).
As \(\gamma_j\) are binary, re-express as \[\begin{equation*} p(\boldsymbol{\gamma}) = \prod_{j=1}^p(\rho_{j \mid <j})^{\gamma_j}{(1-\rho_{j \mid <j})}^{1-\gamma_j} \end{equation*}\] where \(\rho_{j \mid <j} \equiv \Pr(\gamma_j = 1 \mid \boldsymbol{\gamma}_{<j})\) and \(\rho_{1 \mid < 1} = \rho_1\), the marginal probability.
Product of Dependent Bernoullis
Factor proposal \[q(\boldsymbol{\gamma}) = \prod_{j = 1}^p q(\gamma_j \mid \boldsymbol{\gamma}_{<j}) = \prod_j \textsf{Ber}(\hat{\rho}_{j \mid <j}) \]
Note: \(\Pr(\gamma_j = 1 \mid \boldsymbol{\gamma}_{<j}) = \textsf{E}[\gamma_j = 1 \mid \boldsymbol{\gamma}_{<j}]\)
Fit a sequence of \(p\) regressions \(\gamma_j\) on \(\gamma_{<j}\) \[\begin{align*} \gamma_1 & = \mu_1 + \epsilon_1 \\ \gamma_2 & = \mu_2 + \beta_{2 1} (\gamma_1 - \mu_1) + \epsilon_2 \\ \gamma_3 & = \mu_3 + \beta_{3 1} (\gamma_1 - \mu_1) + \beta_{3 2} (\gamma_2 - \mu_2) + \epsilon_3 \\ & \vdots \\ \gamma_p & = \mu_p + \beta_{p 1} (\gamma_1 - \mu_1) \ldots + \beta_{p-1 \, p-1} (\gamma_{p-1} - \mu_{p-1})+ \epsilon_p \end{align*}\]
Approximate model \[\boldsymbol{\gamma}\sim \textsf{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}_{\boldsymbol{\gamma}})\]
Wermouth (1980) compositional regression \[ \mathbf{G}= \mathbf{1}_{T} \boldsymbol{\mu}^T + (\mathbf{G}- \mathbf{1}_T \boldsymbol{\mu}^T) \mathbf{B}+ \boldsymbol{\epsilon} \]
\(\mathbf{G}\) is \(T \times p\) matrix where row \(t\) is \(\boldsymbol{\gamma}_t\)
\(\boldsymbol{\mu}\) is the \(p\) dimensional vector of \(\textsf{E}[\boldsymbol{\gamma}]\)
\(\boldsymbol{\Sigma}_{\boldsymbol{\gamma}} = \mathbf{U}^T \mathbf{U}\) where \(\mathbf{U}\) is upper triangular Cholesky decomposition of covariance matrix of \(\boldsymbol{\gamma}\) (\(p \times p\))
\(\mathbf{B}^T = \mathbf{I}_p - diag(\mathbf{U})^{-1} \mathbf{U}^{-T}\) (lower triangle)
\(\mathbf{B}\) is a \(p \times p\) upper triangular matrix with zeros on the diagonal and regression coefficients for \(jth\) regression in row \(j\)
OLS is BLUE and consistent, but \(\mathbf{G}\) may not be full rank
apply Bayesian Shrinkage with “priors” on \(\boldsymbol{\mu}\) (non-informative or Normal) and \(\Sigma\) (inverse-Wishart)
pseudo-posterior mean \(\boldsymbol{\mu}\) is the current estimate of the marginal inclusion probabilities \(\bar{\boldsymbol{\gamma}} = \hat{\boldsymbol{\mu}}\)
use pseudo-posterior mean for \(\boldsymbol{\Sigma}\)
one Cholesky decomposition provides all coefficients for the \(p\) predictions for proposing \(\boldsymbol{\gamma}^*\)
constrain predicted values \(\hat{\rho}_{j \mid <j} \in (\delta, 1-\delta)\)
generate \(\boldsymbol{\gamma}^*_j \mid \boldsymbol{\gamma}^*_{< j} \sim \textsf{Ber}(\hat{\rho}_{j \mid <j})\)
use as proposal for Adaptive Independent Metropolis-Hastings or Importance Sampling (Accept all)
use only the \(n\) unique models
inclusion probability that \(\boldsymbol{\gamma}_i \in S\) is included under IS (PPSWR) \[\pi_i = 1 - (1 - q(\boldsymbol{\gamma}_i))^T\]
HT estimate of normalizing constant \[\hat{C} = \frac{1}{n} \sum_{i \in n} \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)} {\pi_i}\]
Ratio HT estimate of posterior probabilities \[p({\cal M}_\gamma\mid \mathbf{Y}) = \frac{\sum_{i \in n} I({{\cal M}}_i = {\cal M}_\gamma) p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)/\pi_{\boldsymbol{\gamma}}} {\sum_{i \in n} \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)} {\pi_i}} = \frac { p(\mathbf{Y}\mid {\cal M}_\gamma)p({{\cal M}}_i)/\pi_{\boldsymbol{\gamma}}} {\sum_{i \in n} \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)} {\pi_i}}\]
Estimate of Marginal Inclusion Probabilities
\[p(\gamma_j = 1 \mid \mathbf{Y}) = \frac{\sum_{i \in n} p(\mathbf{Y}\mid {{\cal M}}_i)p({\cal M}_\gamma)/\pi_{\boldsymbol{\gamma}}} {\sum_{i \in n} \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)} {\pi_i}}$ \]
tecator
data (Griffin et al (2021))
a sample of \(p = 20\) variables
compare
same settings burnin.it
, MCMC.it
, thin
Basu (1971), Meeden and Ghosh (1983) : \[C = \sum_{i \in S} p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i) + \frac{1}{n} \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)}{\pi_i} \times (1 - \sum_{i \notin S} \pi_i)\]
Model-based: Let \(m_i = p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)\) \[\begin{align} m_i \mid \pi_i &\mathrel{\mathop{\sim}\limits^{\rm ind}}N(\pi_i \beta, \sigma^2 \pi_i^2) \\ p(\beta, \sigma^2) & \propto 1/\sigma^2 \end{align}\]
posterior mean of \(\beta\) is \(\frac{1}{n} \sum_{i \in S} \frac{m_i}{\pi_i}\) (HT)
using posterior predictive \[\begin{align*} \hat{C} & = \sum_i m_i + \sum_{i \notin S} \hat{\beta} \pi_i = \sum_i m_i + \frac{1}{n} \sum_{i \in S} \frac{m_i}{\pi_i}\sum_{i \in S} (1 - \pi_i) \end{align*}\]
accounts for probability on models not yet observed!
estimate of posterior probability \({\cal M}_\gamma\) \[ \frac{p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)}{\sum_{i \in S} p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i) + \frac{1}{n} \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i) p(\boldsymbol{{\cal M}}_i)}{\pi_i}\sum_{i \in S} (1 - \pi_i)} \]
estimate of all models in \(\mathbf{G}- S\) \[ \frac{1}{\sum_{i \in S} p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i) + \frac{1}{n} \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i) p(\boldsymbol{{\cal M}}_i)}{\pi_i}\sum_{i \in S} (1 - \pi_i)} \]
Uses renormalized marginal likelihoods of sampled models
Computing marginal inclusion probabilities
What about \(\textsf{E}[\boldsymbol{\beta}\mid \mathbf{Y}]\), \(\textsf{E}[\mathbf{X}\boldsymbol{\beta}\mid \mathbf{Y}]\), \(\textsf{E}[\mathbf{Y}^* \mid \mathbf{Y}]\) or \(p(\Delta \mid \mathbf{Y})\)?
Adaptive Independent Metropolis proposal for models (use in more complex IS)
Use observed values of unique marginal likelihoods of models for estimating posterior distribution
Bayes estimates of MC output
Basu’s Elephants seem to be well, holding up the Bayesian World
and the Circus Statistician is still employed!
Code in development version of BAS
on https://github.com/merliseclyde/BAS
Talk available at https://merliseclyde.github.io/AMC-talk/
Meeden and Ghosh (1983) https://www.jstor.org/stable/2240483