In Search of Basu’s Elephants

Theory and Foundations of Statistics in the Era of Big Data
Conference Honoring Bahadur and Basu

Merlise Clyde

Duke University

Basu’s Elephants

Circus Owner plans to ship his 50 elephants

needs an approximate total weight of the 50 elephants
Sambo is a typical elephant of average weight
decides to take 50 $\times$ Sambo’s current weight as an estimate of the total weight of his herd

Sampling Design

Circus Statistician is shocked!

we need a sampling design to create an unbiased estimator!
they compromised and came up with the following plan:
- sample Sambo with probability $99/100$
- the rest of the elephants with probability $1/4900$

Unbiased Estimation & Horvitz-Thompson

Everyone was happy when Sambo was selected!

Circus Owner proposes using $50 \times$ Sambo’s current weight
Circus Statistician: we should use the Horvitz-Thompson Estimator (HT)
- unique hyper-admissible estimator in the class of all generalized polynomial unbiased estimators!
HT estimate is Sambo’s weight $\div$ probability Sambo was selected -or- W $\times 100/99$
But what if the largest elephant Jumbo had been selected? asked the Circus Owner
Well, said the sheepish statistician, HT would lead to Jumbo’s weight $W \times 4900/1$!
and thus the Circus statistician lost their job (and became an instructor of statistics)!

Several years later…

Statistical Modelling Agency has ad out for help sampling models

Circus statistician applies
Agency: we want to implement Bayesian Model Averaging and need your help in sampling models and estimation
Former Circus Statistician: I know all about sampling! Tell me more!
and being an instructor of statistics does not pay enough

Canonical Regression Model

Observe response vector $\mathbf{Y}$ with predictor variables $X_1 \dots X_p$.
Model for data under a specific model ${\cal M}_\gamma$: \[\begin{equation*} \mathbf{Y}\mid \alpha, \boldsymbol{\beta_\gamma}, \phi, {\cal M}_\gamma\sim \textsf{N}(\mathbf{1}_n\alpha + \mathbf{X}_{\boldsymbol{\gamma}}\boldsymbol{\beta_\gamma}, \mathbf{I}_n/\phi) \label{eq:linear.model} \end{equation*}\]
Models ${\cal M}_\gamma$ encoded by $\boldsymbol{\gamma}= (\gamma_1, \ldots \gamma_p)^T$ binary vector with $\gamma_j = 1$ indicating that $X_j$ is included in model ${\cal M}_\gamma$ where \[\begin{align*} \gamma_j = 0 & \Leftrightarrow \beta_j = 0 \\ \gamma_j = 1 & \Leftrightarrow \beta_j \neq 0 \end{align*}\]
$\mathbf{X}_{\boldsymbol{\gamma}}$: the $n \times p_{\boldsymbol{\gamma}}$ design matrix for model ${\cal M}_\gamma$
$\boldsymbol{\beta_\gamma}$: the $p_{\boldsymbol{\gamma}}$ vector of non-zero regression coefficients under ${\cal M}_\gamma$
intercept $\alpha$, precision $\phi$ common to all models

Bayesian Model Averaging (BMA)

prior distributions on all unknowns $(\boldsymbol{\beta_\gamma}, \boldsymbol{\gamma}, \alpha, \phi)$ and turn the Bayesian crank to get posterior distributions!
key component is posterior distribution over models \[ p({\cal M}_\gamma\mid\ Y) = \frac {p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)} {\sum_{\boldsymbol{\gamma}\in \Gamma} p(\mathbf{Y}\mid {\cal M}_\gamma)p({\cal M}_\gamma)} \]
for nice priors, we can integrate out the parameters $\boldsymbol{\theta}_{\boldsymbol{\gamma}}= (\boldsymbol{\beta_\gamma}, \alpha, \phi)$ to obtain the marginal likelihood of ${\cal M}_\gamma$ which is proportional to \[ p(\mathbf{Y}\mid {\cal M}_\gamma) = \int p(\mathbf{Y}\mid \boldsymbol{\theta}_{\boldsymbol{\gamma}}, {\cal M}_\gamma)p(\boldsymbol{\theta}_{\boldsymbol{\gamma}}\mid {\cal M}_\gamma) d\boldsymbol{\theta}_{\boldsymbol{\gamma}} \]
posterior distribution of quantities $\Delta$ of interest under BMA \[ \sum_{\boldsymbol{\gamma}\in \Gamma} p({\cal M}_\gamma\mid \mathbf{Y}) p(\Delta \mid \mathbf{Y}, {\cal M}_\gamma) \]
estimations $\textsf{E}[\mathbf{Y}]$, predictive distribution of $\mathbf{Y}^*$, $\gamma_j$ (marginal inclusion probabilities)

Implementation

The Former Circus Statistician asked - “What if you can’t enumerate all of the models in $\gamma$?

With $88$ predictor variables there are $2^{88} > 3 \times 10^{26}$ models!
more than the number of stars ($10^{24}$) in the universe!

Sampling

Well, then we just use a sample, said the agency statistician

simple random sampling with or without replacement doesn’t work very well (inefficient)
Box used importance sampling in the 80’s, but that didn’t seem to scale
non-random samples using Branch & Bound, but those are biased (also does not scale)
but then Bayesians discovered MCMC so Markov Chain Monte Carlo is now pretty standard for implementation of BMA

MCMC

design a Markov Chain to transition through $\Gamma$ so that ultimately models are sampled proportional to their posterior probabilities \[ p({\cal M}_\gamma\mid \mathbf{Y}) \propto p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma) \]
propose a new model from $q(\boldsymbol{\gamma}^* \mid \boldsymbol{\gamma})$
accept moving to $\boldsymbol{\gamma}^*$ with probability \[ \textsf{MH} = \max(1, \frac{p({\cal M}_\gamma* \mid \mathbf{Y})p({\cal M}_\gamma^*)/q(\boldsymbol{\gamma}^* \mid \boldsymbol{\gamma})} {p({\cal M}_\gamma\mid \mathbf{Y})p({\cal M}_\gamma)/q(\boldsymbol{\gamma})}) \]
otherwise stay at model ${\cal M}_\gamma$
the normalizing constant is the same in the numerator and denominator so we just need $p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)$!

Estimation in BMA

The Former Circus Statistician asks: “So how do you estimate the probabilities of models?

we just use the Monte Carlo frequencies of models or ergodic averages \[\begin{align*} \widehat{p({\cal M}_\gamma\mid \mathbf{Y})} & = \frac{\sum_{t = 1}^T I({{\cal M}}_t = {\cal M}_\gamma)} {T} \\ & = \frac{\sum_{\boldsymbol{\gamma}\in S} n_{\boldsymbol{\gamma}} I({\cal M}_\gamma\in S)} {\sum n_{\boldsymbol{\gamma}}} \\ \end{align*}\]
$T$ = # MCMC samples
$S$ is the collection of unique sampled models
$n_{\boldsymbol{\gamma}}$ is the frequency of model ${\cal M}_\gamma$ in $S$
$n = \sum_{\boldsymbol{\gamma}\in S} n_{\boldsymbol{\gamma}}$ total number of unique models . . .
asymptotically unbiased as $T \to \infty$

Shocked (II)

What! You just the Monte Carlo frequencies! exclaimed the Former Circus Statistician

I don’t know much about Bayes, but that doesn’t seem particularly Bayesian!

what happened to the (observed) marginal likelihoods $\times$ prior probabilities?
since you are sampling from a finite population, can you use survey sampling estimates of \[C = \sum_{\boldsymbol{\gamma}\in \Gamma } p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)\] with the observed $p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)$?
Agency Statistician: We tried using \[ \widehat{p({\cal M}_\gamma\mid\ Y)} = \frac {p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)} {\sum_{\boldsymbol{\gamma}\in S} p(\mathbf{Y}\mid {\cal M}_\gamma)p({\cal M}_\gamma)} \]
it’s Fisher Consistent, but biased in finite samples
that’s why we need you!

Inverse Probability Weighting

Former Circus Statistician: Let’s try the Hansen-Hurwitz estimator in PPS sampling

Goal is to estimate $C= \sum_i^N p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)$
Let $\rho_i$ be the probability of selecting ${{\cal M}}_i$
Hansen-Hurwitz or importance sampling estimate is \[\hat{C} = \frac{1}{n}\sum_i^n \frac{ n_i p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)}{\rho_i} \]
If we have “perfect” samples from the posterior then \[\rho_i = \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)}{C}\] and recover $C$!

Self-Normalized Importance Sampling

Since $C$ is unknown, we can apply the ratio estimator \[ \hat{C} = \frac{\frac1 n \sum_i^n \frac{p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)}{\rho_i}}{ \frac1 n \sum_i^n \frac{1}{\rho_i}} = \left[ \sum_i \frac{1}{p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)} \right]^{-1} \]

But this is the “infamous” harmonic mean estimator, said the Agency Statistician! While unbiased, it’s is highly unstable!

The Former Circus Statistician (hoping to keep this gig):

Wait! - I know that the Horvitz-Thompson (HT) estimator uses only the unique sampled values and not the frequencies
HT dominates the Hansen-Hurwitz in terms of variance!
but we can’t use MCMC…

Proposal Distribution

Former Circus Statistician: So tell me more about $q$ in MCMC

ratio $\frac{p(\mathbf{Y}\mid {\cal M}_\gamma^*) p({\cal M}_\gamma^*)}{q({\cal M}_\gamma^* \mid {\cal M}_\gamma)}$ looks like importance sampling
what are choices for $q({\cal M}_\gamma^* \mid {\cal M}_\gamma)$?
what about independent proposals $q({\cal M}_\gamma^*)$?

Adaptive Independent Metropolis

Griffin et al (2021):

Independent Metropolis Hastings \[q(\boldsymbol{\gamma}) = \prod \pi_i^{\gamma_i} (1 - \pi_i)^{1-\gamma_i}\]
Product of Independent Bernoullis
Use adaptive Independent Metropolis Hastings to learn $\pi_i$ from past samples plus Rao-Blackwellization
optimal for target where $\gamma_j$ are independent a posteriori
still uses Metropolis-Hastings Ratio to accept
And they use Monte Carlo frequencies!
but it does not work well for sampling with replacement and importance sampling in general

Factor Target

The joint posterior distribution of $\boldsymbol{\gamma}$ (dropping $\mathbf{Y}$) may be factored: \[p({\cal M}_\gamma\mid \mathbf{Y}) \equiv p(\boldsymbol{\gamma}) = \prod_{j = 1}^p p(\gamma_j \mid \boldsymbol{\gamma}_{<j}) \] where $\boldsymbol{\gamma}_{< j}\equiv \{\gamma_k\}$ for $k < j$ and $p(\gamma_1 \mid \boldsymbol{\gamma}_{<1}) \equiv p(\gamma_1)$.
As $\gamma_j$ are binary, re-express as \[\begin{equation*} p(\boldsymbol{\gamma}) = \prod_{j=1}^p(\rho_{j \mid <j})^{\gamma_j}{(1-\rho_{j \mid <j})}^{1-\gamma_j} \end{equation*}\] where $\rho_{j \mid <j} \equiv \Pr(\gamma_j = 1 \mid \boldsymbol{\gamma}_{<j})$ and $\rho_{1 \mid < 1} = \rho_1$, the marginal probability.
Product of Dependent Bernoullis

Global Adaptive MCMC Proposal

Factor proposal \[q(\boldsymbol{\gamma}) = \prod_{j = 1}^p q(\gamma_j \mid \boldsymbol{\gamma}_{<j}) = \prod_j \textsf{Ber}(\hat{\rho}_{j \mid <j}) \]

Note: $\Pr(\gamma_j = 1 \mid \boldsymbol{\gamma}_{<j}) = \textsf{E}[\gamma_j = 1 \mid \boldsymbol{\gamma}_{<j}]$
Fit a sequence of $p$ regressions $\gamma_j$ on $\gamma_{<j}$ \[\begin{align*} \gamma_1 & = \mu_1 + \epsilon_1 \\ \gamma_2 & = \mu_2 + \beta_{2 1} (\gamma_1 - \mu_1) + \epsilon_2 \\ \gamma_3 & = \mu_3 + \beta_{3 1} (\gamma_1 - \mu_1) + \beta_{3 2} (\gamma_2 - \mu_2) + \epsilon_3 \\ & \vdots \\ \gamma_p & = \mu_p + \beta_{p 1} (\gamma_1 - \mu_1) \ldots + \beta_{p-1 \, p-1} (\gamma_{p-1} - \mu_{p-1})+ \epsilon_p \end{align*}\]

Compositional Regression

Approximate model \[\boldsymbol{\gamma}\sim \textsf{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}_{\boldsymbol{\gamma}})\]

Wermouth (1980) compositional regression \[ \mathbf{G}= \mathbf{1}_{T} \boldsymbol{\mu}^T + (\mathbf{G}- \mathbf{1}_T \boldsymbol{\mu}^T) \mathbf{B}+ \boldsymbol{\epsilon} \]
$\mathbf{G}$ is $T \times p$ matrix where row $t$ is $\boldsymbol{\gamma}_t$
$\boldsymbol{\mu}$ is the $p$ dimensional vector of $\textsf{E}[\boldsymbol{\gamma}]$
$\boldsymbol{\Sigma}_{\boldsymbol{\gamma}} = \mathbf{U}^T \mathbf{U}$ where $\mathbf{U}$ is upper triangular Cholesky decomposition of covariance matrix of $\boldsymbol{\gamma}$ ($p \times p$)
$\mathbf{B}^T = \mathbf{I}_p - diag(\mathbf{U})^{-1} \mathbf{U}^{-T}$ (lower triangle)
$\mathbf{B}$ is a $p \times p$ upper triangular matrix with zeros on the diagonal and regression coefficients for $jth$ regression in row $j$

Estimators of $\mathbf{B}$ and $\boldsymbol{\mu}$

OLS is BLUE and consistent, but $\mathbf{G}$ may not be full rank
apply Bayesian Shrinkage with “priors” on $\boldsymbol{\mu}$ (non-informative or Normal) and $\Sigma$ (inverse-Wishart)
pseudo-posterior mean $\boldsymbol{\mu}$ is the current estimate of the marginal inclusion probabilities $\bar{\boldsymbol{\gamma}} = \hat{\boldsymbol{\mu}}$
use pseudo-posterior mean for $\boldsymbol{\Sigma}$
one Cholesky decomposition provides all coefficients for the $p$ predictions for proposing $\boldsymbol{\gamma}^*$
constrain predicted values $\hat{\rho}_{j \mid <j} \in (\delta, 1-\delta)$
generate $\boldsymbol{\gamma}^*_j \mid \boldsymbol{\gamma}^*_{< j} \sim \textsf{Ber}(\hat{\rho}_{j \mid <j})$
use as proposal for Adaptive Independent Metropolis-Hastings or Importance Sampling (Accept all)

Horvitz-Thompson

use only the $n$ unique models
inclusion probability that $\boldsymbol{\gamma}_i \in S$ is included under IS (PPSWR) \[\pi_i = 1 - (1 - q(\boldsymbol{\gamma}_i))^T\]
HT estimate of normalizing constant \[\hat{C} = \frac{1}{n} \sum_{i \in n} \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)} {\pi_i}\]
Ratio HT estimate of posterior probabilities \[p({\cal M}_\gamma\mid \mathbf{Y}) = \frac{\sum_{i \in n} I({{\cal M}}_i = {\cal M}_\gamma) p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)/\pi_{\boldsymbol{\gamma}}} {\sum_{i \in n} \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)} {\pi_i}} = \frac { p(\mathbf{Y}\mid {\cal M}_\gamma)p({{\cal M}}_i)/\pi_{\boldsymbol{\gamma}}} {\sum_{i \in n} \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)} {\pi_i}}\]
Estimate of Marginal Inclusion Probabilities

\[p(\gamma_j = 1 \mid \mathbf{Y}) = \frac{\sum_{i \in n} p(\mathbf{Y}\mid {{\cal M}}_i)p({\cal M}_\gamma)/\pi_{\boldsymbol{\gamma}}} {\sum_{i \in n} \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)} {\pi_i}}$ \]

Simulation

tecator data (Griffin et al (2021))
a sample of $p = 20$ variables
compare
- enumeration
- MCMC with add, delete, and swap moves
- Adaptive MCMC
- Importance Sampling with HT
same settings burnin.it, MCMC.it, thin

MSE Comparision

Basu’s Estimator of Total

Basu (1971), Meeden and Ghosh (1983) : \[C = \sum_{i \in S} p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i) + \frac{1}{n} \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)}{\pi_i} \times (1 - \sum_{i \notin S} \pi_i)\]

Model-based: Let $m_i = p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)$ \[\begin{align} m_i \mid \pi_i &\mathrel{\mathop{\sim}\limits^{\rm ind}}N(\pi_i \beta, \sigma^2 \pi_i^2) \\ p(\beta, \sigma^2) & \propto 1/\sigma^2 \end{align}\]
posterior mean of $\beta$ is $\frac{1}{n} \sum_{i \in S} \frac{m_i}{\pi_i}$ (HT)
using posterior predictive \[\begin{align*} \hat{C} & = \sum_i m_i + \sum_{i \notin S} \hat{\beta} \pi_i = \sum_i m_i + \frac{1}{n} \sum_{i \in S} \frac{m_i}{\pi_i}\sum_{i \in S} (1 - \pi_i) \end{align*}\]
accounts for probability on models not yet observed!

Final Posterior Estimates

estimate of posterior probability ${\cal M}_\gamma$ \[ \frac{p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)}{\sum_{i \in S} p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i) + \frac{1}{n} \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i) p(\boldsymbol{{\cal M}}_i)}{\pi_i}\sum_{i \in S} (1 - \pi_i)} \]
estimate of all models in $\mathbf{G}- S$ \[ \frac{1}{\sum_{i \in S} p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i) + \frac{1}{n} \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i) p(\boldsymbol{{\cal M}}_i)}{\pi_i}\sum_{i \in S} (1 - \pi_i)} \]
Uses renormalized marginal likelihoods of sampled models
Computing marginal inclusion probabilities
What about $\textsf{E}[\boldsymbol{\beta}\mid \mathbf{Y}]$, $\textsf{E}[\mathbf{X}\boldsymbol{\beta}\mid \mathbf{Y}]$, $\textsf{E}[\mathbf{Y}^* \mid \mathbf{Y}]$ or $p(\Delta \mid \mathbf{Y})$?

Continued Adaptation ?

can update Cholesky with rank 1 updates with new models
how to combine IS with MH samples (weighting) ?
HT/Hajek - computational complexity involved if we need to compute inclusion probability for all models based on updates (previous models and future models)
Basu (1971) suggests approach for PPSWOR (adaptation?)

Summary

Adaptive Independent Metropolis proposal for models (use in more complex IS)
Use observed values of unique marginal likelihoods of models for estimating posterior distribution
Bayes estimates of MC output
Basu’s Elephants seem to be well, holding up the Bayesian World
and the Circus Statistician is still employed!

Thank You!

Code in development version of BAS on https://github.com/merliseclyde/BAS

Talk available at https://merliseclyde.github.io/AMC-talk/

References

Meeden and Ghosh (1983) https://www.jstor.org/stable/2240483