Performs Bayesian nonparametric bootstrap and calculates posterior statistics
for the mean(s) or regression coefficients of univariate or multivariate data.
-- Function File: bootbayes (Y)
-- Function File: bootbayes (Y, X)
-- Function File: bootbayes (Y, X, CLUSTID)
-- Function File: bootbayes (Y, X, BLOCKSZ)
-- Function File: bootbayes (Y, X, ..., NBOOT)
-- Function File: bootbayes (Y, X, ..., NBOOT, PROB)
-- Function File: bootbayes (Y, X, ..., NBOOT, PROB, PRIOR)
-- Function File: bootbayes (Y, X, ..., NBOOT, PROB, PRIOR, SEED)
-- Function File: bootbayes (Y, X, ..., NBOOT, PROB, PRIOR, SEED, L)
-- Function File: STATS = bootbayes (Y, ...)
-- Function File: [STATS, BOOTSTAT] = bootbayes (Y, ...)
'bootbayes (Y)' performs Bayesian nonparametric bootstrap [1] to create
1999 bootstrap statistics, each representing the weighted mean(s) of the
column vector (or column-major matrix), Y, using a vector of weights
randomly generated from a symmetric Dirichlet distribution. The resulting
bootstrap (or posterior [1,2]) distribution(s) is/are summarised for each
column of Y (i.e. each outcome) with the following statistics printed to
the standard output:
- original: the mean(s) of the data column(s) of Y
- bias: bootstrap bias estimate(s)
- median: the median(s) of the posterior distribution(s)
- stdev: the standard deviation(s) of the posterior distribution(s)
- CI_lower: lower bound(s) of the 95% credible interval(s)
- CI_upper: upper bound(s) of the 95% credible interval(s)
By default, the credible intervals are shortest probability
intervals, which represent a more computationally stable version
of the highest posterior density interval [3].
'bootbayes (Y, X)' also specifies the design matrix (X) for least squares
regression of Y on X. X should be a column vector or matrix the same
number of rows as Y. If the X input argument is empty, the default for X
is a column of ones (i.e. intercept only) and thus the statistic computed
reduces to the mean (as above). The statistics calculated and returned in
the output then relate to the coefficients from the regression of Y on X.
Y can be a column vector (for univariate regression) or a matrix (for
multivariate regression).
'bootbayes (Y, X, CLUSTID)' specifies a vector or cell array of numbers
or strings respectively to be used as cluster labels or identifiers.
Rows in Y (and X) with the same CLUSTID value are treated as clusters with
dependent errors. Rows of Y (and X) assigned to a particular cluster
will have identical weights during Bayesian bootstrap. If empty (default),
no clustered resampling is performed and all errors are treated as
independent.
'bootbayes (Y, X, BLOCKSZ)' specifies a scalar, which sets the block size
for bootstrapping when the residuals have serial dependence. Identical
weights are assigned within each (consecutive) block of length BLOCKSZ
during Bayesian bootstrap. Rows of Y (and X) within the same block are
treated as having dependent errors. If empty (default), no block
resampling is performed and all errors are treated as independent.
'bootbayes (Y, X, ..., NBOOT)' specifies the number of bootstrap resamples,
where NBOOT must be a positive integer. If empty, the default value of
NBOOT is 1999.
'bootbayes (Y, X, ..., NBOOT, PROB)' where PROB is numeric and sets the
lower and upper bounds of the credible interval(s). The value(s) of PROB
must be between 0 and 1. PROB can either be:
<> scalar: To set the central mass of shortest probability intervals
(SPI) to 100*(1-PROB)%
<> vector: A pair of probabilities defining the lower and upper
percentiles of the credible interval(s) as 100*(PROB(1))%
and 100*(PROB(2))% respectively.
Credible intervals are not calculated when the value(s) of PROB
is/are NaN. The default value of PROB is 0.95.
'bootbayes (Y, X, ..., NBOOT, PROB, PRIOR)' accepts a positive real
numeric scalar to parametrize the form of the symmetric Dirichlet
distribution. The Dirichlet distribution is the conjugate PRIOR used to
randomly generate weights on the unit simplex for linear least-squares
fitting of the observed data, and subsequently to estimate the posterior
for the regression coefficients by Bayesian bootstrap and any derived
linear estimates or contrasts.
If PRIOR is not provided or is empty, the default value of PRIOR is
'auto'. The behaviour of 'auto' depends on whether X is provided and
whether the model contains slope coefficients.
If no X is provided, or in intercept-only models, the value 'auto'
sets PRIOR so that the Bayesian-bootstrap posterior standard deviation
of the mean equals the usual frequentist standard error, i.e.
std (Y, 0) / sqrt(N). Here N denotes the number of independent sampling
units (e.g., observations, clusters, or blocks). Thus:
PRIOR (i.e. alpha) = 1 - 2 / N
With this setting, std (BOOTSTAT, 0, 2) = std (Y, 0) / sqrt (N) and
var (BOOTSTAT, 0, 2) = std (Y, 0)^2 / N (up to Monte Carlo error).
When N = 2 (Haldane prior, PRIOR = 0) and the statistic is the mean, the
posterior standard deviation equals the frequentist standard error exactly
(up to Monte Carlo error):
std (BOOTSTAT, 1, 2) = std (Y, 1) = std (Y, 0) / sqrt (N)
If X is a design matrix including slope predictor terms, the value
'auto' generalizes the above by providing a global Bessel‑style
correction matching the overall variance scale on average across
coefficients. Thus:
PRIOR (i.e. alpha) = 1 - (tr (H) + 1) / N = 1 - (rank (X) + 1) / N
Here tr(H) (and equivalently rank(X)) is the effective model degrees of
freedom, and N is the number of independent sampling units. Equivalently:
PRIOR (i.e. alpha) = 1 - (N - dfe + 1) / N
where dfe = N - rank(X) is the effective error degrees of freedom.
Alternative standard prior choices include: 1 for Bayes’ rule (uniform on
the simplex), 0.5 for the transformation-invariant Jeffreys prior (for the
Dirichlet weights), and 0 for the Haldane prior. Priors lower than 1
produce a more conservative (wider) posterior, whereas priors greater
than 1 are more liberal, shrinking the posterior bootstrap statistics
toward the maximum-likelihood estimates.
(For the Haldane prior, normal-quantile CIs use std(BOOTSTAT,1,2) to
match the population normalization used for the interval formula.)
'bootbayes (Y, X, ..., NBOOT, PROB, PRIOR, SEED)' initialises the
Mersenne Twister random number generator using an integer SEED value so
that 'bootbayes' results are reproducible.
'bootbayes (Y, X, ..., NBOOT, PROB, PRIOR, SEED, L)' multiplies the
regression coefficients by the hypothesis matrix L. If L is not provided
or is empty, it will assume the default value of 1 (i.e. no change to
the design). Otherwise, L must have the same number of rows as the number
of columns in X.
'STATS = bootbayes (...)' returns a structure where each field contains
a matrix of size (p x q), where p is the number of predictors and q is
the number of outcomes (columns in Y). Fields include: original, bias,
median, stdev, CI_lower, CI_upper and prior.
'[STATS, BOOTSTAT] = bootbayes (Y, ...)' also returns the bootstrap
statistics. If Y is a column vector, BOOTSTAT is a (p x nboot) matrix.
If Y is a matrix (q > 1), BOOTSTAT is a (1 x q) cell array where each
cell contains the (p x nboot) matrix for that outcome.
Bibliography:
[1] Rubin (1981) The Bayesian Bootstrap. Ann. Statist. 9(1):130-134
[2] Weng (1989) On a Second-Order Asymptotic property of the Bayesian
Bootstrap Mean. Ann. Statist. 17(2):705-710
[3] Liu, Gelman & Zheng (2015). Simulation-efficient shortest probability
intervals. Statistics and Computing, 25(4), 809–819.
bootbayes (version 2026.02.02)
Author: Andrew Charles Penn
https://www.researchgate.net/profile/Andrew_Penn/
Copyright 2019 Andrew Charles Penn
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/
The following code
% Input univariate dataset heights = [183, 192, 182, 183, 177, 185, 188, 188, 182, 185].'; % 95% credible interval for the mean bootbayes (heights); % Please be patient, the calculations will be completed soon...
Produces the following output
Summary of Bayesian bootstrap estimates of bias and precision for linear models ******************************************************************************* Bootstrap settings: Function: sum (W * Y) Resampling method: Bayesian bootstrap Prior: Symmetric Dirichlet distribution (alpha = 0.8) Number of resamples: 1999 Credible interval (CI) type: Shortest probability interval Credible interval: 95% Posterior statistics for outcome 1: original bias median stdev CI_lower CI_upper +184.5 -0.003551 +184.5 1.290 +182.0 +187.3
The following code
% Input bivariate dataset
X = [ones(43,1),...
[01,02,03,04,05,06,07,08,09,10,11,...
12,13,14,15,16,17,18,19,20,21,22,...
23,25,26,27,28,29,30,31,32,33,34,...
35,36,37,38,39,40,41,42,43,44]'];
y = [188.0,170.0,189.0,163.0,183.0,171.0,185.0,168.0,173.0,183.0,173.0,...
173.0,175.0,178.0,183.0,192.4,178.0,173.0,174.0,183.0,188.0,180.0,...
168.0,170.0,178.0,182.0,180.0,183.0,178.0,182.0,188.0,175.0,179.0,...
183.0,192.0,182.0,183.0,177.0,185.0,188.0,188.0,182.0,185.0]';
% 95% credible interval for the regression coefficents
bootbayes (y, X);
% Please be patient, the calculations will be completed soon...
Produces the following output
Summary of Bayesian bootstrap estimates of bias and precision for linear models ******************************************************************************* Bootstrap settings: Function: pinv (X' * W * X) * (X' * W * y) Resampling method: Bayesian bootstrap Prior: Symmetric Dirichlet distribution (alpha = 0.93) Number of resamples: 1999 Credible interval (CI) type: Shortest probability interval Credible interval: 95% Posterior statistics for outcome 1: original bias median stdev CI_lower CI_upper +175.5 -0.002966 +175.5 2.553 +170.9 +180.8 +0.1904 +0.001350 +0.1954 0.08563 +0.03125 +0.3551
The following code
%% --- Stress-test: Simulated Large-Scale Patch-seq Project (bootbayes) ---
%% N = 7500 cells (observations), p = 15 features, q = 2000 genes (outcomes).
%% This tests multivariate un-weaving logic and HPD interval scaling.
N = 7500;
p = 15;
q = 2000;
nboot = 500; % Sufficient for stable HPD intervals
fprintf('Simulating a Large-Scale Patch-seq Dataset (%d x %d)...\n', N, q);
% Generate design matrix X (e.g., E-phys features)
X = [ones(N,1), randn(N, p-1)];
% Generate multivariate outcome Y (Gene expression)
% Approx 120MB of data
true_beta = randn(p, q) .* (rand(p, q) > 0.9);
Y = X * true_beta + randn(N, q) * 0.5;
fprintf('Running bootbayes with %d bootstrap resamples...\n', nboot);
tic;
% Running with default alpha=0.05 (95% Credible Intervals)
stats = bootbayes(Y, X, [], nboot);
runtime = toc;
fprintf('\n--- Performance Results ---\n');
fprintf('Runtime: %.2f seconds\n', runtime);
fprintf('Total parameters estimated: %d\n', p * q);
% Accuracy Check on a random gene
target_gene = randi(q);
% Note: checking if original is returned as p x q matrix
if all(size(stats.original) == [p, q])
estimated = stats.original(:, target_gene);
else
% Fallback if it is returned as the interleaved vector
estimated = stats.original((target_gene-1)*p + (1:p));
endif
actual = true_beta(:, target_gene);
correlation = corr(estimated, actual);
fprintf('Correlation of estimates for Gene %d: %.4f\n', target_gene, correlation);
% Verify Credible Interval structure
fprintf('Number of outcome cells in stats.stdev: %d\n', length(stats.stdev));
if correlation > 0.98
fprintf('Result: PASSED (High accuracy OLS recovery)\n');
else
fprintf('Result: WARNING (Low correlation - check for colinearity)\n');
end
Produces the following output
Simulating a Large-Scale Patch-seq Dataset (7500 x 2000)... Running bootbayes with 500 bootstrap resamples... --- Performance Results --- Runtime: 57.95 seconds Total parameters estimated: 30000 Correlation of estimates for Gene 339: 0.9999 Number of outcome cells in stats.stdev: 2000 Result: PASSED (High accuracy OLS recovery)
Package: statistics-resampling