bootbayes

 Performs Bayesian nonparametric bootstrap and calculates posterior statistics 
 for the mean(s) or regression coefficients of univariate or multivariate data.


 -- Function File: bootbayes (Y)
 -- Function File: bootbayes (Y, X)
 -- Function File: bootbayes (Y, X, CLUSTID)
 -- Function File: bootbayes (Y, X, BLOCKSZ)
 -- Function File: bootbayes (Y, X, ..., NBOOT)
 -- Function File: bootbayes (Y, X, ..., NBOOT, PROB)
 -- Function File: bootbayes (Y, X, ..., NBOOT, PROB, PRIOR)
 -- Function File: bootbayes (Y, X, ..., NBOOT, PROB, PRIOR, SEED)
 -- Function File: bootbayes (Y, X, ..., NBOOT, PROB, PRIOR, SEED, L)
 -- Function File: STATS = bootbayes (Y, ...)
 -- Function File: [STATS, BOOTSTAT] = bootbayes (Y, ...)

     'bootbayes (Y)' performs Bayesian nonparametric bootstrap [1] to create
     1999 bootstrap statistics, each representing the weighted mean(s) of the
     column vector (or column-major matrix), Y, using a vector of weights
     randomly generated from a symmetric Dirichlet distribution. The resulting
     bootstrap (or posterior [1,2]) distribution(s) is/are summarised for each
     column of Y (i.e. each outcome) with the following statistics printed to
     the standard output:
        - original: the mean(s) of the data column(s) of Y
        - bias: bootstrap bias estimate(s)
        - median: the median(s) of the posterior distribution(s)
        - stdev: the standard deviation(s) of the posterior distribution(s)
        - CI_lower: lower bound(s) of the 95% credible interval(s)
        - CI_upper: upper bound(s) of the 95% credible interval(s)
          By default, the credible intervals are shortest probability
          intervals, which represent a more computationally stable version
          of the highest posterior density interval [3].

     'bootbayes (Y, X)' also specifies the design matrix (X) for least squares
     regression of Y on X. X should be a column vector or matrix the same
     number of rows as Y. If the X input argument is empty, the default for X
     is a column of ones (i.e. intercept only) and thus the statistic computed
     reduces to the mean (as above). The statistics calculated and returned in
     the output then relate to the coefficients from the regression of Y on X.
     Y can be a column vector (for univariate regression) or a matrix (for
     multivariate regression).

     'bootbayes (Y, X, CLUSTID)' specifies a vector or cell array of numbers
     or strings respectively to be used as cluster labels or identifiers.
     Rows in Y (and X) with the same CLUSTID value are treated as clusters with
     dependent errors. Rows of Y (and X) assigned to a particular cluster
     will have identical weights during Bayesian bootstrap. If empty (default),
     no clustered resampling is performed and all errors are treated as
     independent.

     'bootbayes (Y, X, BLOCKSZ)' specifies a scalar, which sets the block size
     for bootstrapping when the residuals have serial dependence. Identical
     weights are assigned within each (consecutive) block of length BLOCKSZ
     during Bayesian bootstrap. Rows of Y (and X) within the same block are
     treated as having dependent errors. If empty (default), no block
     resampling is performed and all errors are treated as independent.

     'bootbayes (Y, X, ..., NBOOT)' specifies the number of bootstrap resamples,
     where NBOOT must be a positive integer. If empty, the default value of
     NBOOT is 1999.

     'bootbayes (Y, X, ..., NBOOT, PROB)' where PROB is numeric and sets the
     lower and upper bounds of the credible interval(s). The value(s) of PROB
     must be between 0 and 1. PROB can either be:
       <> scalar: To set the central mass of shortest probability intervals
                  (SPI) to 100*(1-PROB)%
       <> vector: A pair of probabilities defining the lower and upper
                  percentiles of the credible interval(s) as 100*(PROB(1))%
                  and 100*(PROB(2))% respectively. 
          Credible intervals are not calculated when the value(s) of PROB
          is/are NaN. The default value of PROB is 0.95.

     'bootbayes (Y, X, ..., NBOOT, PROB, PRIOR)' accepts a positive real
     numeric scalar to parametrize the form of the symmetric Dirichlet
     distribution. The Dirichlet distribution is the conjugate PRIOR used to
     randomly generate weights on the unit simplex for linear least-squares
     fitting of the observed data, and subsequently to estimate the posterior
     for the regression coefficients by Bayesian bootstrap and any derived 
     linear estimates or contrasts.

     If PRIOR is not provided or is empty, the default value of PRIOR is
     'auto'. The behaviour of 'auto' depends on whether X is provided and
     whether the model contains slope coefficients.

     If no X is provided, or in intercept-only models, the value 'auto'
     sets PRIOR so that the Bayesian-bootstrap posterior standard deviation
     of the mean equals the usual frequentist standard error, i.e. 
     std (Y, 0) / sqrt(N). Here N denotes the number of independent sampling 
     units (e.g., observations, clusters, or blocks). Thus:

          PRIOR (i.e. alpha) = 1 - 2 / N

     With this setting, std (BOOTSTAT, 0, 2) = std (Y, 0) / sqrt (N) and
     var (BOOTSTAT, 0, 2) = std (Y, 0)^2 / N (up to Monte Carlo error).
     When N = 2 (Haldane prior, PRIOR = 0) and the statistic is the mean, the
     posterior standard deviation equals the frequentist standard error exactly
     (up to Monte Carlo error):

         std (BOOTSTAT, 1, 2) = std (Y, 1) = std (Y, 0) / sqrt (N)

     If X is a design matrix including slope predictor terms, the value
     'auto' generalizes the above by providing a global Bessel‑style
     correction matching the overall variance scale on average across
     coefficients. Thus:

          PRIOR (i.e. alpha) = 1 - (tr (H) + 1) / N = 1 - (rank (X) + 1) / N

     Here tr(H) (and equivalently rank(X)) is the effective model degrees of
     freedom, and N is the number of independent sampling units. Equivalently:

          PRIOR (i.e. alpha) = 1 - (N - dfe + 1) / N

     where dfe = N - rank(X) is the effective error degrees of freedom.

     Alternative standard prior choices include: 1 for Bayes’ rule (uniform on
     the simplex), 0.5 for the transformation-invariant Jeffreys prior (for the
     Dirichlet weights), and 0 for the Haldane prior. Priors lower than 1 
     produce a more conservative (wider) posterior, whereas priors greater 
     than 1 are more liberal, shrinking the posterior bootstrap statistics 
     toward the maximum-likelihood estimates.

     (For the Haldane prior, normal-quantile CIs use std(BOOTSTAT,1,2) to
     match the population normalization used for the interval formula.)

     'bootbayes (Y, X, ..., NBOOT, PROB, PRIOR, SEED)' initialises the
     Mersenne Twister random number generator using an integer SEED value so
     that 'bootbayes' results are reproducible.

     'bootbayes (Y, X, ..., NBOOT, PROB, PRIOR, SEED, L)' multiplies the
     regression coefficients by the hypothesis matrix L. If L is not provided
     or is empty, it will assume the default value of 1 (i.e. no change to
     the design). Otherwise, L must have the same number of rows as the number
     of columns in X.

     'STATS = bootbayes (...)' returns a structure where each field contains
     a matrix of size (p x q), where p is the number of predictors and q is
     the number of outcomes (columns in Y). Fields include: original, bias, 
     median, stdev, CI_lower, CI_upper and prior.

     '[STATS, BOOTSTAT] = bootbayes (Y, ...)' also returns the bootstrap
     statistics. If Y is a column vector, BOOTSTAT is a (p x nboot) matrix. 
     If Y is a matrix (q > 1), BOOTSTAT is a (1 x q) cell array where each
     cell contains the (p x nboot) matrix for that outcome.

  Bibliography:
  [1] Rubin (1981) The Bayesian Bootstrap. Ann. Statist. 9(1):130-134
  [2] Weng (1989) On a Second-Order Asymptotic property of the Bayesian
        Bootstrap Mean. Ann. Statist. 17(2):705-710
  [3] Liu, Gelman & Zheng (2015). Simulation-efficient shortest probability
        intervals. Statistics and Computing, 25(4), 809–819. 

  bootbayes (version 2026.02.02)
  Author: Andrew Charles Penn
  https://www.researchgate.net/profile/Andrew_Penn/

  Copyright 2019 Andrew Charles Penn
  This program is free software: you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by
  the Free Software Foundation, either version 3 of the License, or
  (at your option) any later version.

  This program is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  GNU General Public License for more details.

  You should have received a copy of the GNU General Public License
  along with this program.  If not, see http://www.gnu.org/licenses/

Demonstration 1

The following code


 % Input univariate dataset
 heights = [183, 192, 182, 183, 177, 185, 188, 188, 182, 185].';

 % 95% credible interval for the mean 
 bootbayes (heights);

 % Please be patient, the calculations will be completed soon...

Produces the following output

Summary of Bayesian bootstrap estimates of bias and precision for linear models
*******************************************************************************

Bootstrap settings: 
 Function: sum (W * Y)
 Resampling method: Bayesian bootstrap
 Prior: Symmetric Dirichlet distribution (alpha = 0.8)
 Number of resamples: 1999 
 Credible interval (CI) type: Shortest probability interval
 Credible interval: 95%

Posterior statistics for outcome 1: 
 original     bias         median       stdev       CI_lower      CI_upper
 +184.5       -0.003551    +184.5       1.290       +182.0        +187.3

Demonstration 2

The following code


 % Input bivariate dataset
 X = [ones(43,1),...
     [01,02,03,04,05,06,07,08,09,10,11,...
      12,13,14,15,16,17,18,19,20,21,22,...
      23,25,26,27,28,29,30,31,32,33,34,...
      35,36,37,38,39,40,41,42,43,44]'];
 y = [188.0,170.0,189.0,163.0,183.0,171.0,185.0,168.0,173.0,183.0,173.0,...
     173.0,175.0,178.0,183.0,192.4,178.0,173.0,174.0,183.0,188.0,180.0,...
     168.0,170.0,178.0,182.0,180.0,183.0,178.0,182.0,188.0,175.0,179.0,...
     183.0,192.0,182.0,183.0,177.0,185.0,188.0,188.0,182.0,185.0]';

 % 95% credible interval for the regression coefficents
 bootbayes (y, X);

 % Please be patient, the calculations will be completed soon...

Produces the following output

Summary of Bayesian bootstrap estimates of bias and precision for linear models
*******************************************************************************

Bootstrap settings: 
 Function: pinv (X' * W * X) * (X' * W * y)
 Resampling method: Bayesian bootstrap
 Prior: Symmetric Dirichlet distribution (alpha = 0.93)
 Number of resamples: 1999 
 Credible interval (CI) type: Shortest probability interval
 Credible interval: 95%

Posterior statistics for outcome 1: 
 original     bias         median       stdev       CI_lower      CI_upper
 +175.5       -0.002966    +175.5       2.553       +170.9        +180.8    
 +0.1904      +0.001350    +0.1954      0.08563     +0.03125      +0.3551

Demonstration 3

The following code

 %% --- Stress-test: Simulated Large-Scale Patch-seq Project (bootbayes) ---
 %% N = 7500 cells (observations), p = 15 features, q = 2000 genes (outcomes).
 %% This tests multivariate un-weaving logic and HPD interval scaling.

 N = 7500;       
 p = 15;         
 q = 2000;       
 nboot = 500;    % Sufficient for stable HPD intervals

 fprintf('Simulating a Large-Scale Patch-seq Dataset (%d x %d)...\n', N, q);

 % Generate design matrix X (e.g., E-phys features)
 X = [ones(N,1), randn(N, p-1)];

 % Generate multivariate outcome Y (Gene expression)
 % Approx 120MB of data
 true_beta = randn(p, q) .* (rand(p, q) > 0.9); 
 Y = X * true_beta + randn(N, q) * 0.5;

 fprintf('Running bootbayes with %d bootstrap resamples...\n', nboot);
 tic;
 % Running with default alpha=0.05 (95% Credible Intervals)
 stats = bootbayes(Y, X, [], nboot);
 runtime = toc;

 fprintf('\n--- Performance Results ---\n');
 fprintf('Runtime: %.2f seconds\n', runtime);
 fprintf('Total parameters estimated: %d\n', p * q);

 % Accuracy Check on a random gene
 target_gene = randi(q);
 % Note: checking if original is returned as p x q matrix
 if all(size(stats.original) == [p, q])
    estimated = stats.original(:, target_gene);
 else
    % Fallback if it is returned as the interleaved vector
    estimated = stats.original((target_gene-1)*p + (1:p));
 endif

 actual = true_beta(:, target_gene);
 correlation = corr(estimated, actual);

 fprintf('Correlation of estimates for Gene %d: %.4f\n', target_gene, correlation);

 % Verify Credible Interval structure
 fprintf('Number of outcome cells in stats.stdev: %d\n', length(stats.stdev));

 if correlation > 0.98
   fprintf('Result: PASSED (High accuracy OLS recovery)\n');
 else
   fprintf('Result: WARNING (Low correlation - check for colinearity)\n');
 end

Produces the following output

Simulating a Large-Scale Patch-seq Dataset (7500 x 2000)...
Running bootbayes with 500 bootstrap resamples...

--- Performance Results ---
Runtime: 57.95 seconds
Total parameters estimated: 30000
Correlation of estimates for Gene 339: 0.9999
Number of outcome cells in stats.stdev: 2000
Result: PASSED (High accuracy OLS recovery)

Package: statistics-resampling