parseWilkinsonFormula
statistics: terms = parseWilkinsonFormula (formula)
statistics: result = parseWilkinsonFormula (formula, mode)
statistics: [X, y, names] = parseWilkinsonFormula (formula, "model_matrix", data)
Parse and expand statistical model formulae using the Wilkinson notation.
This function implements the recursive-descent parser and expansion logic described by Wilkinson & Rogers (1973) for factorial models. It allows the symbolic specification of analysis of variance and regression models, converting strings into computational schemas or design matrices. It also supports multi-variable response specification on the Left-Hand Side (LHS) using lists or ranges.
parseWilkinsonFormula accepts as its first input argument a Wilkinson
notation string specified by formula either as a character vector or a
string scalar with the following list of valid symbols:
, : List separator for selecting multiple responses.
- : Range operator for selecting multiple responses.
+ : Term addition (Union of terms).
- : Term deletion (Difference of terms).
* : Crossing (Expands to Main Effects + Interaction).
/ : Nesting (Hierarchical relationship).
: : Direct interaction.
^ : Crossing expansion limit.
parseWilkinsonFormula (formula, mode) further specifies
how to process the Wilkinson notation specified by formula. mode
must be a character vector or a string scalar with any of the following
acceptable values.
'expand' (default) : Returns a cell array of character vectors
containing the expanded model terms (e.g., {"A", "B", "A:B"}).
'matrix' : Returns a schema structure containing a binary
matrix defining term membership.
'model_matrix' : Constructs the full Design Matrix (X)
and Response Matrix (y) based on the provided data. Uses
corner-point (reference) coding for categoricals.
'parse' : Returns the raw Abstract Syntax Tree (AST).
'tokenize' : Returns the list of tokens generated by the lexer
(useful only for debugging).
[X, y, names] = parseWilkinsonFormula (formula, "model_matrix", data) will also accept a structure or
a table containing the data variables. Required only when mode is
"model_matrix".
NaN are automatically removed.
Outputs
The processed model structure depending on the selected mode.
The numeric design matrix (observations x parameters).
The response matrix (observations x K).
A cell array of column names corresponding to X.
References
Wilkinson, G. N. and Rogers, C. E. (1973). Symbolic Description of Factorial Models for Analysis of Variance. Applied Statistics, 22, 392-399.
Source Code: parseWilkinsonFormula
## Demo : Tokenizer Mode
## Inspects the raw tokens generated from a formula string.
formula = "y ~ A * (B + c)";
tokens = parseWilkinsonFormula (formula, "tokenize");
display (tokens);
tokens =
1x10 struct array containing the fields:
type
value
pos
|
## Demo : Parser Mode (AST generation)
## Returns the Abstract Syntax Tree (AST) structure.
formula = "A / B";
tree = parseWilkinsonFormula (formula, "parse");
display (tree);
tree =
scalar structure containing the fields:
type = OPERATOR
value = /
left =
scalar structure containing the fields:
type = IDENTIFIER
value = A
left = [](0x0)
right = [](0x0)
right =
scalar structure containing the fields:
type = IDENTIFIER
value = B
left = [](0x0)
right = [](0x0)
|
## Demo : Expansion Mode (Crossings)
## Demonstrates standard Wilkinson expansion for interactions.
formula = "A * B * C";
terms = parseWilkinsonFormula (formula, "expand");
disp (terms);
1x7 cell array
{1x1 cell} {1x1 cell} {1x2 cell} {1x1 cell} {1x2 cell} {1x2 cell} {1x3 cell}
|
## Demo : Expansion Mode (Nesting)
## Demonstrates hierarchical nesting logic.
formula = "Block / Plot / Subplot";
terms = parseWilkinsonFormula (formula, "expand");
disp (terms);
1x3 cell array
{1x1 cell} {1x2 cell} {1x3 cell}
|
## Demo : Matrix Schema Mode
## Generates the binary terms matrix (Row = Term, Col = Variable).
formula = "y ~ Age + Height + Age:Height";
schema = parseWilkinsonFormula (formula, "matrix");
disp (schema.VariableNames);
disp (schema.Terms);
1x3 cell array
{'Age'} {'Height'} {'y'}
0 0 0
0 1 0
1 0 0
1 1 0
|
## Demo : Model Matrix (Regression / Continuous)
## Builds the Design Matrix (X) and Response (y) for numeric data.
d_reg.BP = [120; 122; 128; 130; 125];
d_reg.Age = [25; 30; 35; 40; 32];
d_reg.Weight = [70; 75; 80; 85; 78];
[X, y, names] = parseWilkinsonFormula ("BP ~ Age * Weight", "model_matrix", d_reg);
disp (names);
disp (X);
4x1 cell array
{'(Intercept)'}
{'Weight' }
{'Age' }
{'Age:Weight' }
1 70 25 1750
1 75 30 2250
1 80 35 2800
1 85 40 3400
1 78 32 2496
|
## Demo : Model Matrix (ANOVA / Categorical)
## Automatically handles categorical variables (dummy coding).
d_cat.Yield = [10; 12; 15; 14; 11; 13];
d_cat.Variety = {"A"; "A"; "B"; "B"; "C"; "C"};
[X, y, names] = parseWilkinsonFormula ("Yield ~ Variety", "model_matrix", d_cat);
disp (names);
disp (X);
3x1 cell array
{'(Intercept)'}
{'Variety_B' }
{'Variety_C' }
1 0 0
1 0 0
1 1 0
1 1 0
1 0 1
1 0 1
|
## Demo : Model Matrix (Mixed Numeric & Categorical)
## Demonstrates Analysis of Covariance (ANCOVA) structures.
d_mix.Growth = [1.2; 1.4; 1.1; 1.8];
d_mix.Fertilizer = {"Old"; "Old"; "New"; "New"};
d_mix.Dose = [10; 20; 10; 20];
[X, ~, names] = parseWilkinsonFormula ("Growth ~ Fertilizer * Dose", "model_matrix", d_mix);
disp (names);
disp (X);
4x1 cell array
{'(Intercept)' }
{'Fertilizer_Old' }
{'Dose' }
{'Dose:Fertilizer_Old'}
1 1 10 10
1 1 20 20
1 0 10 0
1 0 20 0
|
## Demo : Multi-Response
## Selects specific response variables using comma.
d_list = struct ();
d_list.Yield_A = [10; 12; 11; 14];
d_list.Yield_B = [20; 22; 21; 24];
d_list.Rain = [100; 110; 105; 120];
formula = "Yield_A, Yield_B ~ Rain";
[X, y, names] = parseWilkinsonFormula (formula, "model_matrix", d_list);
disp (names);
disp (y);
disp (X);
2x1 cell array
{'(Intercept)'}
{'Rain' }
10 20
12 22
11 21
14 24
1 100
1 110
1 105
1 120
|
## Demo : Multi-Response
## Selects a contiguous range of variables using the hyphen.
d_rng.Y_Jan = rand (4, 1);
d_rng.Y_Feb = rand (4, 1);
d_rng.Y_Mar = rand (4, 1);
d_rng.Trt = {"A"; "B"; "A"; "B"};
formula = "Y_Jan - Y_Mar ~ Trt";
[X, y, names] = parseWilkinsonFormula (formula, "model_matrix", d_rng);
disp (names);
disp (y);
disp (X);
2x1 cell array
{'(Intercept)'}
{'Trt_B' }
0.7084 0.7574 0.1893
0.5557 0.8113 0.8787
0.2801 0.8206 0.8471
0.9481 0.3679 0.4640
1 0
1 1
1 0
1 1
|