parseWilkinsonFormula
statistics: terms = parseWilkinsonFormula (formula)
statistics: result = parseWilkinsonFormula (formula, mode)
statistics: [X, y, names] = parseWilkinsonFormula (formula, "model_matrix", data)
Parse and expand statistical model formulae using the Wilkinson notation.
This function implements the recursive-descent parser and expansion logic described by Wilkinson & Rogers (1973) for factorial models. It allows the symbolic specification of analysis of variance and regression models, converting strings into computational schemas or design matrices. It also supports multi-variable response specification on the Left-Hand Side (LHS) using lists or ranges.
parseWilkinsonFormula accepts as its first input argument a Wilkinson
notation string specified by formula either as a character vector or a
string scalar with the following list of valid symbols:
Right-Hand Side (Model) Operators
The RHS specifies the independent variables (predictors) and the structural
relationships between them, such as interactions and nesting. The parser
expands these expressions into fundamental model terms following the standard
statistical rules of marginality. Additionally, explicit nesting notation
(e.g., B(A)) is supported to denote that factor B is nested within A.
| Operator | Description | Expansion Example |
|---|---|---|
+ | Addition (Union) | A + B expands to A, B |
* | Crossing | A * B expands to A, B, A:B |
- | Deletion | A*B - A:B expands to A, B |
/ | Nesting | A / B expands to A, A:B |
: | Interaction | A : B expands to A:B |
^ | Power (Limit) | (A+B)^2 expands to A, B, A:B |
1 | Intercept | y ~ A - 1 removes intercept |
Left-Hand Side (Response) Operators
The LHS, separated by the ~ operator, defines the dependent variables.
It natively supports multi-response syntaxes.
| Operator | Description | Usage Example |
|---|---|---|
~ | Formula separator | y ~ x |
, | List separator | y1, y2 ~ x |
- | Range operator | T1 - T3 ~ x |
Processing Modes
parseWilkinsonFormula (formula, mode) evaluates the
formula string based on the selected mode:
'expand' (default) - Returns a structure containing
response and model fields. Each field contains cell arrays
of the expanded, fundamental terms.
'equation' - Generates a string representing the mathematical
equation of the fitted model. Coefficients are represented generically as
c1, c2, .... If multiple responses are specified, it returns a
string array of equations.
| Formula String | Equation Output |
|---|---|
y ~ x | "y = c1 + c2*x" |
y ~ A * B | "y = c1 + c2*A + c3*B + c4*A*B" |
y ~ School / Class | "y = c1 + c2*School + c3*Class*School" |
y ~ x^2 | "y = c1 + c2*x + c3*x^2" |
y1 - y2 ~ Trt | ["y1 = c1 + c2*Trt", "y2 = ..."] |
'matrix' - Returns a schema structure containing a binary
matrix defining term membership, useful for internal algorithmic processing.
'model_matrix' - Constructs the numeric Design Matrix (X)
and Response Matrix (y) directly from a provided data table.
'parse' - Returns the raw Abstract Syntax Tree (AST) structure.
'tokenize' - Returns the array of tokens generated by the lexer.
Data Handling (’model_matrix’ mode)
When using the 'model_matrix' mode, a data argument must be
provided as an Octave table.
NaN values in any of the
active variables are automatically omitted from the final matrices.
Outputs
The processed model structure, string array, or cell array depending on the selected mode.
The generated numeric design matrix (Observations x Parameters). Includes
a column of ones for the intercept unless - 1 is in the formula.
The numeric response matrix (Observations x K responses).
A cell array of character vectors containing the column names corresponding to the generated design matrix X.
References
Wilkinson, G. N. and Rogers, C. E. (1973). Symbolic Description of Factorial Models for Analysis of Variance. Applied Statistics, 22, 392-399.
Source Code: parseWilkinsonFormula
## Simple Linear Regression :
## This example models a continuous response (Height) as a linear function
## of a single continuous predictor (Age). The 'equation' mode returns the
## symbolic representation, while 'model_matrix' generates the design matrix.
Age = [10; 12; 14; 16; 18];
Height = [140; 148; 155; 162; 170];
t = table (Height, Age);
formula = 'Height ~ Age';
disp (['Formula: ', formula]);
equation = parseWilkinsonFormula (formula, 'equation')
[X, y, names] = parseWilkinsonFormula (formula, 'model_matrix', t)
Formula: Height ~ Age
equation =
string
"Height = c1 + c2*Age"
X =
1 10
1 12
1 14
1 16
1 18
y =
140
148
155
162
170
names =
2x1 cell array
{'(Intercept)'}
{'Age' }
|
## Multiple Regression :
## Here we model House Price based on two independent predictors: Area and
## number of Rooms. The '+' operator adds terms to the model without assuming
## any interaction between them.
Price = [300; 350; 400; 450];
Area = [1500; 1800; 2200; 2500];
Rooms = [3; 3; 4; 5];
t = table (Price, Area, Rooms);
formula = 'Price ~ Area + Rooms';
disp (['Formula: ', formula]);
equation = parseWilkinsonFormula (formula, 'equation')
[X, y, names] = parseWilkinsonFormula (formula, 'model_matrix', t)
Formula: Price ~ Area + Rooms
equation =
string
"Price = c1 + c2*Area + c3*Rooms"
X =
1 3 1500
1 3 1800
1 4 2200
1 5 2500
y =
300
350
400
450
names =
3x1 cell array
{'(Intercept)'}
{'Rooms' }
{'Area' }
|
## Interaction Effects :
## We analyze Relief Score based on Drug Type and Dosage Level.
## The '*' operator expands to the main effects PLUS the interaction term.
## Categorical variables are automatically created.
Relief = [5; 7; 6; 8];
Drug = {'Placebo'; 'Placebo'; 'Active'; 'Active'};
Dose = {'Low'; 'High'; 'Low'; 'High'};
t = table (Relief, Drug, Dose);
formula = 'Relief ~ Drug * Dose';
disp (['Formula: ', formula]);
equation = parseWilkinsonFormula (formula, 'equation')
[X, y, names] = parseWilkinsonFormula (formula, 'model_matrix', t)
Formula: Relief ~ Drug * Dose
equation =
string
"Relief = c1 + c2*Drug + c3*Dose + c4*Dose*Drug"
X =
1 1 1 1
1 1 0 0
1 0 1 0
1 0 0 0
y =
5
7
6
8
names =
4x1 cell array
{'(Intercept)' }
{'Drug_Placebo' }
{'Dose_Low' }
{'Dose_Low:Drug_Placebo'}
|
## Polynomial Regression :
## Uses the power operator (^) to model non-linear relationships.
Distance = [20; 45; 80; 125];
Speed = [30; 50; 70; 90];
Speed_2 = Speed .^ 2;
t = table (Distance, Speed, Speed_2, 'VariableNames', {'Distance', 'Speed', 'Speed^2'});
formula = 'Distance ~ Speed^2';
disp (['Formula: ', formula]);
equation = parseWilkinsonFormula (formula, 'equation')
[X, y, names] = parseWilkinsonFormula (formula, 'model_matrix', t)
Formula: Distance ~ Speed^2
equation =
string
"Distance = c1 + c2*Speed + c3*Speed^2"
X =
1 900 30
1 2500 50
1 4900 70
1 8100 90
y =
20
45
80
125
names =
3x1 cell array
{'(Intercept)'}
{'Speed^2' }
{'Speed' }
|
## Hierarchical Design.
## Common in psychometrics. Here, 'Class' is nested within 'School'.
## The '/' operator implies School + School:Class.
Score = [88; 92; 75; 80];
School = {'North'; 'North'; 'South'; 'South'};
Class = {'Rm101'; 'Rm102'; 'Rm201'; 'Rm202'};
t = table (Score, School, Class);
formula = 'Score ~ School / Class';
disp (['Formula: ', formula]);
equation = parseWilkinsonFormula (formula, 'equation')
terms = parseWilkinsonFormula (formula, 'expand')
Formula: Score ~ School / Class
equation =
string
"Score = c1 + c2*School + c3*Class*School"
terms =
scalar structure containing the fields:
response =
{
[1,1] =
{
[1,1] = Score
}
}
model =
{
[1,1] = {}(0x0)
[1,2] =
{
[1,1] = School
}
[1,3] =
{
[1,1] = Class
[1,2] = School
}
}
|
## Explicit Nesting :
## The parser also supports the explicit 'B(A)' syntax, which means
## 'B is nested within A'. This is equivalent to the interaction 'A:B'
## but often used to denote random effects or specific hierarchy.
formula = 'y ~ Class(School)';
disp (['Formula: ', formula]);
equation = parseWilkinsonFormula (formula, 'equation')
terms = parseWilkinsonFormula (formula, 'expand')
Formula: y ~ Class(School)
equation =
string
"y = c1 + c2*Class(School)"
terms =
scalar structure containing the fields:
response =
{
[1,1] =
{
[1,1] = y
}
}
model =
{
[1,1] = {}(0x0)
[1,2] =
{
[1,1] = Class
[1,2] = School
}
}
|
## Excluding Terms :
## Demonstrates building a complex model and then simplifying it.
## We define a full 3-way interaction (A*B*C) but explicitly remove the
## three-way term (A:B:C) using the minus operator.
formula = 'y ~ (A + B + C)^3 - A:B:C';
disp (['Formula: ', formula]);
equation = parseWilkinsonFormula (formula, 'equation')
terms = parseWilkinsonFormula (formula, 'expand')
Formula: y ~ (A + B + C)^3 - A:B:C
equation =
string
"y = c1 + c2*A + c3*B + c4*C + c5*A*B + c6*A*C + c7*B*C"
terms =
scalar structure containing the fields:
response =
{
[1,1] =
{
[1,1] = y
}
}
model =
{
[1,1] = {}(0x0)
[1,2] =
{
[1,1] = A
}
[1,3] =
{
[1,1] = B
}
[1,4] =
{
[1,1] = C
}
[1,5] =
{
[1,1] = A
[1,2] = B
}
[1,6] =
{
[1,1] = A
[1,2] = C
}
[1,7] =
{
[1,1] = B
[1,2] = C
}
}
|
## Repeated Measures :
## This allows predicting multiple outcomes simultaneously.
## The range operator '-' selects all variables between 'T1' and 'T3'
## as the response matrix Y.
T1 = [10; 11];
T2 = [12; 13];
T3 = [14; 15];
Treatment = {'Control'; 'Treated'};
t = table (T1, T2, T3, Treatment);
formula = 'T1 - T3 ~ Treatment';
disp (['Formula: ', formula]);
equations = parseWilkinsonFormula (formula, 'equation')
[X, Y, names] = parseWilkinsonFormula (formula, 'model_matrix', t)
Formula: T1 - T3 ~ Treatment
equations =
3x1 string array
"T1 = c1 + c2*Treatment"
"T2 = c3 + c4*Treatment"
"T3 = c5 + c6*Treatment"
X =
1 0
1 1
Y =
10 12 14
11 13 15
names =
2x1 cell array
{'(Intercept)' }
{'Treatment_Treated'}
|