Categories &

Functions List

Function Reference: parseWilkinsonFormula

statistics: terms = parseWilkinsonFormula (formula)
statistics: result = parseWilkinsonFormula (formula, mode)
statistics: [X, y, names] = parseWilkinsonFormula (formula, "model_matrix", data)

Parse and expand statistical model formulae using the Wilkinson notation.

This function implements the recursive-descent parser and expansion logic described by Wilkinson & Rogers (1973) for factorial models. It allows the symbolic specification of analysis of variance and regression models, converting strings into computational schemas or design matrices. It also supports multi-variable response specification on the Left-Hand Side (LHS) using lists or ranges.

parseWilkinsonFormula accepts as its first input argument a Wilkinson notation string specified by formula either as a character vector or a string scalar with the following list of valid symbols:

Right-Hand Side (Model) Operators The RHS specifies the independent variables (predictors) and the structural relationships between them, such as interactions and nesting. The parser expands these expressions into fundamental model terms following the standard statistical rules of marginality. Additionally, explicit nesting notation (e.g., B(A)) is supported to denote that factor B is nested within A.

OperatorDescriptionExpansion Example
+Addition (Union)A + B expands to A, B
*CrossingA * B expands to A, B, A:B
-DeletionA*B - A:B expands to A, B
/NestingA / B expands to A, A:B
:InteractionA : B expands to A:B
^Power (Limit)(A+B)^2 expands to A, B, A:B
1Intercepty ~ A - 1 removes intercept

Left-Hand Side (Response) Operators The LHS, separated by the ~ operator, defines the dependent variables. It natively supports multi-response syntaxes.

OperatorDescriptionUsage Example
~Formula separatory ~ x
,List separatory1, y2 ~ x
-Range operatorT1 - T3 ~ x

Processing Modes parseWilkinsonFormula (formula, mode) evaluates the formula string based on the selected mode:

  • 'expand' (default) - Returns a structure containing response and model fields. Each field contains cell arrays of the expanded, fundamental terms.
  • 'equation' - Generates a string representing the mathematical equation of the fitted model. Coefficients are represented generically as c1, c2, .... If multiple responses are specified, it returns a string array of equations.
    Formula StringEquation Output
    y ~ x"y = c1 + c2*x"
    y ~ A * B"y = c1 + c2*A + c3*B + c4*A*B"
    y ~ School / Class"y = c1 + c2*School + c3*Class*School"
    y ~ x^2"y = c1 + c2*x + c3*x^2"
    y1 - y2 ~ Trt["y1 = c1 + c2*Trt", "y2 = ..."]
  • 'matrix' - Returns a schema structure containing a binary matrix defining term membership, useful for internal algorithmic processing.
  • 'model_matrix' - Constructs the numeric Design Matrix (X) and Response Matrix (y) directly from a provided data table.
  • 'parse' - Returns the raw Abstract Syntax Tree (AST) structure.
  • 'tokenize' - Returns the array of tokens generated by the lexer.

Data Handling (’model_matrix’ mode) When using the 'model_matrix' mode, a data argument must be provided as an Octave table.

  • Categorical Variables: Cell arrays of strings in the table are automatically detected as categorical factors and undergo corner-point (reference) dummy coding.
  • Numeric Variables: Standard numeric vectors are treated as continuous predictors or responses.
  • Missing Data: Rows containing NaN values in any of the active variables are automatically omitted from the final matrices.

Outputs

terms / result

The processed model structure, string array, or cell array depending on the selected mode.

X

The generated numeric design matrix (Observations x Parameters). Includes a column of ones for the intercept unless - 1 is in the formula.

y

The numeric response matrix (Observations x K responses).

names

A cell array of character vectors containing the column names corresponding to the generated design matrix X.

References

Wilkinson, G. N. and Rogers, C. E. (1973). Symbolic Description of Factorial Models for Analysis of Variance. Applied Statistics, 22, 392-399.

Source Code: parseWilkinsonFormula

Example: 1

 


 ## Simple Linear Regression :
 ## This example models a continuous response (Height) as a linear function
 ## of a single continuous predictor (Age). The 'equation' mode returns the
 ## symbolic representation, while 'model_matrix' generates the design matrix.
 Age = [10; 12; 14; 16; 18];
 Height = [140; 148; 155; 162; 170];
 t = table (Height, Age);

 formula = 'Height ~ Age';
 disp (['Formula: ', formula]);
 equation = parseWilkinsonFormula (formula, 'equation')
 [X, y, names] = parseWilkinsonFormula (formula, 'model_matrix', t)

Formula: Height ~ Age
equation =
  string

   "Height = c1 + c2*Age"
X =

    1   10
    1   12
    1   14
    1   16
    1   18

y =

   140
   148
   155
   162
   170

names =
  2x1 cell array

    {'(Intercept)'}    
    {'Age'        }    

                    

Example: 2

 


 ## Multiple Regression :
 ## Here we model House Price based on two independent predictors: Area and
 ## number of Rooms. The '+' operator adds terms to the model without assuming
 ## any interaction between them.
 Price = [300; 350; 400; 450];
 Area  = [1500; 1800; 2200; 2500];
 Rooms = [3; 3; 4; 5];
 t = table (Price, Area, Rooms);

 formula = 'Price ~ Area + Rooms';
 disp (['Formula: ', formula]);
 equation = parseWilkinsonFormula (formula, 'equation')
 [X, y, names] = parseWilkinsonFormula (formula, 'model_matrix', t)

Formula: Price ~ Area + Rooms
equation =
  string

   "Price = c1 + c2*Area + c3*Rooms"
X =

      1      3   1500
      1      3   1800
      1      4   2200
      1      5   2500

y =

   300
   350
   400
   450

names =
  3x1 cell array

    {'(Intercept)'}    
    {'Rooms'      }    
    {'Area'       }    

                    

Example: 3

 


 ## Interaction Effects : 
 ## We analyze Relief Score based on Drug Type and Dosage Level.
 ## The '*' operator expands to the main effects PLUS the interaction term.
 ## Categorical variables are automatically created.
 Relief = [5; 7; 6; 8];
 Drug   = {'Placebo'; 'Placebo'; 'Active'; 'Active'};
 Dose   = {'Low'; 'High'; 'Low'; 'High'};
 t = table (Relief, Drug, Dose);

 formula = 'Relief ~ Drug * Dose';
 disp (['Formula: ', formula]);
 equation = parseWilkinsonFormula (formula, 'equation')
 [X, y, names] = parseWilkinsonFormula (formula, 'model_matrix', t)

Formula: Relief ~ Drug * Dose
equation =
  string

   "Relief = c1 + c2*Drug + c3*Dose + c4*Dose*Drug"
X =

   1   1   1   1
   1   1   0   0
   1   0   1   0
   1   0   0   0

y =

   5
   7
   6
   8

names =
  4x1 cell array

    {'(Intercept)'          }    
    {'Drug_Placebo'         }    
    {'Dose_Low'             }    
    {'Dose_Low:Drug_Placebo'}    

                    

Example: 4

 


 ## Polynomial Regression : 
 ## Uses the power operator (^) to model non-linear relationships.
 Distance = [20; 45; 80; 125];
 Speed    = [30; 50; 70; 90];
 Speed_2  = Speed .^ 2; 
 t = table (Distance, Speed, Speed_2, 'VariableNames', {'Distance', 'Speed', 'Speed^2'});

 formula = 'Distance ~ Speed^2';
 disp (['Formula: ', formula]);
 equation = parseWilkinsonFormula (formula, 'equation')
 [X, y, names] = parseWilkinsonFormula (formula, 'model_matrix', t)

Formula: Distance ~ Speed^2
equation =
  string

   "Distance = c1 + c2*Speed + c3*Speed^2"
X =

      1    900     30
      1   2500     50
      1   4900     70
      1   8100     90

y =

    20
    45
    80
   125

names =
  3x1 cell array

    {'(Intercept)'}    
    {'Speed^2'    }    
    {'Speed'      }    

                    

Example: 5

 


 ## Hierarchical Design.
 ## Common in psychometrics. Here, 'Class' is nested within 'School'.
 ## The '/' operator implies School + School:Class.
 Score  = [88; 92; 75; 80];
 School = {'North'; 'North'; 'South'; 'South'};
 Class  = {'Rm101'; 'Rm102'; 'Rm201'; 'Rm202'};
 t = table (Score, School, Class);

 formula = 'Score ~ School / Class';
 disp (['Formula: ', formula]);
 equation = parseWilkinsonFormula (formula, 'equation')
 terms = parseWilkinsonFormula (formula, 'expand')

Formula: Score ~ School / Class
equation =
  string

   "Score = c1 + c2*School + c3*Class*School"
terms =

  scalar structure containing the fields:

    response =
    {
      [1,1] =
      {
        [1,1] = Score
      }

    }

    model =
    {
      [1,1] = {}(0x0)
      [1,2] =
      {
        [1,1] = School
      }

      [1,3] =
      {
        [1,1] = Class
        [1,2] = School
      }

    }


                    

Example: 6

 


 ## Explicit Nesting : 
 ## The parser also supports the explicit 'B(A)' syntax, which means
 ## 'B is nested within A'. This is equivalent to the interaction 'A:B'
 ## but often used to denote random effects or specific hierarchy.
 formula = 'y ~ Class(School)';
 disp (['Formula: ', formula]);
 equation = parseWilkinsonFormula (formula, 'equation')
 terms = parseWilkinsonFormula (formula, 'expand')

Formula: y ~ Class(School)
equation =
  string

   "y = c1 + c2*Class(School)"
terms =

  scalar structure containing the fields:

    response =
    {
      [1,1] =
      {
        [1,1] = y
      }

    }

    model =
    {
      [1,1] = {}(0x0)
      [1,2] =
      {
        [1,1] = Class
        [1,2] = School
      }

    }


                    

Example: 7

 


 ## Excluding Terms : 
 ## Demonstrates building a complex model and then simplifying it.
 ## We define a full 3-way interaction (A*B*C) but explicitly remove the
 ## three-way term (A:B:C) using the minus operator.
 formula = 'y ~ (A + B + C)^3 - A:B:C';
 disp (['Formula: ', formula]);
 equation = parseWilkinsonFormula (formula, 'equation')
 terms = parseWilkinsonFormula (formula, 'expand')

Formula: y ~ (A + B + C)^3 - A:B:C
equation =
  string

   "y = c1 + c2*A + c3*B + c4*C + c5*A*B + c6*A*C + c7*B*C"
terms =

  scalar structure containing the fields:

    response =
    {
      [1,1] =
      {
        [1,1] = y
      }

    }

    model =
    {
      [1,1] = {}(0x0)
      [1,2] =
      {
        [1,1] = A
      }

      [1,3] =
      {
        [1,1] = B
      }

      [1,4] =
      {
        [1,1] = C
      }

      [1,5] =
      {
        [1,1] = A
        [1,2] = B
      }

      [1,6] =
      {
        [1,1] = A
        [1,2] = C
      }

      [1,7] =
      {
        [1,1] = B
        [1,2] = C
      }

    }


                    

Example: 8

 


 ## Repeated Measures : 
 ## This allows predicting multiple outcomes simultaneously.
 ## The range operator '-' selects all variables between 'T1' and 'T3'
 ## as the response matrix Y.
 T1 = [10; 11];
 T2 = [12; 13];
 T3 = [14; 15];
 Treatment = {'Control'; 'Treated'};
 t = table (T1, T2, T3, Treatment);

 formula = 'T1 - T3 ~ Treatment';
 disp (['Formula: ', formula]);
 equations = parseWilkinsonFormula (formula, 'equation')
 [X, Y, names] = parseWilkinsonFormula (formula, 'model_matrix', t)

Formula: T1 - T3 ~ Treatment
equations =
  3x1 string array

    "T1 = c1 + c2*Treatment"    
    "T2 = c3 + c4*Treatment"    
    "T3 = c5 + c6*Treatment"    

X =

   1   0
   1   1

Y =

   10   12   14
   11   13   15

names =
  2x1 cell array

    {'(Intercept)'      }    
    {'Treatment_Treated'}