Categories &

Functions List

Function Reference: parseWilkinsonFormula

statistics: terms = parseWilkinsonFormula (formula)
statistics: result = parseWilkinsonFormula (formula, mode)
statistics: [X, y, names] = parseWilkinsonFormula (formula, "model_matrix", data)

Parse and expand statistical model formulae using the Wilkinson notation.

This function implements the recursive-descent parser and expansion logic described by Wilkinson & Rogers (1973) for factorial models. It allows the symbolic specification of analysis of variance and regression models, converting strings into computational schemas or design matrices. It also supports multi-variable response specification on the Left-Hand Side (LHS) using lists or ranges.

parseWilkinsonFormula accepts as its first input argument a Wilkinson notation string specified by formula either as a character vector or a string scalar with the following list of valid symbols:

  • LHS (Response) Operators:
    • , : List separator for selecting multiple responses.
    • - : Range operator for selecting multiple responses.
  • RHS (Model) Operators:
    • + : Term addition (Union of terms).
    • - : Term deletion (Difference of terms).
    • * : Crossing (Expands to Main Effects + Interaction).
    • / : Nesting (Hierarchical relationship).
    • : : Direct interaction.
    • ^ : Crossing expansion limit.

parseWilkinsonFormula (formula, mode) further specifies how to process the Wilkinson notation specified by formula. mode must be a character vector or a string scalar with any of the following acceptable values.

  • 'expand' (default) : Returns a cell array of character vectors containing the expanded model terms (e.g., {"A", "B", "A:B"}).
  • 'matrix' : Returns a schema structure containing a binary matrix defining term membership.
  • 'model_matrix' : Constructs the full Design Matrix (X) and Response Matrix (y) based on the provided data. Uses corner-point (reference) coding for categoricals.
  • 'parse' : Returns the raw Abstract Syntax Tree (AST).
  • 'tokenize' : Returns the list of tokens generated by the lexer (useful only for debugging).

[X, y, names] = parseWilkinsonFormula   (formula, "model_matrix", data) will also accept a structure or a table containing the data variables. Required only when mode is "model_matrix".

  • Field names must match variables in the formula.
  • Response variables (LHS) must be numeric.
  • Rows containing NaN are automatically removed.

Outputs

terms/result

The processed model structure depending on the selected mode.

X

The numeric design matrix (observations x parameters).

y

The response matrix (observations x K).

names

A cell array of column names corresponding to X.

References

Wilkinson, G. N. and Rogers, C. E. (1973). Symbolic Description of Factorial Models for Analysis of Variance. Applied Statistics, 22, 392-399.

Source Code: parseWilkinsonFormula

Example: 1

 

 ## Demo : Tokenizer Mode
 ## Inspects the raw tokens generated from a formula string.
 formula = "y ~ A * (B + c)";
 tokens = parseWilkinsonFormula (formula, "tokenize");
 display (tokens);

tokens =

  1x10 struct array containing the fields:

    type
    value
    pos

                    

Example: 2

 

 ## Demo : Parser Mode (AST generation)
 ## Returns the Abstract Syntax Tree (AST) structure.
 formula = "A / B";
 tree = parseWilkinsonFormula (formula, "parse");
 display (tree);

tree =

  scalar structure containing the fields:

    type = OPERATOR
    value = /
    left =

      scalar structure containing the fields:

        type = IDENTIFIER
        value = A
        left = [](0x0)
        right = [](0x0)

    right =

      scalar structure containing the fields:

        type = IDENTIFIER
        value = B
        left = [](0x0)
        right = [](0x0)


                    

Example: 3

 

 ## Demo : Expansion Mode (Crossings)
 ## Demonstrates standard Wilkinson expansion for interactions.
 formula = "A * B * C";
 terms = parseWilkinsonFormula (formula, "expand");
 disp (terms);

  1x7 cell array

    {1x1 cell}    {1x1 cell}    {1x2 cell}    {1x1 cell}    {1x2 cell}    {1x2 cell}    {1x3 cell}    

                    

Example: 4

 

 ## Demo : Expansion Mode (Nesting)
 ## Demonstrates hierarchical nesting logic.
 formula = "Block / Plot / Subplot";
 terms = parseWilkinsonFormula (formula, "expand");
 disp (terms);

  1x3 cell array

    {1x1 cell}    {1x2 cell}    {1x3 cell}    

                    

Example: 5

 

 ## Demo : Matrix Schema Mode
 ## Generates the binary terms matrix (Row = Term, Col = Variable).
 formula = "y ~ Age + Height + Age:Height";
 schema = parseWilkinsonFormula (formula, "matrix");
 disp (schema.VariableNames);
 disp (schema.Terms);

  1x3 cell array

    {'Age'}    {'Height'}    {'y'}    

   0   0   0
   0   1   0
   1   0   0
   1   1   0
                    

Example: 6

 

 ## Demo : Model Matrix (Regression / Continuous)
 ## Builds the Design Matrix (X) and Response (y) for numeric data.
 d_reg.BP = [120; 122; 128; 130; 125];
 d_reg.Age = [25; 30; 35; 40; 32];
 d_reg.Weight = [70; 75; 80; 85; 78];
 [X, y, names] = parseWilkinsonFormula ("BP ~ Age * Weight", "model_matrix", d_reg);
 disp (names);
 disp (X);

  4x1 cell array

    {'(Intercept)'}    
    {'Weight'     }    
    {'Age'        }    
    {'Age:Weight' }    

      1     70     25   1750
      1     75     30   2250
      1     80     35   2800
      1     85     40   3400
      1     78     32   2496
                    

Example: 7

 

 ## Demo : Model Matrix (ANOVA / Categorical)
 ## Automatically handles categorical variables (dummy coding).
 d_cat.Yield = [10; 12; 15; 14; 11; 13];
 d_cat.Variety = {"A"; "A"; "B"; "B"; "C"; "C"};
 [X, y, names] = parseWilkinsonFormula ("Yield ~ Variety", "model_matrix", d_cat);
 disp (names);
 disp (X);

  3x1 cell array

    {'(Intercept)'}    
    {'Variety_B'  }    
    {'Variety_C'  }    

   1   0   0
   1   0   0
   1   1   0
   1   1   0
   1   0   1
   1   0   1
                    

Example: 8

 

 ## Demo : Model Matrix (Mixed Numeric & Categorical)
 ## Demonstrates Analysis of Covariance (ANCOVA) structures.
 d_mix.Growth = [1.2; 1.4; 1.1; 1.8];
 d_mix.Fertilizer = {"Old"; "Old"; "New"; "New"};
 d_mix.Dose = [10; 20; 10; 20];
 [X, ~, names] = parseWilkinsonFormula ("Growth ~ Fertilizer * Dose", "model_matrix", d_mix);
 disp (names);
 disp (X);

  4x1 cell array

    {'(Intercept)'        }    
    {'Fertilizer_Old'     }    
    {'Dose'               }    
    {'Dose:Fertilizer_Old'}    

    1    1   10   10
    1    1   20   20
    1    0   10    0
    1    0   20    0
                    

Example: 9

 

 ## Demo : Multi-Response
 ## Selects specific response variables using comma.
 d_list = struct ();
 d_list.Yield_A = [10; 12; 11; 14];
 d_list.Yield_B = [20; 22; 21; 24];
 d_list.Rain    = [100; 110; 105; 120];
 formula = "Yield_A, Yield_B ~ Rain";
 [X, y, names] = parseWilkinsonFormula (formula, "model_matrix", d_list);
 disp (names);
 disp (y);
 disp (X);

  2x1 cell array

    {'(Intercept)'}    
    {'Rain'       }    

   10   20
   12   22
   11   21
   14   24
     1   100
     1   110
     1   105
     1   120
                    

Example: 10

 

 ## Demo : Multi-Response
 ## Selects a contiguous range of variables using the hyphen.
 d_rng.Y_Jan = rand (4, 1);
 d_rng.Y_Feb = rand (4, 1);
 d_rng.Y_Mar = rand (4, 1);
 d_rng.Trt   = {"A"; "B"; "A"; "B"};
 formula = "Y_Jan - Y_Mar ~ Trt";
 [X, y, names] = parseWilkinsonFormula (formula, "model_matrix", d_rng);
 disp (names);
 disp (y);
 disp (X);

  2x1 cell array

    {'(Intercept)'}    
    {'Trt_B'      }    

   0.7084   0.7574   0.1893
   0.5557   0.8113   0.8787
   0.2801   0.8206   0.8471
   0.9481   0.3679   0.4640
   1   0
   1   1
   1   0
   1   1