Categories &

Functions List

Class Definition: cvpartition

statistics: cvpartition

Partition data for cross-validation

The cvpartition class generates a partitioning scheme on a dataset to facilitate cross-validation of statistical models utilizing training and testing subsets of the dataset.

See also: crossval

Source Code: cvpartition

Properties

A positive integer scalar specifying the number of observations in the dataset (including any missing data, where applicable). This property is read-only.

A positive integer scalar specifying the number of folds for partition types 'kfold' and 'leaveout'. When partition type is 'holdout' and 'resubstitution', then NumTestSets is 1. This property is read-only.

A positive integer scalar specifying the size of the train set for partition types 'holdout' and 'resubstitution' or a vector of positive integers specifying the size of each training set for partition types 'kfold' and 'leaveout'. This property is read-only.

A positive integer scalar specifying the size of the test set for partition types 'holdout' and 'resubstitution' or a vector of positive integers specifying the size of each testing set for partition types 'kfold' and 'leaveout'. This property is read-only.

A character vector specifying the type of the cvpartition object. It can be kfold, holdout, leaveout, or resubstitution. This property is read-only.

A logical scalar specifying whether the cvpartition object was created using custom partition partitioning (true) or not (false). This property is read-only.

A logical scalar specifying whether the cvpartition object was created using grouping variables (true) or not (false). This property is read-only.

A logical scalar specifying whether the cvpartition object was created with a 'stratify' value of true. This property is read-only.

Methods

cvpartition: C = cvpartition (n, 'KFold')
cvpartition: C = cvpartition (n, 'KFold', k)
cvpartition: C = cvpartition (n, 'KFold', k, 'GroupingVariables', grpvars)
cvpartition: C = cvpartition (n, 'Holdout')
cvpartition: C = cvpartition (n, 'Holdout', p)
cvpartition: C = cvpartition (n, 'Leaveout')
cvpartition: C = cvpartition (n, 'Resubstitution')
cvpartition: C = cvpartition (X, 'KFold')
cvpartition: C = cvpartition (X, 'KFold', k)
cvpartition: C = cvpartition (X, 'KFold', k, 'Stratify', opt)
cvpartition: C = cvpartition (X, 'Holdout')
cvpartition: C = cvpartition (X, 'Holdout', p)
cvpartition: C = cvpartition (X, 'Holdout', p, 'Stratify', opt)
cvpartition: C = cvpartition ('CustomPartition', testSets)

C = cvpartition (n, 'KFold') creates a cvpartition object C, which defines a random nonstratified partition for k-fold cross-validation on n observations with each fold (subsample) having approximately the same number of observations. The default number of folds is 10 for n >= 10 or equal to n otherwise.

C = cvpartition (n, 'KFold', k) also creates a nonstratified random partition for k-fold cross-validation with the number of folds defined by k, which must be a positive integer scalar smaller than the number of observations n.

C = cvpartition (n, 'KFold', k, 'GroupingVariables', grpvars) creates a cvpartition object C that defines a random partition for k-fold cross-validation with each fold containing the same combination of group labels as defined by grpvars. The grouping variables specified in grpvars can be one of the following:

  • A numeric vector, logical vector, categorical vector, character array, string array, or cell array of character vectors containing one grouping variable.
  • A numeric matrix or cell array containing two or more grouping variables. Each column in the matrix or array must correspond to one grouping variable.

C = cvpartition (n, 'Holdout') creates a cvpartition object C, which defines a random nonstratified partition for holdout validation on n observations. 90% of the observations are assigned to the training set and the remaining 10% to the test set.

C = cvpartition (n, 'Holdout', p) also creates a nonstratified random partition for holdout validation with the percentage of training and test sets defined by p, which can be a scalar value in the range (0,1) or a positive integer scalar in the range [1,n).

C = cvpartition (n, 'Leaveout') creates a cvpartition object C, which defines a random partition for leave-one-out cross-validation on n observations. This is a special case of k-fold cross-validation with the number of folds equal to the number of observations.

C = cvpartition (n, 'Resubstitution') creates a cvpartition object C without partitioning the data and both training and test sets containing all observations n.

C = cvpartition (X, 'KFold') creates a cvpartition object C, which defines a stratified random partition for k-fold cross-validation according to the class proportions in Χ. X can be a numeric, logical, categorical, or string vector, or a character array or a cell array of character vectors. Missing values in X are discarded. The default number of folds is 10 for numel (X) >= 10 or equal to numel (X) otherwise.

C = cvpartition (X, 'KFold', k) also creates a stratified random partition for k-fold cross-validation with the number of folds defined by k, which must be a positive integer scalar smaller than the number of observations in X.

C = cvpartition (X, 'KFold', k, 'Stratify', opt) creates a random partition for k-fold cross-validation, which is stratified if opt is true, or nonstratified if opt is false.

C = cvpartition (X, 'Holdout') creates a cvpartition object C, which defines a stratified random partition for holdout validation while maintaining the class proportions in Χ. 90% of the observations are assigned to the training set and the remaining 10% to the test set.

C = cvpartition (X, 'Holdout', p) also creates a stratified random partition for holdout validation with the percentage of training and test sets defined by p, which can be a scalar value in the range (0,1) or a positive integer scalar in the range [1,n).

C = cvpartition (X, 'Holdout', p, 'Stratify', opt) creates a random partition for holdout validation, which is stratified if opt is true, or nonstratified if opt is false.

C = cvpartition ('CustomPartition', testSets) creates a custom partition according to testSets, which can be a positive integer vector, a logical vector, or a logical matrix according to the following options:

  • A positive integer vector of length N with values in the range [1,K], where K < N, will specify a K-fold cross-validation partition, in which each value indicates the test set of each observation. Alternatively, the same vector with values in the range [1,N] will specify a leave-one-out cross-validation.
  • A logical vector will specify a holdout validation, in which the true elements correspond to the test set and the false elements correspond to the training set.
  • A logical matrix with K columns will specify a K-fold cross-validation partition, in which each column corresponds to a fold and each row to an observation. Alternatively, an N×N logical matrix will specify a leave-one-out cross-validation, where N is the number of observations. true elements correspond to the test set and the false elements correspond to the training set.

See also: cvpartition, summary, test, training

cvpartition: Cnew = repartition (C)
cvpartition: Cnew = repartition (C, sval)
cvpartition: Cnew = repartition (C, 'legacy')

Cnew = repartition (C) creates a cvpartition object Cnew that defines a new random partition of the same type as the cvpartition C.

Cnew = repartition (C, sval) also uses the value of sval to set the state of the random generator used in repartitioning C. If sval is a vector, then the random generator is set using the "state" keyword as in rand ("state", sval). If sval is a scalar, then the "seed" keyword is used as in rand ("seed", sval) to specify that old generators should be used.

Cnew = repartition (C, 'legacy') only applies to cvpartition objects C that use k-fold partitioning and it will repartition C in the same non-random manner that was previously used by the old-style cvpartition class of the statistics package. The 'legacy' option does not apply to stratified or grouped partitions.

See also: cvpartition, summary, test, training

cvpartition: tbl = summary (c)

tbl = summary (c) returns a summary table tbl of the validation partition contained in the cvpartition object c.

This method calculates the distribution of classes (if stratified) or groups (if grouped) across the entire dataset, as well as within every training and test set generated by the partition.

Inputs

  • c A cvpartition object. The object must satisfy two conditions:
    1. The partition type (c.Type) must be "kfold" or "holdout".
    2. The partition must be created with a stratification or grouping variable (i.e., c.IsStratified or c.IsGrouped must be true).

Outputs

  • tbl A table object containing the summary statistics. The table contains one row for every unique label/group in every set (all, train, test). The columns are:
    Set

    The specific subset being described. Values include "all" (the full dataset), "train1", "test1", etc.

    SetSize

    The total number of observations in that specific set.

    Label

    The class or group identifier. If c.IsStratified is true, this column is named StratificationLabel. If c.IsGrouped is true, it is named GroupLabel.

    Count

    The number of observations of that label within the set. If stratified, this column is named StratificationCount; otherwise, GroupCount.

    PercentInSet

    The percentage of the set composed of that specific label.

See also: cvpartition, repartition, test, training

Example: 1

 

 ## 1. Basic Usage
 rng (1, "twister");

 ## Create simple numeric labels
 labels = [ones(10, 1); 2 * ones(10, 1)];
 c = cvpartition (labels, 'KFold', 2);
 summary (c)

ans =
  10x5 table

      Set       SetSize    StratificationLabel    StratificationCount    PercentInSet    
    ________    _______    ___________________    ___________________    ____________    

    "all"            20                      1                     10              50    
    "all"            20                      2                     10              50    
    "train1"         10                      1                      5              50    
    "train1"         10                      2                      5              50    
    "test1"          10                      1                      5              50    
    "test1"          10                      2                      5              50    
    "train2"         10                      1                      5              50    
    "train2"         10                      2                      5              50    
    "test2"          10                      1                      5              50    
    "test2"          10                      2                      5              50    

                    

Example: 2

 

 ## 2. Grouped K-Fold Partition
 rng (101, "twister");

 Region = repelem ({'North'; 'South'; 'East'; 'West'}, [20; 15; 15; 20]);
 Success = repelem ({'Success'; 'Fail'}, [49;  21]);
 Success = Success(randperm (70));
 Tbl = table (Region, Success);

 ## Create Grouped Partition using the 'Region' variable
 c_group = cvpartition (height (Tbl), 'KFold', 4, 'GroupingVariables', Tbl.Region);

 ## Generate Summary
 summary (c_group)

ans =
  36x5 table

      Set       SetSize    GroupLabel    GroupCount    PercentInSet    
    ________    _______    __________    __________    ____________    

    "all"            70    "North"               20         28.5714    
    "all"            70    "South"               15         21.4286    
    "all"            70    "East"                15         21.4286    
    "all"            70    "West"                20         28.5714    
    "train1"         50    "North"                0               0    
    "train1"         50    "South"               15              30    
    "train1"         50    "East"                15              30    
    "train1"         50    "West"                20              40    
    "test1"          20    "North"               20             100    
    "test1"          20    "South"                0               0    
    "test1"          20    "East"                 0               0    
    "test1"          20    "West"                 0               0    
    "train2"         55    "North"               20         36.3636    
    "train2"         55    "South"                0               0    
    "train2"         55    "East"                15         27.2727    
    "train2"         55    "West"                20         36.3636    
    "test2"          15    "North"                0               0    
    "test2"          15    "South"               15             100    
    "test2"          15    "East"                 0               0    
    "test2"          15    "West"                 0               0    
    "train3"         50    "North"               20              40    
    "train3"         50    "South"               15              30    
    "train3"         50    "East"                15              30    
    "train3"         50    "West"                 0               0    
    "test3"          20    "North"                0               0    
    "test3"          20    "South"                0               0    
    "test3"          20    "East"                 0               0    
    "test3"          20    "West"                20             100    
    "train4"         55    "North"               20         36.3636    
    "train4"         55    "South"               15         27.2727    
    "train4"         55    "East"                 0               0    
    "train4"         55    "West"                20         36.3636    
    "test4"          15    "North"                0               0    
    "test4"          15    "South"                0               0    
    "test4"          15    "East"                15             100    
    "test4"          15    "West"                 0               0    

                    

Example: 3

 

 ## 3. Stratified K-Fold Partition
 rng (202, "twister");

 Success = repelem ({'ClassA'; 'ClassB'}, [25; 25]);
 Success = Success(randperm (50));
 Tbl = table (Success);

 ## Create Stratified Partition
 c_strat = cvpartition (Tbl.Success, 'KFold', 5);

 ## Generate Summary
 summary (c_strat)

ans =
  22x5 table

      Set       SetSize    StratificationLabel    StratificationCount    PercentInSet    
    ________    _______    ___________________    ___________________    ____________    

    "all"            50    "ClassA"                                25              50    
    "all"            50    "ClassB"                                25              50    
    "train1"         40    "ClassA"                                20              50    
    "train1"         40    "ClassB"                                20              50    
    "test1"          10    "ClassA"                                 5              50    
    "test1"          10    "ClassB"                                 5              50    
    "train2"         40    "ClassA"                                20              50    
    "train2"         40    "ClassB"                                20              50    
    "test2"          10    "ClassA"                                 5              50    
    "test2"          10    "ClassB"                                 5              50    
    "train3"         40    "ClassA"                                20              50    
    "train3"         40    "ClassB"                                20              50    
    "test3"          10    "ClassA"                                 5              50    
    "test3"          10    "ClassB"                                 5              50    
    "train4"         40    "ClassA"                                20              50    
    "train4"         40    "ClassB"                                20              50    
    "test4"          10    "ClassA"                                 5              50    
    "test4"          10    "ClassB"                                 5              50    
    "train5"         40    "ClassA"                                20              50    
    "train5"         40    "ClassB"                                20              50    
    "test5"          10    "ClassA"                                 5              50    
    "test5"          10    "ClassB"                                 5              50    

                    

Example: 4

 

 ## 4. Handling Missing Values (NaN) in Stratification
 rng (404, "twister");

 ## Create data with missing values (NaN)
 labels = [ones(10, 1); 2 * ones(10, 1); NaN(5, 1)];

 ## Create partition
 c_missing = cvpartition (labels, 'KFold', 2);

 ## Generate Summary
 summary (c_missing)

ans =
  10x5 table

      Set       SetSize    StratificationLabel    StratificationCount    PercentInSet    
    ________    _______    ___________________    ___________________    ____________    

    "all"            20                      1                     10              50    
    "all"            20                      2                     10              50    
    "train1"         10                      1                      5              50    
    "train1"         10                      2                      5              50    
    "test1"          10                      1                      5              50    
    "test1"          10                      2                      5              50    
    "train2"         10                      1                      5              50    
    "train2"         10                      2                      5              50    
    "test2"          10                      1                      5              50    
    "test2"          10                      2                      5              50    

                    

Example: 5

 

 ## 5. Filtering Summary
 rng (303, "twister");

 Success = repelem ({'Yes'; 'No'}, [20; 10]);
 Success = Success(randperm (30));

 ## Create Partition and Summary
 c_strat = cvpartition (Success, 'KFold', 3);
 summaryStrat = summary (c_strat);

 ## A. Filtering by Exact Set Name
 summaryTest1 = summaryStrat (strcmp (summaryStrat.Set, 'test1'), :)

 ## B. Filtering by Partial Match
 is_test = ! cellfun ('isempty', strfind (cellstr (summaryStrat.Set), 'test'));
 testSummaryTbl = summaryStrat (is_test, :)

summaryTest1 =
  2x5 table

      Set      SetSize    StratificationLabel    StratificationCount    PercentInSet    
    _______    _______    ___________________    ___________________    ____________    

    "test1"         10    "No"                                     4              40    
    "test1"         10    "Yes"                                    6              60    

testSummaryTbl =
  6x5 table

      Set      SetSize    StratificationLabel    StratificationCount    PercentInSet    
    _______    _______    ___________________    ___________________    ____________    

    "test1"         10    "No"                                     4              40    
    "test1"         10    "Yes"                                    6              60    
    "test2"         10    "No"                                     3              30    
    "test2"         10    "Yes"                                    7              70    
    "test3"         10    "No"                                     3              30    
    "test3"         10    "Yes"                                    7              70    

                    

Example: 6

 

 ## 6. Unstacking
 rng (505, "twister");

 Success = repelem ({'Yes'; 'No'}, [20; 10]);
 Success = Success(randperm (30));

 ## Create Partition and Summary
 c_strat = cvpartition (Success, 'KFold', 3);
 summaryStrat = summary (c_strat);

 ## Pivot 'StratificationLabel' (Yes/No) into new columns
 speciesSummaryTbl = unstack (summaryStrat(:,1:4), 'StratificationCount', 'StratificationLabel')

speciesSummaryTbl =
  7x4 table

      Set       SetSize    No    Yes    
    ________    _______    __    ___    

    "all"            30    10     20    
    "train1"         20     6     14    
    "test1"          10     4      6    
    "train2"         20     7     13    
    "test2"          10     3      7    
    "train3"         20     7     13    
    "test3"          10     3      7    

                    
cvpartition: idx = test (C)
cvpartition: idx = test (C, i)
cvpartition: idx = test (C, "all")

idx = test (C) returns a logical vector idx with true values indicating the elements corresponding to the test set defined in the cvpartition object C. For K-fold and leave-one-out partitions, the indices corresponding to the first test set are returned.

idx = test (C, i) returns a logical vector or matrix with the indices of the test set indicated by i. If i is a scalar, then idx is a logical vector with the indices of the i-th set. If i is a vector, then idx is a logical matrix in which idx(:,j) specified the observations in the test set i(j). The value(s) in i must not exceed the number of tests in the cvpartition object C.

idx = test (C, "all") returns a logical vector or matrix for all test sets defined in the cvpartition object C. For holdout and resubstitution partition types, a vector is returned. For K-fold and leave-one-out, a matrix is returned.

See also: cvpartition, repartition, summary, training

cvpartition: idx = training (C)
cvpartition: idx = training (C, i)
cvpartition: idx = training (C, "all")

idx = training (C) returns a logical vector idx with true values indicating the elements corresponding to the training set defined in the cvpartition object C. For K-fold and leave-one-out partitions, the indices corresponding to the first training set are returned.

idx = training (C, i) returns a logical vector or matrix with the indices of the training set indicated by i. If i is a scalar, then idx is a logical vector with the indices of the i-th set. If i is a vector, then idx is a logical matrix in which idx(:,j) specified the observations in the training set i(j). The value(s) in i must not exceed the number of tests in the cvpartition object C.

idx = training (C, "all") returns a logical vector or matrix for all training sets defined in the cvpartition object C. For holdout and resubstitution partition types, a vector is returned. For K-fold and leave-one-out, a matrix is returned.

See also: cvpartition, repartition, summary, test