Categories &

Functions List

Class Definition: ClassificationKNN

statistics: ClassificationKNN

K-nearest neighbors classification

The ClassificationKNN class implements a K-nearest neighbor classifier object, which can predict responses for new data using the predict method. The implemented algorithm allows you choose a range of different distance metrics, the number of nearest neighbors, as well as the searching algorithm.

The K-nearest neighbors (k-NN) classifier is a simple, non-parametric machine learning algorithm used for classification tasks. It classifies a data point based on the majority class of its k closest neighbors in the feature space.

Create a ClassificationKNN object by using the fitcknn function or the class constructor.

See also: fitcknn

Source Code: ClassificationKNN

Properties

A numeric matrix containing the unstandardized predictor data. Each column of X represents one predictor (variable), and each row represents one observation. This property is read-only.

Specified as a logical or numeric column vector, or as a character array or a cell array of character vectors with the same number of rows as the predictor data. Each row in Y is the observed class label for the corresponding row in X. This property is read-only.

A positive integer value specifying the number of observations in the training dataset used for training the ClassificationKNN model. This property is read-only.

A logical column vector with the same length as the observations in the original predictor data X specifying which rows have been used for fitting the ClassificationKNN model. This property is read-only.

A positive integer value specifying the number of predictors in the training dataset used for training the ClassificationKNN model. This property is read-only.

A cell array of character vectors specifying the names of the predictor variables. The names are in the order in which the appear in the training dataset. This property is read-only.

A character vector specifying the name of the response variable Y. This property is read-only.

An array of unique values of the response variable Y, which has the same data types as the data in Y. This property is read-only. ClassNames can have any of the following datatypes:

  • Cell array of character vectors
  • Character array
  • Logical vector
  • Numeric vector

A square matrix specifying the cost of misclassification of a point. Cost(i,j) is the cost of classifying a point into class j if its true class is i (that is, the rows correspond to the true class and the columns correspond to the predicted class). The order of the rows and columns in Cost corresponds to the order of the classes in ClassNames. The number of rows and columns in Cost is the number of unique classes in the response. By default, Cost(i,j) = 1 if i != j, and Cost(i,j) = 0 if i = j. In other words, the cost is 0 for correct classification and 1 for incorrect classification.

Add or change the Cost property using dot notation as in:

  • obj.Cost = costMatrix

A numeric vector specifying the prior probabilities for each class. The order of the elements in Prior corresponds to the order of the classes in ClassNames.

Add or change the Prior property using dot notation as in:

  • obj.Prior = priorVector

Specified as a function handle for transforming the classification scores. Add or change the ScoreTransform property using dot notation as in:

  • obj.ScoreTransform = 'function_name'
  • obj.ScoreTransform = @function_handle

When specified as a character vector, it can be any of the following built-in functions. Nevertherless, the ScoreTransform property always stores their function handle equivalent.

ValueDescription
"doublelogit"1 ./ (1 + e×p .^ (-2××))
"invlogit"log (× ./ (1 -×))
"ismax"Sets the score for the class with the largest score to 1, and for all other classes to 0
"logit"1 ./ (1 + e×p .^ (-×))
"none"× (no transformation)
"identity"× (no transformation)
"sign"-1 for× < 0, 0 for× = 0, 1 for× > 0
"symmetric"2×× + 1
"symmetricismax"Sets the score for the class with the largest score to 1, and for all other classes to -1
"symmetriclogit"2 ./ (1 + e×p .^ (-×)) - 1

A logical scalar specifying whether the predictor data in X have been standardized prior to training. This property is read-only.

A numeric vector of the same length as the columns in X with the standard deviations corresponding to each predictor. If the predictor variables have not been standardized, then "obj.Sigma" is empty. This property is read-only.

A numeric vector of the same length as the columns in X with the mean values corresponding to each predictor. If the predictor variables have not been standardized, then "obj.Sigma" is empty. This property is read-only.

A character vector specifying the tie-breaking algorithm used by the predict method, when multiple classes have the same smallest cost. It can be one of the following:

  • "smallest" (default), which favors the class with the smallest index among the tied groups, i.e. the one that appears first in the training labelled data.
  • "nearest", which favors the class with the nearest neighbor among the tied groups, i.e. the class with the closest member point according to the distance metric used.
  • "random", which randomly picks one class among the tied groups.

The tie-breaking algorithm is only used when IncludeTies is false. Change the BreakTies property using dot notation as in:

  • obj.BreakTies = algorithm

A positive integer value specifyingNumber of nearest neighbors in X used to classify each point during prediction. Change the NumNeighbors property using dot notation as in:

  • obj.NumNeighbors = newNumNeighbors

A character vector specifying the distance metric used by the neighbor-searcher method. See the available distance metrics in knnseaarch for more info. Change the Distance property using dot notation as in:

  • obj.Distance = newDistance

A character vector or a function handle specifying the distance weighting function, which can be any of the following values:

  • "equal", which corresponds to @(d) d.
  • "inverse", which corresponds to @(d) 1/d.
  • "squaredinverse", which corresponds to @(d) 1/d.^2.
  • @fcn, which is a function handle that accepts a matrix of nonnegative distances, and returns a matrix the same size containing nonnegative distance weights.

Change the DistanceWeight property using dot notation as in:

  • obj.DistanceWeight = newDistanceWeight

A positive definite covariance matrix, a positive scalar, or a vector of positive scale values specifying the parameter for the corresponding distance metric as shown below:

  • "mahalanobis" accepts a positive definite covariance matrix.
  • "minkowski" accepts a positive scalar as the Minkowski distance exponent.
  • "seuclidean" accepts a vector of positive scale values of equal length as the number of predictors in X.

For any other distance metric, DistParameter is empty ([]). Change the DistParameter property using dot notation as in:

  • obj.DistParameter = distParam

A character vector specified as either "kdtree", which creates and uses a Kd-tree to find nearest neighbors, or "exhaustive", which uses the exhaustive search algorithm by computing the distance values from all points in X to find nearest neighbors.

Change the NSMethod property using dot notation as in:

  • obj.NSMethod = newNSMethod

A logical scalar specifying whether prediction includes all the neighbors whose distance values are equal to the k^th smallest distance. If IncludeTies is true, prediction includes all of these neighbors. Otherwise, prediction uses exactly k neighbors.

Change the IncludeTies property using dot notation as in:

  • obj.IncludeTies = flag

A positive integer scalar specifying the maximum number of data points in the leaf node of the Kd-tree. BucketSize only applies when the NSMethod property is "kdtree".

Change the BucketSize property using dot notation as in:

  • obj.BucketSize = maxnum

Methods

statistics: obj = ClassificationKNN (X, Y)
statistics: obj = ClassificationKNN (…, name, value)

obj = ClassificationKNN (X, Y) returns a ClassificationKNN object, with X as the predictor data and Y containing the class labels of observations in X.

  • X must be a N×P numeric matrix of input data where rows correspond to observations and columns correspond to features or variables. X will be used to train the kNN model.
  • Y is N×1 matrix or cell matrix containing the class labels of corresponding predictor data in X. Y can contain any type of categorical data. Y must have same numbers of Rows as X.

obj = ClassificationKNN (…, name, value) returns a ClassificationKNN object with parameters specified by the following name, value paired input arguments:

NameValue
'PredictorNames'A cell array of character vectors specifying the names of the predictors. The length of this array must match the number of columns in X.
'ResponseName'A character vector specifying the name of the response variable.
'ClassNames'Names of the classes in the class labels, Y, used for fitting the GAM model. ClassNames are of the same type as the class labels in Y.
'Cost'An N×R numeric matrix containing misclassification cost for the corresponding instances in X, where R is the number of unique categories in Y. If an instance is correctly classified into its category the cost is calculated to be 1, otherwise 0. The cost matrix can be altered by using Mdl.cost = somecost. By default, its value is cost = ones (rows (X), numel (unique (Y))).
'Prior'A numeric vector specifying the prior probabilities for each class. The order of the elements in Prior corresponds to the order of the classes in ClassNames. Alternatively, you can specify "empirical" to use the empirical class probabilities or "uniform" to assume equal class probabilities.
'ScoreTransform'A user-defined function handle or a character vector specifying one of the following builtin functions specifying the transformation applied to predicted classification scores. Supported values include 'doublelogit', 'invlogit', 'ismax', 'logit', 'none', 'identity', 'sign', 'symmetric', 'symmetricismax', and 'symmetriclogit'.
"BreakTies"A character vector specifying the tie-breaking algorithm used by predict method, when multiple classes have the same smallest cost. Available options are "smallest" (default), which uses the smallest index among tied groups, "nearest", which uses the class with the nearest neighbor among tied groups, and "random", which randomly selects one of the tied groups.
"NumNeighbors"A positive integer value that specifies the number of nearest neighbors to be found in the kNN search algorithm for classifying each point during prediction. By default, it is 1.
"Distance"Any valid distance metric supported by the pdist2 function. Note that the allowable distrance metrics depend on the selected nearest neighbor search method.
"DistanceWeight"Either a distance weighting function, specified either as a function handle, which accepts a matrix of nonnegative distances and returns a matrix the same size containing nonnegative distance weights, or a character vector with one of the following values: "equal", which corresponds to no weighting; "inverse", which corresponds to a weight equal to 1/distance; "squaredinverse", which corresponds to a weight equal to 1/distance^2.
"Cov"A square matrix with the same number of columns X specifying the covariance matrix for computing the mahalanobis distance. This must be a positive definite matrix matching. This argument is only valid when the selected distance metric is "mahalanobis".
"Exponent"A positive scalar (usually an integer) specifying the Minkowski distance exponent. This argument is only valid when the selected distance metric is "minkowski". By default, it is 2.
"Scale"A nonnegative numeric vector specifying the scale parameters for the standardized Euclidean distance. The vector length must be equal to the number of columns in X. This argument is only valid when the selected distance metric is "seuclidean", in which case each coordinate of X is scaled by the corresponding element of "scale", as is each query point in Y. By default, the scale parameter is the standard deviation of each coordinate in X. If a variable in X is constant, i.e. zero variance, this value is forced to 1 to avoid division by zero. This is the equivalent of this variable not being standardized.
"NSMethod"A character vector specifying the nearest neighbor search method used by knnsearch, which can be "kdtree" or "exhaustive". See knnsearch for more information about default values and allowable distance metrics for each search method.
"BucketSize"A positive integer value specifying the maximum number of data points in the leaf node of the Kd-tree. This argument is meaningful only when the selected nearest neighbor search method is "kdtree". By default, it is 50.

See also: fitcknn, knnsearch, rangesearch, pdist2

ClassificationKNN: labels = predict (obj, XC)
ClassificationKNN: [labels, scores, cost] = predict (obj, XC)

labels = predict (obj, XC) returns the matrix of labels predicted for the corresponding instances in XC, using the predictor data in obj.X and corresponding labels, obj.Y, stored in the k-Nearest Neighbor classification model, obj.

  • obj must be a ClassificationKNN class object.
  • XC must be an M×P numeric matrix with the same number of features P as the corresponding predictors of the SVM model in obj.

[labels, scores, cost] = predict (obj, XC) also returns scores, which contains the predicted class scores or posterior probabilities for each instance of the corresponding unique classes, and cost, which is a matrix containing the expected cost of the classifications. By default, scores returns the posterior probabilities for KNN models, unless a specific ScoreTransform function has been specified. See fitcknn for more info.

Note! predict is explicitly using "exhaustive" as the nearest search method due to the very slow implementation of "kdtree" in the knnsearch function.

See also: fitcknn, ClassificationKNN, knnsearch

ClassificationKNN: L = loss (obj, X, Y)
ClassificationKNN: L = loss (…, name, value)

L = loss (obj, X, Y) computes the loss, L, using the default loss function 'mincost'.

  • obj is a ClassificationKNN object trained on X and Y.
  • X must be a N×P numeric matrix of input data where rows correspond to observations and columns correspond to features or variables.
  • Y is N×1 matrix or cell matrix containing the class labels of corresponding predictor data in X. Y must have same numbers of Rows as X.

L = loss (…, name, value) allows additional options specified by name-value pairs:

NameValue
"LossFun"Specifies the loss function to use. Can be a function handle with four input arguments (C, S, W, Cost) which returns a scalar value or one of: ’binodeviance’, ’classifcost’, ’classiferror’, ’exponential’, ’hinge’, ’logit’,’mincost’, ’quadratic’.
  • C is a logical matrix of size N×K, where N is the number of observations and K is the number of classes. The element C(i,j) is true if the class label of the i-th observation is equal to the j-th class.
  • S is a numeric matrix of size N×K, where each element represents the classification score for the corresponding class.
  • W is a numeric vector of length N, representing the observation weights.
  • Cost is a K×K matrix representing the misclassification costs.
"Weights"Specifies observation weights, must be a numeric vector of length equal to the number of rows in X. Default is ones (size (X, 1)). loss normalizes the weights so that observation weights in each class sum to the prior probability of that class. When you supply Weights, loss computes the weighted classification loss.

See also: fitcknn, ClassificationKNN

ClassificationKNN: m = margin (obj, X, Y)

  • obj is a ClassificationKNN object trained on X and Y.
  • X must be a N×P numeric matrix of input data where rows correspond to observations and columns correspond to features or variables.
  • Y is N×1 matrix or cell matrix containing the class labels of corresponding predictor data in X. Y must have same numbers of Rows as X.

The classification margin for each observation is the difference between the classification score for the true class and the maximal classification score for the false classes.

See also: fitcknn, ClassificationKNN

ClassificationKNN: [pd, x, y] = partialDependence (obj, Vars, Labels)
ClassificationKNN: [pd, x, y] = partialDependence (…, Data)
ClassificationKNN: [pd, x, y] = partialDependence (…, name, value)

[pd, x, y] = partialDependence (obj, Vars, Labels) computes the partial dependence of the classification scores on the variables Vars for the specified class Labels.

  • obj is a trained ClassificationKNN object.
  • Vars is a vector of positive integers, character vector, string array, or cell array of character vectors representing predictor variables (it can be indices of predictor variables in obj.X).
  • Labels is a character vector, logical vector, numeric vector, or cell array of character vectors representing class labels. (column vector)

[pd, x, y] = partialDependence (…, Data) specifies new predictor data to use for computing the partial dependence.

[pd, x, y] = partialDependence (…, name, value) allows additional options specified by name-value pairs:

NameValue
"NumObservationsToSample"Number of observations to sample. Must be a positive integer. Defaults to the number of observations in the training data.
"QueryPoints"Points at which to evaluate the partial dependence. Must be a numeric column vector, numeric two-column matrix, or cell array of character column vectors.
"UseParallel"Logical value indicating whether to perform computations in parallel. Defaults to false.

Return Values

  • pd: Partial dependence values.
  • x: Query points for the first predictor variable in Vars.
  • y: Query points for the second predictor variable in Vars (if applicable).

See also: fitcknn, ClassificationKNN

ClassificationKNN: CVMdl = crossval (obj)
ClassificationKNN: CVMdl = crossval (…, Name, Value)

CVMdl = crossval (obj) returns a cross-validated model object, CVMdl, from a trained model, obj, using 10-fold cross-validation by default.

CVMdl = crossval (obj, name, value) specifies additional name-value pair arguments to customize the cross-validation process.

NameValue
"KFold"Specify the number of folds to use in k-fold cross-validation. "KFold", k, where k is an integer greater than 1.
"Holdout"Specify the fraction of the data to hold out for testing. "Holdout", p, where p is a scalar in the range (0,1).
"Leaveout"Specify whether to perform leave-one-out cross-validation. "Leaveout", Value, where Value is ’on’ or ’off’.
"CVPartition"Specify a cvpartition object used for cross-validation. "CVPartition", cv, where isa (cv, "cvpartition") = 1.

See also: fitcknn, ClassificationKNN, cvpartition, ClassificationPartitionedModel

ClassificationKNN: savemodel (obj, filename)

savemodel (obj, filename) saves each property of a ClassificationKNN object into an Octave binary file, the name of which is specified in filename, along with an extra variable, which defines the type classification object these variables constitute. Use loadmodel in order to load a classification object into Octave’s workspace.

See also: loadmodel, fitcknn, ClassificationKNN

Examples

 
 load fisheriris
 x = meas;
 y = species;
 xc = [min(x); mean(x); max(x)];
 obj = fitcknn (x, y, "NumNeighbors", 5, "Standardize", 1);
 [label, score, cost] = predict (obj, xc)
 
label =
  3x1 cell array

    {'versicolor'}    
    {'versicolor'}    
    {'virginica' }    

score =

   0.4000   0.6000        0
        0   1.0000        0
        0        0   1.0000

cost =

   0.6000   0.4000   1.0000
   1.0000        0   1.0000
   1.0000   1.0000        0
 
 load fisheriris
 x = meas;
 y = species;
 obj = fitcknn (x, y, "NumNeighbors", 5, "Standardize", 1);

 ## Create a cross-validated model
 CVMdl = crossval (obj)
 
CVMdl =

  ClassificationPartitionedModel object with properties:

                   BinEdges: []
      CategoricalPredictors: []
                          X: [5.1000, 3.5000, 1.4000, 0.2000; 4.9000, 3, 1.4000, 0.2000; 4.7000, 3.2000, ...]
                          Y: [150x1 cell]
                 ClassNames: [3x1 cell]
                       Cost: [0, 1, 1; 1, 0, 1; 1, 1, 0]
        CrossValidatedModel: 'ClassificationKNN'
                      KFold: 10
            ModelParameters: [1x1 struct]
            NumObservations: 150
                  Partition: [1x1 cvpartition]
             PredictorNames: [1x4 cell]
                      Prior: [0.3333; 0.3333; 0.3333]
               ResponseName: "Y"
             ScoreTransform: [1x1 function_handle]
                Standardize: 1
                    Trained: [10x1 cell]
 
 load fisheriris
 x = meas;
 y = species;
 covMatrix = cov (x);

 ## Fit the k-NN model using the 'mahalanobis' distance
 ## and the custom covariance matrix
 obj = fitcknn(x, y, 'NumNeighbors', 5, 'Distance','mahalanobis', ...
 'Cov', covMatrix);

 ## Create a partition model using cvpartition
 Partition = cvpartition (size (x, 1), 'kfold', 12);

 ## Create cross-validated model using 'cvPartition' name-value argument
 CVMdl = crossval (obj, 'cvPartition', Partition)

 ## Access the trained model from first fold of cross-validation
 CVMdl.Trained{1}
 
CVMdl =

  ClassificationPartitionedModel object with properties:

                   BinEdges: []
      CategoricalPredictors: []
                          X: [5.1000, 3.5000, 1.4000, 0.2000; 4.9000, 3, 1.4000, 0.2000; 4.7000, 3.2000, ...]
                          Y: [150x1 cell]
                 ClassNames: [3x1 cell]
                       Cost: [0, 1, 1; 1, 0, 1; 1, 1, 0]
        CrossValidatedModel: 'ClassificationKNN'
                      KFold: 12
            ModelParameters: [1x1 struct]
            NumObservations: 150
                  Partition: [1x1 cvpartition]
             PredictorNames: [1x4 cell]
                      Prior: [0.3333; 0.3333; 0.3333]
               ResponseName: "Y"
             ScoreTransform: [1x1 function_handle]
                Standardize: 0
                    Trained: [12x1 cell]

ans =

  ClassificationKNN

             ResponseName: 'Y'
               ClassNames: {'setosa' 'versicolor' 'virginica'}
           ScoreTransform: 'custom function handle'
          NumObservations: 137
            NumPredictors: 4
                 Distance: 'mahalanobis'
                 NSMethod: 'exhaustive'
             NumNeighbors: 5
 
 X = [1, 2; 3, 4; 5, 6];
 Y = {'A'; 'B'; 'A'};
 model = fitcknn (X, Y);
 customLossFun = @(C, S, W, Cost) sum (W .* sum (abs (C - S), 2));
 ## Calculate loss using custom loss function
 L = loss (model, X, Y, 'LossFun', customLossFun)
 
L = 0
 
 X = [1, 2; 3, 4; 5, 6];
 Y = {'A'; 'B'; 'A'};
 model = fitcknn (X, Y);
 ## Calculate loss using 'mincost' loss function
 L = loss (model, X, Y, 'LossFun', 'mincost')
 
L = 0
 
 X = [1, 2; 3, 4; 5, 6];
 Y = ['1'; '2'; '3'];
 model = fitcknn (X, Y);
 X_test = [3, 3; 5, 7];
 Y_test = ['1'; '2'];
 ## Specify custom Weights
 W = [1; 2];
 L = loss (model, X_test, Y_test, 'LossFun', 'logit', 'Weights', W);
 
 
 load fisheriris
 mdl = fitcknn (meas, species);
 X = mean (meas);
 Y = {'versicolor'};
 m = margin (mdl, X, Y)
 
m = 1
 
 X = [1, 2; 4, 5; 7, 8; 3, 2];
 Y = [2; 1; 3; 2];
 ## Train the model
 mdl = fitcknn (X, Y);
 ## Specify Vars and Labels
 Vars = 1;
 Labels = 2;
 ## Calculate partialDependence
 [pd, x, y] = partialDependence (mdl, Vars, Labels);
 
 
 X = [1, 2; 4, 5; 7, 8; 3, 2];
 Y = [2; 1; 3; 2];
 ## Train the model
 mdl = fitcknn (X, Y);
 ## Specify Vars and Labels
 Vars = 1;
 Labels = 1;
 queryPoints = [linspace(0, 1, 3)', linspace(0, 1, 3)'];
 ## Calculate partialDependence using queryPoints
 [pd, x, y] = partialDependence (mdl, Vars, Labels, 'QueryPoints', ...
 queryPoints)
 
pd =

   0.2500   0.2500   0.2500

x =

        0        0
   0.5000   0.5000
   1.0000   1.0000

y = [](0x0)