ClassificationKNN
statistics: ClassificationKNN
K-nearest neighbors classification
The ClassificationKNN class implements a K-nearest neighbor classifier
object, which can predict responses for new data using the predict
method. The implemented algorithm allows you choose a range of different
distance metrics, the number of nearest neighbors, as well as the searching
algorithm.
The K-nearest neighbors (k-NN) classifier is a simple, non-parametric machine learning algorithm used for classification tasks. It classifies a data point based on the majority class of its k closest neighbors in the feature space.
Create a ClassificationKNN object by using the fitcknn
function or the class constructor.
See also: fitcknn
Source Code: ClassificationKNN
A numeric matrix containing the unstandardized predictor data. Each column of X represents one predictor (variable), and each row represents one observation. This property is read-only.
Specified as a logical or numeric column vector, or as a character array or a cell array of character vectors with the same number of rows as the predictor data. Each row in Y is the observed class label for the corresponding row in X. This property is read-only.
A positive integer value specifying the number of observations in the training dataset used for training the ClassificationKNN model. This property is read-only.
A logical column vector with the same length as the observations in the original predictor data X specifying which rows have been used for fitting the ClassificationKNN model. This property is read-only.
A positive integer value specifying the number of predictors in the training dataset used for training the ClassificationKNN model. This property is read-only.
A cell array of character vectors specifying the names of the predictor variables. The names are in the order in which the appear in the training dataset. This property is read-only.
A character vector specifying the name of the response variable Y. This property is read-only.
An array of unique values of the response variable Y, which has the
same data types as the data in Y. This property is read-only.
ClassNames can have any of the following datatypes:
A square matrix specifying the cost of misclassification of a point.
Cost(i,j) is the cost of classifying a point into class j
if its true class is i (that is, the rows correspond to the true
class and the columns correspond to the predicted class). The order of
the rows and columns in Cost corresponds to the order of the
classes in ClassNames. The number of rows and columns in
Cost is the number of unique classes in the response. By
default, Cost(i,j) = 1 if i != j, and
Cost(i,j) = 0 if i = j. In other words, the cost is 0
for correct classification and 1 for incorrect classification.
Add or change the Cost property using dot notation as in:
obj.Cost = costMatrix
A numeric vector specifying the prior probabilities for each class. The
order of the elements in Prior corresponds to the order of the
classes in ClassNames.
Add or change the Prior property using dot notation as in:
obj.Prior = priorVector
Specified as a function handle for transforming the classification
scores. Add or change the ScoreTransform property using dot
notation as in:
obj.ScoreTransform = 'function_name'
obj.ScoreTransform = @function_handle
When specified as a character vector, it can be any of the following
built-in functions. Nevertherless, the ScoreTransform property
always stores their function handle equivalent.
| Value | Description | |
|---|---|---|
"doublelogit" | ||
"invlogit" | ||
"ismax" | Sets the score for the class with the largest score to 1, and for all other classes to 0 | |
"logit" | ||
"none" | (no transformation) | |
"identity" | (no transformation) | |
"sign" | ||
"symmetric" | ||
"symmetricismax" | Sets the score for the class with the largest score to 1, and for all other classes to -1 | |
"symmetriclogit" |
A logical scalar specifying whether the predictor data in X have been standardized prior to training. This property is read-only.
A numeric vector of the same length as the columns in X with the
standard deviations corresponding to each predictor. If the predictor
variables have not been standardized, then "obj.Sigma" is empty.
This property is read-only.
A numeric vector of the same length as the columns in X with the
mean values corresponding to each predictor. If the predictor variables
have not been standardized, then "obj.Sigma" is empty. This
property is read-only.
A character vector specifying the tie-breaking algorithm used by the
predict method, when multiple classes have the same smallest cost.
It can be one of the following:
"smallest" (default), which favors the class with the
smallest index among the tied groups, i.e. the one that appears first in
the training labelled data.
"nearest", which favors the class with the nearest neighbor
among the tied groups, i.e. the class with the closest member point
according to the distance metric used.
"random", which randomly picks one class among the tied
groups.
The tie-breaking algorithm is only used when IncludeTies is
false. Change the BreakTies property using dot notation
as in:
obj.BreakTies = algorithm
A positive integer value specifyingNumber of nearest neighbors in X
used to classify each point during prediction. Change the
NumNeighbors property using dot notation as in:
obj.NumNeighbors = newNumNeighbors
A character vector specifying the distance metric used by the
neighbor-searcher method. See the available distance metrics in
knnseaarch for more info. Change the Distance property
using dot notation as in:
obj.Distance = newDistance
A character vector or a function handle specifying the distance weighting function, which can be any of the following values:
"equal", which corresponds to @(d) d.
"inverse", which corresponds to @(d) 1/d.
"squaredinverse", which corresponds to @(d) 1/d.^2.
@fcn, which is a function handle that accepts a matrix of
nonnegative distances, and returns a matrix the same size containing
nonnegative distance weights.
Change the DistanceWeight property
using dot notation as in:
obj.DistanceWeight = newDistanceWeight
A positive definite covariance matrix, a positive scalar, or a vector of positive scale values specifying the parameter for the corresponding distance metric as shown below:
"mahalanobis" accepts a positive definite covariance matrix.
"minkowski" accepts a positive scalar as the Minkowski
distance exponent.
"seuclidean" accepts a vector of positive scale values of
equal length as the number of predictors in X.
For any other distance metric, DistParameter is empty
([]). Change the DistParameter property using dot
notation as in:
obj.DistParameter = distParam
A character vector specified as either "kdtree", which creates
and uses a Kd-tree to find nearest neighbors, or "exhaustive",
which uses the exhaustive search algorithm by computing the distance
values from all points in X to find nearest neighbors.
Change the NSMethod property using dot notation as in:
obj.NSMethod = newNSMethod
A logical scalar specifying whether prediction includes all the neighbors
whose distance values are equal to the smallest distance. If
IncludeTies is true, prediction includes all of these
neighbors. Otherwise, prediction uses exactly neighbors.
Change the IncludeTies property using dot notation as in:
obj.IncludeTies = flag
A positive integer scalar specifying the maximum number of data points in
the leaf node of the Kd-tree. BucketSize only applies when the
NSMethod property is "kdtree".
Change the BucketSize property using dot notation as in:
obj.BucketSize = maxnum
statistics: obj = ClassificationKNN (X, Y)
statistics: obj = ClassificationKNN (…, name, value)
obj = ClassificationKNN (X, Y) returns a
ClassificationKNN object, with X as the predictor data and Y
containing the class labels of observations in X.
X must be a numeric matrix of input data where rows
correspond to observations and columns correspond to features or
variables. X will be used to train the kNN model.
Y is matrix or cell matrix containing the class labels
of corresponding predictor data in X. Y can contain any type
of categorical data. Y must have same numbers of Rows as X.
obj = ClassificationKNN (…, name, value)
returns a ClassificationKNN object with parameters specified by the
following name, value paired input arguments:
| Name | Value | |
|---|---|---|
'PredictorNames' | A cell array of character vectors specifying the names of the predictors. The length of this array must match the number of columns in X. | |
'ResponseName' | A character vector specifying the name of the response variable. | |
'ClassNames' | Names of the classes in the class
labels, Y, used for fitting the GAM model.
ClassNames are of the same type as the class labels in Y. | |
'Cost' | An numeric matrix containing
misclassification cost for the corresponding instances in X, where
is the number of unique categories in Y. If an instance
is correctly classified into its category the cost is calculated to be 1,
otherwise 0. The cost matrix can be altered by using
Mdl.cost = somecost. By default, its value is
cost = ones (rows (X), numel (unique (Y))). | |
'Prior' | A numeric vector specifying the prior
probabilities for each class. The order of the elements in Prior
corresponds to the order of the classes in ClassNames.
Alternatively, you can specify "empirical" to use the empirical
class probabilities or "uniform" to assume equal class
probabilities. | |
'ScoreTransform' | A user-defined function handle
or a character vector specifying one of the following builtin functions
specifying the transformation applied to predicted classification scores.
Supported values include 'doublelogit', 'invlogit',
'ismax', 'logit', 'none', 'identity',
'sign', 'symmetric', 'symmetricismax', and
'symmetriclogit'. | |
"BreakTies" | A character vector specifying the
tie-breaking algorithm used by predict method, when multiple
classes have the same smallest cost. Available options are
"smallest" (default), which uses the smallest index among tied
groups, "nearest", which uses the class with the nearest neighbor
among tied groups, and "random", which randomly selects one of
the tied groups. | |
"NumNeighbors" | A positive integer value that specifies the number of nearest neighbors to be found in the kNN search algorithm for classifying each point during prediction. By default, it is 1. | |
"Distance" | Any valid distance metric supported by
the pdist2 function. Note that the allowable distrance metrics
depend on the selected nearest neighbor search method. | |
"DistanceWeight" | Either a distance weighting
function, specified either as a function handle, which accepts a matrix
of nonnegative distances and returns a matrix the same size containing
nonnegative distance weights, or a character vector with one of the
following values: "equal", which corresponds to no weighting;
"inverse", which corresponds to a weight equal to
; "squaredinverse", which corresponds to a
weight equal to . | |
"Cov" | A square matrix with the same number of
columns X specifying the covariance matrix for computing the
mahalanobis distance. This must be a positive definite matrix matching.
This argument is only valid when the selected distance metric is
"mahalanobis". | |
"Exponent" | A positive scalar (usually an integer)
specifying the Minkowski distance exponent. This argument is only valid
when the selected distance metric is "minkowski". By default,
it is 2. | |
"Scale" | A nonnegative numeric vector specifying
the scale parameters for the standardized Euclidean distance. The vector
length must be equal to the number of columns in X. This argument
is only valid when the selected distance metric is "seuclidean",
in which case each coordinate of X is scaled by the corresponding
element of "scale", as is each query point in Y. By
default, the scale parameter is the standard deviation of each coordinate
in X. If a variable in X is constant, i.e. zero variance,
this value is forced to 1 to avoid division by zero. This is the
equivalent of this variable not being standardized. | |
"NSMethod" | A character vector specifying the
nearest neighbor search method used by knnsearch, which can be
"kdtree" or "exhaustive". See knnsearch for more
information about default values and allowable distance metrics for each
search method. | |
"BucketSize" | A positive integer value specifying
the maximum number of data points in the leaf node of the Kd-tree. This
argument is meaningful only when the selected nearest neighbor search
method is "kdtree". By default, it is 50. |
See also: fitcknn, knnsearch, rangesearch, pdist2
ClassificationKNN: labels = predict (obj, XC)
ClassificationKNN: [labels, scores, cost] = predict (obj, XC)
labels = predict (obj, XC) returns the matrix of
labels predicted for the corresponding instances in XC, using the
predictor data in obj.X and corresponding labels, obj.Y,
stored in the k-Nearest Neighbor classification model, obj.
ClassificationKNN class object.
[labels, scores, cost] = predict (obj,
XC) also returns scores, which contains the predicted class
scores or posterior probabilities for each instance of the corresponding
unique classes, and cost, which is a matrix containing the expected
cost of the classifications. By default, scores returns the
posterior probabilities for KNN models, unless a specific ScoreTransform
function has been specified. See fitcknn for more info.
Note! predict is explicitly using "exhaustive" as the
nearest search method due to the very slow implementation of
"kdtree" in the knnsearch function.
See also: fitcknn, ClassificationKNN, knnsearch
ClassificationKNN: L = loss (obj, X, Y)
ClassificationKNN: L = loss (…, name, value)
L = loss (obj, X, Y) computes the loss,
L, using the default loss function 'mincost'.
obj is a ClassificationKNN object trained on X and
Y.
X must be a numeric matrix of input data where rows
correspond to observations and columns correspond to features or
variables.
Y is matrix or cell matrix containing the class labels
of corresponding predictor data in X. Y must have same
numbers of Rows as X.
L = loss (…, name, value) allows
additional options specified by name-value pairs:
| Name | Value | |
|---|---|---|
"LossFun" | Specifies the loss function to use.
Can be a function handle with four input arguments (C, S, W, Cost)
which returns a scalar value or one of:
’binodeviance’, ’classifcost’, ’classiferror’, ’exponential’,
’hinge’, ’logit’,’mincost’, ’quadratic’.
| |
"Weights" | Specifies observation weights, must be
a numeric vector of length equal to the number of rows in X.
Default is ones (size (X, 1)). loss normalizes the weights so that
observation weights in each class sum to the prior probability of that
class. When you supply Weights, loss computes the weighted
classification loss. |
See also: fitcknn, ClassificationKNN
ClassificationKNN: m = margin (obj, X, Y)
obj is a ClassificationKNN object trained on X
and Y.
X must be a numeric matrix of input data where rows
correspond to observations and columns correspond to features or
variables.
Y is matrix or cell matrix containing the class labels
of corresponding predictor data in X. Y must have same
numbers of Rows as X.
The classification margin for each observation is the difference between the classification score for the true class and the maximal classification score for the false classes.
See also: fitcknn, ClassificationKNN
ClassificationKNN: [pd, x, y] = partialDependence (obj, Vars, Labels)
ClassificationKNN: [pd, x, y] = partialDependence (…, Data)
ClassificationKNN: [pd, x, y] = partialDependence (…, name, value)
[pd, x, y] = partialDependence (obj, Vars,
Labels)
computes the partial dependence of the classification scores on the
variables Vars for the specified class Labels.
obj is a trained ClassificationKNN object.
Vars is a vector of positive integers, character vector,
string array, or cell array of character
vectors representing predictor variables (it can be indices of
predictor variables in obj.X).
Labels is a character vector, logical vector, numeric vector,
or cell array of character vectors representing class
labels. (column vector)
[pd, x, y] = partialDependence (…, Data)
specifies new predictor data to use for computing the partial dependence.
[pd, x, y] = partialDependence (…, name,
value) allows additional options specified by name-value pairs:
| Name | Value | |
|---|---|---|
"NumObservationsToSample" | Number of observations to sample. Must be a positive integer. Defaults to the number of observations in the training data. | |
"QueryPoints" | Points at which to evaluate the partial dependence. Must be a numeric column vector, numeric two-column matrix, or cell array of character column vectors. | |
"UseParallel" | Logical value indicating
whether to perform computations in parallel.
Defaults to false. |
pd: Partial dependence values.
x: Query points for the first predictor variable in Vars.
y: Query points for the second predictor variable in
Vars (if applicable).
See also: fitcknn, ClassificationKNN
ClassificationKNN: CVMdl = crossval (obj)
ClassificationKNN: CVMdl = crossval (…, Name, Value)
CVMdl = crossval (obj) returns a cross-validated model
object, CVMdl, from a trained model, obj, using 10-fold
cross-validation by default.
CVMdl = crossval (obj, name, value)
specifies additional name-value pair arguments to customize the
cross-validation process.
| Name | Value | |
|---|---|---|
"KFold" | Specify the number of folds to use in
k-fold cross-validation. "KFold", k, where k is an
integer greater than 1. | |
"Holdout" | Specify the fraction of the data to
hold out for testing. "Holdout", p, where p is a
scalar in the range . | |
"Leaveout" | Specify whether to perform
leave-one-out cross-validation. "Leaveout", Value, where
Value is ’on’ or ’off’. | |
"CVPartition" | Specify a cvpartition
object used for cross-validation. "CVPartition", cv, where
isa (cv, "cvpartition") = 1. |
See also: fitcknn, ClassificationKNN, cvpartition, ClassificationPartitionedModel
ClassificationKNN: savemodel (obj, filename)
savemodel (obj, filename) saves each property of a
ClassificationKNN object into an Octave binary file, the name of which is
specified in filename, along with an extra variable, which defines
the type classification object these variables constitute. Use
loadmodel in order to load a classification object into Octave’s
workspace.
See also: loadmodel, fitcknn, ClassificationKNN
load fisheriris x = meas; y = species; xc = [min(x); mean(x); max(x)]; obj = fitcknn (x, y, "NumNeighbors", 5, "Standardize", 1); [label, score, cost] = predict (obj, xc) |
label =
3x1 cell array
{'versicolor'}
{'versicolor'}
{'virginica' }
score =
0.4000 0.6000 0
0 1.0000 0
0 0 1.0000
cost =
0.6000 0.4000 1.0000
1.0000 0 1.0000
1.0000 1.0000 0
|
load fisheriris x = meas; y = species; obj = fitcknn (x, y, "NumNeighbors", 5, "Standardize", 1); ## Create a cross-validated model CVMdl = crossval (obj) |
CVMdl =
ClassificationPartitionedModel object with properties:
BinEdges: []
CategoricalPredictors: []
X: [5.1000, 3.5000, 1.4000, 0.2000; 4.9000, 3, 1.4000, 0.2000; 4.7000, 3.2000, ...]
Y: [150x1 cell]
ClassNames: [3x1 cell]
Cost: [0, 1, 1; 1, 0, 1; 1, 1, 0]
CrossValidatedModel: 'ClassificationKNN'
KFold: 10
ModelParameters: [1x1 struct]
NumObservations: 150
Partition: [1x1 cvpartition]
PredictorNames: [1x4 cell]
Prior: [0.3333; 0.3333; 0.3333]
ResponseName: "Y"
ScoreTransform: [1x1 function_handle]
Standardize: 1
Trained: [10x1 cell]
|
load fisheriris
x = meas;
y = species;
covMatrix = cov (x);
## Fit the k-NN model using the 'mahalanobis' distance
## and the custom covariance matrix
obj = fitcknn(x, y, 'NumNeighbors', 5, 'Distance','mahalanobis', ...
'Cov', covMatrix);
## Create a partition model using cvpartition
Partition = cvpartition (size (x, 1), 'kfold', 12);
## Create cross-validated model using 'cvPartition' name-value argument
CVMdl = crossval (obj, 'cvPartition', Partition)
## Access the trained model from first fold of cross-validation
CVMdl.Trained{1} |
CVMdl =
ClassificationPartitionedModel object with properties:
BinEdges: []
CategoricalPredictors: []
X: [5.1000, 3.5000, 1.4000, 0.2000; 4.9000, 3, 1.4000, 0.2000; 4.7000, 3.2000, ...]
Y: [150x1 cell]
ClassNames: [3x1 cell]
Cost: [0, 1, 1; 1, 0, 1; 1, 1, 0]
CrossValidatedModel: 'ClassificationKNN'
KFold: 12
ModelParameters: [1x1 struct]
NumObservations: 150
Partition: [1x1 cvpartition]
PredictorNames: [1x4 cell]
Prior: [0.3333; 0.3333; 0.3333]
ResponseName: "Y"
ScoreTransform: [1x1 function_handle]
Standardize: 0
Trained: [12x1 cell]
ans =
ClassificationKNN
ResponseName: 'Y'
ClassNames: {'setosa' 'versicolor' 'virginica'}
ScoreTransform: 'custom function handle'
NumObservations: 137
NumPredictors: 4
Distance: 'mahalanobis'
NSMethod: 'exhaustive'
NumNeighbors: 5 |
X = [1, 2; 3, 4; 5, 6];
Y = {'A'; 'B'; 'A'};
model = fitcknn (X, Y);
customLossFun = @(C, S, W, Cost) sum (W .* sum (abs (C - S), 2));
## Calculate loss using custom loss function
L = loss (model, X, Y, 'LossFun', customLossFun) |
L = 0 |
X = [1, 2; 3, 4; 5, 6];
Y = {'A'; 'B'; 'A'};
model = fitcknn (X, Y);
## Calculate loss using 'mincost' loss function
L = loss (model, X, Y, 'LossFun', 'mincost') |
L = 0 |
X = [1, 2; 3, 4; 5, 6]; Y = ['1'; '2'; '3']; model = fitcknn (X, Y); X_test = [3, 3; 5, 7]; Y_test = ['1'; '2']; ## Specify custom Weights W = [1; 2]; L = loss (model, X_test, Y_test, 'LossFun', 'logit', 'Weights', W); |
load fisheriris
mdl = fitcknn (meas, species);
X = mean (meas);
Y = {'versicolor'};
m = margin (mdl, X, Y) |
m = 1 |
X = [1, 2; 4, 5; 7, 8; 3, 2]; Y = [2; 1; 3; 2]; ## Train the model mdl = fitcknn (X, Y); ## Specify Vars and Labels Vars = 1; Labels = 2; ## Calculate partialDependence [pd, x, y] = partialDependence (mdl, Vars, Labels); |
X = [1, 2; 4, 5; 7, 8; 3, 2]; Y = [2; 1; 3; 2]; ## Train the model mdl = fitcknn (X, Y); ## Specify Vars and Labels Vars = 1; Labels = 1; queryPoints = [linspace(0, 1, 3)', linspace(0, 1, 3)']; ## Calculate partialDependence using queryPoints [pd, x, y] = partialDependence (mdl, Vars, Labels, 'QueryPoints', ... queryPoints) |
pd =
0.2500 0.2500 0.2500
x =
0 0
0.5000 0.5000
1.0000 1.0000
y = [](0x0) |