Function Reference: kmeans

statistics: idx = kmeans (data, k)
statistics: [idx, centers] = kmeans (data, k)
statistics: [idx, centers, sumd] = kmeans (data, k)
statistics: [idx, centers, sumd, dist] = kmeans (data, k)
statistics: […] = kmeans (data, k, param1, value1, …)
statistics: […] = kmeans (data, [], "start", start, …)

Perform a k-means clustering of the N×D matrix data.

If parameter "start" is specified, then k may be empty in which case k is set to the number of rows of start.

The outputs are:

idxAn N×1 vector whose i-th element is the class to which row i of data is assigned.
centersA K×D array whose i-th row is the centroid of cluster i.
sumdA k×1 vector whose i-th entry is the sum of the distances from samples in cluster i to centroid i.
distAn N×k matrix whose ij-th element is the distance from sample i to centroid j.

The following parameters may be placed in any order. Each parameter must be followed by its value, as in Name-Value pairs.

NameDescription
"Start"The initialization method for the centroids.
ValueDescription
"plus"The k-means++ algorithm. (Default)
"sample"A subset of k rows from data, sampled uniformly without replacement.
"cluster"Perform a pilot clustering on 10% of the rows of data.
"uniform"Each component of each centroid is drawn uniformly from the interval between the maximum and minimum values of that component within data. This performs poorly and is implemented only for Matlab compatibility.
numeric matrixA k×D matrix of centroid starting locations. The rows correspond to seeds.
numeric arrayA k×D×r array of centroid starting locations. The third dimension invokes replication of the clustering routine. Page r contains the set of seeds for replicate r. kmeans infers the number of replicates (specified by the "Replicates" Name-Value pair argument) from the size of the third dimension.
NameDescription
"Distance"The distance measure used for partitioning and calculating centroids.
ValueDescription
"sqeuclidean"The squared Euclidean distance. i.e. the sum of the squares of the differences between corresponding components. In this case, the centroid is the arithmetic mean of all samples in its cluster. This is the only distance for which this algorithm is truly "k-means".
"cityblock"The sum metric, or L1 distance, i.e. the sum of the absolute differences between corresponding components. In this case, the centroid is the median of all samples in its cluster. This gives the k-medians algorithm.
"cosine"One minus the cosine of the included angle between points (treated as vectors). Each centroid is the mean of the points in that cluster, after normalizing those points to unit Euclidean length.
"correlation"One minus the sample correlation between points (treated as sequences of values). Each centroid is the component-wise mean of the points in that cluster, after centering and normalizing those points to zero mean and unit standard deviation.
"hamming"The number of components in which the sample and the centroid differ. In this case, the centroid is the median of all samples in its cluster. Unlike Matlab, Octave allows non-logical data.
NameDescription
"EmptyAction"What to do when a centroid is not the closest to any data sample.
ValueDescription
"error"Throw an error.
"singleton"(Default) Select the row of data that has the highest error and use that as the new centroid.
"drop"Remove the centroid, and continue computation with one fewer centroid. The dimensions of the outputs centroids and d are unchanged, with values for omitted centroids replaced by NaN.
NameDescription
"Display"Display a text summary.
ValueDescription
"off"(Default) Display no summary.
"final"Display a summary for each clustering operation.
"iter"Display a summary for each iteration of a clustering operation.
NameValue
"Replicates"A positive integer specifying the number of independent clusterings to perform. The output values are the values for the best clustering, i.e., the one with the smallest value of sumd. If Start is numeric, then Replicates defaults to (and must equal) the size of the third dimension of Start. Otherwise it defaults to 1.
"MaxIter"The maximum number of iterations to perform for each replicate. If the maximum change of any centroid is less than 0.001, then the replicate terminates even if MaxIter iterations have no occurred. The default is 100.

Example:

[~,c] = kmeans (rand(10, 3), 2, "emptyaction", "singleton");

See also: linkage

Source Code: kmeans

Example: 1

 

 ## Generate a two-cluster problem
 randn ("seed", 31)  # for reproducibility
 C1 = randn (100, 2) + 1;
 randn ("seed", 32)  # for reproducibility
 C2 = randn (100, 2) - 1;
 data = [C1; C2];

 ## Perform clustering
 rand ("seed", 1)  # for reproducibility
 [idx, centers] = kmeans (data, 2);

 ## Plot the result
 figure;
 plot (data (idx==1, 1), data (idx==1, 2), "ro");
 hold on;
 plot (data (idx==2, 1), data (idx==2, 2), "bs");
 plot (centers (:, 1), centers (:, 2), "kv", "markersize", 10);
 hold off;

                    
plotted figure

Example: 2

 

 ## Cluster data using k-means clustering, then plot the cluster regions
 ## Load Fisher's iris data set and use the petal lengths and widths as
 ## predictors

 load fisheriris
 X = meas(:,3:4);

 figure;
 plot (X(:,1), X(:,2), "k*", "MarkerSize", 5);
 title ("Fisher's Iris Data");
 xlabel ("Petal Lengths (cm)");
 ylabel ("Petal Widths (cm)");

 ## Cluster the data. Specify k = 3 clusters
 rand ("seed", 1)  # for reproducibility
 [idx, C] = kmeans (X, 3);
 x1 = min (X(:,1)):0.01:max (X(:,1));
 x2 = min (X(:,2)):0.01:max (X(:,2));
 [x1G, x2G] = meshgrid (x1, x2);
 XGrid = [x1G(:), x2G(:)];

 idx2Region = kmeans (XGrid, 3, "MaxIter", 1, "Start", C);
 figure;
 gscatter (XGrid(:,1), XGrid(:,2), idx2Region, ...
           [0, 0.75, 0.75; 0.75, 0, 0.75; 0.75, 0.75, 0], "..");
 hold on;
 plot (X(:,1), X(:,2), "k*", "MarkerSize", 5);
 title ("Fisher's Iris Data");
 xlabel ("Petal Lengths (cm)");
 ylabel ("Petal Widths (cm)");
 legend ("Region 1", "Region 2", "Region 3", "Data", "Location", "SouthEast");
 hold off

warning: kmeans: failed to converge in 1 iterations
warning: called from
    kmeans at line 442 column 7
    build_DEMOS at line 94 column 11
    function_texi2html at line 112 column 14
    package_texi2html at line 290 column 9

                    
plotted figure

plotted figure

Example: 3

 

 ## Partition Data into Two Clusters

 randn ("seed", 1)  # for reproducibility
 r1 = randn (100, 2) * 0.75 + ones (100, 2);
 randn ("seed", 2)  # for reproducibility
 r2 = randn (100, 2) * 0.5 - ones (100, 2);
 X = [r1; r2];

 figure;
 plot (X(:,1), X(:,2), ".");
 title ("Randomly Generated Data");
 rand ("seed", 1)  # for reproducibility
 [idx, C] = kmeans (X, 2, "Distance", "cityblock", ...
                          "Replicates", 5, "Display", "final");
 figure;
 plot (X(idx==1,1), X(idx==1,2), "r.", "MarkerSize", 12);
 hold on
 plot(X(idx==2,1), X(idx==2,2), "b.", "MarkerSize", 12);
 plot (C(:,1), C(:,2), "kx", "MarkerSize", 15, "LineWidth", 3);
 legend ("Cluster 1", "Cluster 2", "Centroids", "Location", "NorthWest");
 title ("Cluster Assignments and Centroids");
 hold off

Replicate 1, 5 iterations, total sum of distances = 197.416.
Replicate 2, 1 iterations, total sum of distances = 253.651.
Replicate 3, 1 iterations, total sum of distances = 401.899.
Replicate 4, 1 iterations, total sum of distances = 426.406.
Replicate 5, 1 iterations, total sum of distances = 663.780.
Best total sum of distances = 197.416
                    
plotted figure

plotted figure

Example: 4

 

 ## Assign New Data to Existing Clusters

 ## Generate a training data set using three distributions
 randn ("seed", 5)  # for reproducibility
 r1 = randn (100, 2) * 0.75 + ones (100, 2);
 randn ("seed", 7)  # for reproducibility
 r2 = randn (100, 2) * 0.5 - ones (100, 2);
 randn ("seed", 9)  # for reproducibility
 r3 = randn (100, 2) * 0.75;
 X = [r1; r2; r3];

 ## Partition the training data into three clusters by using kmeans

 rand ("seed", 1)  # for reproducibility
 [idx, C] = kmeans (X, 3);

 ## Plot the clusters and the cluster centroids

 figure
 gscatter (X(:,1), X(:,2), idx, "bgm", "***");
 hold on
 plot (C(:,1), C(:,2), "kx");
 legend ("Cluster 1", "Cluster 2", "Cluster 3", "Cluster Centroid")

 ## Generate a test data set
 randn ("seed", 25)  # for reproducibility
 r1 = randn (100, 2) * 0.75 + ones (100, 2);
 randn ("seed", 27)  # for reproducibility
 r2 = randn (100, 2) * 0.5 - ones (100, 2);
 randn ("seed", 29)  # for reproducibility
 r3 = randn (100, 2) * 0.75;
 Xtest = [r1; r2; r3];

 ## Classify the test data set using the existing clusters
 ## Find the nearest centroid from each test data point by using pdist2

 D = pdist2 (C, Xtest, "euclidean");
 [group, ~] = find (D == min (D));

 ## Plot the test data and label the test data using idx_test with gscatter

 gscatter (Xtest(:,1), Xtest(:,2), group, "bgm", "ooo");
 legend ("Cluster 1", "Cluster 2", "Cluster 3", "Cluster Centroid", ...
         "Data classified to Cluster 1", "Data classified to Cluster 2", ...
         "Data classified to Cluster 3", "Location", "NorthWest");
 title ("Assign New Data to Existing Clusters");

                    
plotted figure