silhouette
Compute the silhouette values of clustered data and show them on a plot.
X is a n-by-p matrix of n data points in a p-dimensional space. Each datapoint is assigned to a cluster using clust, a vector of n elements, one cluster assignment for each data point.
Each silhouette value of si, a vector of size n, is a measure of the likelihood that a data point is accurately classified to the right cluster. Defining "a" as the mean distance between a point and the other points from its cluster, and "b" as the mean distance between that point and the points from other clusters, the silhouette value of the i-th point is:
$$ S_i = \frac{b_i - a_i}{max(a_1,b_i)} $$
Each element of si ranges from -1, minimum likelihood of a correct classification, to 1, maximum likelihood.
Optional input value Metric is the metric used to compute the distances
between data points. Since silhouette
uses pdist
to compute
these distances, Metric is similar to the Distance input argument
of pdist
and it can be:
euclidean
,
squaredeuclidean
(default), seuclidean
, mahalanobis
,
cityblock
, minkowski
, chebychev
, cosine
,
correlation
, hamming
, jaccard
, or spearman
.
pdist
. In this case X does
nothing.
pdist
with MetricArg
as optional inputs.
Optional return value h is a handle to the silhouette plot.
Reference Peter J. Rousseeuw, Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. 1987. doi:10.1016/0377-0427(87)90125-7
See also: dendrogram, evalclusters, kmeans, linkage, pdist
Source Code: silhouette
load fisheriris; X = meas(:,3:4); cidcs = kmeans (X, 3, "Replicates", 5); silhouette (X, cidcs); y_labels(cidcs([1 51 101])) = unique (species); set (gca, "yticklabel", y_labels); title ("Fisher's iris data"); |