Statistics: editDistance

Function Reference: `editDistance`

statistics: d = editDistance (str)
statistics: d = editDistance (doc)
statistics: C = editDistance (…, minDist)
statistics: [C, IA, IC] = editDistance @ (…, minDist)
statistics: [C, IA, IC] = editDistance @ (…, minDist, "OutputAllIndices", value)
statistics: d = editDistance (str1, str2)
statistics: d = editDistance (doc1, doc2)

Compute the edit (Levenshtein) distance between strings or documents.

d = editDistance (str) takes a cell array of character vectors and computes the Levenshtein distance between each pair of strings in str as the lowest number of grapheme insertions, deletions, and substitutions required to convert string str{1} to string str{2}. If str is a cellstr vector with $N$ elements, the returned distance d is an $(N× (N-1)) / 2)$ column vector of doubles. If str is an array (that is all (size (str) > 1) = true), then it is transformed to a column vector as in str = str(:). editDistance expects str to be a column vector, if it is row vector, it is transformed to a column vector.

d = editDistance (doc) can also take a cell array containing cell arrays of character vectors, in which case each element of doc is regarded as a document, and the character vector in each element of the cell string array is regarded a token. editDistance computes the Levenshtein distance between each pair of cell elements in doc as the lowest number of token insertions, deletions, and substitutions required to convert document doc{1} to document doc{2}. If doc is a cell vector with $N$ elements, the distance d is an $(N× (N-1)) / 2)$ column vector of doubles. If doc is an array (that is all (size (doc) > 1) = true), then it is converted to a column vector as in doc = doc(:).

C = editDistance (…, minDist) specifies a minimum distance, minDist, which is regarded as a similarity threshold between each pair of strings or documents, defined in the previous syntaxes. In this case, editDistance resembles the functionality of the uniquetol function and returns the unique strings or documents that are similar up to minDist distance. C is either a cellstring array or a cell array of cellstrings, depending on the first input argument.

[C, IA, IC] = editDistance (…, minDist) also returns index vectors IA and IC. Assuming A contains either strings str or documents doc as defined above, IA is a column vector of indices to the first occurrence of similar elements such that C = A(IA), and IC is a column vector of indices such that A ~ C(IC) where ~ means that the strings or documents are within the specified distance minDist of each other.

[C, IA, IC] = editDistance (…, minDist, "OutputAllIndices", value) specifies the type of the second output index IA. value must be a logical scalar. When set to true, IA is a cell array containing the vectors of indices for ALL elements in A that are within the specified distance minDist of each other. Each cell in IA corresponds to a value in C and the values in each cell correspond to locations in A. If value is set to false, then IA is returned as an index vector described in the previous syntax.

d = editDistance (str1, str2) can also take two character vectors, str1 and str2 and compute the Levenshtein distance d as the lowest number of grapheme insertions, deletions, and substitutions required to convert str1 to str2. str1 and str2 may also be cellstring arrays, in which case the pairwise distance is computed between str1{n} and str1{n}. The cellstring arrays must be of the same size or scalars, in which case the scalar is expanded to the size of the other cellstring input. The returned distance d is a column vector with the same number of elements as the cellstring arrays. If str1 or str2 is an array, then it is transformed to a column vector. editDistance expects both str1 and str2 to be a column vectors, if not, they are transformed into column vectors.

d = editDistance (doc1, doc2) can also take two cell array containing cell arrays of character vectors, in which case each element of doc1 and doc2 is regarded as a document, and the character vector in each element of the cell string array is regarded a token. editDistance computes the pairwise Levenshtein distance between the of cell elements in doc1 and doc2 as the lowest number of token insertions, deletions, and substitutions required to convert document doc1{n} to document doc1{n}.

Source Code: editDistance

Categories &

Functions List

Clustering

Clustering

Classification Classes

Classification Classes

Clustering Classes

Clustering Classes

Regression Classes

Regression Classes

Data Manipulation

Data Manipulation

Descriptive Statistics

Descriptive Statistics

Distribution Classes

Distribution Classes

Distribution Fitting

Distribution Fitting

Distribution Functions

Distribution Functions

Distribution Statistics

Distribution Statistics

Distribution Wrappers

Distribution Wrappers

Experimental Design

Experimental Design

Machine Learning

Machine Learning

Model Fitting

Model Fitting

Hypothesis Testing

Hypothesis Testing

I/O

I/O

Plotting

Plotting

Regression

Regression

Transforms

Transforms

Function Reference: editDistance

Function Reference: `editDistance`