Function Reference: fitgmdist

statistics: GMdist = fitgmdist (data, k, param1, value1, …)

Fit a Gaussian mixture model with k components to data. Each row of data is a data sample. Each column is a variable.

Optional parameters are:

  • "start": Initialization conditions. Possible values are:
    • "randSample" (default) Takes means uniformly from rows of data.
    • "plus" Use k-means++ to initialize means.
    • "cluster" Performs an initial clustering with 10% of the data.
    • vector A vector whose length is the number of rows in data, and whose values are 1 to k specify the components each row is initially allocated to. The mean, variance, and weight of each component is calculated from that.
    • structure A structure with fields mu, Sigma and ComponentProportion.

    For "randSample", "plus", and "cluster", the initial variance of each component is the variance of the entire data sample.

  • "Replicates": Number of random restarts to perform.
  • "RegularizationValue" or "Regularize": A small number added to the diagonal entries of the covariance to prevent singular covariances.
  • "SharedCovariance" or "SharedCov" (logical). True if all components must share the same variance, to reduce the number of free parameters
  • "CovarianceType" or "CovType" (string). Possible values are:
    • "full" (default) Allow arbitrary covariance matrices.
    • "diagonal" Force covariances to be diagonal, to reduce the number of free parameters.
  • "Options": A structure with all of the following fields:
    • MaxIter Maximum number of EM iterations (default 100).
    • TolFun Threshold increase in likelihood to terminate EM (default 1e-6).
    • Display Possible values are:
      • "off" (default): Display nothing.
      • "final": Display the total number of iterations and likelihood once the execution completes.
      • "iter": Display the number of iteration and likelihood after each iteration.
  • "Weight": A column vector or N×2 matrix. The first column consists of non-negative weights given to the samples. If these are all integers, this is equivalent to specifying weight(i) copies of row i of data, but potentially faster. If a row of data is used to represent samples that are similar but not identical, then the second column of weight indicates the variance of those original samples. Specifically, in the EM algorithm, the contribution of row i towards the variance is set to at least weight(i,2), to prevent spurious components with zero variance.

See also: gmdistribution, kmeans

Source Code: fitgmdist

Example: 1

 

 ## Generate a two-cluster problem
 C1 = randn (100, 2) + 2;
 C2 = randn (100, 2) - 2;
 data = [C1; C2];

 ## Perform clustering
 GMModel = fitgmdist (data, 2);

 ## Plot the result
 figure
 [heights, bins] = hist3([C1; C2]);
 [xx, yy] = meshgrid(bins{1}, bins{2});
 bbins = [xx(:), yy(:)];
 contour (reshape (GMModel.pdf (bbins), size (heights)));

                    
plotted figure

Example: 2

 

 Angle_Theta = [ 30 + 10 * randn(1, 10),  60 + 10 * randn(1, 10) ]';
 nbOrientations = 2;
 initial_orientations = [38.0; 18.0];
 initial_weights = ones (1, nbOrientations) / nbOrientations;
 initial_Sigma = 10 * ones (1, 1, nbOrientations);
 start = struct ("mu", initial_orientations, "Sigma", initial_Sigma, ...
                 "ComponentProportion", initial_weights);
 GMModel_Theta = fitgmdist (Angle_Theta, nbOrientations, "Start", start , ...
                            "RegularizationValue", 0.0001)

Gaussian mixture distribution with 2 components in 1 dimension(s)
Clust 1: weight 0.701113
	Mean: 50.5551 
	Variance:135.42
Clust 2: weight 0.298887
	Mean: 19.3242 
	Variance:23.764
AIC=175.832 BIC=180.811 NLogL=82.9162 Iter=10 Cged=1 Reg=0.0001