KMeans (Java Statistical Analysis Tool 0.0.8 API)

java.lang.Object
- jsat.clustering.ClustererBase
- - jsat.clustering.KClustererBase
  - - jsat.clustering.kmeans.KMeans

All Implemented Interfaces:

Serializable, Clusterer, KClusterer, Parameterized

Direct Known Subclasses:

ElkanKMeans, GMeans, HamerlyKMeans, KMeansPDN, NaiveKMeans, XMeans
```
public abstract class KMeans
extends KClustererBase
implements Parameterized
```
Base class for the numerous implementations of k-means that exist. This base class provides an slow heuristic approach to the selection of k.

Author:

Edward Raff

See Also:

Serialized Form

Field Summary

Fields
Modifier and Type	Field and Description
`static SeedSelectionMethods.SeedSelection`	`DEFAULT_SEED_SELECTION` This is the default seed selection method used in ElkanKMeans.
`protected DistanceMetric`	`dm`
`protected int`	`MaxIterLimit` Control the maximum number of iterations to perform.
`protected List<Vec>`	`means` The list of means
`protected double[]`	`nearestCentroidDist` Distance from a datapoint to its nearest centroid.
`protected Random`	`rand`
`protected boolean`	`saveCentroidDistance` Indicates whether or not the distance between a datapoint and its nearest centroid should be saved after clustering.
`protected SeedSelectionMethods.SeedSelection`	`seedSelection`
`protected boolean`	`storeMeans` Indicates whether or not the means from the clustering should be saved

Constructor Summary

Constructors
Constructor and Description

KMeans(DistanceMetric dm, SeedSelectionMethods.SeedSelection seedSelection, Random rand)

KMeans(KMeans toCopy)
Copy constructor

Constructors
Constructor and Description
`KMeans(DistanceMetric dm, SeedSelectionMethods.SeedSelection seedSelection, Random rand)`
`KMeans(KMeans toCopy)` Copy constructor

Method Summary

All Methods Static Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`abstract KMeans`	`clone()`
`int[]`	`cluster(DataSet dataSet, ExecutorService threadpool, int[] designations)` Performs clustering on the given data set.
`int[]`	`cluster(DataSet dataSet, int[] designations)` Performs clustering on the given data set.
`int[]`	`cluster(DataSet dataSet, int clusters, ExecutorService threadpool, int[] designations)`
`int[]`	`cluster(DataSet dataSet, int clusters, int[] designations)`
`int[]`	`cluster(DataSet dataSet, int lowK, int highK, ExecutorService threadpool, int[] designations)`
`int[]`	`cluster(DataSet dataSet, int lowK, int highK, int[] designations)`
`protected abstract double`	`cluster(DataSet dataSet, List<Double> accelCache, int k, List<Vec> means, int[] assignment, boolean exactTotal, ExecutorService threadpool, boolean returnError, Vec dataPointWeights)` This is a helper method where the actual cluster is performed.
`DistanceMetric`	`getDistanceMetric()` Returns the distance metric in use
`int`	`getIterationLimit()` Returns the maximum number of iterations of the ElkanKMeans algorithm that will be performed.
`protected static List<List<DataPoint>>`	`getListOfLists(int k)`
`List<Vec>`	`getMeans()` Returns the raw list of means that were used for each class.
`Parameter`	`getParameter(String paramName)` Returns the parameter with the given name.
`List<Parameter>`	`getParameters()` Returns the list of parameters that can be altered for this learner.
`SeedSelectionMethods.SeedSelection`	`getSeedSelection()`
`void`	`setIterationLimit(int iterLimit)` Sets the maximum number of iterations allowed
`void`	`setSeedSelection(SeedSelectionMethods.SeedSelection seedSelection)` Sets the method of seed selection to use for this algorithm.
`void`	`setStoreMeans(boolean storeMeans)` If set to `true` the computed means will be stored after clustering is completed, and can then be retrieved using `getMeans()`.
`boolean`	`supportsWeightedData()` Indicates whether the model knows how to cluster using weighted data points.

Methods inherited from class jsat.clustering.KClustererBase
cluster, cluster, cluster, cluster

Methods inherited from class jsat.clustering.ClustererBase
cluster, cluster, createClusterListFromAssignmentArray, getDatapointsFromCluster

Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface jsat.clustering.Clusterer
cluster, cluster

- Field Detail
  - DEFAULT_SEED_SELECTION
```
public static final SeedSelectionMethods.SeedSelection DEFAULT_SEED_SELECTION
```
    This is the default seed selection method used in ElkanKMeans. When used with the EuclideanDistance, it selects seeds that are log optimal with a high probability.
  - dm
```
protected DistanceMetric dm
```
  - seedSelection
```
protected SeedSelectionMethods.SeedSelection seedSelection
```
  - rand
```
protected Random rand
```
  - storeMeans
```
protected boolean storeMeans
```
    Indicates whether or not the means from the clustering should be saved
  - saveCentroidDistance
```
protected boolean saveCentroidDistance
```
    Indicates whether or not the distance between a datapoint and its nearest centroid should be saved after clustering. This only applies when the error of the model is requested
  - nearestCentroidDist
```
protected double[] nearestCentroidDist
```
    Distance from a datapoint to its nearest centroid. May be an approximate distance
  - means
```
protected List<Vec> means
```
    The list of means
  - MaxIterLimit
```
protected int MaxIterLimit
```
    Control the maximum number of iterations to perform.
- Constructor Detail
  - KMeans
```
public KMeans(DistanceMetric dm,
              SeedSelectionMethods.SeedSelection seedSelection,
              Random rand)
```
  - KMeans
```
public KMeans(KMeans toCopy)
```
    Copy constructor
    
    Parameters:
    
    toCopy -
- Method Detail
  - setIterationLimit
```
public void setIterationLimit(int iterLimit)
```
    Sets the maximum number of iterations allowed
    
    Parameters:
    
    iterLimit - the maximum number of iterations of the ElkanKMeans algorithm
  - getIterationLimit
```
public int getIterationLimit()
```
    Returns the maximum number of iterations of the ElkanKMeans algorithm that will be performed.
    
    Returns:
    
    the maximum number of iterations of the ElkanKMeans algorithm that will be performed.
  - setStoreMeans
```
public void setStoreMeans(boolean storeMeans)
```
    If set to true the computed means will be stored after clustering is completed, and can then be retrieved using getMeans().
    
    Parameters:
    
    storeMeans - true if the means should be stored for later, false to discard them once clustering is complete.
  - getMeans
```
public List<Vec> getMeans()
```
    Returns the raw list of means that were used for each class.
    
    Returns:
    
    the list of means for each class
  - setSeedSelection
```
public void setSeedSelection(SeedSelectionMethods.SeedSelection seedSelection)
```
    Sets the method of seed selection to use for this algorithm. SeedSelectionMethods.SeedSelection.KPP is recommended for this algorithm in particular.
    
    Parameters:
    
    seedSelection - the method of seed selection to use
  - getSeedSelection
```
public SeedSelectionMethods.SeedSelection getSeedSelection()
```
    Returns:
    
    the method of seed selection used
  - getDistanceMetric
```
public DistanceMetric getDistanceMetric()
```
    Returns the distance metric in use
    
    Returns:
    
    the distance metric in use
  - cluster
```
protected abstract double cluster(DataSet dataSet,
                                  List<Double> accelCache,
                                  int k,
                                  List<Vec> means,
                                  int[] assignment,
                                  boolean exactTotal,
                                  ExecutorService threadpool,
                                  boolean returnError,
                                  Vec dataPointWeights)
```
    This is a helper method where the actual cluster is performed. This is because there are multiple strategies for modifying kmeans, but all of them require this step.
    The distance metric used is trained if needed
    
    Parameters:
    
    dataSet - The set of data points to perform clustering on
    
    accelCache - acceleration cache to use, or null. If null, the kmeans code will attempt to create one
    
    k - the number of clusters
    
    means - the initial points to use as the means. Its length is the number of means that will be searched for. These means will be altered, and should contain deep copies of the points they were drawn from. May be empty, in which case the list will be filled with some selected means
    
    assignment - an empty temp space to store the clustering classifications. Should be the same length as the number of data points
    
    exactTotal - determines how the objective function (return value) will be computed. If true, extra work will be done to compute the exact distance from each data point to its cluster. If false, an upper bound approximation will be used. This also impacts the value stored in nearestCentroidDist
    
    threadpool - the source of threads for parallel computation. If null, single threaded execution will occur
    
    returnError - true is the sum of squared distances should be returned. false means any value can be returned. saveCentroidDistance only applies if this is true
    
    dataPointWeights - the weight value to use for each data point. If null, assume each point has equal weight.
    
    Returns:
    
    the double
  - getListOfLists
```
protected static List<List<DataPoint>> getListOfLists(int k)
```
  - cluster
```
public int[] cluster(DataSet dataSet,
                     int[] designations)
```
    Description copied from interface: Clusterer
    
    Performs clustering on the given data set. Parameters may be estimated by the method, or other heuristics performed.
    
    Specified by:
    
    cluster in interface Clusterer
    
    Parameters:
    
    dataSet - the data set to perform clustering on
    
    designations - the array which will contain the designated values. The array will be altered and returned by the function. If null is given, a new array will be created and returned.
    
    Returns:
    
    an array indicating for each value indicating the cluster designation. This is the same array as designations, or a new one if the input array was null
  - cluster
```
public int[] cluster(DataSet dataSet,
                     ExecutorService threadpool,
                     int[] designations)
```
    Description copied from interface: Clusterer
    
    Performs clustering on the given data set. Parameters may be estimated by the method, or other heuristics performed.
    
    Specified by:
    
    cluster in interface Clusterer
    
    Parameters:
    
    dataSet - the data set to perform clustering on
    
    threadpool - a source of threads to run tasks
    
    designations - the array which will contain the designated values. The array will be altered and returned by the function. If null is given, a new array will be created and returned.
    
    Returns:
    
    an array indicating for each value indicating the cluster designation. This is the same array as designations, or a new one if the input array was null
  - cluster
```
public int[] cluster(DataSet dataSet,
                     int clusters,
                     ExecutorService threadpool,
                     int[] designations)
```
    Specified by:
    
    cluster in interface KClusterer
  - cluster
```
public int[] cluster(DataSet dataSet,
                     int clusters,
                     int[] designations)
```
    Specified by:
    
    cluster in interface KClusterer
  - cluster
```
public int[] cluster(DataSet dataSet,
                     int lowK,
                     int highK,
                     ExecutorService threadpool,
                     int[] designations)
```
    Specified by:
    
    cluster in interface KClusterer
  - cluster
```
public int[] cluster(DataSet dataSet,
                     int lowK,
                     int highK,
                     int[] designations)
```
    Specified by:
    
    cluster in interface KClusterer
  - clone
```
public abstract KMeans clone()
```
    Specified by:
    
    clone in interface Clusterer
    
    Specified by:
    
    clone in interface KClusterer
    
    Specified by:
    
    clone in class KClustererBase
  - supportsWeightedData
```
public boolean supportsWeightedData()
```
    Description copied from interface: Clusterer
    
    Indicates whether the model knows how to cluster using weighted data points. If it does, the model will train assuming the weights. The values returned by this method may change depending on the parameters set for the model.
    
    Specified by:
    
    supportsWeightedData in interface Clusterer
    
    Overrides:
    
    supportsWeightedData in class ClustererBase
    
    Returns:
    
    true if the model supports weighted data, false otherwise
  - getParameters
```
public List<Parameter> getParameters()
```
    Description copied from interface: Parameterized
    
    Returns the list of parameters that can be altered for this learner.
    
    Specified by:
    
    getParameters in interface Parameterized
    
    Returns:
    
    the list of parameters that can be altered for this learner.
  - getParameter
```
public Parameter getParameter(String paramName)
```
    Description copied from interface: Parameterized
    
    Returns the parameter with the given name. Two different strings may map to a single Parameter object. An ASCII only string, and a Unicode style string.
    
    Specified by:
    
    getParameter in interface Parameterized
    
    Parameters:
    
    paramName - the name of the parameter to obtain
    
    Returns:
    
    the Parameter in question, or null if no such named Parameter exists.

Class KMeans

Field Summary

Constructor Summary

Method Summary

Methods inherited from class jsat.clustering.KClustererBase

Methods inherited from class jsat.clustering.ClustererBase

Methods inherited from class java.lang.Object

Methods inherited from interface jsat.clustering.Clusterer

Field Detail

DEFAULT_SEED_SELECTION

dm

seedSelection

rand

storeMeans

saveCentroidDistance

nearestCentroidDist

means

MaxIterLimit

Constructor Detail

KMeans

KMeans

Method Detail

setIterationLimit

getIterationLimit

setStoreMeans

getMeans

setSeedSelection

getSeedSelection

getDistanceMetric

cluster

getListOfLists

cluster

cluster

cluster

cluster

cluster

cluster

clone

supportsWeightedData

getParameters

getParameter