public abstract class KMeans extends KClustererBase implements Parameterized
Modifier and Type | Field and Description |
---|---|
static SeedSelectionMethods.SeedSelection |
DEFAULT_SEED_SELECTION
This is the default seed selection method used in ElkanKMeans.
|
protected DistanceMetric |
dm |
protected int |
MaxIterLimit
Control the maximum number of iterations to perform.
|
protected List<Vec> |
means
The list of means
|
protected double[] |
nearestCentroidDist
Distance from a datapoint to its nearest centroid.
|
protected Random |
rand |
protected boolean |
saveCentroidDistance
Indicates whether or not the distance between a datapoint and its nearest
centroid should be saved after clustering.
|
protected SeedSelectionMethods.SeedSelection |
seedSelection |
protected boolean |
storeMeans
Indicates whether or not the means from the clustering should be saved
|
Constructor and Description |
---|
KMeans(DistanceMetric dm,
SeedSelectionMethods.SeedSelection seedSelection,
Random rand) |
KMeans(KMeans toCopy)
Copy constructor
|
Modifier and Type | Method and Description |
---|---|
abstract KMeans |
clone() |
int[] |
cluster(DataSet dataSet,
ExecutorService threadpool,
int[] designations)
Performs clustering on the given data set.
|
int[] |
cluster(DataSet dataSet,
int[] designations)
Performs clustering on the given data set.
|
int[] |
cluster(DataSet dataSet,
int clusters,
ExecutorService threadpool,
int[] designations) |
int[] |
cluster(DataSet dataSet,
int clusters,
int[] designations) |
int[] |
cluster(DataSet dataSet,
int lowK,
int highK,
ExecutorService threadpool,
int[] designations) |
int[] |
cluster(DataSet dataSet,
int lowK,
int highK,
int[] designations) |
protected abstract double |
cluster(DataSet dataSet,
List<Double> accelCache,
int k,
List<Vec> means,
int[] assignment,
boolean exactTotal,
ExecutorService threadpool,
boolean returnError,
Vec dataPointWeights)
This is a helper method where the actual cluster is performed.
|
DistanceMetric |
getDistanceMetric()
Returns the distance metric in use
|
int |
getIterationLimit()
Returns the maximum number of iterations of the ElkanKMeans algorithm that will be performed.
|
protected static List<List<DataPoint>> |
getListOfLists(int k) |
List<Vec> |
getMeans()
Returns the raw list of means that were used for each class.
|
Parameter |
getParameter(String paramName)
Returns the parameter with the given name.
|
List<Parameter> |
getParameters()
Returns the list of parameters that can be altered for this learner.
|
SeedSelectionMethods.SeedSelection |
getSeedSelection() |
void |
setIterationLimit(int iterLimit)
Sets the maximum number of iterations allowed
|
void |
setSeedSelection(SeedSelectionMethods.SeedSelection seedSelection)
Sets the method of seed selection to use for this algorithm.
|
void |
setStoreMeans(boolean storeMeans)
If set to
true the computed means will be stored after clustering
is completed, and can then be retrieved using getMeans() . |
boolean |
supportsWeightedData()
Indicates whether the model knows how to cluster using weighted data
points.
|
cluster, cluster, cluster, cluster
cluster, cluster, createClusterListFromAssignmentArray, getDatapointsFromCluster
public static final SeedSelectionMethods.SeedSelection DEFAULT_SEED_SELECTION
EuclideanDistance
, it selects seeds that are log optimal with
a high probability.protected DistanceMetric dm
protected SeedSelectionMethods.SeedSelection seedSelection
protected Random rand
protected boolean storeMeans
protected boolean saveCentroidDistance
protected double[] nearestCentroidDist
protected int MaxIterLimit
public KMeans(DistanceMetric dm, SeedSelectionMethods.SeedSelection seedSelection, Random rand)
public KMeans(KMeans toCopy)
toCopy
- public void setIterationLimit(int iterLimit)
iterLimit
- the maximum number of iterations of the ElkanKMeans algorithmpublic int getIterationLimit()
public void setStoreMeans(boolean storeMeans)
true
the computed means will be stored after clustering
is completed, and can then be retrieved using getMeans()
.storeMeans
- true
if the means should be stored for later,
false
to discard them once clustering is complete.public List<Vec> getMeans()
public void setSeedSelection(SeedSelectionMethods.SeedSelection seedSelection)
SeedSelectionMethods.SeedSelection.KPP
is recommended for this algorithm in particular.seedSelection
- the method of seed selection to usepublic SeedSelectionMethods.SeedSelection getSeedSelection()
public DistanceMetric getDistanceMetric()
protected abstract double cluster(DataSet dataSet, List<Double> accelCache, int k, List<Vec> means, int[] assignment, boolean exactTotal, ExecutorService threadpool, boolean returnError, Vec dataPointWeights)
dataSet
- The set of data points to perform clustering onaccelCache
- acceleration cache to use, or null
. If
null
, the kmeans code will attempt to create onek
- the number of clustersmeans
- the initial points to use as the means. Its length is the
number of means that will be searched for. These means will be altered,
and should contain deep copies of the points they were drawn from. May be
empty, in which case the list will be filled with some selected meansassignment
- an empty temp space to store the clustering
classifications. Should be the same length as the number of data pointsexactTotal
- determines how the objective function (return value)
will be computed. If true, extra work will be done to compute the exact
distance from each data point to its cluster. If false, an upper bound
approximation will be used. This also impacts the value stored in
nearestCentroidDist
threadpool
- the source of threads for parallel computation. If
null, single threaded execution will occurreturnError
- true
is the sum of squared distances should be
returned. false
means any value can be returned.
saveCentroidDistance
only applies if this is true
dataPointWeights
- the weight value to use for each data point. If
null, assume each point has equal weight.public int[] cluster(DataSet dataSet, int[] designations)
Clusterer
cluster
in interface Clusterer
dataSet
- the data set to perform clustering ondesignations
- the array which will contain the designated values. The array will be altered and returned by
the function. If null is given, a new array will be created and returned.public int[] cluster(DataSet dataSet, ExecutorService threadpool, int[] designations)
Clusterer
cluster
in interface Clusterer
dataSet
- the data set to perform clustering onthreadpool
- a source of threads to run tasksdesignations
- the array which will contain the designated values. The array will be altered and returned by
the function. If null is given, a new array will be created and returned.public int[] cluster(DataSet dataSet, int clusters, ExecutorService threadpool, int[] designations)
cluster
in interface KClusterer
public int[] cluster(DataSet dataSet, int clusters, int[] designations)
cluster
in interface KClusterer
public int[] cluster(DataSet dataSet, int lowK, int highK, ExecutorService threadpool, int[] designations)
cluster
in interface KClusterer
public int[] cluster(DataSet dataSet, int lowK, int highK, int[] designations)
cluster
in interface KClusterer
public abstract KMeans clone()
clone
in interface Clusterer
clone
in interface KClusterer
clone
in class KClustererBase
public boolean supportsWeightedData()
Clusterer
supportsWeightedData
in interface Clusterer
supportsWeightedData
in class ClustererBase
public List<Parameter> getParameters()
Parameterized
getParameters
in interface Parameterized
public Parameter getParameter(String paramName)
Parameterized
getParameter
in interface Parameterized
paramName
- the name of the parameter to obtainCopyright © 2017. All rights reserved.