public class XMeans extends KMeans
KMeans
clustering when the
value of K
is not known. It works by recursively splitting means
up to some specified maximum.
value. K
is specified, the implementation will simply call
the regular KMeans object it was constructed with. K=1
has a tendency to not be
split by the algorithm, returning the naive result of 1 cluster. It is better
to use at least K=2
as the default minimum, which is what the
implementation will start from when no range of K
is given. DEFAULT_SEED_SELECTION, dm, MaxIterLimit, means, nearestCentroidDist, rand, saveCentroidDistance, seedSelection, storeMeans
Constructor and Description |
---|
XMeans() |
XMeans(KMeans kmeans) |
XMeans(XMeans toCopy)
Copy constructor
|
Modifier and Type | Method and Description |
---|---|
XMeans |
clone() |
int[] |
cluster(DataSet dataSet,
ExecutorService threadpool,
int[] designations)
Performs clustering on the given data set.
|
int[] |
cluster(DataSet dataSet,
int[] designations)
Performs clustering on the given data set.
|
int[] |
cluster(DataSet dataSet,
int lowK,
int highK,
ExecutorService threadpool,
int[] designations) |
int[] |
cluster(DataSet dataSet,
int lowK,
int highK,
int[] designations) |
protected double |
cluster(DataSet dataSet,
List<Double> accelCache,
int k,
List<Vec> means,
int[] assignment,
boolean exactTotal,
ExecutorService threadpool,
boolean returnError,
Vec dataPointWeights)
This is a helper method where the actual cluster is performed.
|
int |
getIterationLimit()
Returns the maximum number of iterations of the ElkanKMeans algorithm that will be performed.
|
boolean |
getIterativeRefine() |
int |
getMinClusterSize() |
SeedSelectionMethods.SeedSelection |
getSeedSelection() |
boolean |
isStopAfterFail() |
void |
setIterationLimit(int iterLimit)
Sets the maximum number of iterations allowed
|
void |
setIterativeRefine(boolean refineCenters)
Sets whether or not the set of all cluster centers should be refined at
every iteration.
|
void |
setMinClusterSize(int minClusterSize)
Sets the minimum size for splitting a cluster.
|
void |
setSeedSelection(SeedSelectionMethods.SeedSelection seedSelection)
Sets the method of seed selection to use for this algorithm.
|
void |
setStopAfterFail(boolean stopAfterFail)
Each new cluster will be tested for improvement according to the BIC
metric.
|
cluster, cluster, getDistanceMetric, getListOfLists, getMeans, getParameter, getParameters, setStoreMeans, supportsWeightedData
cluster, cluster, cluster, cluster
cluster, cluster, createClusterListFromAssignmentArray, getDatapointsFromCluster
public XMeans()
public XMeans(KMeans kmeans)
public XMeans(XMeans toCopy)
toCopy
- the object to copypublic void setStopAfterFail(boolean stopAfterFail)
true
then an optimization is done that
once a center fails be improved by splitting, it will never be tested
again. This is a safe assumption when
setIterativeRefine(boolean)
is set to false
, but
otherwise may not quite be true. trustH0
is true
, X-Means will
make at most O(k) runs of k-means for the final value of k chosen. When
false
(the default option), at most O(k2) runs of
k-means will occur.stopAfterFail
- true
if a centroid shouldn't be re-tested once it
fails to split.public boolean isStopAfterFail()
true
if clusters that fail to split wont be re-tested.
false
if they will.public void setMinClusterSize(int minClusterSize)
minClusterSize
- the minimum number of data points that must be present in a
cluster to consider splitting itpublic int getMinClusterSize()
public void setIterativeRefine(boolean refineCenters)
true
and part of how the
X-Means algorithm is described. Setting this to false
can result
in large speedups at the potential cost of quality.refineCenters
- true
to refine the cluster centers at every
step, false
to skip this step of the algorithm.public boolean getIterativeRefine()
true
if the cluster centers are refined at every
step, false
if skipping this step of the algorithm.public int[] cluster(DataSet dataSet, int[] designations)
Clusterer
cluster
in interface Clusterer
cluster
in class KMeans
dataSet
- the data set to perform clustering ondesignations
- the array which will contain the designated values. The array will be altered and returned by
the function. If null is given, a new array will be created and returned.public int[] cluster(DataSet dataSet, ExecutorService threadpool, int[] designations)
Clusterer
cluster
in interface Clusterer
cluster
in class KMeans
dataSet
- the data set to perform clustering onthreadpool
- a source of threads to run tasksdesignations
- the array which will contain the designated values. The array will be altered and returned by
the function. If null is given, a new array will be created and returned.public int[] cluster(DataSet dataSet, int lowK, int highK, ExecutorService threadpool, int[] designations)
cluster
in interface KClusterer
cluster
in class KMeans
public int[] cluster(DataSet dataSet, int lowK, int highK, int[] designations)
cluster
in interface KClusterer
cluster
in class KMeans
public int getIterationLimit()
KMeans
getIterationLimit
in class KMeans
public void setIterationLimit(int iterLimit)
KMeans
setIterationLimit
in class KMeans
iterLimit
- the maximum number of iterations of the ElkanKMeans algorithmpublic void setSeedSelection(SeedSelectionMethods.SeedSelection seedSelection)
KMeans
SeedSelectionMethods.SeedSelection.KPP
is recommended for this algorithm in particular.setSeedSelection
in class KMeans
seedSelection
- the method of seed selection to usepublic SeedSelectionMethods.SeedSelection getSeedSelection()
getSeedSelection
in class KMeans
protected double cluster(DataSet dataSet, List<Double> accelCache, int k, List<Vec> means, int[] assignment, boolean exactTotal, ExecutorService threadpool, boolean returnError, Vec dataPointWeights)
KMeans
cluster
in class KMeans
dataSet
- The set of data points to perform clustering onaccelCache
- acceleration cache to use, or null
. If
null
, the kmeans code will attempt to create onek
- the number of clustersmeans
- the initial points to use as the means. Its length is the
number of means that will be searched for. These means will be altered,
and should contain deep copies of the points they were drawn from. May be
empty, in which case the list will be filled with some selected meansassignment
- an empty temp space to store the clustering
classifications. Should be the same length as the number of data pointsexactTotal
- determines how the objective function (return value)
will be computed. If true, extra work will be done to compute the exact
distance from each data point to its cluster. If false, an upper bound
approximation will be used. This also impacts the value stored in
KMeans.nearestCentroidDist
threadpool
- the source of threads for parallel computation. If
null, single threaded execution will occurreturnError
- true
is the sum of squared distances should be
returned. false
means any value can be returned.
KMeans.saveCentroidDistance
only applies if this is true
dataPointWeights
- the weight value to use for each data point. If
null, assume each point has equal weight.Copyright © 2017. All rights reserved.