public class GapStatistic extends KClustererBase implements Parameterized
setSamples(int)
. EuclideanDistance
and was developed with the KMeans
algorithm. Thus that combination is the default when using the no argument
constructor. K
satisfying
Gap
(K) ≥ Gap(K+1) - sd
(K+1)
what the value of K
to use. Instead the condition used is the
smallest K
such that Gap(K) ≥ Gap(K+1)- sd(K+1) and Gap(K) > 0.
K
satisfies the condition, the largest
value of Gap(K) will be used. K
that is capped at 100 when using the
ClustererBase.cluster(jsat.DataSet)
type methods.Constructor and Description |
---|
GapStatistic()
Creates a new Gap clusterer using k-means as the base clustering algorithm
|
GapStatistic(GapStatistic toCopy)
Copy constructor
|
GapStatistic(KClusterer base)
Creates a new Gap clusterer using the base clustering algorithm given.
|
GapStatistic(KClusterer base,
boolean PCSampling)
Creates a new Gap clsuterer using the base clustering algorithm given.
|
GapStatistic(KClusterer base,
boolean PCSampling,
int B,
DistanceMetric dm)
Creates a new Gap clsuterer using the base clustering algorithm given.
|
Modifier and Type | Method and Description |
---|---|
GapStatistic |
clone() |
int[] |
cluster(DataSet dataSet,
ExecutorService threadpool,
int[] designations)
Performs clustering on the given data set.
|
int[] |
cluster(DataSet dataSet,
int[] designations)
Performs clustering on the given data set.
|
int[] |
cluster(DataSet dataSet,
int clusters,
ExecutorService threadpool,
int[] designations) |
int[] |
cluster(DataSet dataSet,
int clusters,
int[] designations) |
int[] |
cluster(DataSet dataSet,
int lowK,
int highK,
ExecutorService threadpool,
int[] designations) |
int[] |
cluster(DataSet dataSet,
int lowK,
int highK,
int[] designations) |
DistanceMetric |
getDistanceMetric() |
double[] |
getElogW()
Returns the array of expected E[log(Wk)] scores
computed from sampling new data sets.
|
double[] |
getElogWkStndDev()
Returns the array of standard deviations from the samplings used to compute
getElogWkStndDev() , multiplied by sqrt(1+1/B). |
double[] |
getGap()
Returns the array of gap statistic values.
|
double[] |
getLogW()
Returns the array of empirical log(Wk) scores computed
from the data set last clustered.
|
Parameter |
getParameter(String paramName)
Returns the parameter with the given name.
|
List<Parameter> |
getParameters()
Returns the list of parameters that can be altered for this learner.
|
int |
getSamples() |
boolean |
isPCSampling() |
void |
setDistanceMetric(DistanceMetric dm)
Sets the distance metric to use when evaluating a clustering algorithm
|
void |
setPCSampling(boolean PCSampling)
By default the null distribution is sampled from the bounding hyper-cube
of the dataset.
|
void |
setSamples(int B)
The Gap statistic is measured by sampling from a reference distribution
and comparing with the given data set.
|
cluster, cluster, cluster, cluster
cluster, cluster, createClusterListFromAssignmentArray, getDatapointsFromCluster, supportsWeightedData
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
cluster, cluster, supportsWeightedData
public GapStatistic()
public GapStatistic(KClusterer base)
base
- the base clustering method to use for any individual number
of clusterspublic GapStatistic(KClusterer base, boolean PCSampling)
base
- the base clustering method to use for any individual number
of clustersPCSampling
- true
if the Gap statistic should be computed
from a PCA transformed space, or false
to go with the uniform
bounding hyper cube.public GapStatistic(KClusterer base, boolean PCSampling, int B, DistanceMetric dm)
base
- the base clustering method to use for any individual number
of clustersPCSampling
- true
if the Gap statistic should be computed
from a PCA transformed space, or false
to go with the uniform
bounding hyper cube.B
- the number of datasets to sampledm
- the distance metric to evaluate withpublic GapStatistic(GapStatistic toCopy)
toCopy
- the object to copypublic void setDistanceMetric(DistanceMetric dm)
dm
- the distance metric to usepublic DistanceMetric getDistanceMetric()
public void setPCSampling(boolean PCSampling)
PCSampling
- true
to sample from the projected data, false
to do the default and sample from the bounding hyper-cube.public boolean isPCSampling()
true
to sample from the projected data, false
to do the default and sample from the bounding hyper-cube.public void setSamples(int B)
B
- the number of data sets to samplepublic int getSamples()
public double[] getGap()
i
of the
returned array indicates the gap score for using i+1
clusters. A
value of Double.NaN
if the score was not computed for that value
of K
null
if
the algorithm hasn't been run yet.public double[] getLogW()
i
of the returned array indicates the gap score for using
i+1
clusters. A value of Double.NaN
if the score was not
computed for that value of K
null
if the algorithm hasn't been run yetpublic double[] getElogW()
i
of the returned array indicates the gap score for using
i+1
clusters. A value of Double.NaN
if the score was not
computed for that value of K
null
if the algorithm hasn't been run yetpublic double[] getElogWkStndDev()
getElogWkStndDev()
, multiplied by sqrt(1+1/B). i
of the returned array indicates the gap score for using
i+1
clusters. A value of Double.NaN
if the score was not
computed for that value of K
null
if the algorithm hasn't been run yetpublic int[] cluster(DataSet dataSet, int[] designations)
Clusterer
cluster
in interface Clusterer
dataSet
- the data set to perform clustering ondesignations
- the array which will contain the designated values. The array will be altered and returned by
the function. If null is given, a new array will be created and returned.public int[] cluster(DataSet dataSet, ExecutorService threadpool, int[] designations)
Clusterer
cluster
in interface Clusterer
dataSet
- the data set to perform clustering onthreadpool
- a source of threads to run tasksdesignations
- the array which will contain the designated values. The array will be altered and returned by
the function. If null is given, a new array will be created and returned.public int[] cluster(DataSet dataSet, int clusters, ExecutorService threadpool, int[] designations)
cluster
in interface KClusterer
public int[] cluster(DataSet dataSet, int clusters, int[] designations)
cluster
in interface KClusterer
public int[] cluster(DataSet dataSet, int lowK, int highK, ExecutorService threadpool, int[] designations)
cluster
in interface KClusterer
public int[] cluster(DataSet dataSet, int lowK, int highK, int[] designations)
cluster
in interface KClusterer
public List<Parameter> getParameters()
Parameterized
getParameters
in interface Parameterized
public Parameter getParameter(String paramName)
Parameterized
getParameter
in interface Parameterized
paramName
- the name of the parameter to obtainpublic GapStatistic clone()
clone
in interface Clusterer
clone
in interface KClusterer
clone
in class KClustererBase
Copyright © 2017. All rights reserved.