public abstract class DataSet<Type extends DataSet> extends Object
DataPoint
represents a row in the data set, and the attributes form the columns.Modifier and Type | Field and Description |
---|---|
protected CategoricalData[] |
categories
Contains the categories for each of the categorical variables
|
protected Map<Integer,SoftReference<Vec>> |
columnVecCache
This cache is used to hold a reference to the column vectors that are
returned.
|
protected List<String> |
numericalVariableNames
The list, in order, of the names of the numeric variables.
|
protected int |
numNumerVals
The number of numerical values each data point must have
|
Constructor and Description |
---|
DataSet() |
Modifier and Type | Method and Description |
---|---|
void |
applyTransform(DataTransform dt)
Applies the given transformation to all points in this data set,
replacing each data point with the new value.
|
void |
applyTransform(DataTransform dt,
boolean mutate)
Applies the given transformation to all points in this data set.
|
void |
applyTransform(DataTransform dt,
boolean mutate,
ExecutorService ex)
Applies the given transformation to all points in this data set in
parallel.
|
void |
applyTransform(DataTransform dt,
ExecutorService ex)
Applies the given transformation to all points in this data set in
parallel, replacing each data point with the new value.
|
long |
countMissingValues() |
List<Type> |
cvSet(int folds)
Creates folds data sets that contain data from this data set.
|
List<Type> |
cvSet(int folds,
Random rand)
Creates folds data sets that contain data from this data set.
|
CategoricalData[] |
getCategories()
Returns the array containing the categorical data information for this data
set.
|
String |
getCategoryName(int i)
Returns the name used for the i'th categorical attribute.
|
Vec[] |
getColumnMeanVariance()
Computes the weighted mean and variance for each column of feature
values.
|
Matrix |
getDataMatrix()
Creates a matrix from the data set, where each row represent a data
point, and each column is one of the numeric example from the data set.
|
Matrix |
getDataMatrixView()
Creates a matrix backed by the data set, where each row is a data point
from the dataset, and each column is one of the numeric examples from the
data set.
|
abstract DataPoint |
getDataPoint(int i)
Returns the i'th data point in this set.
|
Iterator<DataPoint> |
getDataPointIterator()
Returns an iterator that will iterate over all data points in the set.
|
List<DataPoint> |
getDataPoints()
Creates a list containing the same DataPoints in this set.
|
List<Vec> |
getDataVectors()
Creates a list of the vectors values for each data point in the correct order.
|
Vec |
getDataWeights()
This method returns the weight of each data point in a single Vector.
|
Type |
getMissingDropped()
This method returns a dataset that is a subset of this dataset, where
only the rows that have no missing values are kept.
|
int |
getNumCategoricalVars()
Returns the number of categorical variables for each data point in the set
|
Vec |
getNumericColumn(int i)
The data set can be seen as a NxM matrix, were each row is a
data point, and each column the values for a particular
variable.
|
Vec[] |
getNumericColumns()
Creates an array of column vectors for every numeric variable in this
data set.
|
Vec[] |
getNumericColumns(Set<Integer> skipColumns)
Creates an array of column vectors for every numeric variable in this
data set.
|
String |
getNumericName(int i)
Returns the name used for the i'th numeric attribute.
|
int |
getNumFeatures()
Returns the number of features in this data set, which is the sum of
getNumCategoricalVars() and getNumNumericalVars() |
int |
getNumNumericalVars()
Returns the number of numerical variables for each data point in the set
|
OnLineStatistics[] |
getOnlineColumnStats(boolean useWeights)
Returns summary statistics computed in an online fashion for each numeric
variable.
|
OnLineStatistics |
getOnlineDenseStats()
Returns an
OnLineStatistics object that is built by observing
what proportion of each data point contains non zero numerical values. |
abstract int |
getSampleSize()
Returns the number of data points in this data set
|
OnLineStatistics |
getSparsityStats()
Returns statistics on the sparsity of the vectors in this data set.
|
protected abstract Type |
getSubset(List<Integer> indicies)
Creates a new dataset that is a subset of this dataset.
|
DataSet |
getTwiceShallowClone()
Returns a new version of this data set that is of the same type, and
contains a different listing pointing to shallow data point copies.
|
List<Type> |
randomSplit(double... splits)
Splits the dataset randomly into proportionally sized partitions.
|
List<Type> |
randomSplit(Random rand,
double... splits)
Splits the dataset randomly into proportionally sized partitions.
|
void |
replaceNumericFeatures(List<Vec> newNumericFeatures)
This method will replace every numeric feature in this dataset with a Vec
object from the given list.
|
abstract void |
setDataPoint(int i,
DataPoint dp)
Replaces an already existing data point with the one given.
|
boolean |
setNumericName(String name,
int i)
Sets the unique name associated with the i'th numeric attribute.
|
abstract DataSet<Type> |
shallowClone()
Returns a new version of this data set that is of the same type, and
contains a different list pointing to the same data points.
|
protected int numNumerVals
protected CategoricalData[] categories
protected List<String> numericalVariableNames
protected Map<Integer,SoftReference<Vec>> columnVecCache
public boolean setNumericName(String name, int i)
name
- the name to usei
- the ith attribute.public String getNumericName(int i)
i
- the ith attribute.public String getCategoryName(int i)
i
- the ith attribute.public void applyTransform(DataTransform dt)
dt
- the transformation to applypublic void applyTransform(DataTransform dt, ExecutorService ex)
dt
- the transformation to applyex
- the threadpool to provide threads from. May be null
to
perform operations in serialpublic void applyTransform(DataTransform dt, boolean mutate)
mutableTransform
is set to true
dt
- the transformation to applymutate
- true
to mutableTransform the original data points,
false
to ignore the ability to mutableTransform and replace the original
data points.public void applyTransform(DataTransform dt, boolean mutate, ExecutorService ex)
mutableTransform
is set to true
dt
- the transformation to applymutate
- true
to mutableTransform the original data points,
false
to ignore the ability to mutableTransform and replace the originalex
- the threadpool to provide threads from. May be null
to
perform operations in serialpublic void replaceNumericFeatures(List<Vec> newNumericFeatures)
newNumericFeatures
- the list of new numeric features to usepublic abstract DataPoint getDataPoint(int i)
i
- the i'th data point in this setpublic abstract void setDataPoint(int i, DataPoint dp)
i
- the i'th dataPoint to set.dp
- the data point to set at the specified indexpublic OnLineStatistics[] getOnlineColumnStats(boolean useWeights)
useWeights
- true
to return the weighted statistics,
unweighted otherwise.public OnLineStatistics getOnlineDenseStats()
OnLineStatistics
object that is built by observing
what proportion of each data point contains non zero numerical values.
A mean of 1 indicates all values were fully dense, and a mean of 0
indicates all values were completely sparse (all zeros).public Vec[] getColumnMeanVariance()
getOnlineColumnStats(boolean)
but returns less information.public Iterator<DataPoint> getDataPointIterator()
public abstract int getSampleSize()
public int getNumCategoricalVars()
public int getNumNumericalVars()
public CategoricalData[] getCategories()
CategoricalData
protected abstract Type getSubset(List<Integer> indicies)
indicies
- the indices of data points to insert into the new
dataset, and will be placed in the order listed.public Type getMissingDropped()
public List<Type> randomSplit(Random rand, double... splits)
rand
- the source of randomness for moving data aroundsplits
- any array, where the length is the number of datasets to
create and the value of in each index is the fraction of samples that
should be placed into that dataset. The sum of values must be less than
or equal to 1.0public List<Type> randomSplit(double... splits)
splits
- any array, where the length is the number of datasets to
create and the value of in each index is the fraction of samples that
should be placed into that dataset. The sum of values must be less than
or equal to 1.0public List<Type> cvSet(int folds, Random rand)
folds
- the number of cross validation sets to create. Should be greater then 1rand
- the source of randomnesspublic List<Type> cvSet(int folds)
folds
- the number of cross validation sets to create. Should be greater then 1public List<DataPoint> getDataPoints()
public List<Vec> getDataVectors()
public Vec getNumericColumn(int i)
i
- the i'th numerical variable to obtain all values ofgetSampleSize()
public long countMissingValues()
public Vec[] getNumericColumns()
getNumericColumn(int)
when multiple columns are needed. public Vec[] getNumericColumns(Set<Integer> skipColumns)
getNumericColumn(int)
when multiple columns are needed. skipColumns
- if a column's index is in this set, a null
will be returned in the array at the column's index instead of a vectorpublic Matrix getDataMatrix()
public Matrix getDataMatrixView()
getDataMatrix()
in that it
does not use any additional memory and it maintains any sparsity
information.public int getNumFeatures()
getNumCategoricalVars()
and getNumNumericalVars()
public abstract DataSet<Type> shallowClone()
public DataSet getTwiceShallowClone()
public OnLineStatistics getSparsityStats()
public Vec getDataWeights()
Copyright © 2017. All rights reserved.