DataSet (Java Statistical Analysis Tool 0.0.8 API)

java.lang.Object
- jsat.DataSet<Type>

Direct Known Subclasses:

ClassificationDataSet, RegressionDataSet, SimpleDataSet
```
public abstract class DataSet<Type extends DataSet>
extends Object
```
This is the base class for representing a data set. A data set contains multiple samples, each of which should have the same number of attributes. Conceptually, each DataPoint represents a row in the data set, and the attributes form the columns.

Author:

Edward Raff

Field Summary

Fields
Modifier and Type	Field and Description
`protected CategoricalData[]`	`categories` Contains the categories for each of the categorical variables
`protected Map<Integer,SoftReference<Vec>>`	`columnVecCache` This cache is used to hold a reference to the column vectors that are returned.
`protected List<String>`	`numericalVariableNames` The list, in order, of the names of the numeric variables.
`protected int`	`numNumerVals` The number of numerical values each data point must have

Constructor Summary

Constructors
Constructor and Description

DataSet()

Constructors
Constructor and Description
`DataSet()`

Method Summary

All Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`applyTransform(DataTransform dt)` Applies the given transformation to all points in this data set, replacing each data point with the new value.
`void`	`applyTransform(DataTransform dt, boolean mutate)` Applies the given transformation to all points in this data set.
`void`	`applyTransform(DataTransform dt, boolean mutate, ExecutorService ex)` Applies the given transformation to all points in this data set in parallel.
`void`	`applyTransform(DataTransform dt, ExecutorService ex)` Applies the given transformation to all points in this data set in parallel, replacing each data point with the new value.
`long`	`countMissingValues()`
`List<Type>`	`cvSet(int folds)` Creates `folds` data sets that contain data from this data set.
`List<Type>`	`cvSet(int folds, Random rand)` Creates `folds` data sets that contain data from this data set.
`CategoricalData[]`	`getCategories()` Returns the array containing the categorical data information for this data set.
`String`	`getCategoryName(int i)` Returns the name used for the `i`'th categorical attribute.
`Vec[]`	`getColumnMeanVariance()` Computes the weighted mean and variance for each column of feature values.
`Matrix`	`getDataMatrix()` Creates a matrix from the data set, where each row represent a data point, and each column is one of the numeric example from the data set.
`Matrix`	`getDataMatrixView()` Creates a matrix backed by the data set, where each row is a data point from the dataset, and each column is one of the numeric examples from the data set.
`abstract DataPoint`	`getDataPoint(int i)` Returns the `i`'th data point in this set.
`Iterator<DataPoint>`	`getDataPointIterator()` Returns an iterator that will iterate over all data points in the set.
`List<DataPoint>`	`getDataPoints()` Creates a list containing the same DataPoints in this set.
`List<Vec>`	`getDataVectors()` Creates a list of the vectors values for each data point in the correct order.
`Vec`	`getDataWeights()` This method returns the weight of each data point in a single Vector.
`Type`	`getMissingDropped()` This method returns a dataset that is a subset of this dataset, where only the rows that have no missing values are kept.
`int`	`getNumCategoricalVars()` Returns the number of categorical variables for each data point in the set
`Vec`	`getNumericColumn(int i)` The data set can be seen as a NxM matrix, were each row is a data point, and each column the values for a particular variable.
`Vec[]`	`getNumericColumns()` Creates an array of column vectors for every numeric variable in this data set.
`Vec[]`	`getNumericColumns(Set<Integer> skipColumns)` Creates an array of column vectors for every numeric variable in this data set.
`String`	`getNumericName(int i)` Returns the name used for the `i`'th numeric attribute.
`int`	`getNumFeatures()` Returns the number of features in this data set, which is the sum of `getNumCategoricalVars()` and `getNumNumericalVars()`
`int`	`getNumNumericalVars()` Returns the number of numerical variables for each data point in the set
`OnLineStatistics[]`	`getOnlineColumnStats(boolean useWeights)` Returns summary statistics computed in an online fashion for each numeric variable.
`OnLineStatistics`	`getOnlineDenseStats()` Returns an `OnLineStatistics` object that is built by observing what proportion of each data point contains non zero numerical values.
`abstract int`	`getSampleSize()` Returns the number of data points in this data set
`OnLineStatistics`	`getSparsityStats()` Returns statistics on the sparsity of the vectors in this data set.
`protected abstract Type`	`getSubset(List<Integer> indicies)` Creates a new dataset that is a subset of this dataset.
`DataSet`	`getTwiceShallowClone()` Returns a new version of this data set that is of the same type, and contains a different listing pointing to shallow data point copies.
`List<Type>`	`randomSplit(double... splits)` Splits the dataset randomly into proportionally sized partitions.
`List<Type>`	`randomSplit(Random rand, double... splits)` Splits the dataset randomly into proportionally sized partitions.
`void`	`replaceNumericFeatures(List<Vec> newNumericFeatures)` This method will replace every numeric feature in this dataset with a Vec object from the given list.
`abstract void`	`setDataPoint(int i, DataPoint dp)` Replaces an already existing data point with the one given.
`boolean`	`setNumericName(String name, int i)` Sets the unique name associated with the `i`'th numeric attribute.
`abstract DataSet<Type>`	`shallowClone()` Returns a new version of this data set that is of the same type, and contains a different list pointing to the same data points.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - numNumerVals
```
protected int numNumerVals
```
    The number of numerical values each data point must have
  - categories
```
protected CategoricalData[] categories
```
    Contains the categories for each of the categorical variables
  - numericalVariableNames
```
protected List<String> numericalVariableNames
```
    The list, in order, of the names of the numeric variables. This should be filled with default values on construction, that can then be changed later.
  - columnVecCache
```
protected Map<Integer,SoftReference<Vec>> columnVecCache
```
    This cache is used to hold a reference to the column vectors that are returned. It is often the case that the column could be requested multiple times, especially if someone is doing a grid search, and there is no need to do the work over again. If the GC is low on memory it can still collect our cache since we use soft references
    
    This map should be cleared whenever the data set as a whole is mutated
- Constructor Detail
  - DataSet
```
public DataSet()
```
- Method Detail
  - setNumericName
```
public boolean setNumericName(String name,
                              int i)
```
    Sets the unique name associated with the i'th numeric attribute. All strings will be converted to lower case first.
    
    Parameters:
    
    name - the name to use
    
    i - the ith attribute.
    
    Returns:
    
    true if the value was set, false if it was not set because an invalid index was given .
  - getNumericName
```
public String getNumericName(int i)
```
    Returns the name used for the i'th numeric attribute.
    
    Parameters:
    
    i - the ith attribute.
    
    Returns:
    
    the name used for the i'th numeric attribute.
  - getCategoryName
```
public String getCategoryName(int i)
```
    Returns the name used for the i'th categorical attribute.
    
    Parameters:
    
    i - the ith attribute.
    
    Returns:
    
    the name used for the i'th categorical attribute.
  - applyTransform
```
public void applyTransform(DataTransform dt)
```
    Applies the given transformation to all points in this data set, replacing each data point with the new value. No mutation of the data points will occur
    
    Parameters:
    
    dt - the transformation to apply
  - applyTransform
```
public void applyTransform(DataTransform dt,
                           ExecutorService ex)
```
    Applies the given transformation to all points in this data set in parallel, replacing each data point with the new value. No mutation of the data points will occur.
    
    Parameters:
    
    dt - the transformation to apply
    
    ex - the threadpool to provide threads from. May be null to perform operations in serial
  - applyTransform
```
public void applyTransform(DataTransform dt,
                           boolean mutate)
```
    Applies the given transformation to all points in this data set. If the transform supports mutating the original data points, this will be applied if mutableTransform is set to true
    
    Parameters:
    
    dt - the transformation to apply
    
    mutate - true to mutableTransform the original data points, false to ignore the ability to mutableTransform and replace the original data points.
  - applyTransform
```
public void applyTransform(DataTransform dt,
                           boolean mutate,
                           ExecutorService ex)
```
    Applies the given transformation to all points in this data set in parallel. If the transform supports mutating the original data points, this will be applied if mutableTransform is set to true
    
    Parameters:
    
    dt - the transformation to apply
    
    mutate - true to mutableTransform the original data points, false to ignore the ability to mutableTransform and replace the original
    
    ex - the threadpool to provide threads from. May be null to perform operations in serial
  - replaceNumericFeatures
```
public void replaceNumericFeatures(List<Vec> newNumericFeatures)
```
    This method will replace every numeric feature in this dataset with a Vec object from the given list. All vecs in the given list must be of the same size.
    
    Parameters:
    
    newNumericFeatures - the list of new numeric features to use
  - getDataPoint
```
public abstract DataPoint getDataPoint(int i)
```
    Returns the i'th data point in this set. The order will never chance so long as no data points are added or removed from the set.
    
    Parameters:
    
    i - the i'th data point in this set
    
    Returns:
    
    the i'th data point in this set
  - setDataPoint
```
public abstract void setDataPoint(int i,
                                  DataPoint dp)
```
    Replaces an already existing data point with the one given. Any values associated with the data point, but not apart of it, will remain intact.
    
    Parameters:
    
    i - the i'th dataPoint to set.
    
    dp - the data point to set at the specified index
  - getOnlineColumnStats
```
public OnLineStatistics[] getOnlineColumnStats(boolean useWeights)
```
    Returns summary statistics computed in an online fashion for each numeric variable. This returns all summary statistics, but can be less numerically stable and uses more memory.
    NaNs / missing values will be ignored in the statistics for each column.
    
    Parameters:
    
    useWeights - true to return the weighted statistics, unweighted otherwise.
    
    Returns:
    
    an array of summary statistics
  - getOnlineDenseStats
```
public OnLineStatistics getOnlineDenseStats()
```
    Returns an OnLineStatistics object that is built by observing what proportion of each data point contains non zero numerical values. A mean of 1 indicates all values were fully dense, and a mean of 0 indicates all values were completely sparse (all zeros).
    
    Returns:
    
    statistics on the percent sparseness of each data point
  - getColumnMeanVariance
```
public Vec[] getColumnMeanVariance()
```
    Computes the weighted mean and variance for each column of feature values. This has less overhead than getOnlineColumnStats(boolean) but returns less information.
    
    Returns:
    
    an array of the vectors containing the mean and variance for each column.
  - getDataPointIterator
```
public Iterator<DataPoint> getDataPointIterator()
```
    Returns an iterator that will iterate over all data points in the set. The behavior is not defined if one attempts to modify the data set while being iterated.
    
    Returns:
    
    an iterator for the data points
  - getSampleSize
```
public abstract int getSampleSize()
```
    Returns the number of data points in this data set
    
    Returns:
    
    the number of data points in this data set
  - getNumCategoricalVars
```
public int getNumCategoricalVars()
```
    Returns the number of categorical variables for each data point in the set
    
    Returns:
    
    the number of categorical variables for each data point in the set
  - getNumNumericalVars
```
public int getNumNumericalVars()
```
    Returns the number of numerical variables for each data point in the set
    
    Returns:
    
    the number of numerical variables for each data point in the set
  - getCategories
```
public CategoricalData[] getCategories()
```
    Returns the array containing the categorical data information for this data set. Changes to this will be reflected in the data set.
    
    Returns:
    
    the array of CategoricalData
  - getSubset
```
protected abstract Type getSubset(List<Integer> indicies)
```
    Creates a new dataset that is a subset of this dataset.
    
    Parameters:
    
    indicies - the indices of data points to insert into the new dataset, and will be placed in the order listed.
    
    Returns:
    
    a new dataset that is a specified subset of this dataset, and backed by the same values
  - getMissingDropped
```
public Type getMissingDropped()
```
    This method returns a dataset that is a subset of this dataset, where only the rows that have no missing values are kept. The new dataset is backed by this dataset.
    
    Returns:
    
    a subset of this dataset that has all data points with missing features dropped
  - randomSplit
```
public List<Type> randomSplit(Random rand,
                              double... splits)
```
    Splits the dataset randomly into proportionally sized partitions.
    
    Parameters:
    
    rand - the source of randomness for moving data around
    
    splits - any array, where the length is the number of datasets to create and the value of in each index is the fraction of samples that should be placed into that dataset. The sum of values must be less than or equal to 1.0
    
    Returns:
    
    a list of new datasets
  - randomSplit
```
public List<Type> randomSplit(double... splits)
```
    Splits the dataset randomly into proportionally sized partitions.
    
    Parameters:
    
    splits - any array, where the length is the number of datasets to create and the value of in each index is the fraction of samples that should be placed into that dataset. The sum of values must be less than or equal to 1.0
    
    Returns:
    
    a list of new datasets
  - cvSet
```
public List<Type> cvSet(int folds,
                        Random rand)
```
    Creates folds data sets that contain data from this data set. The data points in each set will be random. These are meant for cross validation
    
    Parameters:
    
    folds - the number of cross validation sets to create. Should be greater then 1
    
    rand - the source of randomness
    
    Returns:
    
    the list of data sets.
  - cvSet
```
public List<Type> cvSet(int folds)
```
    Creates folds data sets that contain data from this data set. The data points in each set will be random. These are meant for cross validation
    
    Parameters:
    
    folds - the number of cross validation sets to create. Should be greater then 1
    
    Returns:
    
    the list of data sets.
  - getDataPoints
```
public List<DataPoint> getDataPoints()
```
    Creates a list containing the same DataPoints in this set. They are soft copies, in the same order as this data set. However, altering this list will have no effect on DataSet. Altering the DataPoints in the list will effect the DataPoints in this DataSet.
    
    Returns:
    
    a list of the DataPoints in this DataSet.
  - getDataVectors
```
public List<Vec> getDataVectors()
```
    Creates a list of the vectors values for each data point in the correct order.
    
    Returns:
    
    a list of the vectors for the data points
  - getNumericColumn
```
public Vec getNumericColumn(int i)
```
    The data set can be seen as a NxM matrix, were each row is a data point, and each column the values for a particular variable. This method grabs all the numerical values for a 'column' and returns it as one vector.
    This vector can be altered and will not effect any of the values in the data set
    
    Parameters:
    
    i - the i'th numerical variable to obtain all values of
    
    Returns:
    
    a Vector of length getSampleSize()
  - countMissingValues
```
public long countMissingValues()
```
    Returns:
    
    the number of missing values in both numeric and categorical features
  - getNumericColumns
```
public Vec[] getNumericColumns()
```
    Creates an array of column vectors for every numeric variable in this data set. The index of the array corresponds to the numeric feature index. This method is faster and more efficient than calling getNumericColumn(int) when multiple columns are needed.
    
    Note, that the columns returned by this method may be cached and re used by the DataSet itself. If you need to alter the columns you should create your own copy of these vectors. If you know that you will be the only person getting a column vector from this data set, then you may safely alter the columns without mutating the data points themselves. However, future callers may or may not receive the same vector objects.
    
    Returns:
    
    an array of the column vectors
  - getNumericColumns
```
public Vec[] getNumericColumns(Set<Integer> skipColumns)
```
    Creates an array of column vectors for every numeric variable in this data set. The index of the array corresponds to the numeric feature index. This method is faster and more efficient than calling getNumericColumn(int) when multiple columns are needed.
    
    A set of columns to skip can be provided in order to save memory if one does not need all the columns.
    
    Note, that the columns returned by this method may be cached and re used by the DataSet itself. If you need to alter the columns you should create your own copy of these vectors. If you know that you will be the only person getting a column vector from this data set, then you may safely alter the columns without mutating the data points themselves. However, future callers may or may not receive the same vector objects.
    
    Parameters:
    
    skipColumns - if a column's index is in this set, a null will be returned in the array at the column's index instead of a vector
    
    Returns:
    
    an array of the column vectors
  - getDataMatrix
```
public Matrix getDataMatrix()
```
    Creates a matrix from the data set, where each row represent a data point, and each column is one of the numeric example from the data set.
    This matrix can be altered and will not effect any of the values in the data set.
    
    Returns:
    
    a matrix of the data points.
  - getDataMatrixView
```
public Matrix getDataMatrixView()
```
    Creates a matrix backed by the data set, where each row is a data point from the dataset, and each column is one of the numeric examples from the data set.
    Any modifications to this matrix will be reflected in the dataset.
    This method has the advantage over getDataMatrix() in that it does not use any additional memory and it maintains any sparsity information.
    
    Returns:
    
    a matrix representation of the data points
  - getNumFeatures
```
public int getNumFeatures()
```
    Returns the number of features in this data set, which is the sum of getNumCategoricalVars() and getNumNumericalVars()
    
    Returns:
    
    the total number of features in this data set
  - shallowClone
```
public abstract DataSet<Type> shallowClone()
```
    Returns a new version of this data set that is of the same type, and contains a different list pointing to the same data points.
    
    Returns:
    
    a shallow copy of this data set
  - getTwiceShallowClone
```
public DataSet getTwiceShallowClone()
```
    Returns a new version of this data set that is of the same type, and contains a different listing pointing to shallow data point copies. Because the data point object contains the weight itself, the weight is not shared - while the vector and array information is. This allows altering the weights of the data points while preserving the original weights.
    Altering the list or weights of the returned data set will not be reflected in the original. Altering the feature values will.
    
    Returns:
    
    a shallow copy of shallow data point copies for this data set.
  - getSparsityStats
```
public OnLineStatistics getSparsityStats()
```
    Returns statistics on the sparsity of the vectors in this data set. Vectors that are not considered sparse will be treated as completely dense, even if zero values exist in the data.
    
    Returns:
    
    an object containing the statistics of the vector sparsity
  - getDataWeights
```
public Vec getDataWeights()
```
    This method returns the weight of each data point in a single Vector. When all data points have the same weight, this will return a vector that uses fixed memory instead of allocating a full double backed array.
    
    Returns:
    
    a vector that will return the weight for each data point with the same corresponding index.

Class DataSet<Type extends DataSet>

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

numNumerVals

categories

numericalVariableNames

columnVecCache

Constructor Detail

DataSet

Method Detail

setNumericName

getNumericName

getCategoryName

applyTransform

applyTransform

applyTransform

applyTransform

replaceNumericFeatures

getDataPoint

setDataPoint

getOnlineColumnStats

getOnlineDenseStats

getColumnMeanVariance

getDataPointIterator

getSampleSize

getNumCategoricalVars

getNumNumericalVars

getCategories

getSubset

getMissingDropped

randomSplit

randomSplit

cvSet

cvSet

getDataPoints

getDataVectors

getNumericColumn

countMissingValues

getNumericColumns

getNumericColumns

getDataMatrix

getDataMatrixView

getNumFeatures

shallowClone

getTwiceShallowClone

getSparsityStats

getDataWeights