HashedTextDataLoader (Java Statistical Analysis Tool 0.0.8 API)

java.lang.Object
- jsat.text.HashedTextDataLoader

All Implemented Interfaces:

Serializable, TextVectorCreator

Direct Known Subclasses:

ClassificationHashedTextDataLoader
```
public abstract class HashedTextDataLoader
extends Object
implements TextVectorCreator
```
This class provides a framework for loading datasets made of Text documents as hashed feature vectors. Text is broken up into a sequence of tokens using a Tokenizer, that must be provided. The weights used will be determined by some word weighting scheme.
The user adds documents to the initial dataset using the addOriginalDocument(java.lang.String) method. The finishAdding() must be called when no more documents are left to add, at which point class will take care of calling the WordWeighting.setWeight(java.util.List, java.util.List) method to configure the word weighting used with the original data added.

After the initial dataset is loaded, new strings can be converted to vectors using the newText(java.lang.String) method. This should only be called after finishAdding().

Instance of this class will keep a reference to all originally added vectors. To transform new texts into vectors without keeping references to all of the original vectors, the getTextVectorCreator() will return an object that perform the transformation.

Author:

Edward Raff

See Also:

Serialized Form

Field Summary

Fields
Modifier and Type	Field and Description
`protected boolean`	`noMoreAdding`
`protected ThreadLocal<List<String>>`	`storageSpace` Temporary storage space to use for tokenization
`protected List<SparseVector>`	`vectors` List of original vectors
`protected ThreadLocal<Map<String,Integer>>`	`wordCounts` Temporary space to use when creating vectors
`protected ThreadLocal<StringBuilder>`	`workSpace` Temporary work space to use for tokenization

Constructor Summary

Constructors
Constructor and Description
`HashedTextDataLoader(int dimensionSize, Tokenizer tokenizer, WordWeighting weighting)`
`HashedTextDataLoader(Tokenizer tokenizer, WordWeighting weighting)`

Method Summary

All Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`protected int`	`addOriginalDocument(String text)` To be called by the `initialLoad()` method.
`protected void`	`finishAdding()` Once all original documents have been added, this method is called so that post processing steps can be applied.
`DataSet`	`getDataSet()` Returns a new data set containing the original data points that were loaded with this loader.
`TextVectorCreator`	`getTextVectorCreator()` Returns the `TextVectorCreator` used by this data loader to convert documents into vectors.
`protected abstract void`	`initialLoad()` This method will load all the text documents that make up the original data set from their source.
`Vec`	`newText(String input)` Converts the given input text into a vector representation.
`Vec`	`newText(String input, StringBuilder workSpace, List<String> storageSpace)` Converts the given input text into a vector representation

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - vectors
```
protected List<SparseVector> vectors
```
    List of original vectors
  - noMoreAdding
```
protected boolean noMoreAdding
```
  - workSpace
```
protected ThreadLocal<StringBuilder> workSpace
```
    Temporary work space to use for tokenization
  - storageSpace
```
protected ThreadLocal<List<String>> storageSpace
```
    Temporary storage space to use for tokenization
  - wordCounts
```
protected ThreadLocal<Map<String,Integer>> wordCounts
```
    Temporary space to use when creating vectors
- Constructor Detail
  - HashedTextDataLoader
```
public HashedTextDataLoader(Tokenizer tokenizer,
                            WordWeighting weighting)
```
  - HashedTextDataLoader
```
public HashedTextDataLoader(int dimensionSize,
                            Tokenizer tokenizer,
                            WordWeighting weighting)
```
- Method Detail
  - initialLoad
```
protected abstract void initialLoad()
```
    This method will load all the text documents that make up the original data set from their source. For each document, addOriginalDocument(java.lang.String) should be called with the text of the document.
    This method will be called when getDataSet() is called for the first time.
    New document vectors can be obtained after loading by calling newText(java.lang.String).
  - addOriginalDocument
```
protected int addOriginalDocument(String text)
```
    To be called by the initialLoad() method. It will take in the text and add a new document vector to the data set. Once all text documents have been loaded, this method should never be called again.
    This method is thread safe.
    
    Parameters:
    
    text - the text of the document to add
    
    Returns:
    
    the index of the created document for the given text. Starts from zero and counts up.
  - finishAdding
```
protected void finishAdding()
```
    Once all original documents have been added, this method is called so that post processing steps can be applied.
  - getDataSet
```
public DataSet getDataSet()
```
    Returns a new data set containing the original data points that were loaded with this loader.
    
    Returns:
    
    an appropriate data set for this loader
  - newText
```
public Vec newText(String input)
```
    Description copied from interface: TextVectorCreator
    
    Converts the given input text into a vector representation.
    
    Specified by:
    
    newText in interface TextVectorCreator
    
    Parameters:
    
    input - the input string
    
    Returns:
    
    a vector representation
  - newText
```
public Vec newText(String input,
                   StringBuilder workSpace,
                   List<String> storageSpace)
```
    Description copied from interface: TextVectorCreator
    
    Converts the given input text into a vector representation
    
    Specified by:
    
    newText in interface TextVectorCreator
    
    Parameters:
    
    input - the input string
    
    workSpace - an already allocated (but empty) string builder than can be used as a temporary work space.
    
    storageSpace - an already allocated (but empty) list to place the tokens into
    
    Returns:
    
    a vector representation
  - getTextVectorCreator
```
public TextVectorCreator getTextVectorCreator()
```
    Returns the TextVectorCreator used by this data loader to convert documents into vectors.
    
    Returns:
    
    the text vector creator used by this class

Class HashedTextDataLoader

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

vectors

noMoreAdding

workSpace

storageSpace

wordCounts

Constructor Detail

HashedTextDataLoader

HashedTextDataLoader

Method Detail

initialLoad

addOriginalDocument

finishAdding

getDataSet

newText

newText

getTextVectorCreator