TextDataLoader (Java Statistical Analysis Tool 0.0.8 API)

java.lang.Object
- jsat.text.TextDataLoader

All Implemented Interfaces:

Serializable, TextVectorCreator

Direct Known Subclasses:

ClassificationTextDataLoader
```
public abstract class TextDataLoader
extends Object
implements TextVectorCreator
```
This class provides a framework for loading datasets made of Text documents as vectors. Text is broken up into a sequence of tokens using a Tokenizer, that must be provided. The weights used will be determined by some word weighting scheme.
The user adds documents to the initial dataset using the addOriginalDocument(java.lang.String) method. The finishAdding() must be called when no more documents are left to add, at which point class will take care of calling the WordWeighting.setWeight(java.util.List, java.util.List) method to configure the word weighting used with the original data added.

After the initial dataset is loaded, new strings can be converted to vectors using the newText(java.lang.String) method. This should only be called after finishAdding().

Instance of this class will keep a reference to all originally added vectors. To transform new texts into vectors without keeping references to all of the original vectors, the getTextVectorCreator() will return an object that perform the transformation.

Author:

Edward Raff

See Also:

Serialized Form

Field Summary

Fields
Modifier and Type	Field and Description
`protected List<String>`	`allWords` list of all word tokens encountered in order of first observation
`protected boolean`	`noMoreAdding` true when `finishAdding()` is called, and no new original documents can be inserted
`protected ThreadLocal<List<String>>`	`storageSpace` Temporary storage space to use for tokenization
`protected ConcurrentHashMap<Integer,AtomicInteger>`	`termDocumentFrequencys` The map of integer counts of how many times each word token was seen.
`protected Tokenizer`	`tokenizer` Tokenizer to apply to input strings
`protected List<SparseVector>`	`vectors` List of original vectors
`protected ThreadLocal<Map<String,Integer>>`	`wordCounts` Temporary space to use when creating vectors
`protected ConcurrentHashMap<String,Integer>`	`wordIndex` Maps words to their associated index in an array
`protected ThreadLocal<StringBuilder>`	`workSpace` Temporary work space to use for tokenization

Constructor Summary

Constructors
Constructor and Description

TextDataLoader(Tokenizer tokenizer, WordWeighting weighting)
Creates a new loader for text datasets

Constructors
Constructor and Description
`TextDataLoader(Tokenizer tokenizer, WordWeighting weighting)` Creates a new loader for text datasets

Method Summary

All Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`protected int`	`addOriginalDocument(String text)` To be called by the `initialLoad()` method.
`protected void`	`finishAdding()` Once all original documents have been added, this method is called so that post processing steps can be applied.
`DataSet`	`getDataSet()` Returns a new data set containing the original data points that were loaded with this loader.
`RemoveAttributeTransform`	`getMinimumOccurrenceDTF(int minCount)` Creates a new transform factory to remove all features for tokens that did not occur a certain number of times
`int`	`getTermFrequency(int index)` Return the number of times a token has been seen in the document
`TextVectorCreator`	`getTextVectorCreator()` Returns the `TextVectorCreator` used by this data loader to convert documents into vectors.
`String`	`getWordForIndex(int index)` Returns the original token for the given index in the data set
`abstract void`	`initialLoad()` This method will load all the text documents that make up the original data set from their source.
`Vec`	`newText(String text)` To be called after all original texts have been loaded.
`Vec`	`newText(String input, StringBuilder workSpace, List<String> storageSpace)` Converts the given input text into a vector representation

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - vectors
```
protected final List<SparseVector> vectors
```
    List of original vectors
  - tokenizer
```
protected Tokenizer tokenizer
```
    Tokenizer to apply to input strings
  - wordIndex
```
protected ConcurrentHashMap<String,Integer> wordIndex
```
    Maps words to their associated index in an array
  - allWords
```
protected List<String> allWords
```
    list of all word tokens encountered in order of first observation
  - termDocumentFrequencys
```
protected ConcurrentHashMap<Integer,AtomicInteger> termDocumentFrequencys
```
    The map of integer counts of how many times each word token was seen. Key is the index of the word, value is the number of times it was seen. Using a map instead of a list so that it can be updated in a efficient thread safe way
  - workSpace
```
protected ThreadLocal<StringBuilder> workSpace
```
    Temporary work space to use for tokenization
  - storageSpace
```
protected ThreadLocal<List<String>> storageSpace
```
    Temporary storage space to use for tokenization
  - wordCounts
```
protected ThreadLocal<Map<String,Integer>> wordCounts
```
    Temporary space to use when creating vectors
  - noMoreAdding
```
protected boolean noMoreAdding
```
    true when finishAdding() is called, and no new original documents can be inserted
- Constructor Detail
  - TextDataLoader
```
public TextDataLoader(Tokenizer tokenizer,
                      WordWeighting weighting)
```
    Creates a new loader for text datasets
    
    Parameters:
    
    tokenizer - the tokenization method to break up strings with
    
    weighting - the scheme to set the weights for feature vectors.
- Method Detail
  - initialLoad
```
public abstract void initialLoad()
```
    This method will load all the text documents that make up the original data set from their source. For each document, addOriginalDocument(java.lang.String) should be called with the text of the document.
    This method will be called when getDataSet() is called for the first time.
    New document vectors can be obtained after loading by calling newText(java.lang.String).
  - addOriginalDocument
```
protected int addOriginalDocument(String text)
```
    To be called by the initialLoad() method. It will take in the text and add a new document vector to the data set. Once all text documents have been loaded, this method should never be called again.
    
    This method is thread safe.
    
    Parameters:
    
    text - the text of the document to add
    
    Returns:
    
    the index of the created document for the given text. Starts from zero and counts up.
  - finishAdding
```
protected void finishAdding()
```
    Once all original documents have been added, this method is called so that post processing steps can be applied.
  - getDataSet
```
public DataSet getDataSet()
```
    Returns a new data set containing the original data points that were loaded with this loader.
    
    Returns:
    
    an appropriate data set for this loader
  - newText
```
public Vec newText(String text)
```
    To be called after all original texts have been loaded.
    
    Specified by:
    
    newText in interface TextVectorCreator
    
    Parameters:
    
    text - the text of the document to create a document vector from
    
    Returns:
    
    the sparce vector representing this document
  - newText
```
public Vec newText(String input,
                   StringBuilder workSpace,
                   List<String> storageSpace)
```
    Description copied from interface: TextVectorCreator
    
    Converts the given input text into a vector representation
    
    Specified by:
    
    newText in interface TextVectorCreator
    
    Parameters:
    
    input - the input string
    
    workSpace - an already allocated (but empty) string builder than can be used as a temporary work space.
    
    storageSpace - an already allocated (but empty) list to place the tokens into
    
    Returns:
    
    a vector representation
  - getTextVectorCreator
```
public TextVectorCreator getTextVectorCreator()
```
    Returns the TextVectorCreator used by this data loader to convert documents into vectors.
    
    Returns:
    
    the text vector creator used by this class
  - getWordForIndex
```
public String getWordForIndex(int index)
```
    Returns the original token for the given index in the data set
    
    Parameters:
    
    index - the numeric feature index
    
    Returns:
    
    the word token associated with the index
  - getTermFrequency
```
public int getTermFrequency(int index)
```
    Return the number of times a token has been seen in the document
    
    Parameters:
    
    index - the numeric feature index
    
    Returns:
    
    the total occurrence count for the feature
  - getMinimumOccurrenceDTF
```
public RemoveAttributeTransform getMinimumOccurrenceDTF(int minCount)
```
    Creates a new transform factory to remove all features for tokens that did not occur a certain number of times
    
    Parameters:
    
    minCount - the minimum number of occurrences to be kept as a feature
    
    Returns:
    
    a transform factory for removing features that did not occur often enough

Class TextDataLoader

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

vectors

tokenizer

wordIndex

allWords

termDocumentFrequencys

workSpace

storageSpace

wordCounts

noMoreAdding

Constructor Detail

TextDataLoader

Method Detail

initialLoad

addOriginalDocument

finishAdding

getDataSet

newText

newText

getTextVectorCreator

getWordForIndex

getTermFrequency

getMinimumOccurrenceDTF