public abstract class HashedTextDataLoader extends Object implements TextVectorCreator
Tokenizer
, that must be provided. The weights used will be
determined by some word weighting scheme
. addOriginalDocument(java.lang.String)
method. The finishAdding()
must be called when no more documents
are left to add, at which point class will take care of calling the WordWeighting.setWeight(java.util.List, java.util.List)
method to configure the word weighting used with the original data
added.newText(java.lang.String)
method. This should only be
called after finishAdding()
.getTextVectorCreator()
will return an object
that perform the transformation.Modifier and Type | Field and Description |
---|---|
protected boolean |
noMoreAdding |
protected ThreadLocal<List<String>> |
storageSpace
Temporary storage space to use for tokenization
|
protected List<SparseVector> |
vectors
List of original vectors
|
protected ThreadLocal<Map<String,Integer>> |
wordCounts
Temporary space to use when creating vectors
|
protected ThreadLocal<StringBuilder> |
workSpace
Temporary work space to use for tokenization
|
Constructor and Description |
---|
HashedTextDataLoader(int dimensionSize,
Tokenizer tokenizer,
WordWeighting weighting) |
HashedTextDataLoader(Tokenizer tokenizer,
WordWeighting weighting) |
Modifier and Type | Method and Description |
---|---|
protected int |
addOriginalDocument(String text)
To be called by the
initialLoad() method. |
protected void |
finishAdding()
Once all original documents have been added, this method is called so
that post processing steps can be applied.
|
DataSet |
getDataSet()
Returns a new data set containing the original data points that were
loaded with this loader.
|
TextVectorCreator |
getTextVectorCreator()
Returns the
TextVectorCreator used by this data loader to convert
documents into vectors. |
protected abstract void |
initialLoad()
This method will load all the text documents that make up the original
data set from their source.
|
Vec |
newText(String input)
Converts the given input text into a vector representation.
|
Vec |
newText(String input,
StringBuilder workSpace,
List<String> storageSpace)
Converts the given input text into a vector representation
|
protected List<SparseVector> vectors
protected boolean noMoreAdding
protected ThreadLocal<StringBuilder> workSpace
protected ThreadLocal<List<String>> storageSpace
protected ThreadLocal<Map<String,Integer>> wordCounts
public HashedTextDataLoader(Tokenizer tokenizer, WordWeighting weighting)
public HashedTextDataLoader(int dimensionSize, Tokenizer tokenizer, WordWeighting weighting)
protected abstract void initialLoad()
addOriginalDocument(java.lang.String)
should be called with the
text of the document. getDataSet()
is called for the
first time. newText(java.lang.String)
.protected int addOriginalDocument(String text)
initialLoad()
method.
It will take in the text and add a new document
vector to the data set. Once all text documents
have been loaded, this method should never be
called again. text
- the text of the document to addprotected void finishAdding()
public DataSet getDataSet()
public Vec newText(String input)
TextVectorCreator
newText
in interface TextVectorCreator
input
- the input stringpublic Vec newText(String input, StringBuilder workSpace, List<String> storageSpace)
TextVectorCreator
newText
in interface TextVectorCreator
input
- the input stringworkSpace
- an already allocated (but empty) string builder than can
be used as a temporary work space.storageSpace
- an already allocated (but empty) list to place the
tokens intopublic TextVectorCreator getTextVectorCreator()
TextVectorCreator
used by this data loader to convert
documents into vectors.Copyright © 2017. All rights reserved.