public abstract class TextDataLoader extends Object implements TextVectorCreator
Tokenizer
, that must be provided. The weights used will be determined
by some word weighting scheme
. addOriginalDocument(java.lang.String)
method. The finishAdding()
must be called when no more documents
are left to add, at which point class will take care of calling the WordWeighting.setWeight(java.util.List, java.util.List)
method to configure the word weighting used with the original data
added.newText(java.lang.String)
method. This should only be
called after finishAdding()
.getTextVectorCreator()
will return an object
that perform the transformation.Modifier and Type | Field and Description |
---|---|
protected List<String> |
allWords
list of all word tokens encountered in order of first observation
|
protected boolean |
noMoreAdding
true when
finishAdding() is called, and no new original
documents can be inserted |
protected ThreadLocal<List<String>> |
storageSpace
Temporary storage space to use for tokenization
|
protected ConcurrentHashMap<Integer,AtomicInteger> |
termDocumentFrequencys
The map of integer counts of how many times each word token was seen.
|
protected Tokenizer |
tokenizer
Tokenizer to apply to input strings
|
protected List<SparseVector> |
vectors
List of original vectors
|
protected ThreadLocal<Map<String,Integer>> |
wordCounts
Temporary space to use when creating vectors
|
protected ConcurrentHashMap<String,Integer> |
wordIndex
Maps words to their associated index in an array
|
protected ThreadLocal<StringBuilder> |
workSpace
Temporary work space to use for tokenization
|
Constructor and Description |
---|
TextDataLoader(Tokenizer tokenizer,
WordWeighting weighting)
Creates a new loader for text datasets
|
Modifier and Type | Method and Description |
---|---|
protected int |
addOriginalDocument(String text)
To be called by the
initialLoad() method. |
protected void |
finishAdding()
Once all original documents have been added, this method is called so
that post processing steps can be applied.
|
DataSet |
getDataSet()
Returns a new data set containing the original data points that were
loaded with this loader.
|
RemoveAttributeTransform |
getMinimumOccurrenceDTF(int minCount)
Creates a new transform factory to remove all features for tokens that
did not occur a certain number of times
|
int |
getTermFrequency(int index)
Return the number of times a token has been seen in the document
|
TextVectorCreator |
getTextVectorCreator()
Returns the
TextVectorCreator used by this data loader to convert
documents into vectors. |
String |
getWordForIndex(int index)
Returns the original token for the given index in the data set
|
abstract void |
initialLoad()
This method will load all the text documents that make up the original
data set from their source.
|
Vec |
newText(String text)
To be called after all original texts have been loaded.
|
Vec |
newText(String input,
StringBuilder workSpace,
List<String> storageSpace)
Converts the given input text into a vector representation
|
protected final List<SparseVector> vectors
protected Tokenizer tokenizer
protected ConcurrentHashMap<String,Integer> wordIndex
protected List<String> allWords
protected ConcurrentHashMap<Integer,AtomicInteger> termDocumentFrequencys
protected ThreadLocal<StringBuilder> workSpace
protected ThreadLocal<List<String>> storageSpace
protected ThreadLocal<Map<String,Integer>> wordCounts
protected boolean noMoreAdding
finishAdding()
is called, and no new original
documents can be insertedpublic TextDataLoader(Tokenizer tokenizer, WordWeighting weighting)
tokenizer
- the tokenization method to break up strings withweighting
- the scheme to set the weights for feature vectors.public abstract void initialLoad()
addOriginalDocument(java.lang.String)
should be called with the
text of the document. getDataSet()
is called for the
first time. newText(java.lang.String)
.protected int addOriginalDocument(String text)
initialLoad()
method.
It will take in the text and add a new document
vector to the data set. Once all text documents
have been loaded, this method should never be
called again. text
- the text of the document to addprotected void finishAdding()
public DataSet getDataSet()
public Vec newText(String text)
newText
in interface TextVectorCreator
text
- the text of the document to create a document vector frompublic Vec newText(String input, StringBuilder workSpace, List<String> storageSpace)
TextVectorCreator
newText
in interface TextVectorCreator
input
- the input stringworkSpace
- an already allocated (but empty) string builder than can
be used as a temporary work space.storageSpace
- an already allocated (but empty) list to place the
tokens intopublic TextVectorCreator getTextVectorCreator()
TextVectorCreator
used by this data loader to convert
documents into vectors.public String getWordForIndex(int index)
index
- the numeric feature indexpublic int getTermFrequency(int index)
index
- the numeric feature indexpublic RemoveAttributeTransform getMinimumOccurrenceDTF(int minCount)
minCount
- the minimum number of occurrences to be kept as a featureCopyright © 2017. All rights reserved.