public class StopWordTokenizer extends Object implements Tokenizer
Modifier and Type | Field and Description |
---|---|
static Set<String> |
ENGLISH_STOP_SMALL_BASE
This unmodifiable set contains a very small and simple stop word list for
English based on the 100 most common English words and includes all
characters.
|
Constructor and Description |
---|
StopWordTokenizer(Tokenizer base,
Collection<String> stopWords)
Creates a new Stop Word tokenizer
|
StopWordTokenizer(Tokenizer base,
String... stopWords)
Creates a new Stop Word tokenizer
|
Modifier and Type | Method and Description |
---|---|
List<String> |
tokenize(String input)
Breaks the input string into a series of tokens that may be used as
features for a classifier.
|
void |
tokenize(String input,
StringBuilder workSpace,
List<String> storageSpace)
Breaks the input string into a series of tokens that may be used as
features for a classifier.
|
public static final Set<String> ENGLISH_STOP_SMALL_BASE
public StopWordTokenizer(Tokenizer base, Collection<String> stopWords)
base
- the base tokenizer to usestopWords
- the collection of stop words to remove from
tokenizations. A copy of the collection will be madepublic List<String> tokenize(String input)
Tokenizer
public void tokenize(String input, StringBuilder workSpace, List<String> storageSpace)
Tokenizer
Copyright © 2017. All rights reserved.