public class NaiveTokenizer extends Object implements Tokenizer
Constructor and Description |
---|
NaiveTokenizer()
Creates a new naive tokenizer that converts words to lower case
|
NaiveTokenizer(boolean useLowerCase)
Creates a new naive tokenizer
|
Modifier and Type | Method and Description |
---|---|
int |
getMaxTokenLength()
Returns the maximum allowed token length
|
int |
getMinTokenLength()
Returns the minimum allowed token length
|
boolean |
isNoDigits()
Returns
true if digits are not allowed in tokens, false
otherwise. |
boolean |
isOtherToWhiteSpace()
Returns whether or not all other illegal characters are treated as
whitespace, or ignored completely.
|
boolean |
isUseLowerCase()
Returns
true if letters are converted to lower case,
false for case sensitive |
void |
setMaxTokenLength(int maxTokenLength)
Sets the maximum allowed length for any token.
|
void |
setMinTokenLength(int minTokenLength)
Sets the minimum allowed token length.
|
void |
setNoDigits(boolean noDigits)
Sets whether digits will be accepted in tokens or treated as "other" (not
white space and not character).
|
void |
setOtherToWhiteSpace(boolean otherToWhiteSpace)
Sets whether or not all non letter and digit characters are treated as
white space, or ignored completely.
|
void |
setUseLowerCase(boolean useLowerCase)
Sets whether or not characters are made to be lower case or not
|
List<String> |
tokenize(String input)
Breaks the input string into a series of tokens that may be used as
features for a classifier.
|
void |
tokenize(String input,
StringBuilder workSpace,
List<String> storageSpace)
Breaks the input string into a series of tokens that may be used as
features for a classifier.
|
public NaiveTokenizer()
public NaiveTokenizer(boolean useLowerCase)
useLowerCase
- true
to convert everything to lower,
false
to leave the case as ispublic void setUseLowerCase(boolean useLowerCase)
useLowerCase
- public boolean isUseLowerCase()
true
if letters are converted to lower case,
false
for case sensitivetrue
if letters are converted to lower case,public void setOtherToWhiteSpace(boolean otherToWhiteSpace)
false
can result in a lower feature count,
especially for noisy documents.otherToWhiteSpace
- true
to treat all "other" characters as
white space, false
to ignore thempublic boolean isOtherToWhiteSpace()
true
if all other characters are treated as whitespacepublic List<String> tokenize(String input)
Tokenizer
public void tokenize(String input, StringBuilder workSpace, List<String> storageSpace)
Tokenizer
public void setMaxTokenLength(int maxTokenLength)
maxTokenLength
- the maximum token length to accept as a valid tokenpublic int getMaxTokenLength()
public void setMinTokenLength(int minTokenLength)
minTokenLength
- the minimum length for a token to be usedpublic int getMinTokenLength()
public void setNoDigits(boolean noDigits)
noDigits
- true
to disallow numeric digits, false
to
allow digits.public boolean isNoDigits()
true
if digits are not allowed in tokens, false
otherwise.true
if digits are not allowed in tokens, false
otherwise.Copyright © 2017. All rights reserved.