NaiveTokenizer (Java Statistical Analysis Tool 0.0.8 API)

java.lang.Object
- jsat.text.tokenizer.NaiveTokenizer

All Implemented Interfaces:

Serializable, Tokenizer
```
public class NaiveTokenizer
extends Object
implements Tokenizer
```
A simple tokenizer. It converts everything to lower case, and splits on white space. Anything that is not a letter, digit, or space, is treated as white space. This behavior can be altered slightly, and allows for setting a minimum and maximum allowed length for tokens. This can be useful when dealing with noisy documents, and removing small words.

Author:

Edward Raff

See Also:

Serialized Form

Constructor Summary

Constructors
Constructor and Description
`NaiveTokenizer()` Creates a new naive tokenizer that converts words to lower case
`NaiveTokenizer(boolean useLowerCase)` Creates a new naive tokenizer

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`int`	`getMaxTokenLength()` Returns the maximum allowed token length
`int`	`getMinTokenLength()` Returns the minimum allowed token length
`boolean`	`isNoDigits()` Returns `true` if digits are not allowed in tokens, `false` otherwise.
`boolean`	`isOtherToWhiteSpace()` Returns whether or not all other illegal characters are treated as whitespace, or ignored completely.
`boolean`	`isUseLowerCase()` Returns `true` if letters are converted to lower case, `false` for case sensitive
`void`	`setMaxTokenLength(int maxTokenLength)` Sets the maximum allowed length for any token.
`void`	`setMinTokenLength(int minTokenLength)` Sets the minimum allowed token length.
`void`	`setNoDigits(boolean noDigits)` Sets whether digits will be accepted in tokens or treated as "other" (not white space and not character).
`void`	`setOtherToWhiteSpace(boolean otherToWhiteSpace)` Sets whether or not all non letter and digit characters are treated as white space, or ignored completely.
`void`	`setUseLowerCase(boolean useLowerCase)` Sets whether or not characters are made to be lower case or not
`List<String>`	`tokenize(String input)` Breaks the input string into a series of tokens that may be used as features for a classifier.
`void`	`tokenize(String input, StringBuilder workSpace, List<String> storageSpace)` Breaks the input string into a series of tokens that may be used as features for a classifier.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - NaiveTokenizer
```
public NaiveTokenizer()
```
    Creates a new naive tokenizer that converts words to lower case
  - NaiveTokenizer
```
public NaiveTokenizer(boolean useLowerCase)
```
    Creates a new naive tokenizer
    
    Parameters:
    
    useLowerCase - true to convert everything to lower, false to leave the case as is
- Method Detail
  - setUseLowerCase
```
public void setUseLowerCase(boolean useLowerCase)
```
    Sets whether or not characters are made to be lower case or not
    
    Parameters:
    
    useLowerCase -
  - isUseLowerCase
```
public boolean isUseLowerCase()
```
    Returns true if letters are converted to lower case, false for case sensitive
    
    Returns:
    
    true if letters are converted to lower case,
  - setOtherToWhiteSpace
```
public void setOtherToWhiteSpace(boolean otherToWhiteSpace)
```
    Sets whether or not all non letter and digit characters are treated as white space, or ignored completely. If ignored, the tokenizer parses the string as if all non letter, digit, and whitespace characters did not exist in the original string.
    
    Setting this to false can result in a lower feature count, especially for noisy documents.
    
    Parameters:
    
    otherToWhiteSpace - true to treat all "other" characters as white space, false to ignore them
  - isOtherToWhiteSpace
```
public boolean isOtherToWhiteSpace()
```
    Returns whether or not all other illegal characters are treated as whitespace, or ignored completely.
    
    Returns:
    
    true if all other characters are treated as whitespace
  - tokenize
```
public List<String> tokenize(String input)
```
    Description copied from interface: Tokenizer
    
    Breaks the input string into a series of tokens that may be used as features for a classifier. The returned tokens must be either new string objects or interned strings. If a token is returned that is backed by the original document, memory may get leaked by processes consuming the token.
    This method should be thread safe
    
    Specified by:
    
    tokenize in interface Tokenizer
    
    Parameters:
    
    input - the string to tokenize
    
    Returns:
    
    an already allocated list to place the tokens into
  - tokenize
```
public void tokenize(String input,
                     StringBuilder workSpace,
                     List<String> storageSpace)
```
    Description copied from interface: Tokenizer
    
    Breaks the input string into a series of tokens that may be used as features for a classifier. The returned tokens must be either new string objects or interned strings. If a token is returned that is backed by the original document, memory may get leaked by processes consuming the token.
    This method should be thread safe
    
    Specified by:
    
    tokenize in interface Tokenizer
    
    Parameters:
    
    input - the string to tokenize
    
    workSpace - an already allocated (but empty) string builder than can be used as a temporary work space.
    
    storageSpace - an already allocated (but empty) list to place the tokens into
  - setMaxTokenLength
```
public void setMaxTokenLength(int maxTokenLength)
```
    Sets the maximum allowed length for any token. Any token discovered exceeding the length will not be accepted and skipped over. The default is unbounded.
    
    Parameters:
    
    maxTokenLength - the maximum token length to accept as a valid token
  - getMaxTokenLength
```
public int getMaxTokenLength()
```
    Returns the maximum allowed token length
    
    Returns:
    
    the maximum allowed token length
  - setMinTokenLength
```
public void setMinTokenLength(int minTokenLength)
```
    Sets the minimum allowed token length. Any token discovered shorter than the minimum length will not be accepted and skipped over. The default is 0.
    
    Parameters:
    
    minTokenLength - the minimum length for a token to be used
  - getMinTokenLength
```
public int getMinTokenLength()
```
    Returns the minimum allowed token length
    
    Returns:
    
    the maximum allowed token length
  - setNoDigits
```
public void setNoDigits(boolean noDigits)
```
    Sets whether digits will be accepted in tokens or treated as "other" (not white space and not character).
    The default it to allow digits.
    
    Parameters:
    
    noDigits - true to disallow numeric digits, false to allow digits.
  - isNoDigits
```
public boolean isNoDigits()
```
    Returns true if digits are not allowed in tokens, false otherwise.
    
    Returns:
    
    true if digits are not allowed in tokens, false otherwise.

Class NaiveTokenizer

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

NaiveTokenizer

NaiveTokenizer

Method Detail

setUseLowerCase

isUseLowerCase

setOtherToWhiteSpace

isOtherToWhiteSpace

tokenize

tokenize

setMaxTokenLength

getMaxTokenLength

setMinTokenLength

getMinTokenLength

setNoDigits

isNoDigits