StopWordTokenizer (Java Statistical Analysis Tool 0.0.8 API)

java.lang.Object
- jsat.text.tokenizer.StopWordTokenizer

All Implemented Interfaces:

Serializable, Tokenizer
```
public class StopWordTokenizer
extends Object
implements Tokenizer
```
This tokenizer wraps another such that any stop words that would have been returned by the base tokenizer are removed. The stop list is case sensitive.

Author:

Edward Raff

See Also:

Serialized Form

Field Summary

Fields
Modifier and Type	Field and Description
`static Set<String>`	`ENGLISH_STOP_SMALL_BASE` This unmodifiable set contains a very small and simple stop word list for English based on the 100 most common English words and includes all characters.

Constructor Summary

Constructors
Constructor and Description
`StopWordTokenizer(Tokenizer base, Collection<String> stopWords)` Creates a new Stop Word tokenizer
`StopWordTokenizer(Tokenizer base, String... stopWords)` Creates a new Stop Word tokenizer

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`List<String>`	`tokenize(String input)` Breaks the input string into a series of tokens that may be used as features for a classifier.
`void`	`tokenize(String input, StringBuilder workSpace, List<String> storageSpace)` Breaks the input string into a series of tokens that may be used as features for a classifier.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - ENGLISH_STOP_SMALL_BASE
```
public static final Set<String> ENGLISH_STOP_SMALL_BASE
```
    This unmodifiable set contains a very small and simple stop word list for English based on the 100 most common English words and includes all characters. All tokens the set are lowercase.
    This stop list is not meant to be authoritative or complete, but only a reasonable starting point that shouldn't degrade any common tasks.
    
    Significant gains can be realized by deriving a stop list better suited to your individual needs.
- Constructor Detail
  - StopWordTokenizer
```
public StopWordTokenizer(Tokenizer base,
                         Collection<String> stopWords)
```
    Creates a new Stop Word tokenizer
    
    Parameters:
    
    base - the base tokenizer to use
    
    stopWords - the collection of stop words to remove from tokenizations. A copy of the collection will be made
  - StopWordTokenizer
```
public StopWordTokenizer(Tokenizer base,
                         String... stopWords)
```
    Creates a new Stop Word tokenizer
    
    Parameters:
    
    base - the base tokenizer to use
    
    stopWords - the array of strings to use as stop words
- Method Detail
  - tokenize
```
public List<String> tokenize(String input)
```
    Description copied from interface: Tokenizer
    
    Breaks the input string into a series of tokens that may be used as features for a classifier. The returned tokens must be either new string objects or interned strings. If a token is returned that is backed by the original document, memory may get leaked by processes consuming the token.
    This method should be thread safe
    
    Specified by:
    
    tokenize in interface Tokenizer
    
    Parameters:
    
    input - the string to tokenize
    
    Returns:
    
    an already allocated list to place the tokens into
  - tokenize
```
public void tokenize(String input,
                     StringBuilder workSpace,
                     List<String> storageSpace)
```
    Description copied from interface: Tokenizer
    
    Breaks the input string into a series of tokens that may be used as features for a classifier. The returned tokens must be either new string objects or interned strings. If a token is returned that is backed by the original document, memory may get leaked by processes consuming the token.
    This method should be thread safe
    
    Specified by:
    
    tokenize in interface Tokenizer
    
    Parameters:
    
    input - the string to tokenize
    
    workSpace - an already allocated (but empty) string builder than can be used as a temporary work space.
    
    storageSpace - an already allocated (but empty) list to place the tokens into

Class StopWordTokenizer

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

ENGLISH_STOP_SMALL_BASE

Constructor Detail

StopWordTokenizer

StopWordTokenizer

Method Detail

tokenize

tokenize