com.wcohen.secondstring.tokens
Class NGramTokenizer

java.lang.Object
  |
  +--com.wcohen.secondstring.tokens.NGramTokenizer
All Implemented Interfaces:
Tokenizer

public class NGramTokenizer
extends java.lang.Object
implements Tokenizer

Wraps another tokenizer, and adds all computes all ngrams of characters from a single token produced by the inner tokenizer.


Field Summary
static NGramTokenizer DEFAULT_TOKENIZER
           
 
Constructor Summary
NGramTokenizer(int minNGramSize, int maxNGramSize, boolean keepOldTokens, Tokenizer innerTokenizer)
           
 
Method Summary
 Token intern(java.lang.String s)
          Convert a given string into a token
static void main(java.lang.String[] argv)
          Test routine
 Token[] tokenize(java.lang.String input)
          Return tokenized version of a string.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_TOKENIZER

public static NGramTokenizer DEFAULT_TOKENIZER
Constructor Detail

NGramTokenizer

public NGramTokenizer(int minNGramSize,
                      int maxNGramSize,
                      boolean keepOldTokens,
                      Tokenizer innerTokenizer)
Method Detail

tokenize

public Token[] tokenize(java.lang.String input)
Return tokenized version of a string. Tokens are sequences of alphanumerics, or any single punctuation character.

Specified by:
tokenize in interface Tokenizer

intern

public Token intern(java.lang.String s)
Description copied from interface: Tokenizer
Convert a given string into a token

Specified by:
intern in interface Tokenizer

main

public static void main(java.lang.String[] argv)
Test routine