com.wcohen.ss.tokens
Class NGramTokenizer

java.lang.Object
  extended by com.wcohen.ss.tokens.NGramTokenizer
All Implemented Interfaces:
Tokenizer

public class NGramTokenizer
extends java.lang.Object
implements Tokenizer

Wraps another tokenizer, and adds all computes all ngrams of characters from a single token produced by the inner tokenizer.


Field Summary
static NGramTokenizer DEFAULT_TOKENIZER
           
 
Constructor Summary
NGramTokenizer(int minNGramSize, int maxNGramSize, boolean keepOldTokens, Tokenizer innerTokenizer)
           
 
Method Summary
 Token intern(java.lang.String s)
          Convert a given string into a token.
static void main(java.lang.String[] argv)
          Test routine
 int maxTokenIndex()
          Return the higest index of any interned token
 java.util.Iterator<Token> tokenIterator()
          Return an iterator over interned tokens
 Token[] tokenize(java.lang.String input)
          Return tokenized version of a string.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_TOKENIZER

public static NGramTokenizer DEFAULT_TOKENIZER
Constructor Detail

NGramTokenizer

public NGramTokenizer(int minNGramSize,
                      int maxNGramSize,
                      boolean keepOldTokens,
                      Tokenizer innerTokenizer)
Method Detail

tokenize

public Token[] tokenize(java.lang.String input)
Return tokenized version of a string. Tokens are all character n-grams that are part of a token produced by the inner tokenizer.

Specified by:
tokenize in interface Tokenizer

intern

public Token intern(java.lang.String s)
Description copied from interface: Tokenizer
Convert a given string into a token. The intern function should have these properties: (1) If s1.equals(s2), then intern(s1)==intern(s2). (2) If no string equal to s1 has ever been interned before, then intern(s1).getIndex() will be larger than every previously-assigned index--i.e, token 'indexes' are assigned in increasing order.

Specified by:
intern in interface Tokenizer

tokenIterator

public java.util.Iterator<Token> tokenIterator()
Description copied from interface: Tokenizer
Return an iterator over interned tokens

Specified by:
tokenIterator in interface Tokenizer

maxTokenIndex

public int maxTokenIndex()
Description copied from interface: Tokenizer
Return the higest index of any interned token

Specified by:
maxTokenIndex in interface Tokenizer

main

public static void main(java.lang.String[] argv)
Test routine