com.wcohen.ss.tokens
Class SimpleTokenizer

java.lang.Object
  extended by com.wcohen.ss.tokens.SimpleTokenizer
All Implemented Interfaces:
Tokenizer

public class SimpleTokenizer
extends java.lang.Object
implements Tokenizer

Simple implementation of a Tokenizer. Tokens are sequences of alphanumerics, optionally including single punctuation characters.


Field Summary
static SimpleTokenizer DEFAULT_TOKENIZER
           
 
Constructor Summary
SimpleTokenizer(boolean ignorePunctuation, boolean ignoreCase)
           
 
Method Summary
 Token intern(java.lang.String s)
          Convert a given string into a token.
static void main(java.lang.String[] argv)
          Test routine
 int maxTokenIndex()
          Return the higest index of any interned token
 void setIgnoreCase(boolean flag)
           
 void setIgnorePunctuation(boolean flag)
           
 java.util.Iterator<Token> tokenIterator()
          Return an iterator over interned tokens
 Token[] tokenize(java.lang.String input)
          Return tokenized version of a string.
 java.lang.String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

DEFAULT_TOKENIZER

public static final SimpleTokenizer DEFAULT_TOKENIZER
Constructor Detail

SimpleTokenizer

public SimpleTokenizer(boolean ignorePunctuation,
                       boolean ignoreCase)
Method Detail

setIgnorePunctuation

public void setIgnorePunctuation(boolean flag)

setIgnoreCase

public void setIgnoreCase(boolean flag)

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object

tokenize

public Token[] tokenize(java.lang.String input)
Return tokenized version of a string. Tokens are sequences of alphanumerics, or any single punctuation character.

Specified by:
tokenize in interface Tokenizer

intern

public Token intern(java.lang.String s)
Description copied from interface: Tokenizer
Convert a given string into a token. The intern function should have these properties: (1) If s1.equals(s2), then intern(s1)==intern(s2). (2) If no string equal to s1 has ever been interned before, then intern(s1).getIndex() will be larger than every previously-assigned index--i.e, token 'indexes' are assigned in increasing order.

Specified by:
intern in interface Tokenizer

tokenIterator

public java.util.Iterator<Token> tokenIterator()
Description copied from interface: Tokenizer
Return an iterator over interned tokens

Specified by:
tokenIterator in interface Tokenizer

maxTokenIndex

public int maxTokenIndex()
Description copied from interface: Tokenizer
Return the higest index of any interned token

Specified by:
maxTokenIndex in interface Tokenizer

main

public static void main(java.lang.String[] argv)
Test routine