com.wcohen.secondstring.tokens
Class NGramTokenizer
java.lang.Object
|
+--com.wcohen.secondstring.tokens.NGramTokenizer
- All Implemented Interfaces:
- Tokenizer
- public class NGramTokenizer
- extends java.lang.Object
- implements Tokenizer
Wraps another tokenizer, and adds all computes all ngrams of
characters from a single token produced by the inner tokenizer.
Constructor Summary |
NGramTokenizer(int minNGramSize,
int maxNGramSize,
boolean keepOldTokens,
Tokenizer innerTokenizer)
|
Method Summary |
Token |
intern(java.lang.String s)
Convert a given string into a token |
static void |
main(java.lang.String[] argv)
Test routine |
Token[] |
tokenize(java.lang.String input)
Return tokenized version of a string. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
DEFAULT_TOKENIZER
public static NGramTokenizer DEFAULT_TOKENIZER
NGramTokenizer
public NGramTokenizer(int minNGramSize,
int maxNGramSize,
boolean keepOldTokens,
Tokenizer innerTokenizer)
tokenize
public Token[] tokenize(java.lang.String input)
- Return tokenized version of a string. Tokens are sequences
of alphanumerics, or any single punctuation character.
- Specified by:
tokenize
in interface Tokenizer
intern
public Token intern(java.lang.String s)
- Description copied from interface:
Tokenizer
- Convert a given string into a token
- Specified by:
intern
in interface Tokenizer
main
public static void main(java.lang.String[] argv)
- Test routine