com.wcohen.ss.lookup
Class SoftTFIDFDictionary

java.lang.Object
  extended by com.wcohen.ss.lookup.SoftTFIDFDictionary
All Implemented Interfaces:
FastLookup

public class SoftTFIDFDictionary
extends java.lang.Object
implements FastLookup

Looks up nearly-matching strings in a dictionary, using SoftTFIDF distance. To use the dictionary, first load in string/value pairs using 'put'. Then 'freeze' the dictionary. After the dictionary is frozen, you can lookup values with lookup and getResult(i), getValue(i), etc.

For example:

 SoftTFIDFDictionary dict = new SoftTFIDFDictionary();
 dict.put("william cohen", "wcohen@cs.cmu.edu");
 dict.put("vitor del rocha carvalho", "vitor@cs.cmu.edu");
 ...
 dict.freeze();
 int n=dict.lookup("victor carvalho");
 for (int i=0; i


Field Summary
protected  double lookupTime
           
 
Constructor Summary
SoftTFIDFDictionary()
           
SoftTFIDFDictionary(Tokenizer tokenizer)
           
SoftTFIDFDictionary(Tokenizer tokenizer, double minTokenSimilarity)
           
SoftTFIDFDictionary(Tokenizer tokenizer, double minTokenSimilarity, int windowSize, int maxInvertedIndexSize)
          Create a new SoftTFIDFDictionary.
 
Method Summary
 void freeze()
          Make it impossible to add new values, but possible to perform lookups.
 double getLookupTime()
          Get the time used in performing the lookup
 int getMaxInvertedIndexSize()
           
 java.lang.String getResult(int i)
          Get the i'th string found by the last lookup
 double getScore(int i)
          Get the score of the i'th string found by the last lookup
 java.lang.Object getValue(int i)
          Get the value of the i'th string found by the last lookup
 int getWindowSize(int w)
           
 void loadAliases(java.io.File file)
          Load a file of identifiers, each of which has multiple aliases.
 int lookup(double minScore, java.lang.String toFind)
          Lookup items SoftTFIDF-similar to the 'toFind' argument, and return the number of items found.
static void main(java.lang.String[] argv)
          Simple main for testing and experimentation
 void put(java.lang.String string, java.lang.Object value)
          Insert a string into the dictionary, and associate it with the given value.
 void refreeze()
           
static SoftTFIDFDictionary restore(java.io.File file)
           
 void saveAs(java.io.File file)
           
 void setMaxInvertedIndexSize(int m)
          Set the maximum size of an inverted index that will be followed.
 void setWindowSize(int w)
          Set the 'windowSize' used for finding similar tokens.
 int slowLookup(double minScore, java.lang.String toFind)
          Exactly like lookup, but works by exhaustively checking every stored string.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

lookupTime

protected double lookupTime
Constructor Detail

SoftTFIDFDictionary

public SoftTFIDFDictionary()

SoftTFIDFDictionary

public SoftTFIDFDictionary(Tokenizer tokenizer)

SoftTFIDFDictionary

public SoftTFIDFDictionary(Tokenizer tokenizer,
                           double minTokenSimilarity)

SoftTFIDFDictionary

public SoftTFIDFDictionary(Tokenizer tokenizer,
                           double minTokenSimilarity,
                           int windowSize,
                           int maxInvertedIndexSize)
Create a new SoftTFIDFDictionary. The distance is defined by a SoftTFIDF distance function where minTokenSimilarity is the minimum Jaro-Winkler distance between similar tokens, and the tokenizer defines the tokens considered.

Method Detail

saveAs

public void saveAs(java.io.File file)
            throws java.io.IOException,
                   java.io.FileNotFoundException
Throws:
java.io.IOException
java.io.FileNotFoundException

restore

public static SoftTFIDFDictionary restore(java.io.File file)
                                   throws java.io.IOException,
                                          java.io.FileNotFoundException
Throws:
java.io.IOException
java.io.FileNotFoundException

setWindowSize

public void setWindowSize(int w)
Set the 'windowSize' used for finding similar tokens. When finding tokens t2 that are similar to a given t1, the dictionary limits itself to tokens t3 that are within distance 'windowSize' of t1 on a sorted list of all tokens in the dictionary


getWindowSize

public int getWindowSize(int w)

setMaxInvertedIndexSize

public void setMaxInvertedIndexSize(int m)
Set the maximum size of an inverted index that will be followed. If this is zero (the default) then any inverted index will be followed, even for very frequent tokens, if following it is justified by the upper bound algorithms.


getMaxInvertedIndexSize

public int getMaxInvertedIndexSize()

loadAliases

public void loadAliases(java.io.File file)
                 throws java.io.IOException,
                        java.io.FileNotFoundException
Load a file of identifiers, each of which has multiple aliases. The dictionary constructed will map aliases to identifiers. Each line in the file is a list of tab-separated strings, the first of which is the identifier, the remainder of which are aliases.

Throws:
java.io.IOException
java.io.FileNotFoundException

put

public void put(java.lang.String string,
                java.lang.Object value)
Insert a string into the dictionary, and associate it with the given value.


refreeze

public void refreeze()

freeze

public void freeze()
Make it impossible to add new values, but possible to perform lookups.


slowLookup

public int slowLookup(double minScore,
                      java.lang.String toFind)
Exactly like lookup, but works by exhaustively checking every stored string.


lookup

public int lookup(double minScore,
                  java.lang.String toFind)
Lookup items SoftTFIDF-similar to the 'toFind' argument, and return the number of items found. The looked-up items must have a similarity score greater than minScore.

Specified by:
lookup in interface FastLookup

getResult

public java.lang.String getResult(int i)
Get the i'th string found by the last lookup

Specified by:
getResult in interface FastLookup

getValue

public java.lang.Object getValue(int i)
Get the value of the i'th string found by the last lookup

Specified by:
getValue in interface FastLookup

getScore

public double getScore(int i)
Get the score of the i'th string found by the last lookup

Specified by:
getScore in interface FastLookup

getLookupTime

public double getLookupTime()
Get the time used in performing the lookup


main

public static void main(java.lang.String[] argv)
                 throws java.io.IOException,
                        java.io.FileNotFoundException,
                        java.lang.NumberFormatException,
                        java.lang.ClassNotFoundException
Simple main for testing and experimentation

Throws:
java.io.IOException
java.io.FileNotFoundException
java.lang.NumberFormatException
java.lang.ClassNotFoundException