com.wcohen.ss.lookup
Class SoftDictionary

java.lang.Object
  extended by com.wcohen.ss.lookup.SoftDictionary

public class SoftDictionary
extends java.lang.Object

Looks up nearly-matching strings in a dictionary, using a string distance. A typical use:

 SoftDictionary softDict = new SoftDictionary(new SimpleTokenizer(true,true));
 String alias[] = new String[]{"william cohen", "wwcohen", "einat minkov", "eminkov", .... };
 for (int i=0; i


Constructor Summary
SoftDictionary()
           
SoftDictionary(StringDistanceLearner distanceLearner)
           
SoftDictionary(StringDistanceLearner distanceLearner, Tokenizer tokenizer)
           
SoftDictionary(Tokenizer tokenizer)
           
 
Method Summary
 StringDistanceTeacher getTeacher()
          Return a teacher that can 'train' a distance metric from the information in the dictionary.
 void load(java.io.File file)
          Insert all lines in a file as items mapping to themselves.
 void load(java.io.File file, boolean ids)
          Insert all lines in a file as items mapping to themselves.
 void loadAliases(java.io.File file)
          Load a file of identifiers, each of which has multiple aliases.
 java.lang.Object lookup(java.lang.String toFind)
          Lookup a string in the dictionary.
 java.lang.Object lookup(java.lang.String id, java.lang.String toFind)
          Lookup a string in the dictionary.
 java.lang.Object lookup(java.lang.String id, StringWrapper toFind)
          Lookup a prepared string in the dictionary.
 java.lang.Object lookup(StringWrapper toFind)
          Lookup a prepared string in the dictionary.
 double lookupDistance(java.lang.String toFind)
          Return the distance to the best match.
 double lookupDistance(java.lang.String id, java.lang.String toFind)
          Return the distance to the best match.
 double lookupDistance(java.lang.String id, StringWrapper toFind)
          Return the distance to the best match.
 double lookupDistance(StringWrapper toFind)
          Return the distance to the best match.
static void main(java.lang.String[] argv)
          Simple main for testing.
 StringWrapper prepare(java.lang.String s)
          Prepare a string for quicker lookup.
 void put(java.lang.String string, java.lang.Object value)
          Insert a string into the dictionary.
 void put(java.lang.String id, java.lang.String string, java.lang.Object value)
          Insert a string into the dictionary.
 void put(java.lang.String id, StringWrapper toInsert, java.lang.Object value)
          Insert a prepared string into the dictionary.
 int size()
          Return the number of entries in the dictionary.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SoftDictionary

public SoftDictionary()

SoftDictionary

public SoftDictionary(StringDistanceLearner distanceLearner)

SoftDictionary

public SoftDictionary(Tokenizer tokenizer)

SoftDictionary

public SoftDictionary(StringDistanceLearner distanceLearner,
                      Tokenizer tokenizer)
Method Detail

size

public int size()
Return the number of entries in the dictionary.


prepare

public StringWrapper prepare(java.lang.String s)
Prepare a string for quicker lookup.


load

public void load(java.io.File file)
          throws java.io.IOException,
                 java.io.FileNotFoundException
Insert all lines in a file as items mapping to themselves.

Throws:
java.io.IOException
java.io.FileNotFoundException

load

public void load(java.io.File file,
                 boolean ids)
          throws java.io.IOException,
                 java.io.FileNotFoundException
Insert all lines in a file as items mapping to themselves. If 'ids' is true, then make the line number of an item its id.

This is mostly for testing the id feature.

Throws:
java.io.IOException
java.io.FileNotFoundException

loadAliases

public void loadAliases(java.io.File file)
                 throws java.io.IOException,
                        java.io.FileNotFoundException
Load a file of identifiers, each of which has multiple aliases. Each line is a list of tab-separated strings, the first of which is the identifier, the remainder of which are aliases.

Throws:
java.io.IOException
java.io.FileNotFoundException

put

public void put(java.lang.String id,
                java.lang.String string,
                java.lang.Object value)
Insert a string into the dictionary.

Id is a special tag used to handle 'leave one out' lookups. If you do a lookup on a string with a non-null id, you get the closest matches that do not have the same id.


put

public void put(java.lang.String string,
                java.lang.Object value)
Insert a string into the dictionary.


put

public void put(java.lang.String id,
                StringWrapper toInsert,
                java.lang.Object value)
Insert a prepared string into the dictionary.

Id is a special tag used to handle 'leave one out' lookups. If you do a lookup on a string with a non-null id, you get the closest matches that do not have the same id.


lookup

public java.lang.Object lookup(java.lang.String id,
                               java.lang.String toFind)
Lookup a string in the dictionary.

If id is non-null, then consider only strings with different ids (or null ids).


lookup

public java.lang.Object lookup(java.lang.String id,
                               StringWrapper toFind)
Lookup a prepared string in the dictionary.

If id is non-null, then consider only strings with different ids (or null ids).


lookupDistance

public double lookupDistance(java.lang.String id,
                             java.lang.String toFind)
Return the distance to the best match.

If id is non-null, then consider only strings with different ids (or null ids).


lookupDistance

public double lookupDistance(java.lang.String id,
                             StringWrapper toFind)
Return the distance to the best match.

If id is non-null, then consider only strings with different ids (or null ids).


lookup

public java.lang.Object lookup(java.lang.String toFind)
Lookup a string in the dictionary.


lookup

public java.lang.Object lookup(StringWrapper toFind)
Lookup a prepared string in the dictionary.


lookupDistance

public double lookupDistance(java.lang.String toFind)
Return the distance to the best match.


lookupDistance

public double lookupDistance(StringWrapper toFind)
Return the distance to the best match.


getTeacher

public StringDistanceTeacher getTeacher()
Return a teacher that can 'train' a distance metric from the information in the dictionary. Since there are no known distances, this means unsupervised training, e.g. accumulating TFIDF weights, etc.


main

public static void main(java.lang.String[] argv)
                 throws java.io.IOException,
                        java.io.FileNotFoundException
Simple main for testing.

Throws:
java.io.IOException
java.io.FileNotFoundException