com.wcohen.ss
Class SoftTFIDF

java.lang.Object
  extended by com.wcohen.ss.AbstractStringDistance
      extended by com.wcohen.ss.AbstractTokenizedStringDistance
          extended by com.wcohen.ss.AbstractStatisticalTokenDistance
              extended by com.wcohen.ss.TFIDF
                  extended by com.wcohen.ss.SoftTFIDF
All Implemented Interfaces:
StringDistance, StringDistanceLearner
Direct Known Subclasses:
JaroWinklerTFIDF

public class SoftTFIDF
extends TFIDF

TFIDF-based distance metric, extended to use "soft" token-matching. Specifically, tokens are considered a partial match if they get a good score using an inner string comparator.

On the WHIRL datasets, thresholding JaroWinkler at 0.9 or 0.95 seems to be about right.


Nested Class Summary
 
Nested classes/interfaces inherited from class com.wcohen.ss.TFIDF
TFIDF.UnitVector
 
Field Summary
 
Fields inherited from class com.wcohen.ss.AbstractStatisticalTokenDistance
collectionSize, documentFrequency, totalTokenCount
 
Fields inherited from class com.wcohen.ss.AbstractTokenizedStringDistance
tokenizer
 
Constructor Summary
SoftTFIDF(StringDistance tokenDistance)
           
SoftTFIDF(StringDistance tokenDistance, double tokenMatchThreshold)
           
SoftTFIDF(Tokenizer tokenizer, StringDistance tokenDistance, double tokenMatchThreshold)
           
 
Method Summary
 java.lang.String explainScore(StringWrapper s, StringWrapper t)
          Explain how the distance was computed.
 double getTokenMatchThreshold()
           
 double score(StringWrapper s, StringWrapper t)
          This method needs to be implemented by subclasses.
 void setTokenMatchThreshold(double d)
           
 void setTokenMatchThreshold(java.lang.Double d)
           
 java.lang.String toString()
           
 
Methods inherited from class com.wcohen.ss.TFIDF
asUnitVector, getCollectionSize, getDocumentFrequency, getTokens, getWeight, main, prepare, setCollectionSize, setDocumentFrequency
 
Methods inherited from class com.wcohen.ss.AbstractStatisticalTokenDistance
checkTrainingHasHappened, train
 
Methods inherited from class com.wcohen.ss.AbstractTokenizedStringDistance
asBagOfTokens, prepare, setStringWrapperPool
 
Methods inherited from class com.wcohen.ss.AbstractStringDistance
addExample, doMain, explainScore, getDistance, hasNextQuery, nextQuery, prepare, score, setDistanceInstancePool
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

SoftTFIDF

public SoftTFIDF(Tokenizer tokenizer,
                 StringDistance tokenDistance,
                 double tokenMatchThreshold)

SoftTFIDF

public SoftTFIDF(StringDistance tokenDistance,
                 double tokenMatchThreshold)

SoftTFIDF

public SoftTFIDF(StringDistance tokenDistance)
Method Detail

setTokenMatchThreshold

public void setTokenMatchThreshold(double d)

setTokenMatchThreshold

public void setTokenMatchThreshold(java.lang.Double d)

getTokenMatchThreshold

public double getTokenMatchThreshold()

score

public double score(StringWrapper s,
                    StringWrapper t)
Description copied from class: AbstractStringDistance
This method needs to be implemented by subclasses.

Specified by:
score in interface StringDistance
Overrides:
score in class TFIDF

explainScore

public java.lang.String explainScore(StringWrapper s,
                                     StringWrapper t)
Explain how the distance was computed. In the output, the tokens in S and T are listed, and the common tokens are marked with an asterisk.

Specified by:
explainScore in interface StringDistance
Overrides:
explainScore in class TFIDF

toString

public java.lang.String toString()
Overrides:
toString in class TFIDF