com.wcohen.ss
Class TFIDF

java.lang.Object
  extended by com.wcohen.ss.AbstractStringDistance
      extended by com.wcohen.ss.AbstractTokenizedStringDistance
          extended by com.wcohen.ss.AbstractStatisticalTokenDistance
              extended by com.wcohen.ss.TFIDF
All Implemented Interfaces:
StringDistance, StringDistanceLearner
Direct Known Subclasses:
SoftTFIDF

public class TFIDF
extends AbstractStatisticalTokenDistance

TFIDF-based distance metric.


Nested Class Summary
protected  class TFIDF.UnitVector
          Marker class extending BagOfTokens
 
Field Summary
 
Fields inherited from class com.wcohen.ss.AbstractStatisticalTokenDistance
collectionSize, documentFrequency, totalTokenCount
 
Fields inherited from class com.wcohen.ss.AbstractTokenizedStringDistance
tokenizer
 
Constructor Summary
TFIDF()
           
TFIDF(Tokenizer tokenizer)
           
 
Method Summary
protected  TFIDF.UnitVector asUnitVector(StringWrapper w)
           
 java.lang.String explainScore(StringWrapper s, StringWrapper t)
          Explain how the distance was computed.
 int getCollectionSize()
           
 int getDocumentFrequency(Token token)
          Get the document frequency of the token.
 Token[] getTokens()
          Access the tokens of the last prepare()-ed string.
 double getWeight(Token token)
          Access the weight of a token in the vector created for the last prepare()-ed string.
static void main(java.lang.String[] argv)
           
 StringWrapper prepare(java.lang.String s)
          Preprocess a string by finding tokens and giving them TFIDF weights
 double score(StringWrapper s, StringWrapper t)
          This method needs to be implemented by subclasses.
 void setCollectionSize(int n)
          Setting the collectionSize and alsoSet the size of the collection that this TFIDF measure was trained on to some value.
 void setDocumentFrequency(Token token, int df)
          Set the document frequency of the token to some value.
 java.lang.String toString()
           
 
Methods inherited from class com.wcohen.ss.AbstractStatisticalTokenDistance
checkTrainingHasHappened, train
 
Methods inherited from class com.wcohen.ss.AbstractTokenizedStringDistance
asBagOfTokens, prepare, setStringWrapperPool
 
Methods inherited from class com.wcohen.ss.AbstractStringDistance
addExample, doMain, explainScore, getDistance, hasNextQuery, nextQuery, prepare, score, setDistanceInstancePool
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

TFIDF

public TFIDF(Tokenizer tokenizer)

TFIDF

public TFIDF()
Method Detail

score

public double score(StringWrapper s,
                    StringWrapper t)
Description copied from class: AbstractStringDistance
This method needs to be implemented by subclasses.

Specified by:
score in interface StringDistance
Specified by:
score in class AbstractStringDistance

asUnitVector

protected TFIDF.UnitVector asUnitVector(StringWrapper w)

prepare

public StringWrapper prepare(java.lang.String s)
Preprocess a string by finding tokens and giving them TFIDF weights

Specified by:
prepare in interface StringDistance
Overrides:
prepare in class AbstractStringDistance

getTokens

public Token[] getTokens()
Access the tokens of the last prepare()-ed string.


getWeight

public double getWeight(Token token)
Access the weight of a token in the vector created for the last prepare()-ed string.


getDocumentFrequency

public int getDocumentFrequency(Token token)
Get the document frequency of the token.

Overrides:
getDocumentFrequency in class AbstractStatisticalTokenDistance

setDocumentFrequency

public void setDocumentFrequency(Token token,
                                 int df)
Set the document frequency of the token to some value. Setting the collectionSize and also setting the document frequency of every token is an alternative to explicit training.


getCollectionSize

public int getCollectionSize()

setCollectionSize

public void setCollectionSize(int n)
Setting the collectionSize and alsoSet the size of the collection that this TFIDF measure was trained on to some value. setting the document frequency of every token is an alternative to explicit training.


explainScore

public java.lang.String explainScore(StringWrapper s,
                                     StringWrapper t)
Explain how the distance was computed. In the output, the tokens in S and T are listed, and the common tokens are marked with an asterisk.

Specified by:
explainScore in interface StringDistance
Specified by:
explainScore in class AbstractStringDistance

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object

main

public static void main(java.lang.String[] argv)