|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Interface Summary | |
StringDistance | Compute the difference between pairs of strings. |
Class Summary | |
AbstractStatisticalTokenDistance | Abstract token distance metric that uses frequency statistics. |
AbstractStringDistance | Abstract StringDistance implementation, implementing a few useful defaults. |
AffineGap | Affine-gap string distance, following Durban et al. |
CharJaccard | Character-based Jaccard distance: the distance between two strings is the Jaccard distance of the letters in them. |
CharMatchScore | Abstract distance between characters. |
DirichletJS | Jensen-Shannon distance of two unigram language models, smoothed using Dirichlet prior. |
DistanceFactory | Creates distance metrics from string descriptions. |
Jaccard | Jaccard distance implementation. |
Jaro | Jaro distance metric. |
JaroWinkler | Jaro distance metric, as extended by Winkler. |
JaroWinklerTFIDF | Soft TFIDF-based distance metric, extended to use "soft" token-matching with the JaroWinkler distance metric. |
JelinekMercerJS | Jensen-Shannon distance of two unigram language models, smoothed using Jelinek-Mercer mixture model. |
JensenShannonDistance | Distance metrics based on Jensen-Shannon distance of two smoothed unigram language models. |
Level2 | Generic version of Monge & Elkan's "level 2" recursive field matching. |
Level2Jaro | "Level 2" recursive field matching algorithm, based on Jaro distance. |
Level2JaroWinkler | "Level 2" recursive field matching algorithm, based on Jaro distance. |
Level2Levenstein | "Level 2" recursive field matching algorithm using Levenstein distance. |
Level2MongeElkan | Monge & Elkan's "level 2" recursive field matching algorithm. |
Level2SLIM | "Level 2" recursive field matching algorithm, based on SLIM distance. |
Level2SLIMWinkler | "Level 2" recursive field matching algorithm, based on SLIM distance. |
Levenstein | Levenstein string distance. |
MemoMatrix | A matrix of doubles, defined recursively by the compute(i,j) method, that will not be recomputed more than necessary. |
Mixture | Mixture-based distance metric. |
MongeElkan | The match method proposed by Monge and Elkan. |
NeedlemanWunsch | Needleman-Wunsch string distance, following Durban et al. |
PrintfFormat | PrintfFormat allows the formatting of an array of objects embedded within a string. |
SLIM | The same-letter index mixture distance. |
SlimTFIDF | Soft TFIDF-based distance metric, extended to use "soft" token-matching with the SLIM distance metric. |
SLIMWinkler | SLIM distance metric, with extensions proposed by Winkler for the Jaro metric. |
SmithWaterman | Smith-Waterman string distance, following Durban et al. |
SoftTFIDF | TFIDF-based distance metric, extended to use "soft" token-matching. |
SoftTokenFelligiSunter | Highly simplified model of Felligi-Sunter's method 1, applied to tokens. |
StringWrapper | An extendible (non-final) class that implements some of the functionality of a string. |
TestPackage | |
TestPackage.MyFixture | |
TFIDF | TFIDF-based distance metric. |
TokenFelligiSunter | Highly simplified model of Felligi-Sunter's method 1, applied to tokens. |
UnsmoothedJS | Jensen-Shannon distance of two unsmoothed unigram language models. |
WinklerRescorer | Winkler's reweighting scheme for distance metrics. |
This package contains a bunch of approximate string comparators, plus code for performing controlled experiments with this.
A StringDistance
is the basic class
for computing distances. The score() function of this class outputs a
distance measure between its two arguments. The other methods are
there for efficiency, so that preprocessing steps (like tokenization)
can be amortized over multiple comparisons with the same string.
The way that preprocessing steps are saved is by creating a StringWrapper
object which contains the
preprocessed string, plus whatever else needs to be cached. To do
this, extend default implementation of StringWrapper.
Almost everything in this package implements StringDistance. The only
(public) exceptions are StringWrapper; PrintfFormat, pilfered from Sun
to make the explanations easier; CharMatchScore
, which is a character-based
distance metric; and MemoMatrix
, a
utility for defining edit-distance-based methods.
|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |