Package com.wcohen.ss

This package contains a bunch of approximate string comparators, plus code for performing controlled experiments with this.

See:
          Description

Class Summary
AbstractStatisticalTokenDistance Abstract token distance metric that uses frequency statistics.
AbstractStringDistance Abstract class which implements StringDistanceLearner as well as StringDistance.
AbstractTokenizedStringDistance Abstract distance metric for tokenized strings.
AdaptiveStringDistanceLearner Abstract StringDistanceLearner class which averages results of a number of inner distance metrics, learned by a number of inner distance learners.
AffineGap Affine-gap string distance, following Durban et al.
ApproxMemoMatrix Variant of MemoMatrix that only stores values near the diagonal, for better efficiency.
ApproxNeedlemanWunsch Needleman-Wunsch string distance, following Durban et al.
AveragedStringDistanceLearner Abstract StringDistanceLearner class which averages results of a number of inner distance metrics, learned by a number of inner distance learners.
BasicDistanceInstanceIterator A simple DistanceInstanceIterator implementation.
BasicStringWrapper An extendible (non-final) class that implements some of the functionality of a string.
BasicStringWrapperIterator A simple StringWrapperIterator implementation.
CharMatchScore Abstract distance between characters.
CombinedStringDistanceLearner Abstract StringDistanceLearner class which combines results of a number of inner distance metrics, learned by a number of inner distance learners.
DirichletJS Jensen-Shannon distance of two unigram language models, smoothed using Dirichlet prior.
DistanceLearnerFactory Creates distance metric learners from string descriptions.
Jaccard Jaccard distance implementation.
Jaro Jaro distance metric.
JaroWinkler Jaro distance metric, as extended by Winkler.
JaroWinklerTFIDF Soft TFIDF-based distance metric, extended to use "soft" token-matching with the JaroWinkler distance metric.
JelinekMercerJS Jensen-Shannon distance of two unigram language models, smoothed using Jelinek-Mercer mixture model.
JensenShannonDistance Distance metrics based on Jensen-Shannon distance of two smoothed unigram language models.
Level2 Generic version of Monge & Elkan's "level 2" recursive field matching.
Level2Jaro "Level 2" recursive field matching algorithm, based on Jaro distance.
Level2JaroWinkler "Level 2" recursive field matching algorithm, based on Jaro distance.
Level2Levenstein "Level 2" recursive field matching algorithm using Levenstein distance.
Level2MongeElkan Monge & Elkan's "level 2" recursive field matching algorithm.
Levenstein Levenstein string distance.
MemoMatrix A matrix of doubles, defined recursively by the compute(i,j) method, that will not be recomputed more than necessary.
Mixture Mixture-based distance metric.
MongeElkan The match method proposed by Monge and Elkan.
MultiStringAvgDistance StringDistance defined over Strings that are broken into fields, with distance defined as the average distance between any field.
MultiStringDistance Abstract class StringDistance defined over Strings that are broken into fields.
MultiStringWrapper A StringWrapper that stores a version of the string that has been either (a) split into a number of distinct fields, or (b) duplicated k times, so that k different StringDistance's can preprocess it, of (b) both of the above.
NeedlemanWunsch Needleman-Wunsch string distance, following Durban et al.
PrintfFormat PrintfFormat allows the formatting of an array of objects embedded within a string.
ScaledLevenstein Levenstein string distance.
SmithWaterman Smith-Waterman string distance, following Durban et al.
SoftTFIDF TFIDF-based distance metric, extended to use "soft" token-matching.
SoftTokenFelligiSunter Highly simplified model of Felligi-Sunter's method 1, applied to tokens.
TagLink  
TagLink.Candidates  
TFIDF TFIDF-based distance metric.
TokenFelligiSunter Highly simplified model of Felligi-Sunter's method 1, applied to tokens.
UnsmoothedJS Jensen-Shannon distance of two unsmoothed unigram language models.
WinklerRescorer Winkler's reweighting scheme for distance metrics.
WizardUI Top-level GUI interface.
 

Package com.wcohen.ss Description

This package contains a bunch of approximate string comparators, plus code for performing controlled experiments with this.

A StringDistance is the basic class for computing distances. The score() function of this class outputs a distance measure between its two arguments. The other methods are there for efficiency, so that preprocessing steps (like tokenization) can be amortized over multiple comparisons with the same string.

The way that preprocessing steps are saved is by creating a StringWrapper object which contains the preprocessed string, plus whatever else needs to be cached. To do this, extend default implementation of StringWrapper.

Almost everything in this package implements StringDistance. The only (public) exceptions are StringWrapper; PrintfFormat, pilfered from Sun to make the explanations easier; CharMatchScore, which is a character-based distance metric; and MemoMatrix, a utility for defining edit-distance-based methods.