Package com.wcohen.secondstring

This package contains a bunch of approximate string comparators, plus code for performing controlled experiments with this.

See:
          Description

Interface Summary
StringDistance Compute the difference between pairs of strings.
 

Class Summary
AbstractStatisticalTokenDistance Abstract token distance metric that uses frequency statistics.
AbstractStringDistance Abstract StringDistance implementation, implementing a few useful defaults.
AffineGap Affine-gap string distance, following Durban et al.
CharJaccard Character-based Jaccard distance: the distance between two strings is the Jaccard distance of the letters in them.
CharMatchScore Abstract distance between characters.
DirichletJS Jensen-Shannon distance of two unigram language models, smoothed using Dirichlet prior.
DistanceFactory Creates distance metrics from string descriptions.
Jaccard Jaccard distance implementation.
Jaro Jaro distance metric.
JaroWinkler Jaro distance metric, as extended by Winkler.
JaroWinklerTFIDF Soft TFIDF-based distance metric, extended to use "soft" token-matching with the JaroWinkler distance metric.
JelinekMercerJS Jensen-Shannon distance of two unigram language models, smoothed using Jelinek-Mercer mixture model.
JensenShannonDistance Distance metrics based on Jensen-Shannon distance of two smoothed unigram language models.
Level2 Generic version of Monge & Elkan's "level 2" recursive field matching.
Level2Jaro "Level 2" recursive field matching algorithm, based on Jaro distance.
Level2JaroWinkler "Level 2" recursive field matching algorithm, based on Jaro distance.
Level2Levenstein "Level 2" recursive field matching algorithm using Levenstein distance.
Level2MongeElkan Monge & Elkan's "level 2" recursive field matching algorithm.
Level2SLIM "Level 2" recursive field matching algorithm, based on SLIM distance.
Level2SLIMWinkler "Level 2" recursive field matching algorithm, based on SLIM distance.
Levenstein Levenstein string distance.
MemoMatrix A matrix of doubles, defined recursively by the compute(i,j) method, that will not be recomputed more than necessary.
Mixture Mixture-based distance metric.
MongeElkan The match method proposed by Monge and Elkan.
NeedlemanWunsch Needleman-Wunsch string distance, following Durban et al.
PrintfFormat PrintfFormat allows the formatting of an array of objects embedded within a string.
SLIM The same-letter index mixture distance.
SlimTFIDF Soft TFIDF-based distance metric, extended to use "soft" token-matching with the SLIM distance metric.
SLIMWinkler SLIM distance metric, with extensions proposed by Winkler for the Jaro metric.
SmithWaterman Smith-Waterman string distance, following Durban et al.
SoftTFIDF TFIDF-based distance metric, extended to use "soft" token-matching.
SoftTokenFelligiSunter Highly simplified model of Felligi-Sunter's method 1, applied to tokens.
StringWrapper An extendible (non-final) class that implements some of the functionality of a string.
TestPackage  
TestPackage.MyFixture  
TFIDF TFIDF-based distance metric.
TokenFelligiSunter Highly simplified model of Felligi-Sunter's method 1, applied to tokens.
UnsmoothedJS Jensen-Shannon distance of two unsmoothed unigram language models.
WinklerRescorer Winkler's reweighting scheme for distance metrics.
 

Package com.wcohen.secondstring Description

This package contains a bunch of approximate string comparators, plus code for performing controlled experiments with this.

A StringDistance is the basic class for computing distances. The score() function of this class outputs a distance measure between its two arguments. The other methods are there for efficiency, so that preprocessing steps (like tokenization) can be amortized over multiple comparisons with the same string.

The way that preprocessing steps are saved is by creating a StringWrapper object which contains the preprocessed string, plus whatever else needs to be cached. To do this, extend default implementation of StringWrapper.

Almost everything in this package implements StringDistance. The only (public) exceptions are StringWrapper; PrintfFormat, pilfered from Sun to make the explanations easier; CharMatchScore, which is a character-based distance metric; and MemoMatrix, a utility for defining edit-distance-based methods.