Commit d8783c51 authored by Bruce Momjian's avatar Bruce Momjian

Per this discussion, here's a patch to implement both levenshtein() and

metaphone() in a contrib. There seem to be a fair number of different
approaches to both of these algorithms. I used the simplest case for
levenshtein which has a cost  of 1 for any character insertion, deletion, or
substitution. For metaphone, I adapted the same code from CPAN that the PHP
folks did.

A couple of questions:
1. Does it make sense to fold the soundex contrib together with this one?

2. I was debating trying to add multibyte support to levenshtein (it would
make no sense at all for metaphone), but a quick search through the contrib
directory found no hits on the word MULTIBYTE. Should worry about adding
multibyte support to levenshtein()?

Joe Conway
parent 0bc291e0
......@@ -55,6 +55,10 @@ fulltextindex -
Full text indexing using triggers
by Maarten Boekhold <maartenb@dutepp0.et.tudelft.nl>
fuzzystrmatch -
Levenshtein and Metaphone fuzzy string matching
by Joe Conway <joseph.conway@home.com>
intarray -
Index support for arrays of int4, using GiST
by Teodor Sigaev <teodor@stack.net> and Oleg Bartunov
......
subdir = contrib/fuzzystrmatch
top_builddir = ../..
include $(top_builddir)/src/Makefile.global
# override libdir to install shlib in contrib not main directory
libdir := $(libdir)/contrib
# shared library parameters
NAME= fuzzystrmatch
SO_MAJOR_VERSION= 0
SO_MINOR_VERSION= 1
override CPPFLAGS := -I$(srcdir)/src/include $(CPPFLAGS)
OBJS= fuzzystrmatch.o
all: all-lib $(NAME).sql
# Shared library stuff
include $(top_srcdir)/src/Makefile.shlib
$(NAME).sql: $(NAME).sql.in
sed -e 's:MODULE_PATHNAME:$(libdir)/$(shlib):g' < $< > $@
install: all installdirs install-lib
installdirs:
$(mkinstalldirs) $(DESTDIR)$(libdir)
uninstall: uninstall-lib
clean distclean maintainer-clean: clean-lib
rm -f $(OBJS) $(NAME).sql
depend dep:
$(CC) -MM $(CFLAGS) *.c >depend
ifeq (depend,$(wildcard depend))
include depend
endif
/*
* fuzzystrmatch.c
*
* Functions for "fuzzy" comparison of strings
*
* Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001;
*
* levenshtein()
* -------------
* Written based on a description of the algorithm by Michael Gilleland
* found at http://www.merriampark.com/ld.htm
* Also looked at levenshtein.c in the PHP 4.0.6 distribution for
* inspiration.
*
* metaphone()
* -----------
* Modified for PostgreSQL by Joe Conway.
* Based on CPAN's "Text-Metaphone-1.96" by Michael G Schwern <schwern@pobox.com>
* Code slightly modified for use as PostgreSQL function (palloc, elog, etc).
* Metaphone was originally created by Lawrence Philips and presented in article
* in "Computer Language" December 1990 issue.
*
* Permission to use, copy, modify, and distribute this software and its
* documentation for any purpose, without fee, and without a written agreement
* is hereby granted, provided that the above copyright notice and this
* paragraph and the following two paragraphs appear in all copies.
*
* IN NO EVENT SHALL THE AUTHORS OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
* DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
* LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
* DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
* POSSIBILITY OF SUCH DAMAGE.
*
* THE AUTHORS AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
* INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
* AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS
* ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
* PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
*
*/
Version 0.1 (3 August, 2001):
Functions to calculate the degree to which two strings match in a "fuzzy" way
Tested under Linux (Red Hat 6.2 and 7.0) and PostgreSQL 7.2devel
Release Notes:
Version 0.1
- initial release
Installation:
Place these files in a directory called 'fuzzystrmatch' under 'contrib' in the PostgreSQL source tree. Then run:
make
make install
You can use fuzzystrmatch.sql to create the functions in your database of choice, e.g.
psql -U postgres template1 < fuzzystrmatch.sql
installs following functions into database template1:
levenshtein() - calculates the levenshtein distance between two strings
metaphone() - calculates the metaphone code of an input string
Documentation
==================================================================
Name
levenshtein -- calculates the levenshtein distance between two strings
Synopsis
levenshtein(text source, text target)
Inputs
source
any text string, 255 characters max, NOT NULL
target
any text string, 255 characters max, NOT NULL
Outputs
Returns int
Example usage
select levenshtein('GUMBO','GAMBOL');
==================================================================
Name
metaphone -- calculates the metaphone code of an input string
Synopsis
metaphone(text source, int max_output_length)
Inputs
source
any text string, 255 characters max, NOT NULL
max_output_length
maximum length of the output metaphone code; if longer, the output
is truncated to this length
Outputs
Returns text
Example usage
select metaphone('GUMBO',4);
==================================================================
-- Joe Conway
/*
* fuzzystrmatch.c
*
* Functions for "fuzzy" comparison of strings
*
* Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001;
*
* levenshtein()
* -------------
* Written based on a description of the algorithm by Michael Gilleland
* found at http://www.merriampark.com/ld.htm
* Also looked at levenshtein.c in the PHP 4.0.6 distribution for
* inspiration.
*
* metaphone()
* -----------
* Modified for PostgreSQL by Joe Conway.
* Based on CPAN's "Text-Metaphone-1.96" by Michael G Schwern <schwern@pobox.com>
* Code slightly modified for use as PostgreSQL function (palloc, elog, etc).
* Metaphone was originally created by Lawrence Philips and presented in article
* in "Computer Language" December 1990 issue.
*
* Permission to use, copy, modify, and distribute this software and its
* documentation for any purpose, without fee, and without a written agreement
* is hereby granted, provided that the above copyright notice and this
* paragraph and the following two paragraphs appear in all copies.
*
* IN NO EVENT SHALL THE AUTHORS OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
* DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
* LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
* DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
* POSSIBILITY OF SUCH DAMAGE.
*
* THE AUTHORS AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
* INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
* AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS
* ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
* PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
*
*/
#include "fuzzystrmatch.h"
/*
* Calculates Levenshtein Distance between two strings.
* Uses simplest and fastest cost model only, i.e. assumes a cost of 1 for
* each deletion, substitution, or insertion.
*/
PG_FUNCTION_INFO_V1(levenshtein);
Datum
levenshtein(PG_FUNCTION_ARGS)
{
char *str_s;
char *str_s0;
char *str_t;
int cols = 0;
int rows = 0;
int *u_cells;
int *l_cells;
int *tmp;
int i;
int j;
/*
* Fetch the arguments.
* str_s is referred to as the "source"
* cols = length of source + 1 to allow for the initialization column
* str_t is referred to as the "target", rows = length of target + 1
* rows = length of target + 1 to allow for the initialization row
*/
str_s = DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(0))));
str_t = DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(1))));
cols = strlen(str_s) + 1;
rows = strlen(str_t) + 1;
/*
* Restrict the length of the strings being compared to something reasonable
* because we will have to perform rows * cols calcualtions. If longer strings need to be
* compared, increase MAX_LEVENSHTEIN_STRLEN to suit (but within your tolerance for
* speed and memory usage).
*/
if ((cols > MAX_LEVENSHTEIN_STRLEN + 1) || (rows > MAX_LEVENSHTEIN_STRLEN + 1))
elog(ERROR, "levenshtein: Arguments may not exceed %d characters in length", MAX_LEVENSHTEIN_STRLEN);
/*
* If either rows or cols is 0, the answer is the other value.
* This makes sense since it would take that many insertions
* the build a matching string
*/
if (cols == 0)
PG_RETURN_INT32(rows);
if (rows == 0)
PG_RETURN_INT32(cols);
/*
* Allocate two vectors of integers. One will be used for the "upper" row,
* the other for the "lower" row. Initialize the "upper" row to 0..cols
*/
u_cells = palloc(sizeof(int) * cols);
for (i = 0; i < cols; i++)
u_cells[i] = i;
l_cells = palloc(sizeof(int) * cols);
/*
* Use str_s0 to "rewind" the pointer to str_s in the nested for loop below
*/
str_s0 = str_s;
/*
* Loop throught the rows, starting at row 1. Row 0 is used for the initial
* "upper" row.
*/
for (j = 1; j < rows; j++)
{
/*
* We'll always start with col 1,
* and initialize lower row col 0 to j
*/
l_cells[0] = j;
for (i = 1; i < cols; i++)
{
int c = 0;
int c1 = 0;
int c2 = 0;
int c3 = 0;
/*
* The "cost" value is 0 if the character at the current col position
* in the source string, matches the character at the current row position
* in the target string; cost is 1 otherwise.
*/
c = ((CHAREQ(str_s, str_t)) ? 0 : 1);
/*
* c1 is upper right cell plus 1
*/
c1 = u_cells[i] + 1;
/*
* c2 is lower left cell plus 1
*/
c2 = l_cells[i - 1] + 1;
/*
* c3 is cell diagonally above to the left plus "cost"
*/
c3 = u_cells[i - 1] + c;
/*
* The lower right cell is set to the minimum
* of c1, c2, c3
*/
l_cells[i] = (c1 < c2 ? c1 : c2) < c3 ? (c1 < c2 ? c1 : c2) : c3;
/*
* Increment the pointer to str_s
*/
NextChar(str_s);
}
/*
* Lower row now becomes the upper row, and the upper row
* gets reused as the new lower row.
*/
tmp = u_cells;
u_cells = l_cells;
l_cells = tmp;
/*
* Increment the pointer to str_t
*/
NextChar(str_t);
/*
* Rewind the pointer to str_s
*/
str_s = str_s0;
}
/*
* Because the final value (at position row, col) was swapped from
* the lower row to the upper row, that's where we'll find it.
*/
PG_RETURN_INT32(u_cells[cols - 1]);
}
/*
* Calculates the metaphone of an input string.
* Returns number of characters requested
* (suggested value is 4)
*/
PG_FUNCTION_INFO_V1(metaphone);
Datum
metaphone(PG_FUNCTION_ARGS)
{
int reqlen;
char *str_i;
size_t str_i_len;
char *metaph;
text *result_text;
int retval;
str_i = DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(0))));
str_i_len = strlen(str_i);
if (str_i_len > MAX_METAPHONE_STRLEN)
elog(ERROR, "metaphone: Input string must not exceed %d characters", MAX_METAPHONE_STRLEN);
if (!(str_i_len > 0))
elog(ERROR, "metaphone: Input string length must be > 0");
reqlen = PG_GETARG_INT32(1);
if (reqlen > MAX_METAPHONE_STRLEN)
elog(ERROR, "metaphone: Requested Metaphone output length must not exceed %d characters", MAX_METAPHONE_STRLEN);
if (!(reqlen > 0))
elog(ERROR, "metaphone: Requested Metaphone output length must be > 0");
metaph = palloc(reqlen);
memset(metaph, '\0', reqlen);
retval = _metaphone(str_i, reqlen, &metaph);
if (retval == META_SUCCESS)
{
result_text = DatumGetTextP(DirectFunctionCall1(textin, CStringGetDatum(metaph)));
PG_RETURN_TEXT_P(result_text);
}
else
{
elog(ERROR, "metaphone: failure");
/*
* Keep the compiler quiet
*/
PG_RETURN_NULL();
}
}
/*
* Original code by Michael G Schwern starts here.
* Code slightly modified for use as PostgreSQL
* function (palloc, etc). Original includes
* are rolled into fuzzystrmatch.h
*------------------------------------------------------------------*/
/* I suppose I could have been using a character pointer instead of
* accesssing the array directly... */
/* Look at the next letter in the word */
#define Next_Letter (toupper(word[w_idx+1]))
/* Look at the current letter in the word */
#define Curr_Letter (toupper(word[w_idx]))
/* Go N letters back. */
#define Look_Back_Letter(n) (w_idx >= n ? toupper(word[w_idx-n]) : '\0')
/* Previous letter. I dunno, should this return null on failure? */
#define Prev_Letter (Look_Back_Letter(1))
/* Look two letters down. It makes sure you don't walk off the string. */
#define After_Next_Letter (Next_Letter != '\0' ? toupper(word[w_idx+2]) \
: '\0')
#define Look_Ahead_Letter(n) (toupper(Lookahead(word+w_idx, n)))
/* Allows us to safely look ahead an arbitrary # of letters */
/* I probably could have just used strlen... */
char Lookahead(char * word, int how_far) {
char letter_ahead = '\0'; /* null by default */
int idx;
for(idx = 0; word[idx] != '\0' && idx < how_far; idx++);
/* Edge forward in the string... */
letter_ahead = word[idx]; /* idx will be either == to how_far or
* at the end of the string
*/
return letter_ahead;
}
/* phonize one letter */
#define Phonize(c) {(*phoned_word)[p_idx++] = c;}
/* Slap a null character on the end of the phoned word */
#define End_Phoned_Word {(*phoned_word)[p_idx] = '\0';}
/* How long is the phoned word? */
#define Phone_Len (p_idx)
/* Note is a letter is a 'break' in the word */
#define Isbreak(c) (!isalpha(c))
int _metaphone (
/* IN */
char * word,
int max_phonemes,
/* OUT */
char ** phoned_word
) {
int w_idx = 0; /* point in the phonization we're at. */
int p_idx = 0; /* end of the phoned phrase */
/*-- Parameter checks --*/
/*
* Shouldn't be necessary, but left these here anyway
* jec Aug 3, 2001
*/
/* Negative phoneme length is meaningless */
if (!(max_phonemes > 0))
elog(ERROR, "metaphone: Requested output length must be > 0");
/* Empty/null string is meaningless */
if ((word == NULL) || !(strlen(word) > 0))
elog(ERROR, "metaphone: Input string length must be > 0");
/*-- Allocate memory for our phoned_phrase --*/
if (max_phonemes == 0) { /* Assume largest possible */
*phoned_word = palloc(sizeof(char) * strlen(word) + 1);
if (!*phoned_word)
return META_ERROR;
} else {
*phoned_word = palloc(sizeof(char) * max_phonemes + 1);
if (!*phoned_word)
return META_ERROR;
}
/*-- The first phoneme has to be processed specially. --*/
/* Find our first letter */
for( ; !isalpha(Curr_Letter); w_idx++ ) {
/* On the off chance we were given nothing but crap... */
if( Curr_Letter == '\0' ) {
End_Phoned_Word
return META_SUCCESS; /* For testing */
}
}
switch (Curr_Letter) {
/* AE becomes E */
case 'A':
if( Next_Letter == 'E' ) {
Phonize('E');
w_idx+=2;
}
/* Remember, preserve vowels at the beginning */
else {
Phonize('A');
w_idx++;
}
break;
/* [GKP]N becomes N */
case 'G':
case 'K':
case 'P':
if( Next_Letter == 'N' ) {
Phonize('N');
w_idx+=2;
}
break;
/* WH becomes H,
WR becomes R
W if followed by a vowel */
case 'W':
if( Next_Letter == 'H' ||
Next_Letter == 'R' )
{
Phonize(Next_Letter);
w_idx+=2;
}
else if ( isvowel(Next_Letter) ) {
Phonize('W');
w_idx+=2;
}
/* else ignore */
break;
/* X becomes S */
case 'X':
Phonize('S');
w_idx++;
break;
/* Vowels are kept */
/* We did A already
case 'A':
case 'a':
*/
case 'E':
case 'I':
case 'O':
case 'U':
Phonize(Curr_Letter);
w_idx++;
break;
default:
/* do nothing */
break;
}
/* On to the metaphoning */
for(; Curr_Letter != '\0' &&
(max_phonemes == 0 || Phone_Len < max_phonemes);
w_idx++) {
/* How many letters to skip because an eariler encoding handled
* multiple letters */
unsigned short int skip_letter = 0;
/* THOUGHT: It would be nice if, rather than having things like...
* well, SCI. For SCI you encode the S, then have to remember
* to skip the C. So the phonome SCI invades both S and C. It would
* be better, IMHO, to skip the C from the S part of the encoding.
* Hell, I'm trying it.
*/
/* Ignore non-alphas */
if( !isalpha(Curr_Letter) )
continue;
/* Drop duplicates, except CC */
if( Curr_Letter == Prev_Letter &&
Curr_Letter != 'C' )
continue;
switch (Curr_Letter) {
/* B -> B unless in MB */
case 'B':
if( Prev_Letter != 'M' )
Phonize('B');
break;
/* 'sh' if -CIA- or -CH, but not SCH, except SCHW.
* (SCHW is handled in S)
* S if -CI-, -CE- or -CY-
* dropped if -SCI-, SCE-, -SCY- (handed in S)
* else K
*/
case 'C':
if( MAKESOFT(Next_Letter) ) { /* C[IEY] */
if( After_Next_Letter == 'A' &&
Next_Letter == 'I' ) { /* CIA */
Phonize(SH);
}
/* SC[IEY] */
else if ( Prev_Letter == 'S' ) {
/* Dropped */
}
else {
Phonize('S');
}
}
else if ( Next_Letter == 'H' ) {
#ifndef USE_TRADITIONAL_METAPHONE
if( After_Next_Letter == 'R' ||
Prev_Letter == 'S' ) { /* Christ, School */
Phonize('K');
}
else {
Phonize(SH);
}
#else
Phonize(SH);
#endif
skip_letter++;
}
else {
Phonize('K');
}
break;
/* J if in -DGE-, -DGI- or -DGY-
* else T
*/
case 'D':
if( Next_Letter == 'G' &&
MAKESOFT(After_Next_Letter) ) {
Phonize('J');
skip_letter++;
}
else
Phonize('T');
break;
/* F if in -GH and not B--GH, D--GH, -H--GH, -H---GH
* else dropped if -GNED, -GN,
* else dropped if -DGE-, -DGI- or -DGY- (handled in D)
* else J if in -GE-, -GI, -GY and not GG
* else K
*/
case 'G':
if( Next_Letter == 'H' ) {
if( !( NOGHTOF(Look_Back_Letter(3)) ||
Look_Back_Letter(4) == 'H' ) ) {
Phonize('F');
skip_letter++;
}
else {
/* silent */
}
}
else if( Next_Letter == 'N' ) {
if( Isbreak(After_Next_Letter) ||
( After_Next_Letter == 'E' &&
Look_Ahead_Letter(3) == 'D' ) ) {
/* dropped */
}
else
Phonize('K');
}
else if( MAKESOFT(Next_Letter) &&
Prev_Letter != 'G' ) {
Phonize('J');
}
else {
Phonize('K');
}
break;
/* H if before a vowel and not after C,G,P,S,T */
case 'H':
if( isvowel(Next_Letter) &&
!AFFECTH(Prev_Letter) )
Phonize('H');
break;
/* dropped if after C
* else K
*/
case 'K':
if( Prev_Letter != 'C' )
Phonize('K');
break;
/* F if before H
* else P
*/
case 'P':
if( Next_Letter == 'H' ) {
Phonize('F');
}
else {
Phonize('P');
}
break;
/* K
*/
case 'Q':
Phonize('K');
break;
/* 'sh' in -SH-, -SIO- or -SIA- or -SCHW-
* else S
*/
case 'S':
if( Next_Letter == 'I' &&
( After_Next_Letter == 'O' ||
After_Next_Letter == 'A' ) ) {
Phonize(SH);
}
else if ( Next_Letter == 'H' ) {
Phonize(SH);
skip_letter++;
}
#ifndef USE_TRADITIONAL_METAPHONE
else if ( Next_Letter == 'C' &&
Look_Ahead_Letter(2) == 'H' &&
Look_Ahead_Letter(3) == 'W' ) {
Phonize(SH);
skip_letter += 2;
}
#endif
else {
Phonize('S');
}
break;
/* 'sh' in -TIA- or -TIO-
* else 'th' before H
* else T
*/
case 'T':
if( Next_Letter == 'I' &&
( After_Next_Letter == 'O' ||
After_Next_Letter == 'A' ) ) {
Phonize(SH);
}
else if ( Next_Letter == 'H' ) {
Phonize(TH);
skip_letter++;
}
else {
Phonize('T');
}
break;
/* F */
case 'V':
Phonize('F');
break;
/* W before a vowel, else dropped */
case 'W':
if( isvowel(Next_Letter) )
Phonize('W');
break;
/* KS */
case 'X':
Phonize('K');
Phonize('S');
break;
/* Y if followed by a vowel */
case 'Y':
if( isvowel(Next_Letter) )
Phonize('Y');
break;
/* S */
case 'Z':
Phonize('S');
break;
/* No transformation */
case 'F':
case 'J':
case 'L':
case 'M':
case 'N':
case 'R':
Phonize(Curr_Letter);
break;
default:
/* nothing */
break;
} /* END SWITCH */
w_idx += skip_letter;
} /* END FOR */
End_Phoned_Word;
return(META_SUCCESS);
} /* END metaphone */
/*
* fuzzystrmatch.h
*
* Functions for "fuzzy" comparison of strings
*
* Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001;
*
* levenshtein()
* -------------
* Written based on a description of the algorithm by Michael Gilleland
* found at http://www.merriampark.com/ld.htm
* Also looked at levenshtein.c in the PHP 4.0.6 distribution for
* inspiration.
*
* metaphone()
* -----------
* Modified for PostgreSQL by Joe Conway.
* Based on CPAN's "Text-Metaphone-1.96" by Michael G Schwern <schwern@pobox.com>
* Code slightly modified for use as PostgreSQL function (palloc, elog, etc).
* Metaphone was originally created by Lawrence Philips and presented in article
* in "Computer Language" December 1990 issue.
*
* Permission to use, copy, modify, and distribute this software and its
* documentation for any purpose, without fee, and without a written agreement
* is hereby granted, provided that the above copyright notice and this
* paragraph and the following two paragraphs appear in all copies.
*
* IN NO EVENT SHALL THE AUTHORS OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
* DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
* LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
* DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
* POSSIBILITY OF SUCH DAMAGE.
*
* THE AUTHORS AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
* INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
* AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS
* ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
* PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
*
*/
#ifndef FUZZYSTRMATCH_H
#define FUZZYSTRMATCH_H
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include "postgres.h"
#include "fmgr.h"
#include "utils/builtins.h"
#define MAX_LEVENSHTEIN_STRLEN 255
#define MAX_METAPHONE_STRLEN 255
typedef struct dynmatrix
{
int value;
} dynmat;
/*
* External declarations
*/
extern Datum levenshtein(PG_FUNCTION_ARGS);
extern Datum metaphone(PG_FUNCTION_ARGS);
/*
* Internal declarations
*/
#define STRLEN(p) strlen(p)
#define CHAREQ(p1, p2) (*(p1) == *(p2))
#define NextChar(p) ((p)++)
/*
* Original code by Michael G Schwern starts here.
* Code slightly modified for use as PostgreSQL
* function (combined *.h into here).
*------------------------------------------------------------------*/
/**************************************************************************
metaphone -- Breaks english phrases down into their phonemes.
Input
word -- An english word to be phonized
max_phonemes -- How many phonemes to calculate. If 0, then it
will phonize the entire phrase.
phoned_word -- The final phonized word. (We'll allocate the
memory.)
Output
error -- A simple error flag, returns TRUE or FALSE
NOTES: ALL non-alpha characters are ignored, this includes whitespace,
although non-alpha characters will break up phonemes.
****************************************************************************/
/**************************************************************************
my constants -- constants I like
Probably redundant.
***************************************************************************/
#define META_ERROR FALSE
#define META_SUCCESS TRUE
#define META_FAILURE FALSE
/* I add modifications to the traditional metaphone algorithm that you
might find in books. Define this if you want metaphone to behave
traditionally */
#undef USE_TRADITIONAL_METAPHONE
/* Special encodings */
#define SH 'X'
#define TH '0'
char Lookahead(char * word, int how_far);
int _metaphone (
/* IN */
char * word,
int max_phonemes,
/* OUT */
char ** phoned_word
);
/* Metachar.h ... little bits about characters for metaphone */
/*-- Character encoding array & accessing macros --*/
/* Stolen directly out of the book... */
char _codes[26] = {
1,16,4,16,9,2,4,16,9,2,0,2,2,2,1,4,0,2,4,4,1,0,0,0,8,0
/* a b c d e f g h i j k l m n o p q r s t u v w x y z */
};
#define ENCODE(c) (isalpha(c) ? _codes[((toupper(c)) - 'A')] : 0)
#define isvowel(c) (ENCODE(c) & 1) /* AEIOU */
/* These letters are passed through unchanged */
#define NOCHANGE(c) (ENCODE(c) & 2) /* FJMNR */
/* These form dipthongs when preceding H */
#define AFFECTH(c) (ENCODE(c) & 4) /* CGPST */
/* These make C and G soft */
#define MAKESOFT(c) (ENCODE(c) & 8) /* EIY */
/* These prevent GH from becoming F */
#define NOGHTOF(c) (ENCODE(c) & 16) /* BDH */
#endif /* FUZZYSTRMATCH_H */
CREATE FUNCTION levenshtein (text,text) RETURNS int
AS 'MODULE_PATHNAME','levenshtein' LANGUAGE 'c' with (iscachable, isstrict);
CREATE FUNCTION metaphone (text,int) RETURNS text
AS 'MODULE_PATHNAME','metaphone' LANGUAGE 'c' with (iscachable, isstrict);
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment