Commit d8783c51 authored by Bruce Momjian's avatar Bruce Momjian

Per this discussion, here's a patch to implement both levenshtein() and

metaphone() in a contrib. There seem to be a fair number of different
approaches to both of these algorithms. I used the simplest case for
levenshtein which has a cost  of 1 for any character insertion, deletion, or
substitution. For metaphone, I adapted the same code from CPAN that the PHP
folks did.

A couple of questions:
1. Does it make sense to fold the soundex contrib together with this one?

2. I was debating trying to add multibyte support to levenshtein (it would
make no sense at all for metaphone), but a quick search through the contrib
directory found no hits on the word MULTIBYTE. Should worry about adding
multibyte support to levenshtein()?

Joe Conway
parent 0bc291e0
...@@ -55,6 +55,10 @@ fulltextindex - ...@@ -55,6 +55,10 @@ fulltextindex -
Full text indexing using triggers Full text indexing using triggers
by Maarten Boekhold <maartenb@dutepp0.et.tudelft.nl> by Maarten Boekhold <maartenb@dutepp0.et.tudelft.nl>
fuzzystrmatch -
Levenshtein and Metaphone fuzzy string matching
by Joe Conway <joseph.conway@home.com>
intarray - intarray -
Index support for arrays of int4, using GiST Index support for arrays of int4, using GiST
by Teodor Sigaev <teodor@stack.net> and Oleg Bartunov by Teodor Sigaev <teodor@stack.net> and Oleg Bartunov
......
subdir = contrib/fuzzystrmatch
top_builddir = ../..
include $(top_builddir)/src/Makefile.global
# override libdir to install shlib in contrib not main directory
libdir := $(libdir)/contrib
# shared library parameters
NAME= fuzzystrmatch
SO_MAJOR_VERSION= 0
SO_MINOR_VERSION= 1
override CPPFLAGS := -I$(srcdir)/src/include $(CPPFLAGS)
OBJS= fuzzystrmatch.o
all: all-lib $(NAME).sql
# Shared library stuff
include $(top_srcdir)/src/Makefile.shlib
$(NAME).sql: $(NAME).sql.in
sed -e 's:MODULE_PATHNAME:$(libdir)/$(shlib):g' < $< > $@
install: all installdirs install-lib
installdirs:
$(mkinstalldirs) $(DESTDIR)$(libdir)
uninstall: uninstall-lib
clean distclean maintainer-clean: clean-lib
rm -f $(OBJS) $(NAME).sql
depend dep:
$(CC) -MM $(CFLAGS) *.c >depend
ifeq (depend,$(wildcard depend))
include depend
endif
/*
* fuzzystrmatch.c
*
* Functions for "fuzzy" comparison of strings
*
* Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001;
*
* levenshtein()
* -------------
* Written based on a description of the algorithm by Michael Gilleland
* found at http://www.merriampark.com/ld.htm
* Also looked at levenshtein.c in the PHP 4.0.6 distribution for
* inspiration.
*
* metaphone()
* -----------
* Modified for PostgreSQL by Joe Conway.
* Based on CPAN's "Text-Metaphone-1.96" by Michael G Schwern <schwern@pobox.com>
* Code slightly modified for use as PostgreSQL function (palloc, elog, etc).
* Metaphone was originally created by Lawrence Philips and presented in article
* in "Computer Language" December 1990 issue.
*
* Permission to use, copy, modify, and distribute this software and its
* documentation for any purpose, without fee, and without a written agreement
* is hereby granted, provided that the above copyright notice and this
* paragraph and the following two paragraphs appear in all copies.
*
* IN NO EVENT SHALL THE AUTHORS OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
* DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
* LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
* DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
* POSSIBILITY OF SUCH DAMAGE.
*
* THE AUTHORS AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
* INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
* AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS
* ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
* PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
*
*/
Version 0.1 (3 August, 2001):
Functions to calculate the degree to which two strings match in a "fuzzy" way
Tested under Linux (Red Hat 6.2 and 7.0) and PostgreSQL 7.2devel
Release Notes:
Version 0.1
- initial release
Installation:
Place these files in a directory called 'fuzzystrmatch' under 'contrib' in the PostgreSQL source tree. Then run:
make
make install
You can use fuzzystrmatch.sql to create the functions in your database of choice, e.g.
psql -U postgres template1 < fuzzystrmatch.sql
installs following functions into database template1:
levenshtein() - calculates the levenshtein distance between two strings
metaphone() - calculates the metaphone code of an input string
Documentation
==================================================================
Name
levenshtein -- calculates the levenshtein distance between two strings
Synopsis
levenshtein(text source, text target)
Inputs
source
any text string, 255 characters max, NOT NULL
target
any text string, 255 characters max, NOT NULL
Outputs
Returns int
Example usage
select levenshtein('GUMBO','GAMBOL');
==================================================================
Name
metaphone -- calculates the metaphone code of an input string
Synopsis
metaphone(text source, int max_output_length)
Inputs
source
any text string, 255 characters max, NOT NULL
max_output_length
maximum length of the output metaphone code; if longer, the output
is truncated to this length
Outputs
Returns text
Example usage
select metaphone('GUMBO',4);
==================================================================
-- Joe Conway
This diff is collapsed.
/*
* fuzzystrmatch.h
*
* Functions for "fuzzy" comparison of strings
*
* Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001;
*
* levenshtein()
* -------------
* Written based on a description of the algorithm by Michael Gilleland
* found at http://www.merriampark.com/ld.htm
* Also looked at levenshtein.c in the PHP 4.0.6 distribution for
* inspiration.
*
* metaphone()
* -----------
* Modified for PostgreSQL by Joe Conway.
* Based on CPAN's "Text-Metaphone-1.96" by Michael G Schwern <schwern@pobox.com>
* Code slightly modified for use as PostgreSQL function (palloc, elog, etc).
* Metaphone was originally created by Lawrence Philips and presented in article
* in "Computer Language" December 1990 issue.
*
* Permission to use, copy, modify, and distribute this software and its
* documentation for any purpose, without fee, and without a written agreement
* is hereby granted, provided that the above copyright notice and this
* paragraph and the following two paragraphs appear in all copies.
*
* IN NO EVENT SHALL THE AUTHORS OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
* DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
* LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
* DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
* POSSIBILITY OF SUCH DAMAGE.
*
* THE AUTHORS AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
* INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
* AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS
* ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
* PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
*
*/
#ifndef FUZZYSTRMATCH_H
#define FUZZYSTRMATCH_H
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include "postgres.h"
#include "fmgr.h"
#include "utils/builtins.h"
#define MAX_LEVENSHTEIN_STRLEN 255
#define MAX_METAPHONE_STRLEN 255
typedef struct dynmatrix
{
int value;
} dynmat;
/*
* External declarations
*/
extern Datum levenshtein(PG_FUNCTION_ARGS);
extern Datum metaphone(PG_FUNCTION_ARGS);
/*
* Internal declarations
*/
#define STRLEN(p) strlen(p)
#define CHAREQ(p1, p2) (*(p1) == *(p2))
#define NextChar(p) ((p)++)
/*
* Original code by Michael G Schwern starts here.
* Code slightly modified for use as PostgreSQL
* function (combined *.h into here).
*------------------------------------------------------------------*/
/**************************************************************************
metaphone -- Breaks english phrases down into their phonemes.
Input
word -- An english word to be phonized
max_phonemes -- How many phonemes to calculate. If 0, then it
will phonize the entire phrase.
phoned_word -- The final phonized word. (We'll allocate the
memory.)
Output
error -- A simple error flag, returns TRUE or FALSE
NOTES: ALL non-alpha characters are ignored, this includes whitespace,
although non-alpha characters will break up phonemes.
****************************************************************************/
/**************************************************************************
my constants -- constants I like
Probably redundant.
***************************************************************************/
#define META_ERROR FALSE
#define META_SUCCESS TRUE
#define META_FAILURE FALSE
/* I add modifications to the traditional metaphone algorithm that you
might find in books. Define this if you want metaphone to behave
traditionally */
#undef USE_TRADITIONAL_METAPHONE
/* Special encodings */
#define SH 'X'
#define TH '0'
char Lookahead(char * word, int how_far);
int _metaphone (
/* IN */
char * word,
int max_phonemes,
/* OUT */
char ** phoned_word
);
/* Metachar.h ... little bits about characters for metaphone */
/*-- Character encoding array & accessing macros --*/
/* Stolen directly out of the book... */
char _codes[26] = {
1,16,4,16,9,2,4,16,9,2,0,2,2,2,1,4,0,2,4,4,1,0,0,0,8,0
/* a b c d e f g h i j k l m n o p q r s t u v w x y z */
};
#define ENCODE(c) (isalpha(c) ? _codes[((toupper(c)) - 'A')] : 0)
#define isvowel(c) (ENCODE(c) & 1) /* AEIOU */
/* These letters are passed through unchanged */
#define NOCHANGE(c) (ENCODE(c) & 2) /* FJMNR */
/* These form dipthongs when preceding H */
#define AFFECTH(c) (ENCODE(c) & 4) /* CGPST */
/* These make C and G soft */
#define MAKESOFT(c) (ENCODE(c) & 8) /* EIY */
/* These prevent GH from becoming F */
#define NOGHTOF(c) (ENCODE(c) & 16) /* BDH */
#endif /* FUZZYSTRMATCH_H */
CREATE FUNCTION levenshtein (text,text) RETURNS int
AS 'MODULE_PATHNAME','levenshtein' LANGUAGE 'c' with (iscachable, isstrict);
CREATE FUNCTION metaphone (text,int) RETURNS text
AS 'MODULE_PATHNAME','metaphone' LANGUAGE 'c' with (iscachable, isstrict);
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment