Commit b8860533 authored by Teodor Sigaev's avatar Teodor Sigaev

tsearch2 module

parent a6053826
subdir = contrib/tsearch2
top_builddir = ../..
include $(top_builddir)/src/Makefile.global
override CPPFLAGS := -I. -I./snowball -I./ispell -I./wordparser $(CPPFLAGS)
MODULE_big = tsearch2
OBJS = dict_ex.o dict.o snmap.o stopword.o common.o prs_dcfg.o \
snowball/english_stem.o snowball/api.o snowball/russian_stem.o snowball/utilities.o \
dict_snowball.o ispell/spell.o dict_ispell.o dict_syn.o \
wparser.o wordparser/parser.o wordparser/deflex.o wparser_def.o \
ts_cfg.o tsvector.o rewrite.o crc32.o query.o gistidx.o \
tsvector_op.o rank.o ts_stat.o
DATA_built = tsearch2.sql untsearch2.sql
DOCS = README.tsearch2
REGRESS = tsearch2
wordparser/parser.c: wordparser/parser.l
ifdef FLEX
$(FLEX) $(FLEXFLAGS) -8 -Ptsearch2_yy -o'$@' $<
else
@$(missing) flex $< $@
endif
EXTRA_CLEAN = wordparser/parser.c tsearch2.sql.in
SHLIB_LINK := -lm
include $(top_srcdir)/contrib/contrib-global.mk
# DO NOT DELETE
install: installstop
installstop:
cp stopword/*.stop $(datadir)
tsearch2.sql.in: tsearch.sql._in
sed 's,DATA_PATH,$(datadir),g' < $< > $@
untsearch2.sql: untsearch.sql.in
cp $< $@
Tsearch2 - full text search extension for PostgreSQL
[10][Online version] of this document is available
This module is sponsored by Delta-Soft Ltd., Moscow, Russia.
Notice: This version is fully incompatible with old tsearch (V1),
which is considered as deprecated in upcoming 7.4 release and
obsoleted in 7.5.
The Tsearch2 contrib module contains an implementation of a new data
type tsvector - a searchable data type with indexed access. In a
nutshell, tsvector is a set of unique words along with their
positional information in the document, organized in a special
structure optimized for fast access and lookup. Actually, each word
entry, besides its position in the document, could have a weight
attribute, describing importance of this word (at a specific) position
in document. A set of bit-signatures of a fixed length, representing
tsvectors, are stored in a search tree (developed using PostgreSQL
GiST), which provides online update of full text index and fast query
lookup. The module provides indexed access methods, queries,
operations and supporting routines for the tsvector data type and easy
conversion of text data to tsvector. Table driven configuration allows
creation of custom configuration optimized for specific searches using
standard SQL commands.
Configuration allows you to:
* specify the type of lexemes to be indexed and the way they are
processed.
* specify dictionaries to be used along with stop words recognition.
* specify the parser used to process a document.
See [11]Documentation Roadmap for links to documentation.
Authors
* Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
* Teodor Sigaev <teodor@sigaev.ru>, Moscow, Delta-Soft Ltd.,Russia
Contributors
* Robert John Shepherd and Andrew J. Kopciuch submitted
"Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch
v2)
* Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2
Reference" and proposed new naming convention for tsearch V2
New features
* Relevance ranking of search results
* Table driven configuration
* Morphology support (ispell dictionaries, snowball stemmers)
* Headline support (text fragments with highlighted search terms)
* Ability to plug-in custom dictionaries and parsers
* Synonym dictionary
* Generator of templates for dictionaries (built-in snowball stemmer
support)
* Statistics of indexed words is available
Limitations
* Lexeme should be not longer than 2048 bytes
* The number of lexemes is limited by 2^32. Note, that actual
capacity of tsvector is depends on whether positional information
is stored or not.
* tsvector - the size is limited by approximately 2^20 bytes.
* tsquery - the number of entries (lexemes and operations) < 32768
* Positional information
+ maximal position of lexeme < 2^14 (16384)
+ lexeme could have maximum 256 positions
References
* GiST development site -
[12]http://www.sai.msu.su/~megera/postgres/gist
* OpenFTS home page - [13]http://openfts.sourceforge.net/
* Mailing list -
[14]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gen
eral
[15]Documentation Roadmap
Documentation Roadmap
* Several docs are available from docs/ subdirectory
+ "Tsearch V2 Introduction" by Andrew Kopciuch
+ "Tsearch2 Guide" by Brandon Rhodes
+ "Tsearch2 Reference" by Brandon Rhodes
* Readme.gendict in gendict/ subdirectory
+ [16][Gendict tutorial]
Online version of documentation is always available from Tsearch V2
home page -
[17]http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
Support
Authors urgently recommend people to use [18][openfts-general] or
[19][pgsql-general] mailing lists for questions and discussions.
Caution
In spite of apparent easy full text searching with our tsearch module
(authors hope it's so), any serious search engine require profound
study of various aspects, such as stop words, dictionaries, special
parsers. Tsearch module was designed to facilitate both those cases.
Development History
Pre-tsearch era
Development of OpenFTS began in 2000 after realizing that we
needed a search engine optimized for online updates and able to
access metadata from the database. This is essential for online
news agencies, web portals, digital libraries, etc. Most search
engines available utilize an inverted index which is very fast
for searching but very slow for online updates. Incremental
updates of an inverted index is a complex engineering task
while we needed something light, free and with the ability to
access metadata from the database. The last requirement is very
important because in a real life application a search engine
should always consult metadata ( topic, permissions, date
range, version, etc.). We extensively use PostgreSQL as a
database backend and have no intention to move from it, so the
problem was to find a data structure and a fast way to access
it. PostgreSQL has rather unique data type for storing sets
(think about words) - arrays, but lacks index access to them. A
document is parsed into lexemes, which are identified in
various ways (e.g. stemming, morphology, dictionary), and as a
result is reduced to an array of integer numbers. During our
research we found a paper of Joseph Hellerstein which
introduced an interesting data structure suitable for sets -
RD-tree (Russian Doll tree). It looked very attractive, but
implementing it in PostgreSQL seemed difficult because of our
ignorance of database internals. Further research lead us to
the idea to use GiST for implementing RD-tree, but at that time
the GiST code had for a long while remained untouched and
contained several bugs. After work on improving GiST for
version 7.0.3 of PostgreSQL was done, we were able to implement
RD-Tree and use it for index access to arrays of integers. This
implementation was ideally suited for small arrays and
eliminated complex joins, but was practically useless for
indexing large arrays. The next improvement came from an idea
to represent a document by a single bit-signature, a so-called
superimposed signature (see "Index Structures for Databases
Containing Data Items with Set-valued Attributes", 1997, Sven
Helmer for details). We developeded the contrib/intarray module
and used it for full text indexing.
tsearch v1
It was inconvenient to use integer id's instead of words, so we
introduced a new data type called 'txtidx' - a searchable data
type (textual) with indexed access. This was a first step of
our work on an implementation of a built-in PostgreSQL full
text search engine. Even though tsearch v1 had many features of
a search engine it lacked configuration support and relevance
ranking. People were encouraged to use OpenFTS, which provided
relevance ranking based on coordinate information and flexible
configuration. OpenFTS v.0.34 is the last version based on
tsearch v1.
tsearch V2
People recognized tsearch as a powerful tool for full text
searching and insisted on adding ranking support, better
configurability, etc. We already thought about moving most of
the features of OpenFTS to tsearch, and in the early 2003 we
decided to work on a new version of tsearch - tsearch v2. We've
abandoned auxiliary index tables which were used by OpenFTS to
store coordinate information and modified the txtidx type to
store them internally. Also, we've added table-driven
configuration, support of ispell dictionaries, snowball
stemmers and the ability to specify which types of lexemes to
index. Also, it's now possible to generate headlines of
documents with highlighted search terms. These changes make
tsearch more user friendly and turn it into a really powerful
full text search engine. After announcing the alpha version, we
received a proposal from Brandon Rhodes to rename tsearch
functions to be more consistent. So, we have renamed txtidx
type to tsvector and other things as well.
To allow users of tsearch v1 smooth upgrade, we named the module as
tsearch2.
Future release of OpenFTS (v.0.35) will be based on tsearch2. Brave
people could download it from OpenFTS CVS (see link from [20][OpenFTS
page]
References
10. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html
11. http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Readme#Documentation_Roadmap
12. http://www.sai.msu.su/~megera/postgres/gist
13. http://openfts.sourceforge.net/
14. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
15. http://www.sai.msu.su/~megera/oddmuse/index.cgi?action=anchor&id=Documentation_Roadmap#Documentation_Roadmap
16. http://www.sai.msu.su/~megera/oddmuse/index.cgi?Gendict
17. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
18. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
19. http://archives.postgresql.org/pgsql-general/
20. http://openfts.sourceforge.net/
#include "postgres.h"
#include "common.h"
#include "wparser.h"
#include "ts_cfg.h"
#include "dict.h"
text*
char2text(char* in) {
return charl2text(in, strlen(in));
}
text* charl2text(char* in, int len) {
text *out=(text*)palloc(len+VARHDRSZ);
memcpy(VARDATA(out), in, len);
VARATT_SIZEP(out) = len+VARHDRSZ;
return out;
}
char
*text2char(text* in) {
char *out=palloc( VARSIZE(in) );
memcpy(out, VARDATA(in), VARSIZE(in)-VARHDRSZ);
out[ VARSIZE(in)-VARHDRSZ ] ='\0';
return out;
}
char
*pnstrdup(char* in, int len) {
char *out=palloc( len+1 );
memcpy(out, in, len);
out[len]='\0';
return out;
}
text
*ptextdup(text* in) {
text *out=(text*)palloc( VARSIZE(in) );
memcpy(out,in,VARSIZE(in));
return out;
}
text
*mtextdup(text* in) {
text *out=(text*)malloc( VARSIZE(in) );
if ( !out )
ts_error(ERROR, "No memory");
memcpy(out,in,VARSIZE(in));
return out;
}
void
ts_error(int state, const char *format, ...) {
va_list args;
int tlen = 128, len=0;
char *buf;
reset_cfg();
reset_dict();
reset_prs();
va_start(args, format);
buf = palloc(tlen);
len = vsnprintf(buf, tlen-1, format, args);
if ( len >= tlen ) {
tlen=len+1;
buf = repalloc( buf, tlen );
vsnprintf(buf, tlen-1, format, args);
}
va_end(args);
elog(state,buf);
pfree(buf);
}
int
text_cmp(text *a, text *b) {
if ( VARSIZE(a) == VARSIZE(b) )
return strncmp( VARDATA(a), VARDATA(b), VARSIZE(a)-VARHDRSZ );
return (int)VARSIZE(a) - (int)VARSIZE(b);
}
#ifndef __TS_COMMON_H__
#define __TS_COMMON_H__
#include "postgres.h"
#include "fmgr.h"
#ifndef PG_NARGS
#define PG_NARGS() (fcinfo->nargs)
#endif
text* char2text(char* in);
text* charl2text(char* in, int len);
char *text2char(text* in);
char *pnstrdup(char* in, int len);
text *ptextdup(text* in);
text *mtextdup(text* in);
int text_cmp(text *a, text *b);
#define NEXTVAL(x) ( (text*)( (char*)(x) + INTALIGN( VARSIZE(x) ) ) )
#define ARRNELEMS(x) ArrayGetNItems( ARR_NDIM(x), ARR_DIMS(x))
void ts_error(int state, const char *format, ...);
#endif
/* Both POSIX and CRC32 checksums */
#include <sys/types.h>
#include <stdio.h>
#include <sys/types.h>
#include "crc32.h"
/*
* This code implements the AUTODIN II polynomial
* The variable corresponding to the macro argument "crc" should
* be an unsigned long.
* Oroginal code by Spencer Garrett <srg@quick.com>
*/
#define _CRC32_(crc, ch) (crc = (crc >> 8) ^ crc32tab[(crc ^ (ch)) & 0xff])
/* generated using the AUTODIN II polynomial
* x^32 + x^26 + x^23 + x^22 + x^16 +
* x^12 + x^11 + x^10 + x^8 + x^7 + x^5 + x^4 + x^2 + x^1 + 1
*/
static const unsigned int crc32tab[256] = {
0x00000000, 0x77073096, 0xee0e612c, 0x990951ba,
0x076dc419, 0x706af48f, 0xe963a535, 0x9e6495a3,
0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988,
0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91,
0x1db71064, 0x6ab020f2, 0xf3b97148, 0x84be41de,
0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7,
0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec,
0x14015c4f, 0x63066cd9, 0xfa0f3d63, 0x8d080df5,
0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172,
0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b,
0x35b5a8fa, 0x42b2986c, 0xdbbbc9d6, 0xacbcf940,
0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59,
0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116,
0x21b4f4b5, 0x56b3c423, 0xcfba9599, 0xb8bda50f,
0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924,
0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d,
0x76dc4190, 0x01db7106, 0x98d220bc, 0xefd5102a,
0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433,
0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818,
0x7f6a0dbb, 0x086d3d2d, 0x91646c97, 0xe6635c01,
0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e,
0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457,
0x65b0d9c6, 0x12b7e950, 0x8bbeb8ea, 0xfcb9887c,
0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65,
0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2,
0x4adfa541, 0x3dd895d7, 0xa4d1c46d, 0xd3d6f4fb,
0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0,
0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9,
0x5005713c, 0x270241aa, 0xbe0b1010, 0xc90c2086,
0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f,
0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4,
0x59b33d17, 0x2eb40d81, 0xb7bd5c3b, 0xc0ba6cad,
0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a,
0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683,
0xe3630b12, 0x94643b84, 0x0d6d6a3e, 0x7a6a5aa8,
0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1,
0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe,
0xf762575d, 0x806567cb, 0x196c3671, 0x6e6b06e7,
0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc,
0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5,
0xd6d6a3e8, 0xa1d1937e, 0x38d8c2c4, 0x4fdff252,
0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b,
0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60,
0xdf60efc3, 0xa867df55, 0x316e8eef, 0x4669be79,
0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236,
0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f,
0xc5ba3bbe, 0xb2bd0b28, 0x2bb45a92, 0x5cb36a04,
0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d,
0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a,
0x9c0906a9, 0xeb0e363f, 0x72076785, 0x05005713,
0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38,
0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21,
0x86d3d2d4, 0xf1d4e242, 0x68ddb3f8, 0x1fda836e,
0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777,
0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c,
0x8f659eff, 0xf862ae69, 0x616bffd3, 0x166ccf45,
0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2,
0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db,
0xaed16a4a, 0xd9d65adc, 0x40df0b66, 0x37d83bf0,
0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9,
0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6,
0xbad03605, 0xcdd70693, 0x54de5729, 0x23d967bf,
0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94,
0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d,
};
unsigned int
crc32_sz(char *buf, int size)
{
unsigned int crc = ~0;
char *p;
int len,
nr;
len = 0;
nr = size;
for (len += nr, p = buf; nr--; ++p)
_CRC32_(crc, *p);
return ~crc;
}
#ifndef _CRC32_H
#define _CRC32_H
/* Returns crc32 of data block */
extern unsigned int crc32_sz(char *buf, int size);
/* Returns crc32 of null-terminated string */
#define crc32(buf) crc32_sz((buf),strlen(buf))
#endif
This diff is collapsed.
/*
* interface functions to dictionary
* Teodor Sigaev <teodor@sigaev.ru>
*/
#include <errno.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include "postgres.h"
#include "fmgr.h"
#include "utils/array.h"
#include "catalog/pg_type.h"
#include "executor/spi.h"
#include "dict.h"
#include "common.h"
#include "snmap.h"
/*********top interface**********/
static void *plan_getdict=NULL;
void
init_dict(Oid id, DictInfo *dict) {
Oid arg[1]={ OIDOID };
bool isnull;
Datum pars[1]={ ObjectIdGetDatum(id) };
int stat;
memset(dict,0,sizeof(DictInfo));
SPI_connect();
if ( !plan_getdict ) {
plan_getdict = SPI_saveplan( SPI_prepare( "select dict_init, dict_initoption, dict_lexize from pg_ts_dict where oid = $1" , 1, arg ) );
if ( !plan_getdict )
ts_error(ERROR, "SPI_prepare() failed");
}
stat = SPI_execp(plan_getdict, pars, " ", 1);
if ( stat < 0 )
ts_error (ERROR, "SPI_execp return %d", stat);
if ( SPI_processed > 0 ) {
Datum opt;
Oid oid=InvalidOid;
oid=DatumGetObjectId( SPI_getbinval(SPI_tuptable->vals[0], SPI_tuptable->tupdesc, 1, &isnull) );
if ( !(isnull || oid==InvalidOid) ) {
opt=SPI_getbinval(SPI_tuptable->vals[0], SPI_tuptable->tupdesc, 2, &isnull);
dict->dictionary=(void*)DatumGetPointer(OidFunctionCall1(oid, opt));
}
oid=DatumGetObjectId( SPI_getbinval(SPI_tuptable->vals[0], SPI_tuptable->tupdesc, 3, &isnull) );
if ( isnull || oid==InvalidOid )
ts_error(ERROR, "Null dict_lexize for dictonary %d", id);
fmgr_info_cxt(oid, &(dict->lexize_info), TopMemoryContext);
dict->dict_id=id;
} else
ts_error(ERROR, "No dictionary with id %d", id);
SPI_finish();
}
typedef struct {
DictInfo *last_dict;
int len;
int reallen;
DictInfo *list;
SNMap name2id_map;
} DictList;
static DictList DList = {NULL,0,0,NULL,{0,0,NULL}};
void
reset_dict(void) {
freeSNMap( &(DList.name2id_map) );
/* XXX need to free DList.list[*].dictionary */
if ( DList.list )
free(DList.list);
memset(&DList,0,sizeof(DictList));
}
static int
comparedict(const void *a, const void *b) {
return ((DictInfo*)a)->dict_id - ((DictInfo*)b)->dict_id;
}
DictInfo *
finddict(Oid id) {
/* last used dict */
if ( DList.last_dict && DList.last_dict->dict_id==id )
return DList.last_dict;
/* already used dict */
if ( DList.len != 0 ) {
DictInfo key;
key.dict_id=id;
DList.last_dict = bsearch(&key, DList.list, DList.len, sizeof(DictInfo), comparedict);
if ( DList.last_dict != NULL )
return DList.last_dict;
}
/* last chance */
if ( DList.len==DList.reallen ) {
DictInfo *tmp;
int reallen = ( DList.reallen ) ? 2*DList.reallen : 16;
tmp=(DictInfo*)realloc(DList.list,sizeof(DictInfo)*reallen);
if ( !tmp )
ts_error(ERROR,"No memory");
DList.reallen=reallen;
DList.list=tmp;
}
DList.last_dict=&(DList.list[DList.len]);
init_dict(id, DList.last_dict);
DList.len++;
qsort(DList.list, DList.len, sizeof(DictInfo), comparedict);
return finddict(id); /* qsort changed order!! */;
}
static void *plan_name2id=NULL;
Oid
name2id_dict(text *name) {
Oid arg[1]={ TEXTOID };
bool isnull;
Datum pars[1]={ PointerGetDatum(name) };
int stat;
Oid id=findSNMap_t( &(DList.name2id_map), name );
if ( id )
return id;
SPI_connect();
if ( !plan_name2id ) {
plan_name2id = SPI_saveplan( SPI_prepare( "select oid from pg_ts_dict where dict_name = $1" , 1, arg ) );
if ( !plan_name2id )
ts_error(ERROR, "SPI_prepare() failed");
}
stat = SPI_execp(plan_name2id, pars, " ", 1);
if ( stat < 0 )
ts_error (ERROR, "SPI_execp return %d", stat);
if ( SPI_processed > 0 )
id=DatumGetObjectId( SPI_getbinval(SPI_tuptable->vals[0], SPI_tuptable->tupdesc, 1, &isnull) );
else
ts_error(ERROR, "No dictionary with name '%s'", text2char(name));
SPI_finish();
addSNMap_t( &(DList.name2id_map), name, id );
return id;
}
/******sql-level interface******/
PG_FUNCTION_INFO_V1(lexize);
Datum lexize(PG_FUNCTION_ARGS);
Datum
lexize(PG_FUNCTION_ARGS) {
text *in=PG_GETARG_TEXT_P(1);
DictInfo *dict = finddict( PG_GETARG_OID(0) );
char **res, **ptr;
Datum *da;
ArrayType *a;
ptr = res = (char**)DatumGetPointer(
FunctionCall3(&(dict->lexize_info),
PointerGetDatum(dict->dictionary),
PointerGetDatum(VARDATA(in)),
Int32GetDatum(VARSIZE(in)-VARHDRSZ)
)
);
PG_FREE_IF_COPY(in, 1);
if ( !res ) {
if (PG_NARGS() > 2)
PG_RETURN_POINTER(NULL);
else
PG_RETURN_NULL();
}
while(*ptr) ptr++;
da = (Datum*)palloc(sizeof(Datum)*(ptr-res+1));
ptr=res;
while(*ptr) {
da[ ptr-res ] = PointerGetDatum( char2text(*ptr) );
ptr++;
}
a = construct_array(
da,
ptr-res,
TEXTOID,
-1,
false,
'i'
);
ptr=res;
while(*ptr) {
pfree( DatumGetPointer(da[ ptr-res ]) );
pfree( *ptr );
ptr++;
}
pfree(res);
pfree(da);
PG_RETURN_POINTER(a);
}
PG_FUNCTION_INFO_V1(lexize_byname);
Datum lexize_byname(PG_FUNCTION_ARGS);
Datum
lexize_byname(PG_FUNCTION_ARGS) {
text *dictname=PG_GETARG_TEXT_P(0);
Datum res;
strdup("simple");
res=DirectFunctionCall3(
lexize,
ObjectIdGetDatum(name2id_dict(dictname)),
PG_GETARG_DATUM(1),
(Datum)0
);
PG_FREE_IF_COPY(dictname, 0);
if (res)
PG_RETURN_DATUM(res);
else
PG_RETURN_NULL();
}
static Oid currect_dictionary_id=0;
PG_FUNCTION_INFO_V1(set_curdict);
Datum set_curdict(PG_FUNCTION_ARGS);
Datum
set_curdict(PG_FUNCTION_ARGS) {
finddict(PG_GETARG_OID(0));
currect_dictionary_id=PG_GETARG_OID(0);
PG_RETURN_VOID();
}
PG_FUNCTION_INFO_V1(set_curdict_byname);
Datum set_curdict_byname(PG_FUNCTION_ARGS);
Datum
set_curdict_byname(PG_FUNCTION_ARGS) {
text *dictname=PG_GETARG_TEXT_P(0);
DirectFunctionCall1(
set_curdict,
ObjectIdGetDatum( name2id_dict(dictname) )
);
PG_FREE_IF_COPY(dictname, 0);
PG_RETURN_VOID();
}
PG_FUNCTION_INFO_V1(lexize_bycurrent);
Datum lexize_bycurrent(PG_FUNCTION_ARGS);
Datum
lexize_bycurrent(PG_FUNCTION_ARGS) {
Datum res;
if ( currect_dictionary_id == 0 )
elog(ERROR, "No currect dictionary. Execute select set_curdict().");
res = DirectFunctionCall3(
lexize,
ObjectIdGetDatum(currect_dictionary_id),
PG_GETARG_DATUM(0),
(Datum)0
);
if (res)
PG_RETURN_DATUM(res);
else
PG_RETURN_NULL();
}
#ifndef __DICT_H__
#define __DICT_H__
#include "postgres.h"
#include "fmgr.h"
typedef struct {
int len;
char **stop;
char* (*wordop)(char*);
} StopList;
void sortstoplist(StopList *s);
void freestoplist(StopList *s);
void readstoplist(text *in, StopList *s);
bool searchstoplist(StopList *s, char *key);
char* lowerstr(char *str);
typedef struct {
Oid dict_id;
FmgrInfo lexize_info;
void *dictionary;
} DictInfo;
void init_dict(Oid id, DictInfo *dict);
DictInfo* finddict(Oid id);
Oid name2id_dict(text *name);
void reset_dict(void);
/* simple parser of cfg string */
typedef struct {
char *key;
char *value;
} Map;
void parse_cfgdict(text *in, Map **m);
#endif
/*
* example of dictionary
* Teodor Sigaev <teodor@sigaev.ru>
*/
#include <errno.h>
#include <stdlib.h>
#include <string.h>
#include "postgres.h"
#include "dict.h"
#include "common.h"
typedef struct {
StopList stoplist;
} DictExample;
PG_FUNCTION_INFO_V1(dex_init);
Datum dex_init(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(dex_lexize);
Datum dex_lexize(PG_FUNCTION_ARGS);
Datum
dex_init(PG_FUNCTION_ARGS) {
DictExample *d = (DictExample*)malloc( sizeof(DictExample) );
if ( !d )
elog(ERROR, "No memory");
memset(d,0,sizeof(DictExample));
d->stoplist.wordop=lowerstr;
if ( !PG_ARGISNULL(0) && PG_GETARG_POINTER(0)!=NULL ) {
text *in = PG_GETARG_TEXT_P(0);
readstoplist(in, &(d->stoplist));
sortstoplist(&(d->stoplist));
PG_FREE_IF_COPY(in, 0);
}
PG_RETURN_POINTER(d);
}
Datum
dex_lexize(PG_FUNCTION_ARGS) {
DictExample *d = (DictExample*)PG_GETARG_POINTER(0);
char *in = (char*)PG_GETARG_POINTER(1);
char *txt = pnstrdup(in, PG_GETARG_INT32(2));
char **res=palloc(sizeof(char*)*2);
if ( *txt=='\0' || searchstoplist(&(d->stoplist),txt) ) {
pfree(txt);
res[0]=NULL;
} else
res[0]=txt;
res[1]=NULL;
PG_RETURN_POINTER(res);
}
/*
* ISpell interface
* Teodor Sigaev <teodor@sigaev.ru>
*/
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include "postgres.h"
#include "dict.h"
#include "common.h"
#include "ispell/spell.h"
typedef struct {
StopList stoplist;
IspellDict obj;
} DictISpell;
PG_FUNCTION_INFO_V1(spell_init);
Datum spell_init(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(spell_lexize);
Datum spell_lexize(PG_FUNCTION_ARGS);
static void
freeDictISpell(DictISpell *d) {
FreeIspell(&(d->obj));
freestoplist(&(d->stoplist));
free(d);
}
Datum
spell_init(PG_FUNCTION_ARGS) {
DictISpell *d;
Map *cfg, *pcfg;
text *in;
bool affloaded=false, dictloaded=false, stoploaded=false;
if ( PG_ARGISNULL(0) || PG_GETARG_POINTER(0)==NULL )
elog(ERROR,"ISpell confguration error");
d = (DictISpell*)malloc( sizeof(DictISpell) );
if ( !d )
elog(ERROR, "No memory");
memset(d,0,sizeof(DictISpell));
d->stoplist.wordop=lowerstr;
in = PG_GETARG_TEXT_P(0);
parse_cfgdict(in,&cfg);
PG_FREE_IF_COPY(in, 0);
pcfg=cfg;
while(pcfg->key) {
if ( strcasecmp("DictFile", pcfg->key) == 0 ) {
if ( dictloaded ) {
freeDictISpell(d);
elog(ERROR,"Dictionary already loaded");
}
if ( ImportDictionary(&(d->obj), pcfg->value) ) {
freeDictISpell(d);
elog(ERROR,"Can't load dictionary file (%s)", pcfg->value);
}
dictloaded=true;
} else if ( strcasecmp("AffFile", pcfg->key) == 0 ) {
if ( affloaded ) {
freeDictISpell(d);
elog(ERROR,"Affixes already loaded");
}
if ( ImportAffixes(&(d->obj), pcfg->value) ) {
freeDictISpell(d);
elog(ERROR,"Can't load affix file (%s)", pcfg->value);
}
affloaded=true;
} else if ( strcasecmp("StopFile", pcfg->key) == 0 ) {
text *tmp=char2text(pcfg->value);
if ( stoploaded ) {
freeDictISpell(d);
elog(ERROR,"Stop words already loaded");
}
readstoplist(tmp, &(d->stoplist));
sortstoplist(&(d->stoplist));
pfree(tmp);
stoploaded=true;
} else {
freeDictISpell(d);
elog(ERROR,"Unknown option: %s => %s", pcfg->key, pcfg->value);
}
pfree(pcfg->key);
pfree(pcfg->value);
pcfg++;
}
pfree(cfg);
if ( affloaded && dictloaded ) {
SortDictionary(&(d->obj));
SortAffixes(&(d->obj));
} else if ( !affloaded ) {
freeDictISpell(d);
elog(ERROR,"No affixes");
} else {
freeDictISpell(d);
elog(ERROR,"No dictionary");
}
PG_RETURN_POINTER(d);
}
Datum
spell_lexize(PG_FUNCTION_ARGS) {
DictISpell *d = (DictISpell*)PG_GETARG_POINTER(0);
char *in = (char*)PG_GETARG_POINTER(1);
char *txt;
char **res;
char **ptr, **cptr;
if ( !PG_GETARG_INT32(2) )
PG_RETURN_POINTER(NULL);
res=palloc(sizeof(char*)*2);
txt = pnstrdup(in, PG_GETARG_INT32(2));
res=NormalizeWord(&(d->obj), txt);
pfree(txt);
if ( res==NULL )
PG_RETURN_POINTER(NULL);
ptr=cptr=res;
while(*ptr) {
if ( searchstoplist(&(d->stoplist),*ptr) ) {
pfree(*ptr);
*ptr=NULL;
ptr++;
} else {
*cptr=*ptr;
cptr++; ptr++;
}
}
*cptr=NULL;
PG_RETURN_POINTER(res);
}
/*
* example of Snowball dictionary
* http://snowball.tartarus.org/
* Teodor Sigaev <teodor@sigaev.ru>
*/
#include <stdlib.h>
#include <string.h>
#include "postgres.h"
#include "dict.h"
#include "common.h"
#include "snowball/header.h"
#include "snowball/english_stem.h"
#include "snowball/russian_stem.h"
typedef struct {
struct SN_env *z;
StopList stoplist;
int (*stem)(struct SN_env * z);
} DictSnowball;
PG_FUNCTION_INFO_V1(snb_en_init);
Datum snb_en_init(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(snb_ru_init);
Datum snb_ru_init(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(snb_lexize);
Datum snb_lexize(PG_FUNCTION_ARGS);
Datum
snb_en_init(PG_FUNCTION_ARGS) {
DictSnowball *d = (DictSnowball*)malloc( sizeof(DictSnowball) );
if ( !d )
elog(ERROR, "No memory");
memset(d,0,sizeof(DictSnowball));
d->stoplist.wordop=lowerstr;
if ( !PG_ARGISNULL(0) && PG_GETARG_POINTER(0)!=NULL ) {
text *in = PG_GETARG_TEXT_P(0);
readstoplist(in, &(d->stoplist));
sortstoplist(&(d->stoplist));
PG_FREE_IF_COPY(in, 0);
}
d->z = english_create_env();
if (!d->z) {
freestoplist(&(d->stoplist));
elog(ERROR,"No memory");
}
d->stem=english_stem;
PG_RETURN_POINTER(d);
}
Datum
snb_ru_init(PG_FUNCTION_ARGS) {
DictSnowball *d = (DictSnowball*)malloc( sizeof(DictSnowball) );
if ( !d )
elog(ERROR, "No memory");
memset(d,0,sizeof(DictSnowball));
d->stoplist.wordop=lowerstr;
if ( !PG_ARGISNULL(0) && PG_GETARG_POINTER(0)!=NULL ) {
text *in = PG_GETARG_TEXT_P(0);
readstoplist(in, &(d->stoplist));
sortstoplist(&(d->stoplist));
PG_FREE_IF_COPY(in, 0);
}
d->z = russian_create_env();
if (!d->z) {
freestoplist(&(d->stoplist));
elog(ERROR,"No memory");
}
d->stem=russian_stem;
PG_RETURN_POINTER(d);
}
Datum
snb_lexize(PG_FUNCTION_ARGS) {
DictSnowball *d = (DictSnowball*)PG_GETARG_POINTER(0);
char *in = (char*)PG_GETARG_POINTER(1);
char *txt = pnstrdup(in, PG_GETARG_INT32(2));
char **res=palloc(sizeof(char*)*2);
if ( *txt=='\0' || searchstoplist(&(d->stoplist),txt) ) {
pfree(txt);
res[0]=NULL;
} else {
SN_set_current(d->z, strlen(txt), txt);
(d->stem)(d->z);
if ( d->z->p && d->z->l ) {
txt=repalloc(txt, d->z->l+1);
memcpy( txt, d->z->p, d->z->l);
txt[d->z->l]='\0';
}
res[0]=txt;
}
res[1]=NULL;
PG_RETURN_POINTER(res);
}
/*
* ISpell interface
* Teodor Sigaev <teodor@sigaev.ru>
*/
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <errno.h>
#include "postgres.h"
#include "dict.h"
#include "common.h"
#define SYNBUFLEN 4096
typedef struct {
char *in;
char *out;
} Syn;
typedef struct {
int len;
Syn *syn;
} DictSyn;
PG_FUNCTION_INFO_V1(syn_init);
Datum syn_init(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(syn_lexize);
Datum syn_lexize(PG_FUNCTION_ARGS);
static char *
findwrd(char *in, char **end) {
char *start;
*end=NULL;
while(*in && isspace(*in))
in++;
if ( !in )
return NULL;
start=in;
while(*in && !isspace(*in))
in++;
*end=in;
return start;
}
static int
compareSyn(const void *a, const void *b) {
return strcmp( ((Syn*)a)->in, ((Syn*)b)->in );
}
Datum
syn_init(PG_FUNCTION_ARGS) {
text *in;
DictSyn *d;
int cur=0;
FILE *fin;
char *filename;
char buf[SYNBUFLEN];
char *starti,*starto,*end=NULL;
int slen;
if ( PG_ARGISNULL(0) || PG_GETARG_POINTER(0)==NULL )
elog(ERROR,"NULL config");
in = PG_GETARG_TEXT_P(0);
if ( VARSIZE(in) - VARHDRSZ == 0 )
elog(ERROR,"VOID config");
filename=text2char(in);
PG_FREE_IF_COPY(in, 0);
if ( (fin=fopen(filename,"r")) == NULL )
elog(ERROR,"Can't open file '%s': %s", filename, strerror(errno));
d = (DictSyn*)malloc( sizeof(DictSyn) );
if ( !d ) {
fclose(fin);
elog(ERROR, "No memory");
}
memset(d,0,sizeof(DictSyn));
while( fgets(buf,SYNBUFLEN,fin) ) {
slen = strlen(buf)-1;
buf[slen] = '\0';
if ( *buf=='\0' ) continue;
if (cur==d->len) {
d->len = (d->len) ? 2*d->len : 16;
d->syn=(Syn*)realloc( d->syn, sizeof(Syn)*d->len );
if ( !d->syn ) {
fclose(fin);
elog(ERROR, "No memory");
}
}
starti=findwrd(buf,&end);
if ( !starti )
continue;
*end='\0';
if ( end >= buf+slen )
continue;
starto= findwrd(end+1, &end);
if ( !starto )
continue;
*end='\0';
d->syn[cur].in=strdup(lowerstr(starti));
d->syn[cur].out=strdup(lowerstr(starto));
if ( !(d->syn[cur].in && d->syn[cur].out) ) {
fclose(fin);
elog(ERROR, "No memory");
}
cur++;
}
fclose(fin);
d->len=cur;
if ( cur>1 )
qsort(d->syn, d->len, sizeof(Syn), compareSyn);
pfree(filename);
PG_RETURN_POINTER(d);
}
Datum
syn_lexize(PG_FUNCTION_ARGS) {
DictSyn *d = (DictSyn*)PG_GETARG_POINTER(0);
char *in = (char*)PG_GETARG_POINTER(1);
Syn key,*found;
char **res=NULL;
if ( !PG_GETARG_INT32(2) )
PG_RETURN_POINTER(NULL);
key.out=NULL;
key.in=lowerstr(pnstrdup(in, PG_GETARG_INT32(2)));
found=(Syn*)bsearch(&key, d->syn, d->len, sizeof(Syn), compareSyn);
pfree(key.in);
if ( !found )
PG_RETURN_POINTER(NULL);
res=palloc(sizeof(char*)*2);
res[0]=pstrdup(found->out);
res[1]=NULL;
PG_RETURN_POINTER(res);
}
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
subdir = contrib/CFG_DIR
top_builddir = ../..
include $(top_builddir)/src/Makefile.global
MODULE_big = dict_CFG_MODNAME
OBJS = CFG_OFILE
DATA_built = dict_CFG_MODNAME.sql
DOCS = README.CFG_MODNAME
PG_CPPFLAGS =
SHLIB_LINK = ../tsearch2/libtsearch2.a
include $(top_srcdir)/contrib/contrib-global.mk
Gendict - generate dictionary templates for contrib/tsearch2 module.
This utility aims to help people creating dictionary for contrib/tsearch v2
module. Particularly, it has built-in support for snowball stemmers.
Programming API to tsearch2 dictionaries is described in tsearch v2
documentation.
Prerequisities:
* PostgreSQL 7.3 and above.
* You need tsearch2 module sources already compiled
* Rights to install contrib modules
Usage:
run config.sh without parameters to see options and arguments
Usage:
./config.sh -n DICTNAME ( [ -s [ -p PREFIX ] ] | [ -c CFILES ] [ -h HFILES ] [ -i ] ) [ -v ] [ -d DIR ] [ -C COMMENT ]
-v - be verbose
-d DIR - name of directory in PGSQL_SRC/contrib (default dict_DICTNAME)
-C COMMENT - dictionary comment
Generate Snowball stemmer:
./config.sh -n DICTNAME -s [ -p PREFIX ] [ -v ] [ -d DIR ] [ -C COMMENT ]
-s - generate Snowball wrapper
-p - prefix of Snowball's function, (default DICTNAME)
Generate template dictionary:
./config.sh -n DICTNAME [ -c CFILES ] [ -h HFILES ] [ -i ] [ -v ] [ -d DIR ] [ -C COMMENT ]
-c CFILES - source files, must be placed in contrib/tsearch2/gendict directory.
These files will be used in Makefile.
-h HFILES - header files, must be placed in contrib/tsearch2/gendict directory.
These files will be used in Makefile and subinclude.h
-i - dictionary has init method
Example 1:
Create Portuguese stemmer
0. cd PGSQL_SRC/contrib/tsearch2/gendict
1. Obtain stem.{c,h} files for Portuguese
wget http://snowball.tartarus.org/portuguese/stem.c
wget http://snowball.tartarus.org/portuguese/stem.h
2. Create template files for Portuguese
./config.sh -n pt -s -p portuguese -v -C'Snowball stemmer for Portuguese'
Note, that argument for -p option should be *the same* as name of stemming
function in stem.c (without _stem)
A bunch of files will be generated and placed in PGSQL_SRC/contrib/dict_pt
directory.
3. Compile and install dictionary
cd PGSQL_SRC/contrib/dict_pt
make
make install
4. Test it
Sample portuguese words with the stemmed forms are available
from http://snowball.tartarus.org/portuguese/stemmer.html
createdb testdict
psql testdict < /usr/local/pgsql/share/contrib/tsearch2.sql
psql testdict < /usr/local/pgsql/share/contrib/dict_pt.sql
psql -d testdict -c "select lexize('pt','bobagem');"
lexize
---------
{bobag}
(1 row)
Here is what I have in pg_ts_dict table
psql -d testdict -c "select * from pg_ts_dict where dict_name='pt';"
dict_name | dict_init | dict_initoption | dict_lexize | dict_comment
-----------+-----------+-----------------+-------------+---------------------------------
pt | 7177806 | | 7159330 | Snowball stemmer for Portuguese
(1 row)
Note, that you have already installed dictionary and corresponding
entry in tsearch configuration and you may modify it using
plain SQL commands, for example, specify stop words.
Example 2:
a) Simple template dictionary with init method
./config.sh -n wow -v -i -C WOW
b) Create simple template dict (without init method):
./config.sh -n wow -v -C WOW
The same as above, but dictionary will have not init method
Dictionaries obtained in a) and b) are fully working and ready
for use:
a) lowercase input word and remove it if it is a stop word
b) recognizes any word
c) Simple template dictionary with source files (with init method):
./config.sh -n wow -v -i -c a.c -h a.h -C WOW
Source files ( a.c ) must be placed in contrib/tsearch2/gendict directory.
These files will be used in Makefile.
Header files ( a.h ), must be placed in contrib/tsearch2/gendict directory.
These files will be used in Makefile and subinclude.h
d) Simple template dictionary with source files (without init method):
./config.sh -n wow -v -c a.c -h a.h -C WOW
The same as above, but dictionary will have not init method
After that you have sources in PGSQL_SRC/contrib/dict_wow and
you may edit them to create actual dictionary.
Please, check Tsearch2 home page (http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/)
for additional information about "Gendict tutorial" and dictionaries.
\ No newline at end of file
#!/bin/sh
usage () {
echo Usage:
echo $0 -n DICTNAME \( [ -s [ -p PREFIX ] ] \| [ -c CFILES ] [ -h HFILES ] [ -i ] \) [ -v ] [ -d DIR ] [ -C COMMENT ]
echo ' -v - be verbose'
echo ' -d DIR - name of directory in PGSQL_SRL/contrib (default dict_DICTNAME)'
echo ' -C COMMENT - dictionary comment'
echo Generate Snowball stemmer:
echo $0 -n DICTNAME -s [ -p PREFIX ] [ -v ] [ -d DIR ] [ -C COMMENT ]
echo ' -s - generate Snowball wrapper'
echo " -p - prefix of Snowball's function, (default DICTNAME)"
echo Generate template dictionary:
echo $0 -n DICTNAME [ -c CFILES ] [ -h HFILES ] [ -i ] [ -v ] [ -d DIR ] [ -C COMMENT ]
echo ' -c CFILES - source files, must be placed in contrib/tsearch2/gendict directory.'
echo ' These files will be used in Makefile.'
echo ' -h HFILES - header files, must be placed in contrib/tsearch2/gendict directory.'
echo ' These files will be used in Makefile and subinclude.h'
echo ' -i - dictionary has init method'
exit 1;
}
dictname=
stemmode=no
verbose=no
cfile=
hfile=
dir=
hasinit=no
comment=
prefix=
while getopts n:c:C:h:d:p:vis opt
do
case "$opt" in
v) verbose=yes;;
s) stemmode=yes;;
i) hasinit=yes;;
n) dictname="$OPTARG";;
c) cfile="$OPTARG";;
h) hfile="$OPTARG";;
d) dir="$OPTARG";;
C) comment="$OPTARG";;
p) prefix="$OPTARG";;
\?) usage;;
esac
done
[ ${#dictname} -eq 0 ] && usage
dictname=`echo $dictname | tr '[:upper:]' '[:lower:]'`
if [ $stemmode = "yes" ] ; then
[ ${#prefix} -eq 0 ] && prefix=$dictname
hasinit=yes
cfile="stem.c"
hfile="stem.h"
fi
[ ${#dir} -eq 0 ] && dir="dict_$dictname"
if [ ${#comment} -eq 0 ]; then
comment=null
else
comment="'$comment'"
fi
ofile=
for f in $cfile
do
f=` echo $f | sed 's#c$#o#'`
ofile="$ofile $f"
done
if [ $stemmode = "yes" ] ; then
ofile="$ofile dict_snowball.o"
else
ofile="$ofile dict_tmpl.o"
fi
if [ $verbose = "yes" ]; then
echo Dictname: "'"$dictname"'"
echo Snowball stemmer: $stemmode
echo Has init method: $hasinit
[ $stemmode = "yes" ] && echo Function prefix: $prefix
echo Source files: $cfile
echo Header files: $hfile
echo Object files: $ofile
echo Comment: $comment
echo Directory: ../../$dir
fi
[ $verbose = "yes" ] && echo -n 'Build directory... '
if [ ! -d ../../$dir ]; then
if ! mkdir ../../$dir ; then
echo "Can't create directory ../../$dir"
exit 1
fi
fi
[ $verbose = "yes" ] && echo ok
[ $verbose = "yes" ] && echo -n 'Build Makefile... '
sed s#CFG_DIR#$dir# < Makefile.IN | sed s#CFG_MODNAME#$dictname# | sed "s#CFG_OFILE#$ofile#" > ../../$dir/Makefile.tmp
if [ $stemmode = "yes" ] ; then
sed "s#^PG_CPPFLAGS.*\$#PG_CPPFLAGS = -I../tsearch2/snowball -I../tsearch2#" < ../../$dir/Makefile.tmp > ../../$dir/Makefile
else
sed "s#^PG_CPPFLAGS.*\$#PG_CPPFLAGS = -I../tsearch2#" < ../../$dir/Makefile.tmp > ../../$dir/Makefile
fi
rm ../../$dir/Makefile.tmp
[ $verbose = "yes" ] && echo ok
[ $verbose = "yes" ] && echo -n Build dict_$dictname'.sql.in... '
if [ $hasinit = "yes" ]; then
sed s#CFG_MODNAME#$dictname# < sql.IN | sed "s#CFG_COMMENT#$comment#" | sed s#^HASINIT## | sed 's#^NOINIT.*$##' > ../../$dir/dict_$dictname.sql.in.tmp
if [ $stemmode = "yes" ] ; then
sed s#^ISSNOWBALL## < ../../$dir/dict_$dictname.sql.in.tmp | sed s#^NOSNOWBALL.*\$## > ../../$dir/dict_$dictname.sql.in
else
sed s#^NOSNOWBALL## < ../../$dir/dict_$dictname.sql.in.tmp | sed s#^ISSNOWBALL.*\$## > ../../$dir/dict_$dictname.sql.in
fi
rm ../../$dir/dict_$dictname.sql.in.tmp
else
sed s#CFG_MODNAME#$dictname# < sql.IN | sed "s#CFG_COMMENT#$comment#" | sed s#^NOINIT## | sed 's#^HASINIT.*$##' | sed s#^NOSNOWBALL## | sed s#^ISSNOWBALL.*\$## > ../../$dir/dict_$dictname.sql.in
fi
[ $verbose = "yes" ] && echo ok
if [ ${#cfile} -ne 0 ] || [ ${#hfile} -ne 0 ] ; then
[ $verbose = "yes" ] && echo -n 'Copy source and header files... '
if [ ${#cfile} -ne 0 ] ; then
if ! cp $cfile ../../$dir ; then
echo "Cant cp all or one of files: $cfile"
exit 1
fi
fi
if [ ${#hfile} -ne 0 ] ; then
if ! cp $hfile ../../$dir ; then
echo "Cant cp all or one of files: $hfile"
exit 1
fi
fi
[ $verbose = "yes" ] && echo ok
fi
[ $verbose = "yes" ] && echo -n 'Build sub-include header... '
echo -n > ../../$dir/subinclude.h
for i in $hfile
do
echo "#include \"$i\"" >> ../../$dir/subinclude.h
done
[ $verbose = "yes" ] && echo ok
if [ $stemmode = "yes" ] ; then
[ $verbose = "yes" ] && echo -n 'Build Snowball stemmer... '
sed s#CFG_MODNAME#$dictname#g < dict_snowball.c.IN | sed s#CFG_PREFIX#$prefix#g > ../../$dir/dict_snowball.c
else
[ $verbose = "yes" ] && echo -n 'Build dictinonary... '
sed s#CFG_MODNAME#$dictname#g < dict_tmpl.c.IN > ../../$dir/dict_tmpl.c.tmp
if [ $hasinit = "yes" ]; then
sed s#^HASINIT## < ../../$dir/dict_tmpl.c.tmp | sed 's#^NOINIT.*$##' > ../../$dir/dict_tmpl.c
else
sed s#^HASINIT.*\$## < ../../$dir/dict_tmpl.c.tmp | sed 's#^NOINIT##' > ../../$dir/dict_tmpl.c
fi
rm ../../$dir/dict_tmpl.c.tmp
fi
[ $verbose = "yes" ] && echo ok
[ $verbose = "yes" ] && echo -n "Build README.$dictname... "
if [ $stemmode = "yes" ] ; then
echo "Autogenerated Snowball's wrapper for $prefix" > ../../$dir/README.$dictname
else
echo "Autogenerated template for $dictname" > ../../$dir/README.$dictname
fi
[ $verbose = "yes" ] && echo ok
echo All is done
/*
* example of Snowball dictionary
* http://snowball.tartarus.org/
* Teodor Sigaev <teodor@sigaev.ru>
*/
#include <stdlib.h>
#include <string.h>
#include "postgres.h"
#include "dict.h"
#include "common.h"
#include "snowball/header.h"
#include "subinclude.h"
typedef struct {
struct SN_env *z;
StopList stoplist;
int (*stem)(struct SN_env * z);
} DictSnowball;
PG_FUNCTION_INFO_V1(dinit_CFG_MODNAME);
Datum dinit_CFG_MODNAME(PG_FUNCTION_ARGS);
Datum
dinit_CFG_MODNAME(PG_FUNCTION_ARGS) {
DictSnowball *d = (DictSnowball*)malloc( sizeof(DictSnowball) );
if ( !d )
elog(ERROR, "No memory");
memset(d,0,sizeof(DictSnowball));
d->stoplist.wordop=lowerstr;
if ( !PG_ARGISNULL(0) && PG_GETARG_POINTER(0)!=NULL ) {
text *in = PG_GETARG_TEXT_P(0);
readstoplist(in, &(d->stoplist));
sortstoplist(&(d->stoplist));
PG_FREE_IF_COPY(in, 0);
}
d->z = CFG_PREFIX_create_env();
if (!d->z) {
freestoplist(&(d->stoplist));
elog(ERROR,"No memory");
}
d->stem=CFG_PREFIX_stem;
PG_RETURN_POINTER(d);
}
/*
* example of dictionary
* Teodor Sigaev <teodor@sigaev.ru>
*/
#include <errno.h>
#include <stdlib.h>
#include <string.h>
#include "postgres.h"
#include "dict.h"
#include "common.h"
#include "subinclude.h"
HASINIT typedef struct {
HASINIT StopList stoplist;
HASINIT } DictExample;
HASINIT PG_FUNCTION_INFO_V1(dinit_CFG_MODNAME);
HASINIT Datum dinit_CFG_MODNAME(PG_FUNCTION_ARGS);
HASINIT Datum
HASINIT dinit_CFG_MODNAME(PG_FUNCTION_ARGS) {
HASINIT DictExample *d = (DictExample*)malloc( sizeof(DictExample) );
HASINIT
HASINIT if ( !d )
HASINIT elog(ERROR, "No memory");
HASINIT memset(d,0,sizeof(DictExample));
HASINIT
HASINIT d->stoplist.wordop=lowerstr;
HASINIT
HASINIT /* Your INIT code */
HASINIT
HASINIT if ( !PG_ARGISNULL(0) && PG_GETARG_POINTER(0)!=NULL ) {
HASINIT text *in = PG_GETARG_TEXT_P(0);
HASINIT readstoplist(in, &(d->stoplist));
HASINIT sortstoplist(&(d->stoplist));
HASINIT PG_FREE_IF_COPY(in, 0);
HASINIT }
HASINIT
HASINIT PG_RETURN_POINTER(d);
HASINIT }
PG_FUNCTION_INFO_V1(dlexize_CFG_MODNAME);
Datum dlexize_CFG_MODNAME(PG_FUNCTION_ARGS);
Datum
dlexize_CFG_MODNAME(PG_FUNCTION_ARGS) {
HASINIT DictExample *d = (DictExample*)PG_GETARG_POINTER(0);
char *in = (char*)PG_GETARG_POINTER(1);
char *txt = pnstrdup(in, PG_GETARG_INT32(2));
char **res=palloc(sizeof(char*)*2);
/* Your INIT dictionary code */
HASINIT if ( *txt=='\0' || searchstoplist(&(d->stoplist),txt) ) {
HASINIT pfree(txt);
HASINIT res[0]=NULL;
HASINIT } else
res[0]=txt;
res[1]=NULL;
PG_RETURN_POINTER(res);
}
SET search_path = public;
BEGIN;
HASINIT create function dinit_CFG_MODNAME(text)
HASINIT returns internal
HASINIT as 'MODULE_PATHNAME'
HASINIT language 'C';
NOSNOWBALL create function dlexize_CFG_MODNAME(internal,internal,int4)
NOSNOWBALL returns internal
NOSNOWBALL as 'MODULE_PATHNAME'
NOSNOWBALL language 'C'
NOSNOWBALL with (isstrict);
insert into pg_ts_dict select
'CFG_MODNAME',
HASINIT (select oid from pg_proc where proname='dinit_CFG_MODNAME'),
NOINIT null,
null,
ISSNOWBALL (select oid from pg_proc where proname='snb_lexize'),
NOSNOWBALL (select oid from pg_proc where proname='dlexize_CFG_MODNAME'),
CFG_COMMENT
;
END;
This diff is collapsed.
#ifndef __GISTIDX_H__
#define __GISTIDX_H__
/*
#define GISTIDX_DEBUG
*/
/*
* signature defines
*/
#define BITBYTE 8
#define SIGLENINT 63 /* >121 => key will toast, so it will not
* work !!! */
#define SIGLEN ( sizeof(int4)*SIGLENINT )
#define SIGLENBIT (SIGLEN*BITBYTE)
typedef char BITVEC[SIGLEN];
typedef char *BITVECP;
#define LOOPBYTE(a) \
for(i=0;i<SIGLEN;i++) {\
a;\
}
#define LOOPBIT(a) \
for(i=0;i<SIGLENBIT;i++) {\
a;\
}
#define GETBYTE(x,i) ( *( (BITVECP)(x) + (int)( (i) / BITBYTE ) ) )
#define GETBITBYTE(x,i) ( ((char)(x)) >> i & 0x01 )
#define CLRBIT(x,i) GETBYTE(x,i) &= ~( 0x01 << ( (i) % BITBYTE ) )
#define SETBIT(x,i) GETBYTE(x,i) |= ( 0x01 << ( (i) % BITBYTE ) )
#define GETBIT(x,i) ( (GETBYTE(x,i) >> ( (i) % BITBYTE )) & 0x01 )
#define abs(a) ((a) < (0) ? -(a) : (a))
#define min(a,b) ((a) < (b) ? (a) : (b))
#define HASHVAL(val) (((unsigned int)(val)) % SIGLENBIT)
#define HASH(sign, val) SETBIT((sign), HASHVAL(val))
/*
* type of index key
*/
typedef struct
{
int4 len;
int4 flag;
char data[1];
} GISTTYPE;
#define ARRKEY 0x01
#define SIGNKEY 0x02
#define ALLISTRUE 0x04
#define ISARRKEY(x) ( ((GISTTYPE*)x)->flag & ARRKEY )
#define ISSIGNKEY(x) ( ((GISTTYPE*)x)->flag & SIGNKEY )
#define ISALLTRUE(x) ( ((GISTTYPE*)x)->flag & ALLISTRUE )
#define GTHDRSIZE ( sizeof(int4)*2 )
#define CALCGTSIZE(flag, len) ( GTHDRSIZE + ( ( (flag) & ARRKEY ) ? ((len)*sizeof(int4)) : (((flag) & ALLISTRUE) ? 0 : SIGLEN) ) )
#define GETSIGN(x) ( (BITVECP)( (char*)x+GTHDRSIZE ) )
#define GETARR(x) ( (int4*)( (char*)x+GTHDRSIZE ) )
#define ARRNELEM(x) ( ( ((GISTTYPE*)x)->len - GTHDRSIZE )/sizeof(int4) )
#endif
This diff is collapsed.
#ifndef __SPELL_H__
#define __SPELL_H__
#include <sys/types.h>
#include <regex.h>
typedef struct spell_struct {
char * word;
char flag[10];
} SPELL;
typedef struct aff_struct {
char flag;
char type;
char mask[33];
char find[16];
char repl[16];
regex_t reg;
size_t replen;
char compile;
} AFFIX;
typedef struct Tree_struct {
int Left[256], Right[256];
} Tree_struct;
typedef struct {
int maffixes;
int naffixes;
AFFIX * Affix;
int nspell;
int mspell;
SPELL *Spell;
Tree_struct SpellTree;
Tree_struct PrefixTree;
Tree_struct SuffixTree;
} IspellDict;
char ** NormalizeWord(IspellDict * Conf,char *word);
int ImportAffixes(IspellDict * Conf, const char *filename);
int ImportDictionary(IspellDict * Conf,const char *filename);
int AddSpell(IspellDict * Conf,const char * word,const char *flag);
int AddAffix(IspellDict * Conf,int flag,const char *mask,const char *find,const char *repl,int type);
void SortDictionary(IspellDict * Conf);
void SortAffixes(IspellDict * Conf);
void FreeIspell (IspellDict *Conf);
#endif
/*
* Simple config parser
* Teodor Sigaev <teodor@sigaev.ru>
*/
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include "postgres.h"
#include "dict.h"
#include "common.h"
#define CS_WAITKEY 0
#define CS_INKEY 1
#define CS_WAITEQ 2
#define CS_WAITVALUE 3
#define CS_INVALUE 4
#define CS_IN2VALUE 5
#define CS_WAITDELIM 6
#define CS_INESC 7
#define CS_IN2ESC 8
static char *
nstrdup(char *ptr, int len) {
char *res=palloc(len+1), *cptr;
memcpy(res,ptr,len);
res[len]='\0';
cptr = ptr = res;
while(*ptr) {
if ( *ptr == '\\' )
ptr++;
*cptr=*ptr; ptr++; cptr++;
}
*cptr='\0';
return res;
}
void
parse_cfgdict(text *in, Map **m) {
Map *mptr;
char *ptr=VARDATA(in), *begin=NULL;
char num=0;
int state=CS_WAITKEY;
while( ptr-VARDATA(in) < VARSIZE(in) - VARHDRSZ ) {
if ( *ptr==',' ) num++;
ptr++;
}
*m=mptr=(Map*)palloc( sizeof(Map)*(num+2) );
memset(mptr, 0, sizeof(Map)*(num+2) );
ptr=VARDATA(in);
while( ptr-VARDATA(in) < VARSIZE(in) - VARHDRSZ ) {
if (state==CS_WAITKEY) {
if (isalpha(*ptr)) {
begin=ptr;
state=CS_INKEY;
} else if ( !isspace(*ptr) )
elog(ERROR,"Syntax error in position %d near '%c'", ptr-VARDATA(in), *ptr);
} else if (state==CS_INKEY) {
if ( isspace(*ptr) ) {
mptr->key=nstrdup(begin, ptr-begin);
state=CS_WAITEQ;
} else if ( *ptr=='=' ) {
mptr->key=nstrdup(begin, ptr-begin);
state=CS_WAITVALUE;
} else if ( !isalpha(*ptr) )
elog(ERROR,"Syntax error in position %d near '%c'", ptr-VARDATA(in), *ptr);
} else if ( state==CS_WAITEQ ) {
if ( *ptr=='=' )
state=CS_WAITVALUE;
else if ( !isspace(*ptr) )
elog(ERROR,"Syntax error in position %d near '%c'", ptr-VARDATA(in), *ptr);
} else if ( state==CS_WAITVALUE ) {
if ( *ptr=='"' ) {
begin=ptr+1;
state=CS_INVALUE;
} else if ( !isspace(*ptr) ) {
begin=ptr;
state=CS_IN2VALUE;
}
} else if ( state==CS_INVALUE ) {
if ( *ptr=='"' ) {
mptr->value = nstrdup(begin, ptr-begin);
mptr++;
state=CS_WAITDELIM;
} else if ( *ptr=='\\' )
state=CS_INESC;
} else if ( state==CS_IN2VALUE ) {
if ( isspace(*ptr) || *ptr==',' ) {
mptr->value = nstrdup(begin, ptr-begin);
mptr++;
state=( *ptr==',' ) ? CS_WAITKEY : CS_WAITDELIM;
} else if ( *ptr=='\\' )
state=CS_INESC;
} else if ( state==CS_WAITDELIM ) {
if ( *ptr==',' )
state=CS_WAITKEY;
else if ( !isspace(*ptr) )
elog(ERROR,"Syntax error in position %d near '%c'", ptr-VARDATA(in), *ptr);
} else if ( state == CS_INESC ) {
state=CS_INVALUE;
} else if ( state == CS_IN2ESC ) {
state=CS_IN2VALUE;
} else
elog(ERROR,"Bad parser state: %d at position %d near '%c'", state, ptr-VARDATA(in), *ptr);
ptr++;
}
if (state==CS_IN2VALUE) {
mptr->value = nstrdup(begin, ptr-begin);
mptr++;
} else if ( !(state==CS_WAITDELIM || state==CS_WAITKEY) )
elog(ERROR,"Unexpected end of line");
}
This diff is collapsed.
#ifndef __QUERY_H__
#define __QUERY_H__
/*
#define BS_DEBUG
*/
/*
* item in polish notation with back link
* to left operand
*/
typedef struct ITEM
{
int8 type;
int8 weight;
int2 left;
int4 val;
/* user-friendly value, must correlate with WordEntry */
uint32
unused:1,
length:11,
distance:20;
} ITEM;
/*
*Storage:
* (len)(size)(array of ITEM)(array of operand in user-friendly form)
*/
typedef struct
{
int4 len;
int4 size;
char data[1];
} QUERYTYPE;
#define HDRSIZEQT ( 2*sizeof(int4) )
#define COMPUTESIZE(size,lenofoperand) ( HDRSIZEQT + size * sizeof(ITEM) + lenofoperand )
#define GETQUERY(x) (ITEM*)( (char*)(x)+HDRSIZEQT )
#define GETOPERAND(x) ( (char*)GETQUERY(x) + ((QUERYTYPE*)x)->size * sizeof(ITEM) )
#define ISOPERATOR(x) ( (x)=='!' || (x)=='&' || (x)=='|' || (x)=='(' || (x)==')' )
#define END 0
#define ERR 1
#define VAL 2
#define OPR 3
#define OPEN 4
#define CLOSE 5
#define VALTRUE 6 /* for stop words */
#define VALFALSE 7
bool TS_execute(ITEM * curitem, void *checkval,
bool calcnot, bool (*chkcond) (void *checkval, ITEM * val));
#endif
This diff is collapsed.
This diff is collapsed.
#ifndef __REWRITE_H__
#define __REWRITE_H__
ITEM *clean_NOT_v2(ITEM * ptr, int4 *len);
ITEM *clean_fakeval_v2(ITEM * ptr, int4 *len);
#endif
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment