Commit d78a7d9c authored by Teodor Sigaev's avatar Teodor Sigaev

Improve support of Hunspell in ispell dictionary.

Now it's possible to load recent version of Hunspell for several languages.
To handle these dictionaries Hunspell patch adds support for:
* FLAG long - sets the double extended ASCII character flag type
* FLAG num - sets the decimal number flag type (from 1 to 65535)
* AF parameter - alias for flag's set

Also it moves test dictionaries into separate directory.

Author: Artur Zakirov with editorization by me
parent 9445db92
......@@ -2615,18 +2615,41 @@ SELECT plainto_tsquery('supernova star');
</para>
<para>
To create an <application>Ispell</> dictionary, use the built-in
<literal>ispell</literal> template and specify several parameters:
To create an <application>Ispell</> dictionary perform these steps:
</para>
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
download dictionary configuration files. <productname>OpenOffice</>
extension files have the <filename>.oxt</> extension. It is necessary
to extract <filename>.aff</> and <filename>.dic</> files, change
extensions to <filename>.affix</> and <filename>.dict</>. For some
dictionary files it is also needed to convert characters to the UTF-8
encoding with commands (for example, for norwegian language dictionary):
<programlisting>
CREATE TEXT SEARCH DICTIONARY english_ispell (
iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
</programlisting>
</para>
</listitem>
<listitem>
<para>
copy files to the <filename>$SHAREDIR/tsearch_data</> directory
</para>
</listitem>
<listitem>
<para>
load files into PostgreSQL with the following command:
<programlisting>
CREATE TEXT SEARCH DICTIONARY english_hunspell (
TEMPLATE = ispell,
DictFile = english,
AffFile = english,
StopWords = english
);
DictFile = en_us,
AffFile = en_us,
Stopwords = english);
</programlisting>
</para>
</listitem>
</itemizedlist>
<para>
Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
......@@ -2642,6 +2665,56 @@ CREATE TEXT SEARCH DICTIONARY english_ispell (
example, a Snowball dictionary, which recognizes everything.
</para>
<para>
The <filename>.affix</> file of <application>Ispell</> has the following
structure:
<programlisting>
prefixes
flag *A:
. > RE # As in enter > reenter
suffixes
flag T:
E > ST # As in late > latest
[^AEIOU]Y > -Y,IEST # As in dirty > dirtiest
[AEIOU]Y > EST # As in gray > grayest
[^EY] > EST # As in small > smallest
</programlisting>
</para>
<para>
And the <filename>.dict</> file has the following structure:
<programlisting>
lapse/ADGRS
lard/DGRS
large/PRTY
lark/MRS
</programlisting>
</para>
<para>
Format of the <filename>.dict</> file is:
<programlisting>
basic_form/affix_class_name
</programlisting>
</para>
<para>
In the <filename>.affix</> file every affix flag is described in the
following format:
<programlisting>
condition > [-stripping_letters,] adding_affix
</programlisting>
</para>
<para>
Here, condition has a format similar to the format of regular expressions.
It can use groupings <literal>[...]</> and <literal>[^...]</>.
For example, <literal>[AEIOU]Y</> means that the last letter of the word
is <literal>"y"</> and the penultimate letter is <literal>"a"</>,
<literal>"e"</>, <literal>"i"</>, <literal>"o"</> or <literal>"u"</>.
<literal>[^EY]</> means that the last letter is neither <literal>"e"</>
nor <literal>"y"</>.
</para>
<para>
Ispell dictionaries support splitting compound words;
a useful feature.
......@@ -2663,6 +2736,65 @@ SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
</programlisting>
</para>
<para>
<application>MySpell</> format is a subset of <application>Hunspell</>.
The <filename>.affix</> file of <application>Hunspell</> has the following
structure:
<programlisting>
PFX A Y 1
PFX A 0 re .
SFX T N 4
SFX T 0 st e
SFX T y iest [^aeiou]y
SFX T 0 est [aeiou]y
SFX T 0 est [^ey]
</programlisting>
</para>
<para>
The first line of an affix class is the header. Fields of an affix rules are
listed after the header:
</para>
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
parameter name (PFX or SFX)
</para>
</listitem>
<listitem>
<para>
flag (name of the affix class)
</para>
</listitem>
<listitem>
<para>
stripping characters from beginning (at prefix) or end (at suffix) of the
word
</para>
</listitem>
<listitem>
<para>
adding affix
</para>
</listitem>
<listitem>
<para>
condition that has a format similar to the format of regular expressions.
</para>
</listitem>
</itemizedlist>
<para>
The <filename>.dict</> file looks like the <filename>.dict</> file of
<application>Ispell</>:
<programlisting>
larder/M
lardy/RT
large/RSPMYT
largehearted
</programlisting>
</para>
<note>
<para>
<application>MySpell</> does not support compound words.
......
......@@ -13,8 +13,11 @@ include $(top_builddir)/src/Makefile.global
DICTDIR=tsearch_data
DICTFILES=synonym_sample.syn thesaurus_sample.ths hunspell_sample.affix \
ispell_sample.affix ispell_sample.dict
DICTFILES=dicts/synonym_sample.syn dicts/thesaurus_sample.ths \
dicts/hunspell_sample.affix \
dicts/ispell_sample.affix dicts/ispell_sample.dict \
dicts/hunspell_sample_long.affix dicts/hunspell_sample_long.dict \
dicts/hunspell_sample_num.affix dicts/hunspell_sample_num.dict
OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
dict_simple.o dict_synonym.o dict_thesaurus.o \
......
FLAG long
AF 7
AF cZ #1
AF cL #2
AF sGsJpUsS #3
AF sSpB #4
AF cZsS #5
AF sScZs\ #6
AF sA #7
COMPOUNDFLAG cZ
ONLYINCOMPOUND cL
PFX pB Y 1
PFX pB 0 re .
PFX pU N 1
PFX pU 0 un .
SFX sJ Y 1
SFX sJ 0 INGS [^E]
SFX sG Y 1
SFX sG 0 ING [^E]
SFX sS Y 1
SFX sS 0 S [^SXZHY]
SFX sA Y 1
SFX sA Y IES [^AEIOU]Y
SFX s\ N 1
SFX s\ 0 Y/2 [^Y]
book/3
booking/4
footballklubber
foot/5
football/1
ball/6
klubber/1
sky/7
FLAG num
COMPOUNDFLAG 101
ONLYINCOMPOUND 102
PFX 201 Y 1
PFX 201 0 re .
PFX 202 N 1
PFX 202 0 un .
SFX 301 Y 1
SFX 301 0 INGS [^E]
SFX 302 Y 1
SFX 302 0 ING [^E]
SFX 303 Y 1
SFX 303 0 S [^SXZHY]
SFX 304 Y 1
SFX 304 Y IES [^AEIOU]Y
SFX 305 N 1
SFX 305 0 Y/102 [^Y]
book/302,301,202,303
booking/303,201
footballklubber
foot/101,303
football/101
ball/303,101,305
klubber/101
sky/304
This diff is collapsed.
......@@ -19,18 +19,18 @@
#include "tsearch/ts_public.h"
/*
* Max length of a flag name. Names longer than this will be truncated
* to the maximum.
* SPNode and SPNodeData are used to represent prefix tree (Trie) to store
* a words list.
*/
#define MAXFLAGLEN 16
struct SPNode;
typedef struct
{
uint32 val:8,
isword:1,
/* Stores compound flags listed below */
compoundflag:4,
/* Reference to an entry of the AffixData field */
affix:19;
struct SPNode *node;
} SPNodeData;
......@@ -43,7 +43,8 @@ typedef struct
#define FF_COMPOUNDBEGIN 0x02
#define FF_COMPOUNDMIDDLE 0x04
#define FF_COMPOUNDLAST 0x08
#define FF_COMPOUNDFLAG ( FF_COMPOUNDBEGIN | FF_COMPOUNDMIDDLE | FF_COMPOUNDLAST )
#define FF_COMPOUNDFLAG ( FF_COMPOUNDBEGIN | FF_COMPOUNDMIDDLE | \
FF_COMPOUNDLAST )
#define FF_DICTFLAGMASK 0x0f
typedef struct SPNode
......@@ -54,19 +55,24 @@ typedef struct SPNode
#define SPNHDRSZ (offsetof(SPNode,data))
/*
* Represents an entry in a words list.
*/
typedef struct spell_struct
{
union
{
/*
* flag is filled in by NIImportDictionary. After NISortDictionary, d
* is valid and flag is invalid.
* flag is filled in by NIImportDictionary(). After NISortDictionary(),
* d is used instead of flag.
*/
char flag[MAXFLAGLEN];
char *flag;
/* d is used in mkSPNode() */
struct
{
/* Reference to an entry of the AffixData field */
int affix;
/* Length of the word */
int len;
} d;
} p;
......@@ -75,10 +81,14 @@ typedef struct spell_struct
#define SPELLHDRSZ (offsetof(SPELL, word))
/*
* Represents an entry in an affix list.
*/
typedef struct aff_struct
{
uint32 flag:8,
type:1,
uint32 flag:16;
/* FF_SUFFIX or FF_PREFIX */
uint32 type:1,
flagflags:7,
issimple:1,
isregis:1,
......@@ -106,6 +116,10 @@ typedef struct aff_struct
#define FF_SUFFIX 1
#define FF_PREFIX 0
/*
* AffixNode and AffixNodeData are used to represent prefix tree (Trie) to store
* an affix list.
*/
struct AffixNode;
typedef struct
......@@ -132,6 +146,16 @@ typedef struct
bool issuffix;
} CMPDAffix;
typedef enum
{
FM_CHAR,
FM_LONG,
FM_NUM
} FlagMode;
#define FLAGCHAR_MAXSIZE (1 << 8)
#define FLAGNUM_MAXSIZE (1 << 16)
typedef struct
{
int maffixes;
......@@ -142,14 +166,17 @@ typedef struct
AffixNode *Prefix;
SPNode *Dictionary;
/* Array of sets of affixes */
char **AffixData;
int lenAffixData;
int nAffixData;
bool useFlagAliases;
CMPDAffix *CompoundAffix;
unsigned char flagval[256];
unsigned char flagval[FLAGNUM_MAXSIZE];
bool usecompound;
FlagMode flagMode;
/*
* Remaining fields are only used during dictionary construction; they are
......
......@@ -191,6 +191,198 @@ SELECT ts_lexize('hunspell', 'footballyklubber');
{foot,ball,klubber}
(1 row)
-- Test ISpell dictionary with hunspell affix file with FLAG long parameter
CREATE TEXT SEARCH DICTIONARY hunspell_long (
Template=ispell,
DictFile=hunspell_sample_long,
AffFile=hunspell_sample_long
);
SELECT ts_lexize('hunspell_long', 'skies');
ts_lexize
-----------
{sky}
(1 row)
SELECT ts_lexize('hunspell_long', 'bookings');
ts_lexize
----------------
{booking,book}
(1 row)
SELECT ts_lexize('hunspell_long', 'booking');
ts_lexize
----------------
{booking,book}
(1 row)
SELECT ts_lexize('hunspell_long', 'foot');
ts_lexize
-----------
{foot}
(1 row)
SELECT ts_lexize('hunspell_long', 'foots');
ts_lexize
-----------
{foot}
(1 row)
SELECT ts_lexize('hunspell_long', 'rebookings');
ts_lexize
----------------
{booking,book}
(1 row)
SELECT ts_lexize('hunspell_long', 'rebooking');
ts_lexize
----------------
{booking,book}
(1 row)
SELECT ts_lexize('hunspell_long', 'rebook');
ts_lexize
-----------
(1 row)
SELECT ts_lexize('hunspell_long', 'unbookings');
ts_lexize
-----------
{book}
(1 row)
SELECT ts_lexize('hunspell_long', 'unbooking');
ts_lexize
-----------
{book}
(1 row)
SELECT ts_lexize('hunspell_long', 'unbook');
ts_lexize
-----------
{book}
(1 row)
SELECT ts_lexize('hunspell_long', 'footklubber');
ts_lexize
----------------
{foot,klubber}
(1 row)
SELECT ts_lexize('hunspell_long', 'footballklubber');
ts_lexize
------------------------------------------------------
{footballklubber,foot,ball,klubber,football,klubber}
(1 row)
SELECT ts_lexize('hunspell_long', 'ballyklubber');
ts_lexize
----------------
{ball,klubber}
(1 row)
SELECT ts_lexize('hunspell_long', 'footballyklubber');
ts_lexize
---------------------
{foot,ball,klubber}
(1 row)
-- Test ISpell dictionary with hunspell affix file with FLAG num parameter
CREATE TEXT SEARCH DICTIONARY hunspell_num (
Template=ispell,
DictFile=hunspell_sample_num,
AffFile=hunspell_sample_num
);
SELECT ts_lexize('hunspell_num', 'skies');
ts_lexize
-----------
{sky}
(1 row)
SELECT ts_lexize('hunspell_num', 'bookings');
ts_lexize
----------------
{booking,book}
(1 row)
SELECT ts_lexize('hunspell_num', 'booking');
ts_lexize
----------------
{booking,book}
(1 row)
SELECT ts_lexize('hunspell_num', 'foot');
ts_lexize
-----------
{foot}
(1 row)
SELECT ts_lexize('hunspell_num', 'foots');
ts_lexize
-----------
{foot}
(1 row)
SELECT ts_lexize('hunspell_num', 'rebookings');
ts_lexize
----------------
{booking,book}
(1 row)
SELECT ts_lexize('hunspell_num', 'rebooking');
ts_lexize
----------------
{booking,book}
(1 row)
SELECT ts_lexize('hunspell_num', 'rebook');
ts_lexize
-----------
(1 row)
SELECT ts_lexize('hunspell_num', 'unbookings');
ts_lexize
-----------
{book}
(1 row)
SELECT ts_lexize('hunspell_num', 'unbooking');
ts_lexize
-----------
{book}
(1 row)
SELECT ts_lexize('hunspell_num', 'unbook');
ts_lexize
-----------
{book}
(1 row)
SELECT ts_lexize('hunspell_num', 'footklubber');
ts_lexize
----------------
{foot,klubber}
(1 row)
SELECT ts_lexize('hunspell_num', 'footballklubber');
ts_lexize
------------------------------------------------------
{footballklubber,foot,ball,klubber,football,klubber}
(1 row)
SELECT ts_lexize('hunspell_num', 'ballyklubber');
ts_lexize
----------------
{ball,klubber}
(1 row)
SELECT ts_lexize('hunspell_num', 'footballyklubber');
ts_lexize
---------------------
{foot,ball,klubber}
(1 row)
-- Synonim dictionary
CREATE TEXT SEARCH DICTIONARY synonym (
Template=synonym,
......@@ -277,6 +469,48 @@ SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
'foot':B & 'ball':B & 'klubber':B & ( 'booking':A | 'book':A ) & 'sky'
(1 row)
-- Test ispell dictionary with hunspell affix with FLAG long in configuration
ALTER TEXT SEARCH CONFIGURATION hunspell_tst ALTER MAPPING
REPLACE hunspell WITH hunspell_long;
SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footballklubber from a foot');
to_tsvector
----------------------------------------------------------------------------------------------------
'ball':7 'book':1,5 'booking':1,5 'foot':7,10 'football':7 'footballklubber':7 'klubber':7 'sky':3
(1 row)
SELECT to_tsquery('hunspell_tst', 'footballklubber');
to_tsquery
------------------------------------------------------------------------------
( 'footballklubber' | 'foot' & 'ball' & 'klubber' ) | 'football' & 'klubber'
(1 row)
SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
to_tsquery
------------------------------------------------------------------------
'foot':B & 'ball':B & 'klubber':B & ( 'booking':A | 'book':A ) & 'sky'
(1 row)
-- Test ispell dictionary with hunspell affix with FLAG num in configuration
ALTER TEXT SEARCH CONFIGURATION hunspell_tst ALTER MAPPING
REPLACE hunspell_long WITH hunspell_num;
SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footballklubber from a foot');
to_tsvector
----------------------------------------------------------------------------------------------------
'ball':7 'book':1,5 'booking':1,5 'foot':7,10 'football':7 'footballklubber':7 'klubber':7 'sky':3
(1 row)
SELECT to_tsquery('hunspell_tst', 'footballklubber');
to_tsquery
------------------------------------------------------------------------------
( 'footballklubber' | 'foot' & 'ball' & 'klubber' ) | 'football' & 'klubber'
(1 row)
SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
to_tsquery
------------------------------------------------------------------------
'foot':B & 'ball':B & 'klubber':B & ( 'booking':A | 'book':A ) & 'sky'
(1 row)
-- Test synonym dictionary in configuration
CREATE TEXT SEARCH CONFIGURATION synonym_tst (
COPY=english
......
......@@ -48,6 +48,54 @@ SELECT ts_lexize('hunspell', 'footballklubber');
SELECT ts_lexize('hunspell', 'ballyklubber');
SELECT ts_lexize('hunspell', 'footballyklubber');
-- Test ISpell dictionary with hunspell affix file with FLAG long parameter
CREATE TEXT SEARCH DICTIONARY hunspell_long (
Template=ispell,
DictFile=hunspell_sample_long,
AffFile=hunspell_sample_long
);
SELECT ts_lexize('hunspell_long', 'skies');
SELECT ts_lexize('hunspell_long', 'bookings');
SELECT ts_lexize('hunspell_long', 'booking');
SELECT ts_lexize('hunspell_long', 'foot');
SELECT ts_lexize('hunspell_long', 'foots');
SELECT ts_lexize('hunspell_long', 'rebookings');
SELECT ts_lexize('hunspell_long', 'rebooking');
SELECT ts_lexize('hunspell_long', 'rebook');
SELECT ts_lexize('hunspell_long', 'unbookings');
SELECT ts_lexize('hunspell_long', 'unbooking');
SELECT ts_lexize('hunspell_long', 'unbook');
SELECT ts_lexize('hunspell_long', 'footklubber');
SELECT ts_lexize('hunspell_long', 'footballklubber');
SELECT ts_lexize('hunspell_long', 'ballyklubber');
SELECT ts_lexize('hunspell_long', 'footballyklubber');
-- Test ISpell dictionary with hunspell affix file with FLAG num parameter
CREATE TEXT SEARCH DICTIONARY hunspell_num (
Template=ispell,
DictFile=hunspell_sample_num,
AffFile=hunspell_sample_num
);
SELECT ts_lexize('hunspell_num', 'skies');
SELECT ts_lexize('hunspell_num', 'bookings');
SELECT ts_lexize('hunspell_num', 'booking');
SELECT ts_lexize('hunspell_num', 'foot');
SELECT ts_lexize('hunspell_num', 'foots');
SELECT ts_lexize('hunspell_num', 'rebookings');
SELECT ts_lexize('hunspell_num', 'rebooking');
SELECT ts_lexize('hunspell_num', 'rebook');
SELECT ts_lexize('hunspell_num', 'unbookings');
SELECT ts_lexize('hunspell_num', 'unbooking');
SELECT ts_lexize('hunspell_num', 'unbook');
SELECT ts_lexize('hunspell_num', 'footklubber');
SELECT ts_lexize('hunspell_num', 'footballklubber');
SELECT ts_lexize('hunspell_num', 'ballyklubber');
SELECT ts_lexize('hunspell_num', 'footballyklubber');
-- Synonim dictionary
CREATE TEXT SEARCH DICTIONARY synonym (
Template=synonym,
......@@ -94,6 +142,22 @@ SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footb
SELECT to_tsquery('hunspell_tst', 'footballklubber');
SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
-- Test ispell dictionary with hunspell affix with FLAG long in configuration
ALTER TEXT SEARCH CONFIGURATION hunspell_tst ALTER MAPPING
REPLACE hunspell WITH hunspell_long;
SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footballklubber from a foot');
SELECT to_tsquery('hunspell_tst', 'footballklubber');
SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
-- Test ispell dictionary with hunspell affix with FLAG num in configuration
ALTER TEXT SEARCH CONFIGURATION hunspell_tst ALTER MAPPING
REPLACE hunspell_long WITH hunspell_num;
SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footballklubber from a foot');
SELECT to_tsquery('hunspell_tst', 'footballklubber');
SELECT to_tsquery('hunspell_tst', 'footballyklubber:b & rebookings:A & sky');
-- Test synonym dictionary in configuration
CREATE TEXT SEARCH CONFIGURATION synonym_tst (
COPY=english
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment