Commit 38e2bf62 authored by Teodor Sigaev's avatar Teodor Sigaev

ISpell info updated

parent ef38ca9b
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title>tsearch-v2-intro</title>
<link type="text/css" rel="stylesheet" href="tsearch-V2-intro_files/tsearch.txt"></head>
<html>
<head>
<title>tsearch-v2-intro</title>
<link type="text/css" rel="stylesheet" href="/~megera/postgres/gist/tsearch/tsearch.css">
</head>
<body> <body>
<div class="content"> <div class="content">
<h2>Tsearch2 - Introduction</h2> <h2>Tsearch2 - Introduction</h2>
<p><a href= <p><a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html">
"http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html">
[Online version]</a> of this document is available.</p> [Online version]</a> of this document is available.</p>
<p>The tsearch2 module is available to add as an extension to <p>The tsearch2 module is available to add as an extension to
...@@ -38,13 +34,11 @@ ...@@ -38,13 +34,11 @@
<p>The README.tsearch2 file included in the contrib/tsearch2 <p>The README.tsearch2 file included in the contrib/tsearch2
directory contains a brief overview and history behind tsearch. directory contains a brief overview and history behind tsearch.
This can also be found online <a href= This can also be found online <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/">[right
"http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">[right
here]</a>.</p> here]</a>.</p>
<p>Further in depth documentation such as a full function <p>Further in depth documentation such as a full function
reference, and user guide can be found online at the <a href= reference, and user guide can be found online at the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/">[tsearch
"http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/">[tsearch
documentation home]</a>.</p> documentation home]</a>.</p>
<h3>ACKNOWLEDGEMENTS</h3> <h3>ACKNOWLEDGEMENTS</h3>
...@@ -105,11 +99,9 @@ ...@@ -105,11 +99,9 @@
<p>Step one is to download the tsearch V2 module :</p> <p>Step one is to download the tsearch V2 module :</p>
<p><a href= <p><a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/">[http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/]</a>
"http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">[http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/]</a>
(check Development History for latest stable version !)</p> (check Development History for latest stable version !)</p>
<pre> <pre> tar -zxvf tsearch-v2.tar.gz
tar -zxvf tsearch-v2.tar.gz
mv tsearch2 PGSQL_SRC/contrib/ mv tsearch2 PGSQL_SRC/contrib/
cd PGSQL_SRC/contrib/tsearch2 cd PGSQL_SRC/contrib/tsearch2
</pre> </pre>
...@@ -121,18 +113,15 @@ ...@@ -121,18 +113,15 @@
<p>Then continue with the regular building and installation <p>Then continue with the regular building and installation
process</p> process</p>
<pre> <pre> gmake
gmake
gmake install gmake install
gmake installcheck gmake installcheck
</pre> </pre>
<p>That is pretty much all you have to do, unless of course you <p>That is pretty much all you have to do, unless of course you
get errors. However if you get those, you better go check with get errors. However if you get those, you better go check with
the mailing lists over at <a href= the mailing lists over at <a href="http://www.postgresql.org/">http://www.postgresql.org</a> or
"http://www.postgresql.org">http://www.postgresql.org</a> or <a href="http://openfts.sourceforge.net/">http://openfts.sourceforge.net/</a>
<a href=
"http://openfts.sourceforge.net/">http://openfts.sourceforge.net/</a>
since its never failed for me.</p> since its never failed for me.</p>
<p>The directory in the contib/ and the directory from the <p>The directory in the contib/ and the directory from the
...@@ -151,15 +140,13 @@ ...@@ -151,15 +140,13 @@
<p>We should create a database to use as an example for the <p>We should create a database to use as an example for the
remainder of this file. We can call the database "ftstest". You remainder of this file. We can call the database "ftstest". You
can create it from the command line like this:</p> can create it from the command line like this:</p>
<pre> <pre> #createdb ftstest
#createdb ftstest
</pre> </pre>
<p>If you thought installation was easy, this next bit is even <p>If you thought installation was easy, this next bit is even
easier. Change to the PGSQL_SRC/contrib/tsearch2 directory and easier. Change to the PGSQL_SRC/contrib/tsearch2 directory and
type:</p> type:</p>
<pre> <pre> psql ftstest &lt; tsearch2.sql
psql ftstest &lt; tsearch2.sql
</pre> </pre>
<p>The file "tsearch2.sql" holds all the wonderful little <p>The file "tsearch2.sql" holds all the wonderful little
...@@ -170,8 +157,7 @@ ...@@ -170,8 +157,7 @@
pg_ts_cfgmap are added.</p> pg_ts_cfgmap are added.</p>
<p>You can check out the tables if you like:</p> <p>You can check out the tables if you like:</p>
<pre> <pre> #psql ftstest
#psql ftstest
ftstest=# \d ftstest=# \d
List of relations List of relations
Schema | Name | Type | Owner Schema | Name | Type | Owner
...@@ -188,8 +174,7 @@ ...@@ -188,8 +174,7 @@
<p>The first thing we can do is try out some of the types that <p>The first thing we can do is try out some of the types that
are provided for us. Lets look at the tsvector type provided are provided for us. Lets look at the tsvector type provided
for us:</p> for us:</p>
<pre> <pre> SELECT 'Our first string used today'::tsvector;
SELECT 'Our first string used today'::tsvector;
tsvector tsvector
--------------------------------------- ---------------------------------------
'Our' 'used' 'first' 'today' 'string' 'Our' 'used' 'first' 'today' 'string'
...@@ -199,8 +184,7 @@ ...@@ -199,8 +184,7 @@
<p>The results are the words used within our string. Notice <p>The results are the words used within our string. Notice
they are not in any particular order. The tsvector type returns they are not in any particular order. The tsvector type returns
a string of space separated words.</p> a string of space separated words.</p>
<pre> <pre> SELECT 'Our first string used today first string'::tsvector;
SELECT 'Our first string used today first string'::tsvector;
tsvector tsvector
----------------------------------------------- -----------------------------------------------
'Our' 'used' 'again' 'first' 'today' 'string' 'Our' 'used' 'again' 'first' 'today' 'string'
...@@ -217,8 +201,7 @@ ...@@ -217,8 +201,7 @@
by the tsearch2 module.</p> by the tsearch2 module.</p>
<p>The function to_tsvector has 3 possible signatures:</p> <p>The function to_tsvector has 3 possible signatures:</p>
<pre> <pre> to_tsvector(oid, text);
to_tsvector(oid, text);
to_tsvector(text, text); to_tsvector(text, text);
to_tsvector(text); to_tsvector(text);
</pre> </pre>
...@@ -228,8 +211,7 @@ ...@@ -228,8 +211,7 @@
the searchable text is broken up into words (Stemming process). the searchable text is broken up into words (Stemming process).
Right now we will specify the 'default' configuration. See the Right now we will specify the 'default' configuration. See the
section on TSEARCH2 CONFIGURATION to learn more about this.</p> section on TSEARCH2 CONFIGURATION to learn more about this.</p>
<pre> <pre> SELECT to_tsvector('default',
SELECT to_tsvector('default',
'Our first string used today first string'); 'Our first string used today first string');
to_tsvector to_tsvector
-------------------------------------------- --------------------------------------------
...@@ -259,8 +241,7 @@ ...@@ -259,8 +241,7 @@
<p>If you want to view the output of the tsvector fields <p>If you want to view the output of the tsvector fields
without their positions, you can do so with the function without their positions, you can do so with the function
"strip(tsvector)".</p> "strip(tsvector)".</p>
<pre> <pre> SELECT strip(to_tsvector('default',
SELECT strip(to_tsvector('default',
'Our first string used today first string')); 'Our first string used today first string'));
strip strip
-------------------------------- --------------------------------
...@@ -270,8 +251,7 @@ ...@@ -270,8 +251,7 @@
<p>If you wish to know the number of unique words returned in <p>If you wish to know the number of unique words returned in
the tsvector you can do so by using the function the tsvector you can do so by using the function
"length(tsvector)"</p> "length(tsvector)"</p>
<pre> <pre> SELECT length(to_tsvector('default',
SELECT length(to_tsvector('default',
'Our first string used today first string')); 'Our first string used today first string'));
length length
-------- --------
...@@ -282,15 +262,13 @@ ...@@ -282,15 +262,13 @@
<p>Lets take a look at the function to_tsquery. It also has 3 <p>Lets take a look at the function to_tsquery. It also has 3
signatures which follow the same rational as the to_tsvector signatures which follow the same rational as the to_tsvector
function:</p> function:</p>
<pre> <pre> to_tsquery(oid, text);
to_tsquery(oid, text);
to_tsquery(text, text); to_tsquery(text, text);
to_tsquery(text); to_tsquery(text);
</pre> </pre>
<p>Lets try using the function with a single word :</p> <p>Lets try using the function with a single word :</p>
<pre> <pre> SELECT to_tsquery('default', 'word');
SELECT to_tsquery('default', 'word');
to_tsquery to_tsquery
----------- -----------
'word' 'word'
...@@ -303,8 +281,7 @@ ...@@ -303,8 +281,7 @@
<p>Lets attempt to use the function with a string of multiple <p>Lets attempt to use the function with a string of multiple
words:</p> words:</p>
<pre> <pre> SELECT to_tsquery('default', 'this is many words');
SELECT to_tsquery('default', 'this is many words');
ERROR: Syntax error ERROR: Syntax error
</pre> </pre>
...@@ -313,8 +290,7 @@ ...@@ -313,8 +290,7 @@
"tsquery" used for searching a tsvector field. What we need to "tsquery" used for searching a tsvector field. What we need to
do is search for one to many words with some kind of logic (for do is search for one to many words with some kind of logic (for
now simple boolean).</p> now simple boolean).</p>
<pre> <pre> SELECT to_tsquery('default', 'searching|sentence');
SELECT to_tsquery('default', 'searching|sentence');
to_tsquery to_tsquery
---------------------- ----------------------
'search' | 'sentenc' 'search' | 'sentenc'
...@@ -328,8 +304,7 @@ ...@@ -328,8 +304,7 @@
<p>You can not use words defined as being a stop word in your <p>You can not use words defined as being a stop word in your
configuration. The function will not fail ... you will just get configuration. The function will not fail ... you will just get
no result, and a NOTICE like this:</p> no result, and a NOTICE like this:</p>
<pre> <pre> SELECT to_tsquery('default', 'a|is&amp;not|!the');
SELECT to_tsquery('default', 'a|is&amp;not|!the');
NOTICE: Query contains only stopword(s) NOTICE: Query contains only stopword(s)
or doesn't contain lexem(s), ignored or doesn't contain lexem(s), ignored
to_tsquery to_tsquery
...@@ -348,8 +323,7 @@ ...@@ -348,8 +323,7 @@
<p>The next stage is to add a full text index to an existing <p>The next stage is to add a full text index to an existing
table. In this example we already have a table defined as table. In this example we already have a table defined as
follows:</p> follows:</p>
<pre> <pre> CREATE TABLE tblMessages
CREATE TABLE tblMessages
( (
intIndex int4, intIndex int4,
strTopic varchar(100), strTopic varchar(100),
...@@ -362,8 +336,7 @@ ...@@ -362,8 +336,7 @@
test strings for a topic, and a message. here is some test data test strings for a topic, and a message. here is some test data
I inserted. (yes I know it's completely useless stuff ;-) but I inserted. (yes I know it's completely useless stuff ;-) but
it will serve our purpose right now).</p> it will serve our purpose right now).</p>
<pre> <pre> INSERT INTO tblMessages
INSERT INTO tblMessages
VALUES ('1', 'Testing Topic', 'Testing message data input'); VALUES ('1', 'Testing Topic', 'Testing message data input');
INSERT INTO tblMessages INSERT INTO tblMessages
VALUES ('2', 'Movie', 'Breakfast at Tiffany\'s'); VALUES ('2', 'Movie', 'Breakfast at Tiffany\'s');
...@@ -400,8 +373,7 @@ ...@@ -400,8 +373,7 @@
<p>The next stage is to create a special text index which we <p>The next stage is to create a special text index which we
will use for FTI, so we can search our table of messages for will use for FTI, so we can search our table of messages for
words or a phrase. We do this using the SQL command:</p> words or a phrase. We do this using the SQL command:</p>
<pre> <pre> ALTER TABLE tblMessages ADD COLUMN idxFTI tsvector;
ALTER TABLE tblMessages ADD idxFTI tsvector;
</pre> </pre>
<p>Note that unlike traditional indexes, this is actually a new <p>Note that unlike traditional indexes, this is actually a new
...@@ -411,8 +383,7 @@ ...@@ -411,8 +383,7 @@
<p>The general rule for the initial insertion of data will <p>The general rule for the initial insertion of data will
follow four steps:</p> follow four steps:</p>
<pre> <pre> 1. update table
1. update table
2. vacuum full analyze 2. vacuum full analyze
3. create index 3. create index
4. vacuum full analyze 4. vacuum full analyze
...@@ -426,8 +397,7 @@ ...@@ -426,8 +397,7 @@
the index has been created on the table, vacuum full analyze is the index has been created on the table, vacuum full analyze is
run again to update postgres's statistics (ie having the index run again to update postgres's statistics (ie having the index
take effect).</p> take effect).</p>
<pre> <pre> UPDATE tblMessages SET idxFTI=to_tsvector('default', strMessage);
UPDATE tblMessages SET idxFTI=to_tsvector('default', strMessage);
VACUUM FULL ANALYZE; VACUUM FULL ANALYZE;
</pre> </pre>
...@@ -436,8 +406,7 @@ ...@@ -436,8 +406,7 @@
information stored, you should instead do the following, which information stored, you should instead do the following, which
effectively concatenates the two fields into one before being effectively concatenates the two fields into one before being
inserted into the table:</p> inserted into the table:</p>
<pre> <pre> UPDATE tblMessages
UPDATE tblMessages
SET idxFTI=to_tsvector('default',coalesce(strTopic,'') ||' '|| coalesce(strMessage,'')); SET idxFTI=to_tsvector('default',coalesce(strTopic,'') ||' '|| coalesce(strMessage,''));
VACUUM FULL ANALYZE; VACUUM FULL ANALYZE;
</pre> </pre>
...@@ -451,8 +420,7 @@ ...@@ -451,8 +420,7 @@
Full Text INDEXINGi ;-)), so don't worry about any indexing Full Text INDEXINGi ;-)), so don't worry about any indexing
overhead. We will create an index based on the gist function. overhead. We will create an index based on the gist function.
GiST is an index structure for Generalized Search Tree.</p> GiST is an index structure for Generalized Search Tree.</p>
<pre> <pre> CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);
CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);
VACUUM FULL ANALYZE; VACUUM FULL ANALYZE;
</pre> </pre>
...@@ -464,15 +432,13 @@ ...@@ -464,15 +432,13 @@
<p>The last thing to do is set up a trigger so every time a row <p>The last thing to do is set up a trigger so every time a row
in this table is changed, the text index is automatically in this table is changed, the text index is automatically
updated. This is easily done using:</p> updated. This is easily done using:</p>
<pre> <pre> CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, strMessage); FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, strMessage);
</pre> </pre>
<p>Or if you are indexing both strMessage and strTopic you <p>Or if you are indexing both strMessage and strTopic you
should instead do:</p> should instead do:</p>
<pre> <pre> CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
FOR EACH ROW EXECUTE PROCEDURE FOR EACH ROW EXECUTE PROCEDURE
tsearch2(idxFTI, strTopic, strMessage); tsearch2(idxFTI, strTopic, strMessage);
</pre> </pre>
...@@ -490,15 +456,13 @@ ...@@ -490,15 +456,13 @@
the tsearch2 function. Lets say we want to create a function to the tsearch2 function. Lets say we want to create a function to
remove certain characters (like the @ symbol from all remove certain characters (like the @ symbol from all
text).</p> text).</p>
<pre> <pre> CREATE FUNCTION dropatsymbol(text)
CREATE FUNCTION dropatsymbol(text)
RETURNS text AS 'select replace($1, \'@\', \' \');' LANGUAGE SQL; RETURNS text AS 'select replace($1, \'@\', \' \');' LANGUAGE SQL;
</pre> </pre>
<p>Now we can use this function within the tsearch2 function on <p>Now we can use this function within the tsearch2 function on
the trigger.</p> the trigger.</p>
<pre> <pre> DROP TRIGGER tsvectorupdate ON tblmessages;
DROP TRIGGER tsvectorupdate ON tblmessages;
CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, dropatsymbol, strMessage); FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, dropatsymbol, strMessage);
INSERT INTO tblmessages VALUES (69, 'Attempt for dropatsymbol', 'Test@test.com'); INSERT INTO tblmessages VALUES (69, 'Attempt for dropatsymbol', 'Test@test.com');
...@@ -513,8 +477,7 @@ ...@@ -513,8 +477,7 @@
locale of the server. All you have to do is change your default locale of the server. All you have to do is change your default
configuration, or add a new one for your specific locale. See configuration, or add a new one for your specific locale. See
the section on TSEARCH2 CONFIGURATION.</p> the section on TSEARCH2 CONFIGURATION.</p>
<pre class="real"> <pre class="real"> SELECT * FROM tblmessages WHERE intindex = 69;
SELECT * FROM tblmessages WHERE intindex = 69;
intindex | strtopic | strmessage | idxfti intindex | strtopic | strmessage | idxfti
----------+--------------------------+---------------+----------------------- ----------+--------------------------+---------------+-----------------------
...@@ -540,8 +503,7 @@ in the tsvector column. ...@@ -540,8 +503,7 @@ in the tsvector column.
<p>Lets search the indexed data for the word "Test". I indexed <p>Lets search the indexed data for the word "Test". I indexed
based on the the concatenation of the strTopic, and the based on the the concatenation of the strTopic, and the
strMessage:</p> strMessage:</p>
<pre> <pre> SELECT intindex, strtopic FROM tblmessages
SELECT intindex, strtopic FROM tblmessages
WHERE idxfti @@ 'test'::tsquery; WHERE idxfti @@ 'test'::tsquery;
intindex | strtopic intindex | strtopic
----------+--------------- ----------+---------------
...@@ -553,8 +515,7 @@ in the tsvector column. ...@@ -553,8 +515,7 @@ in the tsvector column.
"Testing Topic". Notice that the word I search for was all "Testing Topic". Notice that the word I search for was all
lowercase. Let's see what happens when I query for uppercase lowercase. Let's see what happens when I query for uppercase
"Test".</p> "Test".</p>
<pre> <pre> SELECT intindex, strtopic FROM tblmessages
SELECT intindex, strtopic FROM tblmessages
WHERE idxfti @@ 'Test'::tsquery; WHERE idxfti @@ 'Test'::tsquery;
intindex | strtopic intindex | strtopic
----------+---------- ----------+----------
...@@ -570,8 +531,7 @@ in the tsvector column. ...@@ -570,8 +531,7 @@ in the tsvector column.
<p>Most likely the best way to query the field is to use the <p>Most likely the best way to query the field is to use the
to_tsquery function on the right hand side of the @@ operator to_tsquery function on the right hand side of the @@ operator
like this:</p> like this:</p>
<pre> <pre> SELECT intindex, strtopic FROM tblmessages
SELECT intindex, strtopic FROM tblmessages
WHERE idxfti @@ to_tsquery('default', 'Test | Zeppelin'); WHERE idxfti @@ to_tsquery('default', 'Test | Zeppelin');
intindex | strtopic intindex | strtopic
----------+-------------------- ----------+--------------------
...@@ -592,8 +552,7 @@ in the tsvector column. ...@@ -592,8 +552,7 @@ in the tsvector column.
a way around which doesn't appear to have a significant impact a way around which doesn't appear to have a significant impact
on query time, and that is to use a query such as the on query time, and that is to use a query such as the
following:</p> following:</p>
<pre> <pre> SELECT intindex, strTopic FROM tblmessages
SELECT intindex, strTopic FROM tblmessages
WHERE idxfti @@ to_tsquery('default', 'gettysburg &amp; address') WHERE idxfti @@ to_tsquery('default', 'gettysburg &amp; address')
AND strMessage ~* '.*men are created equal.*'; AND strMessage ~* '.*men are created equal.*';
intindex | strtopic intindex | strtopic
...@@ -626,8 +585,7 @@ in the tsvector column. ...@@ -626,8 +585,7 @@ in the tsvector column.
english stemming. We could edit the file english stemming. We could edit the file
:'/usr/local/pgsql/share/english.stop' and add a word to the :'/usr/local/pgsql/share/english.stop' and add a word to the
list. I edited mine to exclude my name from indexing:</p> list. I edited mine to exclude my name from indexing:</p>
<pre> <pre> - Edit /usr/local/pgsql/share/english.stop
- Edit /usr/local/pgsql/share/english.stop
- Add 'andy' to the list - Add 'andy' to the list
- Save the file. - Save the file.
</pre> </pre>
...@@ -638,16 +596,14 @@ in the tsvector column. ...@@ -638,16 +596,14 @@ in the tsvector column.
connected to the DB while editing the stop words, you will need connected to the DB while editing the stop words, you will need
to end the current session and re-connect. When you re-connect to end the current session and re-connect. When you re-connect
to the database, 'andy' is no longer indexed:</p> to the database, 'andy' is no longer indexed:</p>
<pre> <pre> SELECT to_tsvector('default', 'Andy');
SELECT to_tsvector('default', 'Andy');
to_tsvector to_tsvector
------------ ------------
(1 row) (1 row)
</pre> </pre>
<p>Originally I would get the result :</p> <p>Originally I would get the result :</p>
<pre> <pre> SELECT to_tsvector('default', 'Andy');
SELECT to_tsvector('default', 'Andy');
to_tsvector to_tsvector
------------ ------------
'andi':1 'andi':1
...@@ -660,8 +616,7 @@ in the tsvector column. ...@@ -660,8 +616,7 @@ in the tsvector column.
'simple', the results would be different. There are no stop 'simple', the results would be different. There are no stop
words for the simple dictionary. It will just convert to lower words for the simple dictionary. It will just convert to lower
case, and index every unique word.</p> case, and index every unique word.</p>
<pre> <pre> SELECT to_tsvector('simple', 'Andy andy The the in out');
SELECT to_tsvector('simple', 'Andy andy The the in out');
to_tsvector to_tsvector
------------------------------------- -------------------------------------
'in':5 'out':6 'the':3,4 'andy':1,2 'in':5 'out':6 'the':3,4 'andy':1,2
...@@ -672,8 +627,7 @@ in the tsvector column. ...@@ -672,8 +627,7 @@ in the tsvector column.
into the actual configuration of tsearch2. In the examples in into the actual configuration of tsearch2. In the examples in
this document the configuration has always been specified when this document the configuration has always been specified when
using the tsearch2 functions:</p> using the tsearch2 functions:</p>
<pre> <pre> SELECT to_tsvector('default', 'Testing the default config');
SELECT to_tsvector('default', 'Testing the default config');
SELECT to_tsvector('simple', 'Example of simple Config'); SELECT to_tsvector('simple', 'Example of simple Config');
</pre> </pre>
...@@ -682,8 +636,7 @@ in the tsvector column. ...@@ -682,8 +636,7 @@ in the tsvector column.
contains both the 'default' configurations based on the 'C' contains both the 'default' configurations based on the 'C'
locale. And the 'simple' configuration which is not based on locale. And the 'simple' configuration which is not based on
any locale.</p> any locale.</p>
<pre> <pre> SELECT * from pg_ts_cfg;
SELECT * from pg_ts_cfg;
ts_name | prs_name | locale ts_name | prs_name | locale
-----------------+----------+-------------- -----------------+----------+--------------
default | default | C default | default | C
...@@ -706,8 +659,7 @@ in the tsvector column. ...@@ -706,8 +659,7 @@ in the tsvector column.
configuration or just use one that already exists. If I do not configuration or just use one that already exists. If I do not
specify which configuration to use in the to_tsvector function, specify which configuration to use in the to_tsvector function,
I receive the following error.</p> I receive the following error.</p>
<pre> <pre> SELECT to_tsvector('learning tsearch is like going to school');
SELECT to_tsvector('learning tsearch is like going to school');
ERROR: Can't find tsearch config by locale ERROR: Can't find tsearch config by locale
</pre> </pre>
...@@ -716,8 +668,7 @@ in the tsvector column. ...@@ -716,8 +668,7 @@ in the tsvector column.
into the pg_ts_cfg table. We will call the configuration into the pg_ts_cfg table. We will call the configuration
'default_english', with the default parser and use the locale 'default_english', with the default parser and use the locale
'en_US'.</p> 'en_US'.</p>
<pre> <pre> INSERT INTO pg_ts_cfg (ts_name, prs_name, locale)
INSERT INTO pg_ts_cfg (ts_name, prs_name, locale)
VALUES ('default_english', 'default', 'en_US'); VALUES ('default_english', 'default', 'en_US');
</pre> </pre>
...@@ -732,15 +683,14 @@ in the tsvector column. ...@@ -732,15 +683,14 @@ in the tsvector column.
tsearch2.sql</p> tsearch2.sql</p>
<p>Lets take a first look at the pg_ts_dict table</p> <p>Lets take a first look at the pg_ts_dict table</p>
<pre> <pre> ftstest=# \d pg_ts_dict
ftstest=# \d pg_ts_dict
Table "public.pg_ts_dict" Table "public.pg_ts_dict"
Column | Type | Modifiers Column | Type | Modifiers
-----------------+---------+----------- -----------------+---------+-----------
dict_name | text | not null dict_name | text | not null
dict_init | oid | dict_init | oid |
dict_initoption | text | dict_initoption | text |
dict_lemmatize | oid | not null dict_lexize | oid | not null
dict_comment | text | dict_comment | text |
Indexes: pg_ts_dict_idx unique btree (dict_name) Indexes: pg_ts_dict_idx unique btree (dict_name)
</pre> </pre>
...@@ -763,28 +713,57 @@ in the tsvector column. ...@@ -763,28 +713,57 @@ in the tsvector column.
ISpell. We will assume you have ISpell installed on you ISpell. We will assume you have ISpell installed on you
machine. (in /usr/local/lib)</p> machine. (in /usr/local/lib)</p>
<p>First lets register the dictionary(ies) to use from ISpell. <p>There has been some confusion in the past as to which files
We will use the english dictionary from ISpell. We insert the are used from ISpell. ISpell operates using a hash file. This
paths to the relevant ISpell dictionary (*.hash) and affixes is a binary file created by the ISpell command line utility
(*.aff) files. There seems to be some question as to which "buildhash". This utility accepts a file containing the words
ISpell files are to be used. I installed ISpell from the latest from the dictionary, and the affixes file and the output is the
sources on my computer. The installation installed the hash file. The default installation of ISPell installs the
dictionary files with an extension of *.hash. Some english hash file english.hash, which is the exact same file as
installations install with an extension of *.dict As far as I american.hash. ISpell uses this as the fallback dictionary to
know the two extensions are equivilant. So *.hash == use.</p>
*.dict.</p>
<p>This hash file is not what tsearch2 requires as the ISpell
<p>We will also continue to use the english word stop file that interface. The file(s) needed are those used to create the
hash. Tsearch uses the dictionary words for morphology, so the
listing is needed not spellchecking. Regardless, these files
are included in the ISpell sources, and you can use them to
integrate into tsearch2. This is not complicated, but is not
very obvious to begin with. The tsearch2 ISpell interface needs
only the listing of dictionary words, it will parse and load
those words, and use the ISpell dictionary for lexem
processing.</p>
<p>I found the ISPell make system to be very finicky. Their
documentation actually states this to be the case. So I just
did things the command line way. In the ISpell source tree
under langauges/english there are several files in this
directory. For a complete description, please read the ISpell
README. Basically for the english dictionary there is the
option to create the small, medium, large and extra large
dictionaries. The medium dictionary is recommended. If the make
system is configured correctly, it would build and install the
english.has file from the medium size dictionary. Since we are
only concerned with the dictionary word listing ... it can be
created from the /languages/english directory with the
following command:</p>
<pre> sort -u -t/ +0f -1 +0 -T /usr/tmp -o english.med english.0 english.1
</pre>
<p>This will create a file called english.med. You can copy
this file to whever you like. I place mine in /usr/local/lib so
it coincides with the ISpell hash files. You can now add the
tsearch2 configuration entry for the ISpell english dictionary.
We will also continue to use the english word stop file that
was installed for the en_stem dictionary. You could use a was installed for the en_stem dictionary. You could use a
different one if you like. The ISpell configuration is based on different one if you like. The ISpell configuration is based on
the "ispell_template" dictionary installed by default with the "ispell_template" dictionary installed by default with
tsearch2. We will use the OIDs to the stored procedures from tsearch2. We will use the OIDs to the stored procedures from
the row where the dict_name = 'ispell_template'.</p> the row where the dict_name = 'ispell_template'.</p>
<pre> <pre> INSERT INTO pg_ts_dict
INSERT INTO pg_ts_dict
(SELECT 'en_ispell', (SELECT 'en_ispell',
dict_init, dict_init,
'DictFile="/usr/local/lib/english.hash",' 'DictFile="/usr/local/lib/english.med",'
'AffFile="/usr/local/lib/english.aff",' 'AffFile="/usr/local/lib/english.aff",'
'StopFile="/usr/local/pgsql/share/english.stop"', 'StopFile="/usr/local/pgsql/share/english.stop"',
dict_lexize dict_lexize
...@@ -792,6 +771,50 @@ in the tsvector column. ...@@ -792,6 +771,50 @@ in the tsvector column.
WHERE dict_name = 'ispell_template'); WHERE dict_name = 'ispell_template');
</pre> </pre>
<p>Now that we have a dictionary we can specify it's use in a
query to get a lexem. For this we will use the lexize function.
The lexize function takes the name of the dictionary to use as
an argument. Just as the other tsearch2 functions operate.</p>
<pre> SELECT lexize('en_ispell', 'program');
lexize
-----------
{program}
(1 row)
</pre>
<p>If you wanted to always use the ISpell english dictionary
you have installed, you can configure tsearch2 to always use a
specific dictionary.</p>
<pre> SELCECT set_curdict('en_ispell');
</pre>
<p>Lexize is meant to turn a word into a lexem. It is possible
to receive more than one lexem returned for a single word.</p>
<pre> SELECT lexize('en_ispell', 'conditionally');
lexize
-----------------------------
{conditionally,conditional}
(1 row)
</pre>
<p>The lexize function is not meant to take a full string as an
argument to return lexems for. If you passed in an entire
sentence, it attempts to find that entire sentence in the
dictionary. SInce the dictionary contains only words, you will
receive an empty result set back.</p>
<pre> SELECT lexize('en_ispell', 'This is a senctece to lexize');
lexize
--------
(1 row)
If you parse a lexem from a word not in the dictionary, then you will receive an empty result. This makes sense because the word "tsearch" is not int the english dictionary. You can create your own additions to the dictionary if you like. This may be useful for scientific or technical glossaries that need to be indexed. SELECT lexize('en_ispell', 'tsearch'); lexize -------- (1 row)
</pre>
<p>This is not to say that tsearch will be ignored when adding
text information to the the tsvector index column. This will be
explained in greater detail with the table pg_ts_cfgmap.</p>
<p>Next we need to set up the configuration for mapping the <p>Next we need to set up the configuration for mapping the
dictionay use to the lexxem parsings. This will be done by dictionay use to the lexxem parsings. This will be done by
altering the pg_ts_cfgmap table. We will insert several rows, altering the pg_ts_cfgmap table. We will insert several rows,
...@@ -799,8 +822,7 @@ in the tsvector column. ...@@ -799,8 +822,7 @@ in the tsvector column.
configured for use within tsearch2. There are several type of configured for use within tsearch2. There are several type of
lexims we would be concerned with forcing the use of the ISpell lexims we would be concerned with forcing the use of the ISpell
dictionary.</p> dictionary.</p>
<pre> <pre> INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
VALUES ('default_english', 'lhword', '{en_ispell,en_stem}'); VALUES ('default_english', 'lhword', '{en_ispell,en_stem}');
INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name) INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
VALUES ('default_english', 'lpart_hword', '{en_ispell,en_stem}'); VALUES ('default_english', 'lpart_hword', '{en_ispell,en_stem}');
...@@ -818,8 +840,7 @@ in the tsvector column. ...@@ -818,8 +840,7 @@ in the tsvector column.
<p>There are several other lexem types used that we do not need <p>There are several other lexem types used that we do not need
to specify as using the ISpell dictionary. We can simply insert to specify as using the ISpell dictionary. We can simply insert
values using the 'simple' stemming process dictionary.</p> values using the 'simple' stemming process dictionary.</p>
<pre> <pre> INSERT INTO pg_ts_cfgmap
INSERT INTO pg_ts_cfgmap
VALUES ('default_english', 'url', '{simple}'); VALUES ('default_english', 'url', '{simple}');
INSERT INTO pg_ts_cfgmap INSERT INTO pg_ts_cfgmap
VALUES ('default_english', 'host', '{simple}'); VALUES ('default_english', 'host', '{simple}');
...@@ -857,8 +878,7 @@ in the tsvector column. ...@@ -857,8 +878,7 @@ in the tsvector column.
complete. We have successfully created a new tsearch2 complete. We have successfully created a new tsearch2
configuration. At the same time we have also set the new configuration. At the same time we have also set the new
configuration to be our default for en_US locale.</p> configuration to be our default for en_US locale.</p>
<pre> <pre> SELECT to_tsvector('default_english',
SELECT to_tsvector('default_english',
'learning tsearch is like going to school'); 'learning tsearch is like going to school');
to_tsvector to_tsvector
-------------------------------------------------- --------------------------------------------------
...@@ -870,12 +890,37 @@ in the tsvector column. ...@@ -870,12 +890,37 @@ in the tsvector column.
(1 row) (1 row)
</pre> </pre>
<p>Notice here that words like "tsearch" are still parsed and
indexed in the tsvector column. There is a lexem returned for
the word becuase in the configuration mapping table, we specify
words to be used from the 'en_ispell' dictionary first, but as
a fallback to use the 'en_stem' dictionary. Therefore a lexem
is not returned from en_ispell, but is returned from en_stem,
and added to the tsvector.</p>
<pre> SELECT to_tsvector('learning tsearch is like going to computer school');
to_tsvector
---------------------------------------------------------------------------
'go':5 'like':4 'learn':1 'school':8 'compute':7 'tsearch':2 'computer':7
(1 row)
</pre>
<p>Notice in this last example I added the word "computer" to
the text to be converted into a tsvector. Because we have setup
our default configuration to use the ISpell english dictionary,
the words are lexized, and computer returns 2 lexems at the
same position. 'compute':7 and 'computer':7 are now both
indexed for the word computer.</p>
<p>You can create additional dictionarynlists, or use the extra
large dictionary from ISpell. You can read through the ISpell
documents, and source tree to make modifications as you see
fit.</p>
<p>In the case that you already have a configuration set for <p>In the case that you already have a configuration set for
the locale, and you are changing it to your new dictionary the locale, and you are changing it to your new dictionary
configuration. You will have to set the old locale to NULL. If configuration. You will have to set the old locale to NULL. If
we are using the 'C' locale then we would do this:</p> we are using the 'C' locale then we would do this:</p>
<pre> <pre> UPDATE pg_ts_cfg SET locale=NULL WHERE locale = 'C';
UPDATE pg_ts_cfg SET locale=NULL WHERE locale = 'C';
</pre> </pre>
<p>That about wraps up the configuration of tsearch2. There is <p>That about wraps up the configuration of tsearch2. There is
...@@ -917,38 +962,32 @@ in the tsvector column. ...@@ -917,38 +962,32 @@ in the tsvector column.
<p>1) Backup any global database objects such as users and <p>1) Backup any global database objects such as users and
groups (this step is usually only necessary when you will be groups (this step is usually only necessary when you will be
restoring to a virgin system)</p> restoring to a virgin system)</p>
<pre> <pre> pg_dumpall -g &gt; GLOBALobjects.sql
pg_dumpall -g &gt; GLOBALobjects.sql
</pre> </pre>
<p>2) Backup the full database schema using pg_dump</p> <p>2) Backup the full database schema using pg_dump</p>
<pre> <pre> pg_dump -s DATABASE &gt; DATABASEschema.sql
pg_dump -s DATABASE &gt; DATABASEschema.sql
</pre> </pre>
<p>3) Backup the full database using pg_dump</p> <p>3) Backup the full database using pg_dump</p>
<pre> <pre> pg_dump -Fc DATABASE &gt; DATABASEdata.tar
pg_dump -Fc DATABASE &gt; DATABASEdata.tar
</pre> </pre>
<p>To Restore a PostgreSQL database that uses the tsearch2 <p>To Restore a PostgreSQL database that uses the tsearch2
module:</p> module:</p>
<p>1) Create the blank database</p> <p>1) Create the blank database</p>
<pre> <pre> createdb DATABASE
createdb DATABASE
</pre> </pre>
<p>2) Restore any global database objects such as users and <p>2) Restore any global database objects such as users and
groups (this step is usually only necessary when you will be groups (this step is usually only necessary when you will be
restoring to a virgin system)</p> restoring to a virgin system)</p>
<pre> <pre> psql DATABASE &lt; GLOBALobjects.sql
psql DATABASE &lt; GLOBALobjects.sql
</pre> </pre>
<p>3) Create the tsearch2 objects, functions and operators</p> <p>3) Create the tsearch2 objects, functions and operators</p>
<pre> <pre> psql DATABASE &lt; tsearch2.sql
psql DATABASE &lt; tsearch2.sql
</pre> </pre>
<p>4) Edit the backed up database schema and delete all SQL <p>4) Edit the backed up database schema and delete all SQL
...@@ -957,13 +996,11 @@ in the tsvector column. ...@@ -957,13 +996,11 @@ in the tsvector column.
tsvector types. If your not sure what these are, they are the tsvector types. If your not sure what these are, they are the
ones listed in tsearch2.sql. Then restore the edited schema to ones listed in tsearch2.sql. Then restore the edited schema to
the database</p> the database</p>
<pre> <pre> psql DATABASE &lt; DATABASEschema.sql
psql DATABASE &lt; DATABASEschema.sql
</pre> </pre>
<p>5) Restore the data for the database</p> <p>5) Restore the data for the database</p>
<pre> <pre> pg_restore -N -a -d DATABASE DATABASEdata.tar
pg_restore -N -a -d DATABASE DATABASEdata.tar
</pre> </pre>
<p>If you get any errors in step 4, it will most likely be <p>If you get any errors in step 4, it will most likely be
...@@ -971,5 +1008,4 @@ in the tsvector column. ...@@ -971,5 +1008,4 @@ in the tsvector column.
tsearch2.sql. Any errors in step 5 will mean the database tsearch2.sql. Any errors in step 5 will mean the database
schema was probably restored wrongly.</p> schema was probably restored wrongly.</p>
</div> </div>
</body> </body></html>
</html> \ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment