Commit 9389ac89 authored by Tom Lane's avatar Tom Lane

Document filtering dictionaries in textsearch.sgml.

While at it, copy-edit the description of prefix-match marker support in
synonym dictionaries, and clarify the description of the default unaccent
dictionary a bit more.
parent acac35ad
<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.58 2010/08/20 13:59:45 tgl Exp $ --> <!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.59 2010/08/25 21:42:55 tgl Exp $ -->
<chapter id="textsearch"> <chapter id="textsearch">
<title>Full Text Search</title> <title>Full Text Search</title>
...@@ -112,7 +112,7 @@ ...@@ -112,7 +112,7 @@
as a sorted array of normalized lexemes. Along with the lexemes it is as a sorted array of normalized lexemes. Along with the lexemes it is
often desirable to store positional information to use for often desirable to store positional information to use for
<firstterm>proximity ranking</firstterm>, so that a document that <firstterm>proximity ranking</firstterm>, so that a document that
contains a more <quote>dense</> region of query words is contains a more <quote>dense</> region of query words is
assigned a higher rank than one with scattered query words. assigned a higher rank than one with scattered query words.
</para> </para>
</listitem> </listitem>
...@@ -1151,13 +1151,13 @@ MaxFragments=0, FragmentDelimiter=" ... " ...@@ -1151,13 +1151,13 @@ MaxFragments=0, FragmentDelimiter=" ... "
<screen> <screen>
SELECT ts_headline('english', SELECT ts_headline('english',
'The most common type of search 'The most common type of search
is to find all documents containing given query terms is to find all documents containing given query terms
and return them in order of their similarity to the and return them in order of their similarity to the
query.', query.',
to_tsquery('query &amp; similarity')); to_tsquery('query &amp; similarity'));
ts_headline ts_headline
------------------------------------------------------------ ------------------------------------------------------------
containing given &lt;b&gt;query&lt;/b&gt; terms containing given &lt;b&gt;query&lt;/b&gt; terms
and return them in order of their &lt;b&gt;similarity&lt;/b&gt; to the and return them in order of their &lt;b&gt;similarity&lt;/b&gt; to the
&lt;b&gt;query&lt;/b&gt;. &lt;b&gt;query&lt;/b&gt;.
...@@ -1166,7 +1166,7 @@ SELECT ts_headline('english', ...@@ -1166,7 +1166,7 @@ SELECT ts_headline('english',
is to find all documents containing given query terms is to find all documents containing given query terms
and return them in order of their similarity to the and return them in order of their similarity to the
query.', query.',
to_tsquery('query &amp; similarity'), to_tsquery('query &amp; similarity'),
'StartSel = &lt;, StopSel = &gt;'); 'StartSel = &lt;, StopSel = &gt;');
ts_headline ts_headline
------------------------------------------------------- -------------------------------------------------------
...@@ -2064,6 +2064,14 @@ SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.h ...@@ -2064,6 +2064,14 @@ SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.h
(notice that one token can produce more than one lexeme) (notice that one token can produce more than one lexeme)
</para> </para>
</listitem> </listitem>
<listitem>
<para>
a single lexeme with the <literal>TSL_FILTER</> flag set, to replace
the original token with a new token to be passed to subsequent
dictionaries (a dictionary that does this is called a
<firstterm>filtering dictionary</>)
</para>
</listitem>
<listitem> <listitem>
<para> <para>
an empty array if the dictionary knows the token, but it is a stop word an empty array if the dictionary knows the token, but it is a stop word
...@@ -2096,6 +2104,13 @@ SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.h ...@@ -2096,6 +2104,13 @@ SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.h
until some dictionary recognizes it as a known word. If it is identified until some dictionary recognizes it as a known word. If it is identified
as a stop word, or if no dictionary recognizes the token, it will be as a stop word, or if no dictionary recognizes the token, it will be
discarded and not indexed or searched for. discarded and not indexed or searched for.
Normally, the first dictionary that returns a non-<literal>NULL</>
output determines the result, and any remaining dictionaries are not
consulted; but a filtering dictionary can replace the given word
with a modified word, which is then passed to subsequent dictionaries.
</para>
<para>
The general rule for configuring a list of dictionaries The general rule for configuring a list of dictionaries
is to place first the most narrow, most specific dictionary, then the more is to place first the most narrow, most specific dictionary, then the more
general dictionaries, finishing with a very general dictionary, like general dictionaries, finishing with a very general dictionary, like
...@@ -2112,6 +2127,16 @@ ALTER TEXT SEARCH CONFIGURATION astro_en ...@@ -2112,6 +2127,16 @@ ALTER TEXT SEARCH CONFIGURATION astro_en
</programlisting> </programlisting>
</para> </para>
<para>
A filtering dictionary can be placed anywhere in the list, except at the
end where it'd be useless. Filtering dictionaries are useful to partially
normalize words to simplify the task of later dictionaries. For example,
a filtering dictionary could be used to remove accents from accented
letters, as is done by the
<link linkend="unaccent"><filename>contrib/unaccent</></link>
extension module.
</para>
<sect2 id="textsearch-stopwords"> <sect2 id="textsearch-stopwords">
<title>Stop Words</title> <title>Stop Words</title>
...@@ -2184,7 +2209,7 @@ CREATE TEXT SEARCH DICTIONARY public.simple_dict ( ...@@ -2184,7 +2209,7 @@ CREATE TEXT SEARCH DICTIONARY public.simple_dict (
Here, <literal>english</literal> is the base name of a file of stop words. Here, <literal>english</literal> is the base name of a file of stop words.
The file's full name will be The file's full name will be
<filename>$SHAREDIR/tsearch_data/english.stop</>, <filename>$SHAREDIR/tsearch_data/english.stop</>,
where <literal>$SHAREDIR</> means the where <literal>$SHAREDIR</> means the
<productname>PostgreSQL</productname> installation's shared-data directory, <productname>PostgreSQL</productname> installation's shared-data directory,
often <filename>/usr/local/share/postgresql</> (use <command>pg_config often <filename>/usr/local/share/postgresql</> (use <command>pg_config
--sharedir</> to determine it if you're not sure). --sharedir</> to determine it if you're not sure).
...@@ -2295,17 +2320,39 @@ SELECT * FROM ts_debug('english', 'Paris'); ...@@ -2295,17 +2320,39 @@ SELECT * FROM ts_debug('english', 'Paris');
asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris} asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}
</screen> </screen>
</para> </para>
<para> <para>
An asterisk (<literal>*</literal>) at the end of definition word indicates The only parameter required by the <literal>synonym</> template is
that definition word is a prefix, and <function>to_tsquery()</function> <literal>SYNONYMS</>, which is the base name of its configuration file
function will transform that definition to the prefix search format (see &mdash; <literal>my_synonyms</> in the above example.
<xref linkend="textsearch-parsing-queries">). The file's full name will be
Notice that it is ignored in <function>to_tsvector()</function>. <filename>$SHAREDIR/tsearch_data/my_synonyms.syn</>
(where <literal>$SHAREDIR</> means the
<productname>PostgreSQL</> installation's shared-data directory).
The file format is just one line
per word to be substituted, with the word followed by its synonym,
separated by white space. Blank lines and trailing spaces are ignored.
</para>
<para>
The <literal>synonym</> template also has an optional parameter
<literal>CaseSensitive</>, which defaults to <literal>false</>. When
<literal>CaseSensitive</> is <literal>false</>, words in the synonym file
are folded to lower case, as are input tokens. When it is
<literal>true</>, words and tokens are not folded to lower case,
but are compared as-is.
</para> </para>
<para> <para>
Contents of <filename>$SHAREDIR/tsearch_data/synonym_sample.syn</>: An asterisk (<literal>*</literal>) can be placed at the end of a synonym
in the configuration file. This indicates that the synonym is a prefix.
The asterisk is ignored when the entry is used in
<function>to_tsvector()</function>, but when it is used in
<function>to_tsquery()</function>, the result will be a query item with
the prefix match marker (see
<xref linkend="textsearch-parsing-queries">).
For example, suppose we have these entries in
<filename>$SHAREDIR/tsearch_data/synonym_sample.syn</>:
<programlisting> <programlisting>
postgres pgsql postgres pgsql
postgresql pgsql postgresql pgsql
...@@ -2313,67 +2360,42 @@ postgre pgsql ...@@ -2313,67 +2360,42 @@ postgre pgsql
gogle googl gogle googl
indices index* indices index*
</programlisting> </programlisting>
</para> Then we will get these results:
<para>
Results:
<screen> <screen>
=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample'); mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
=# SELECT ts_lexize('syn','indices'); mydb=# SELECT ts_lexize('syn','indices');
ts_lexize ts_lexize
----------- -----------
{index} {index}
(1 row) (1 row)
=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple); mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn; mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
=# SELECT to_tsquery('tst','indices'); mydb=# SELECT to_tsvector('tst','indices');
to_tsvector
-------------
'index':1
(1 row)
mydb=# SELECT to_tsquery('tst','indices');
to_tsquery to_tsquery
------------ ------------
'index':* 'index':*
(1 row) (1 row)
=# SELECT 'indexes are very useful'::tsvector; mydb=# SELECT 'indexes are very useful'::tsvector;
tsvector tsvector
--------------------------------- ---------------------------------
'are' 'indexes' 'useful' 'very' 'are' 'indexes' 'useful' 'very'
(1 row) (1 row)
=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices'); mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');
?column? ?column?
---------- ----------
t t
(1 row) (1 row)
=# SELECT to_tsvector('tst','indices');
to_tsvector
-------------
'index':1
(1 row)
</screen> </screen>
</para> </para>
<para>
The only parameter required by the <literal>synonym</> template is
<literal>SYNONYMS</>, which is the base name of its configuration file
&mdash; <literal>my_synonyms</> in the above example.
The file's full name will be
<filename>$SHAREDIR/tsearch_data/my_synonyms.syn</>
(where <literal>$SHAREDIR</> means the
<productname>PostgreSQL</> installation's shared-data directory).
The file format is just one line
per word to be substituted, with the word followed by its synonym,
separated by white space. Blank lines and trailing spaces are ignored.
</para>
<para>
The <literal>synonym</> template also has an optional parameter
<literal>CaseSensitive</>, which defaults to <literal>false</>. When
<literal>CaseSensitive</> is <literal>false</>, words in the synonym file
are folded to lower case, as are input tokens. When it is
<literal>true</>, words and tokens are not folded to lower case,
but are compared as-is.
</para>
</sect2> </sect2>
<sect2 id="textsearch-thesaurus"> <sect2 id="textsearch-thesaurus">
......
<!-- $PostgreSQL: pgsql/doc/src/sgml/unaccent.sgml,v 1.6 2010/08/25 02:12:00 tgl Exp $ --> <!-- $PostgreSQL: pgsql/doc/src/sgml/unaccent.sgml,v 1.7 2010/08/25 21:42:55 tgl Exp $ -->
<sect1 id="unaccent"> <sect1 id="unaccent">
<title>unaccent</title> <title>unaccent</title>
...@@ -75,8 +75,10 @@ ...@@ -75,8 +75,10 @@
<para> <para>
Running the installation script <filename>unaccent.sql</> creates a text Running the installation script <filename>unaccent.sql</> creates a text
search template <literal>unaccent</> and a dictionary <literal>unaccent</> search template <literal>unaccent</> and a dictionary <literal>unaccent</>
based on it, with default parameters. You can alter the based on it. The <literal>unaccent</> dictionary has the default
parameters, for example parameter setting <literal>RULES='unaccent'</>, which makes it immediately
usable with the standard <filename>unaccent.rules</> file.
If you wish, you can alter the parameter, for example
<programlisting> <programlisting>
mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules'); mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment