<p>Basically,thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally,
preserves them for indexing. Thesaurus used when indexing, so any changes in thesaurus require reindexing.
Tsearch2's <tt>thesaurus</tt> dictionary (TZ) is an extension of <tt>synonym</tt> dictionary
with <b>phrase</b> support. Thesaurus is a plain file of the following format:
<pre>
# this is a comment
sample word(s) : indexed word(s)
...............................
</pre>
<ul>
<li><strong>Colon</strong> (:) symbol used as a delimiter.</li>
<li>Use asterisk (<b>*</b>) at the beginning of <tt>indexed word</tt> to skip subdictionary.
It's still required, that <tt>sample words</tt> should be known.</li>
<li>thesaurus dictionary looks for the most longest match</li></ul>
<P>
TZ uses <strong>subdictionary</strong> (should be defined in tsearch2 configuration)
to normalize thesaurus text. It's possible to define only <strong>one dictionary</strong>.
Notice, that subdictionary produces an error, if it couldn't recognize word.
In that case, you should remove definition line with this word or teach subdictionary to know it.
</p>
<p>Stop-words recognized by subdictionary replaced by 'stop-word placeholder', i.e.,
important only their position.
To break possible ties thesaurus applies the last definition. For example, consider
thesaurus (with simple subdictionary) rules with pattern 'swsw'
('s' designates stop-word and 'w' - known word): </p>
<pre>
a one the two : swsw
the one a two : swsw2
</pre>
<p>Words 'a' and 'the' are stop-words defined in the configuration of a subdictionary.
Thesaurus considers texts 'the one the two' and 'that one then two' as equal and will use definition
'swsw2'.</p>
<p>As a normal dictionary, it should be assigned to the specific lexeme types.
Since TZ has a capability to recognize phrases it must remember its state and interact with parser.
TZ use these assignments to check if it should handle next word or stop accumulation.
Compiler of TZ should take care about proper configuration to avoid confusion.
For example, if TZ is assigned to handle only <tt>lword</tt> lexeme, then TZ definition like
' one 1:11' will not works, since lexeme type <tt>digit</tt> doesn't assigned to the TZ.</p>
<h3>Configuration</h3>
<dl><dt>tsearch2</dt><dd></dd></dl><p>tsearch2 comes with thesaurus template, which could be used to define new dictionary: </p>
<preclass="real">INSERT INTO pg_ts_dict
(SELECT 'tz_simple', dict_init,
'DictFile="/path/to/tz_simple.txt",'
'Dictionary="en_stem"',
dict_lexize
FROM pg_ts_dict
WHERE dict_name = 'thesaurus_template');
</pre>
<p>Here: </p>
<ul>
<li><tt>tz_simple</tt> - is the dictionary name</li>
<li><tt>DictFile="/path/to/tz_simple.txt"</tt> - is the location of thesaurus file</li>
<li><tt>Dictionary="en_stem"</tt> defines dictionary (snowball english stemmer) to use for thesaurus normalization. Notice, that <em>en_stem</em> dictionary has it's own configuration (stop-words, for example).</li>
</ul>
<p>Now, it's possible to use <tt>tz_simple</tt> in pg_ts_cfgmap, for example: </p>
<pre>
update pg_ts_cfgmap set dict_name='{tz_simple,en_stem}' where ts_name = 'default_russian' and
tok_alias in ('lhword', 'lword', 'lpart_hword');
</pre>
<h3>Examples</h3>
<p>tz_simple: </p>
<pre>
one : 1
two : 2
one two : 12
the one : 1
one 1 : 11
</pre>
<p>To see, how thesaurus works, one could use <tt>to_tsvector</tt>, <tt>to_tsquery</tt> or <tt>plainto_tsquery</tt> functions: </p><preclass="real">=# select plainto_tsquery('default_russian',' one day is oneday');
plainto_tsquery
------------------------
'1' & 'day' & 'oneday'
=# select plainto_tsquery('default_russian','one two day is oneday');