trgm.sgml

<sect1 id="pgtrgm">
 <title>pg_trgm</title>
 
 <indexterm zone="pgtrgm">
  <primary>pgtrgm</primary>
 </indexterm>

 <para>
  The <literal>pg_trgm</literal> module provides functions and index classes
  for determining the similarity of text based on trigram matching.
 </para>

 <sect2>
  <title>Trigram (or Trigraph)</title>
  <para>
   A trigram is a set of three consecutive characters taken
   from a string.  A string is considered to have two spaces
   prefixed and one space suffixed when determining the set
   of trigrams that comprise the string.
  </para>
  <para>
   eg. The set of trigrams in the word "cat" is "  c", " ca", 
   "at " and "cat".
  </para>
 </sect2>

 <sect2>
  <title>Public Functions</title>
  <table>
   <title><literal>pg_trgm</literal> functions</title>
   <tgroup cols="2">
    <thead>
     <row>
      <entry>Function</entry>
      <entry>Description</entry>
     </row>
    </thead>
    <tbody>
     <row>
      <entry><literal>real similarity(text, text)</literal></entry>
      <entry>
       <para>
        Returns a number that indicates how closely matches the two
        arguments are.  A zero result indicates that the two words
        are completely dissimilar, and a result of one indicates that
        the two words are identical.
       </para>
      </entry>
     </row>
     <row>
      <entry><literal>real show_limit()</literal></entry>
      <entry>
       <para>
        Returns the current similarity threshold used by the '%'
        operator.  This in effect sets the minimum similarity between
        two words in order that they be considered similar enough to
        be misspellings of each other, for example.
       </para>
      </entry>
     </row>
     <row>
      <entry><literal>real set_limit(real)</literal></entry>
      <entry>
       <para>
        Sets the current similarity threshold that is used by the '%'
        operator, and is returned by the show_limit() function.
       </para>
      </entry>
     </row>
     <row>
      <entry><literal>text[] show_trgm(text)</literal></entry>
      <entry>
       <para>
        Returns an array of all the trigrams of the supplied text
        parameter.
       </para>
      </entry>
     </row>
     <row>
      <entry>Operator: <literal>text % text (returns boolean)</literal></entry> 
      <entry>
       <para>
        The '%' operator returns TRUE if its two arguments have a similarity
        that is greater than the similarity threshold set by set_limit(). It
        will return FALSE if the similarity is less than the current
        threshold.
       </para>
      </entry>
     </row>
    </tbody>
   </tgroup>
  </table>
 </sect2>

 <sect2>
  <title>Public Index Operator Class</title>
  <para>
   The <literal>pg_trgm</literal> module comes with the 
   <literal>gist_trgm_ops</literal> index operator class that allows a
   developer to create an index over a text column for the purpose
   of very fast similarity searches.
  </para>
  <para>
   To use this index, the '%' operator must be used and an appropriate
   similarity threshold for the application must be set. Example:
  </para>
  <programlisting>
CREATE TABLE test_trgm (t text);
CREATE INDEX trgm_idx ON test_trgm USING gist (t gist_trgm_ops);
  </programlisting>
  <para>	
   At this point, you will have an index on the t text column that you
   can use for similarity searching. Example:
  </para>
  <programlisting>
SELECT
	t,
	similarity(t, 'word') AS sml
FROM
	test_trgm
WHERE
	t % 'word'
ORDER BY
	sml DESC, t;
  </programlisting>
  <para>
   This will return all values in the text column that are sufficiently
   similar to 'word', sorted from best match to worst.  The index will
   be used to make this a fast operation over very large data sets.
  </para>
 </sect2>

 <sect2>
  <title>Tsearch2 Integration</title>
  <para>
   Trigram matching is a very useful tool when used in conjunction
   with a text index created by the Tsearch2 contrib module. (See
   contrib/tsearch2)
  </para>
  <para>
   The first step is to generate an auxiliary table containing all
   the unique words in the Tsearch2 index:
  </para>
  <programlisting>
CREATE TABLE words AS SELECT word FROM 
	stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');
  </programlisting>
  <para>
   Where 'documents' is a table that has a text field 'bodytext'
   that TSearch2 is used to search.  The use of the 'simple' dictionary
   with the to_tsvector function, instead of just using the already
   existing vector is to avoid creating a list of already stemmed
   words.  This way, only the original, unstemmed words are added
   to the word list.
  </para>
  <para>
   Next, create a trigram index on the word column:
  </para>
  <programlisting>
CREATE INDEX words_idx ON words USING gist(word gist_trgm_ops);
  </programlisting>
  <para>
   or
  </para>
  <programlisting>
CREATE INDEX words_idx ON words USING gin(word gist_trgm_ops);
  </programlisting>
  <para>
   Now, a <literal>SELECT</literal> query similar to the example above can be 
   used to suggest spellings for misspelled words in user search terms. A
   useful extra clause is to ensure that the similar words are also
   of similar length to the misspelled word.
  </para>
  <para>
   <note>
    <para>
     Since the 'words' table has been generated as a separate,
     static table, it will need to be periodically regenerated so that
     it remains up to date with the word list in the Tsearch2 index.
    </para>
   </note>
  </para>
 </sect2>

 <sect2>
  <title>References</title>
  <para>
   Tsearch2 Development Site
   <ulink url="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/"></ulink>
  </para>
  <para>
   GiST Development Site
   <ulink url="http://www.sai.msu.su/~megera/postgres/gist/"></ulink>
  </para>
 </sect2>

 <sect2>
  <title>Authors</title>
  <para>
   Oleg Bartunov <email>oleg@sai.msu.su</email>, Moscow, Moscow University, Russia
  </para>
  <para>
   Teodor Sigaev <email>teodor@sigaev.ru</email>, Moscow, Delta-Soft Ltd.,Russia
  </para>
  <para>
   Documentation: Christopher Kings-Lynne 
  </para>
  <para>
   This module is sponsored by Delta-Soft Ltd., Moscow, Russia.
  </para>
 </sect2>

</sect1>