trgm.sgml 6.07 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214
<sect1 id="pgtrgm">
 <title>pg_trgm</title>
 
 <indexterm zone="pgtrgm">
  <primary>pgtrgm</primary>
 </indexterm>

 <para>
  The <literal>pg_trgm</literal> module provides functions and index classes
  for determining the similarity of text based on trigram matching.
 </para>

 <sect2>
  <title>Trigram (or Trigraph)</title>
  <para>
   A trigram is a set of three consecutive characters taken
   from a string.  A string is considered to have two spaces
   prefixed and one space suffixed when determining the set
   of trigrams that comprise the string.
  </para>
  <para>
   eg. The set of trigrams in the word "cat" is "  c", " ca", 
   "at " and "cat".
  </para>
 </sect2>

 <sect2>
  <title>Public Functions</title>
  <table>
   <title><literal>pg_trgm</literal> functions</title>
   <tgroup cols="2">
    <thead>
     <row>
      <entry>Function</entry>
      <entry>Description</entry>
     </row>
    </thead>
    <tbody>
     <row>
      <entry><literal>real similarity(text, text)</literal></entry>
      <entry>
       <para>
        Returns a number that indicates how closely matches the two
        arguments are.  A zero result indicates that the two words
        are completely dissimilar, and a result of one indicates that
        the two words are identical.
       </para>
      </entry>
     </row>
     <row>
      <entry><literal>real show_limit()</literal></entry>
      <entry>
       <para>
        Returns the current similarity threshold used by the '%'
        operator.  This in effect sets the minimum similarity between
        two words in order that they be considered similar enough to
        be misspellings of each other, for example.
       </para>
      </entry>
     </row>
     <row>
      <entry><literal>real set_limit(real)</literal></entry>
      <entry>
       <para>
        Sets the current similarity threshold that is used by the '%'
        operator, and is returned by the show_limit() function.
       </para>
      </entry>
     </row>
     <row>
      <entry><literal>text[] show_trgm(text)</literal></entry>
      <entry>
       <para>
        Returns an array of all the trigrams of the supplied text
        parameter.
       </para>
      </entry>
     </row>
     <row>
      <entry>Operator: <literal>text % text (returns boolean)</literal></entry> 
      <entry>
       <para>
        The '%' operator returns TRUE if its two arguments have a similarity
        that is greater than the similarity threshold set by set_limit(). It
        will return FALSE if the similarity is less than the current
        threshold.
       </para>
      </entry>
     </row>
    </tbody>
   </tgroup>
  </table>
 </sect2>

 <sect2>
  <title>Public Index Operator Class</title>
  <para>
   The <literal>pg_trgm</literal> module comes with the 
   <literal>gist_trgm_ops</literal> index operator class that allows a
   developer to create an index over a text column for the purpose
   of very fast similarity searches.
  </para>
  <para>
   To use this index, the '%' operator must be used and an appropriate
   similarity threshold for the application must be set. Example:
  </para>
  <programlisting>
CREATE TABLE test_trgm (t text);
CREATE INDEX trgm_idx ON test_trgm USING gist (t gist_trgm_ops);
  </programlisting>
  <para>	
   At this point, you will have an index on the t text column that you
   can use for similarity searching. Example:
  </para>
  <programlisting>
SELECT
	t,
	similarity(t, 'word') AS sml
FROM
	test_trgm
WHERE
	t % 'word'
ORDER BY
	sml DESC, t;
  </programlisting>
  <para>
   This will return all values in the text column that are sufficiently
   similar to 'word', sorted from best match to worst.  The index will
   be used to make this a fast operation over very large data sets.
  </para>
 </sect2>

 <sect2>
  <title>Tsearch2 Integration</title>
  <para>
   Trigram matching is a very useful tool when used in conjunction
   with a text index created by the Tsearch2 contrib module. (See
   contrib/tsearch2)
  </para>
  <para>
   The first step is to generate an auxiliary table containing all
   the unique words in the Tsearch2 index:
  </para>
  <programlisting>
CREATE TABLE words AS SELECT word FROM 
	stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');
  </programlisting>
  <para>
   Where 'documents' is a table that has a text field 'bodytext'
   that TSearch2 is used to search.  The use of the 'simple' dictionary
   with the to_tsvector function, instead of just using the already
   existing vector is to avoid creating a list of already stemmed
   words.  This way, only the original, unstemmed words are added
   to the word list.
  </para>
  <para>
   Next, create a trigram index on the word column:
  </para>
  <programlisting>
CREATE INDEX words_idx ON words USING gist(word gist_trgm_ops);
  </programlisting>
  <para>
   or
  </para>
  <programlisting>
CREATE INDEX words_idx ON words USING gin(word gist_trgm_ops);
  </programlisting>
  <para>
   Now, a <literal>SELECT</literal> query similar to the example above can be 
   used to suggest spellings for misspelled words in user search terms. A
   useful extra clause is to ensure that the similar words are also
   of similar length to the misspelled word.
  </para>
  <para>
   <note>
    <para>
     Since the 'words' table has been generated as a separate,
     static table, it will need to be periodically regenerated so that
     it remains up to date with the word list in the Tsearch2 index.
    </para>
   </note>
  </para>
 </sect2>

 <sect2>
  <title>References</title>
  <para>
   Tsearch2 Development Site
   <ulink url="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/"></ulink>
  </para>
  <para>
   GiST Development Site
   <ulink url="http://www.sai.msu.su/~megera/postgres/gist/"></ulink>
  </para>
 </sect2>

 <sect2>
  <title>Authors</title>
  <para>
   Oleg Bartunov <email>oleg@sai.msu.su</email>, Moscow, Moscow University, Russia
  </para>
  <para>
   Teodor Sigaev <email>teodor@sigaev.ru</email>, Moscow, Delta-Soft Ltd.,Russia
  </para>
  <para>
   Documentation: Christopher Kings-Lynne 
  </para>
  <para>
   This module is sponsored by Delta-Soft Ltd., Moscow, Russia.
  </para>
 </sect2>

</sect1>