Rework word_similarity documentation, make it close to actual algorithm.

word_similarity before claimed as returning similarity of closest word in string, but, actually it returns similarity of substring. Also fix mistyped comments. Author: Alexander Korotkov Review by: David Steele, Liudmila Mantrova Discussionis: https://www.postgresql.org/message-id/flat/CY4PR17MB13207ED8310F847CF117EED0D85A0@CY4PR17MB1320.namprd17.prod.outlook.com https://www.postgresql.org/message-id/flat/f43b242d-000c-f4c8-cb8b-d37e9752cd93%40postgrespro.ru

Rework word_similarity documentation, make it close to actual algorithm.
word_similarity before claimed as returning similarity of closest word in string, but, actually it returns similarity of substring. Also fix mistyped comments. Author: Alexander Korotkov Review by: David Steele, Liudmila Mantrova Discussionis: https://www.postgresql.org/message-id/flat/CY4PR17MB13207ED8310F847CF117EED0D85A0@CY4PR17MB1320.namprd17.prod.outlook.com https://www.postgresql.org/message-id/flat/f43b242d-000c-f4c8-cb8b-d37e9752cd93%40postgrespro.ru
aea7c17e · Teodor Sigaev · d652e352 · aea7c17e · aea7c17e
Commit aea7c17e authored Mar 21, 2018 by Teodor Sigaev
Hide whitespace changes
Inline Side-by-side

Showing with 44 additions and 16 deletions

contrib/pg_trgm/trgm_op.c contrib/pg_trgm/trgm_op.c +2 -2

doc/src/sgml/pgtrgm.sgml doc/src/sgml/pgtrgm.sgml +42 -14

No files found.
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@@ -456,7 +456,7 @@ iterate_word_similarity(int *trg2indexes,
 			lastpos[trgindex] = i;
 		}
-		/* Adjust lower bound if this trigram is present in required substring */
+		/* Adjust upper bound if this trigram is present in required substring */
 		if (found[trgindex])
 		{
 			int			prev_lower,
@@ -473,7 +473,7 @@ iterate_word_similarity(int *trg2indexes,
 			smlr_cur = CALCSML(count, ulen1, ulen2);
-			/* Also try to adjust upper bound for greater similarity */
+			/* Also try to adjust lower bound for greater similarity */
 			tmp_count = count;
 			tmp_ulen2 = ulen2;
 			prev_lower = lower;

--- a/doc/src/sgml/pgtrgm.sgml
+++ b/doc/src/sgml/pgtrgm.sgml
@@ -99,12 +99,10 @@
      </entry>
      <entry><type>real</type></entry>
      <entry>
-       Returns a number that indicates how similar the first string
+       Returns a number that indicates the greatest similarity between
-       to the most similar word of the second string. The function searches in
+       the set of trigrams in the first string and any continuous extent
-       the second string a most similar word not a most similar substring.  The
+       of an ordered set of trigrams in the second string.  For details, see
-       range of the result is zero (indicating that the two strings are
+       the explanation below.
-       completely dissimilar) to one (indicating that the first string is
-       identical to one of the words of the second string).
      </entry>
     </row>
     <row>
@@ -131,6 +129,34 @@
   </tgroup>
  </table>
+  <para>
+   Consider the following example:
+<programlisting>
+# SELECT word_similarity('word', 'two words');
+ word_similarity
+-----------------
+             0.8
+(1 row)
+</programlisting>
+   In the first string, the set of trigrams is
+   <literal>{"  w"," wo","ord","wor","rd "}</literal>.
+   In the second string, the ordered set of trigrams is
+   <literal>{"  t"," tw",two,"wo ","  w"," wo","wor","ord","rds", ds "}</literal>.
+   The most similar extent of an ordered set of trigrams in the second string
+   is <literal>{"  w"," wo","wor","ord"}</literal>, and the similarity is
+   <literal>0.8</literal>.
+  </para>
+  <para>
+   This function returns a value that can be approximately understood as the
+   greatest similarity between the first string and any substring of the second
+   string.  However, this function does not add padding to the boundaries of
+   the extent.  Thus, a whole word match gets a higher score than a match with
+   a part of the word.
+  </para>
  <table id="pgtrgm-op-table">
   <title><filename>pg_trgm</filename> Operators</title>
   <tgroup cols="3">
@@ -156,10 +182,11 @@
      <entry><type>text</type> <literal>&lt;%</literal> <type>text</type></entry>
      <entry><type>boolean</type></entry>
      <entry>
-       Returns <literal>true</literal> if its first argument has the similar word in
+       Returns <literal>true</literal> if the similarity between the trigram
-       the second argument and they have a similarity that is greater than the
+       set in the first argument and a continuous extent of an ordered trigram
-       current word similarity threshold set by
+       set in the second argument is greater than the current word similarity
-       <varname>pg_trgm.word_similarity_threshold</varname> parameter.
+       threshold set by <varname>pg_trgm.word_similarity_threshold</varname>
+       parameter.
      </entry>
     </row>
     <row>
@@ -302,10 +329,11 @@ SELECT t, word_similarity('<replaceable>word</replaceable>', t) AS sml
  WHERE '<replaceable>word</replaceable>' &lt;% t
  ORDER BY sml DESC, t;
 </programlisting>
-   This will return all values in the text column that have a word
+   This will return all values in the text column for which there is a
-   which sufficiently similar to <replaceable>word</replaceable>, sorted from best
+   continuous extent in the corresponding ordered trigram set that is
-   match to worst.  The index will be used to make this a fast operation
+   sufficiently similar to the trigram set of <replaceable>word</replaceable>,
-   even over very large data sets.
+   sorted from best match to worst.  The index will be used to make this
+   a fast operation even over very large data sets.
  </para>
  <para>