Commit aea7c17e authored by Teodor Sigaev's avatar Teodor Sigaev

Rework word_similarity documentation, make it close to actual algorithm.

word_similarity before claimed as returning similarity of closest word in
string, but, actually it returns similarity of substring. Also fix mistyped
comments.

Author: Alexander Korotkov
Review by: David Steele, Liudmila Mantrova
Discussionis:
https://www.postgresql.org/message-id/flat/CY4PR17MB13207ED8310F847CF117EED0D85A0@CY4PR17MB1320.namprd17.prod.outlook.com
https://www.postgresql.org/message-id/flat/f43b242d-000c-f4c8-cb8b-d37e9752cd93%40postgrespro.ru
parent d652e352
...@@ -456,7 +456,7 @@ iterate_word_similarity(int *trg2indexes, ...@@ -456,7 +456,7 @@ iterate_word_similarity(int *trg2indexes,
lastpos[trgindex] = i; lastpos[trgindex] = i;
} }
/* Adjust lower bound if this trigram is present in required substring */ /* Adjust upper bound if this trigram is present in required substring */
if (found[trgindex]) if (found[trgindex])
{ {
int prev_lower, int prev_lower,
...@@ -473,7 +473,7 @@ iterate_word_similarity(int *trg2indexes, ...@@ -473,7 +473,7 @@ iterate_word_similarity(int *trg2indexes,
smlr_cur = CALCSML(count, ulen1, ulen2); smlr_cur = CALCSML(count, ulen1, ulen2);
/* Also try to adjust upper bound for greater similarity */ /* Also try to adjust lower bound for greater similarity */
tmp_count = count; tmp_count = count;
tmp_ulen2 = ulen2; tmp_ulen2 = ulen2;
prev_lower = lower; prev_lower = lower;
......
...@@ -99,12 +99,10 @@ ...@@ -99,12 +99,10 @@
</entry> </entry>
<entry><type>real</type></entry> <entry><type>real</type></entry>
<entry> <entry>
Returns a number that indicates how similar the first string Returns a number that indicates the greatest similarity between
to the most similar word of the second string. The function searches in the set of trigrams in the first string and any continuous extent
the second string a most similar word not a most similar substring. The of an ordered set of trigrams in the second string. For details, see
range of the result is zero (indicating that the two strings are the explanation below.
completely dissimilar) to one (indicating that the first string is
identical to one of the words of the second string).
</entry> </entry>
</row> </row>
<row> <row>
...@@ -131,6 +129,34 @@ ...@@ -131,6 +129,34 @@
</tgroup> </tgroup>
</table> </table>
<para>
Consider the following example:
<programlisting>
# SELECT word_similarity('word', 'two words');
word_similarity
-----------------
0.8
(1 row)
</programlisting>
In the first string, the set of trigrams is
<literal>{" w"," wo","ord","wor","rd "}</literal>.
In the second string, the ordered set of trigrams is
<literal>{" t"," tw",two,"wo "," w"," wo","wor","ord","rds", ds "}</literal>.
The most similar extent of an ordered set of trigrams in the second string
is <literal>{" w"," wo","wor","ord"}</literal>, and the similarity is
<literal>0.8</literal>.
</para>
<para>
This function returns a value that can be approximately understood as the
greatest similarity between the first string and any substring of the second
string. However, this function does not add padding to the boundaries of
the extent. Thus, a whole word match gets a higher score than a match with
a part of the word.
</para>
<table id="pgtrgm-op-table"> <table id="pgtrgm-op-table">
<title><filename>pg_trgm</filename> Operators</title> <title><filename>pg_trgm</filename> Operators</title>
<tgroup cols="3"> <tgroup cols="3">
...@@ -156,10 +182,11 @@ ...@@ -156,10 +182,11 @@
<entry><type>text</type> <literal>&lt;%</literal> <type>text</type></entry> <entry><type>text</type> <literal>&lt;%</literal> <type>text</type></entry>
<entry><type>boolean</type></entry> <entry><type>boolean</type></entry>
<entry> <entry>
Returns <literal>true</literal> if its first argument has the similar word in Returns <literal>true</literal> if the similarity between the trigram
the second argument and they have a similarity that is greater than the set in the first argument and a continuous extent of an ordered trigram
current word similarity threshold set by set in the second argument is greater than the current word similarity
<varname>pg_trgm.word_similarity_threshold</varname> parameter. threshold set by <varname>pg_trgm.word_similarity_threshold</varname>
parameter.
</entry> </entry>
</row> </row>
<row> <row>
...@@ -302,10 +329,11 @@ SELECT t, word_similarity('<replaceable>word</replaceable>', t) AS sml ...@@ -302,10 +329,11 @@ SELECT t, word_similarity('<replaceable>word</replaceable>', t) AS sml
WHERE '<replaceable>word</replaceable>' &lt;% t WHERE '<replaceable>word</replaceable>' &lt;% t
ORDER BY sml DESC, t; ORDER BY sml DESC, t;
</programlisting> </programlisting>
This will return all values in the text column that have a word This will return all values in the text column for which there is a
which sufficiently similar to <replaceable>word</replaceable>, sorted from best continuous extent in the corresponding ordered trigram set that is
match to worst. The index will be used to make this a fast operation sufficiently similar to the trigram set of <replaceable>word</replaceable>,
even over very large data sets. sorted from best match to worst. The index will be used to make this
a fast operation even over very large data sets.
</para> </para>
<para> <para>
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment