Editorial improvements for GIN documentation.

08fa6a68 · Tom Lane · a88ec7b4 · 08fa6a68 · 08fa6a68 · 08fa6a68
Commit 08fa6a68 authored Dec 01, 2006 by Tom Lane
Hide whitespace changes
Inline Side-by-side

Showing with 138 additions and 114 deletions

doc/src/sgml/gin.sgml doc/src/sgml/gin.sgml +100 -80

doc/src/sgml/indices.sgml doc/src/sgml/indices.sgml +6 -5

doc/src/sgml/xindex.sgml doc/src/sgml/xindex.sgml +32 -29

No files found.
--- a/doc/src/sgml/gin.sgml
+++ b/doc/src/sgml/gin.sgml
-<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.6 2006/11/30 20:50:44 petere Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.7 2006/12/01 23:46:46 tgl Exp $ -->
 <chapter id="GIN">
 <title>GIN Indexes</title>
@@ -14,8 +14,9 @@
 <para>
   <acronym>GIN</acronym> stands for Generalized Inverted Index.  It is
   an index structure storing a set of (key, posting list) pairs, where
-   'posting list' is a set of rows in which the key occurs. Each
+   a <quote>posting list</> is a set of rows in which the key occurs. Each
-   row may contain many keys.
+   indexed value may contain many keys, so the same row ID may appear in
+   multiple posting lists.
 </para>
 <para>
@@ -45,7 +46,7 @@
 <para>
   The <acronym>GIN</acronym> interface has a high level of abstraction,
-   requiring the access method implementer to only implement the semantics of
+   requiring the access method implementer only to implement the semantics of
   the data type being accessed.  The <acronym>GIN</acronym> layer itself
   takes care of concurrency, logging and searching the tree structure.
 </para>
@@ -53,26 +54,14 @@
 <para>
   All it takes to get a <acronym>GIN</acronym> access method working
   is to implement four user-defined methods, which define the behavior of
-   keys in the tree. In short, <acronym>GIN</acronym> combines extensibility
+   keys in the tree and the relationships between keys, indexed values,
-   along with generality, code reuse, and a clean interface.
+   and indexable queries. In short, <acronym>GIN</acronym> combines
- </para>
+   extensibility with generality, code reuse, and a clean interface.
-</sect1>
-<sect1 id="gin-implementation">
- <title>Implementation</title>
- <para>
-  Internally, <acronym>GIN</acronym> consists of a B-tree index constructed 
-  over keys, where each key is an element of the indexed value 
-  (element of array, for example) and where each tuple in a leaf page is 
-  either a pointer to a B-tree over heap pointers (PT, posting tree), or a 
-  list of heap pointers (PL, posting list) if the tuple is small enough.
 </para>
 <para>
-   There are four methods that an index operator class for
+   The four methods that an index operator class for
-   <acronym>GIN</acronym> must provide (prototypes are in pseudocode):
+   <acronym>GIN</acronym> must provide are:
 </para>
 <variablelist>
@@ -80,9 +69,9 @@
     <term>int compare(Datum a, Datum b)</term>
     <listitem>
      <para>
-	   Compares keys (not indexed values!) and returns an integer less than 
+       Compares keys (not indexed values!) and returns an integer less than
-	   zero, zero, or greater than zero, indicating whether the first key is 
+       zero, zero, or greater than zero, indicating whether the first key is
-	   less than, equal to, or greater than the second.
+       less than, equal to, or greater than the second.
      </para>
     </listitem>
    </varlistentry>
@@ -91,21 +80,26 @@
     <term>Datum* extractValue(Datum inputValue, uint32 *nkeys)</term>
     <listitem>
      <para>
-	   Returns an array of keys of value to be indexed, nkeys should
+       Returns an array of keys given a value to be indexed.  The
-	   contain the number of returned keys.
+       number of returned keys must be stored into <literal>*nkeys</>.
      </para>
     </listitem>
    </varlistentry>
    <varlistentry>
-     <term>Datum* extractQuery(Datum query, uint32 nkeys, 
+     <term>Datum* extractQuery(Datum query, uint32 *nkeys,
-		StrategyNumber n)</term>
+        StrategyNumber n)</term>
     <listitem>
      <para>
-	   Returns an array of keys of the query to be executed. n contains the
+       Returns an array of keys given a value to be queried; that is,
-	   strategy number of the operation (see <xref
+       <literal>query</> is the value on the right-hand side of an
-	   linkend="xindex-strategies">).  Depending on n, query may be
+       indexable operator whose left-hand side is the indexed column.
-	   different type.
+       <literal>n</> is the strategy number of the operator within the
+       operator class (see <xref linkend="xindex-strategies">).
+       Often, <function>extractQuery</> will need
+       to consult <literal>n</> to determine the data type of
+       <literal>query</> and the key values that need to be extracted.
+       The number of returned keys must be stored into <literal>*nkeys</>.
      </para>
     </listitem>
    </varlistentry>
@@ -114,11 +108,16 @@
     <term>bool consistent(bool check[], StrategyNumber n, Datum query)</term>
     <listitem>
      <para>
-	   Returns TRUE if the indexed value satisfies the query qualifier with 
+       Returns TRUE if the indexed value satisfies the query operator with
-	   strategy n (or may satisfy in case of RECHECK mark in operator class). 
+       strategy number <literal>n</> (or may satisfy, if the operator is
-	   Each element of the check array is TRUE if the indexed value has a 
+       marked RECHECK in the operator class).  The <literal>check</> array has
-	   corresponding key in the query: if (check[i] == TRUE) the i-th key of 
+       the same length as the number of keys previously returned by
-	   the query is present in the indexed value.
+       <function>extractQuery</> for this query.  Each element of the
+       <literal>check</> array is TRUE if the indexed value contains the
+       corresponding query key, ie, if (check[i] == TRUE) the i-th key of the
+       <function>extractQuery</> result array is present in the indexed value.
+       The original <literal>query</> datum (not the extracted key array!) is
+       passed in case the <function>consistent</> method needs to consult it.
      </para>
     </listitem>
    </varlistentry>
@@ -127,6 +126,19 @@
 </sect1>
+<sect1 id="gin-implementation">
+ <title>Implementation</title>
+ <para>
+  Internally, a <acronym>GIN</acronym> index contains a B-tree index
+  constructed over keys, where each key is an element of the indexed value
+  (a member of an array, for example) and where each tuple in a leaf page is
+  either a pointer to a B-tree over heap pointers (PT, posting tree), or a
+  list of heap pointers (PL, posting list) if the list is small enough.
+ </para>
+</sect1>
 <sect1 id="gin-tips">
 <title>GIN tips and tricks</title>
@@ -134,44 +146,43 @@
  <varlistentry>
   <term>Create vs insert</term>
   <listitem>
-	<para>
+    <para>
-	 In most cases, insertion into a <acronym>GIN</acronym> index is slow
+     In most cases, insertion into a <acronym>GIN</acronym> index is slow
-	 due to the likelihood of many keys being inserted for each value.
+     due to the likelihood of many keys being inserted for each value.
-	 So, for bulk insertions into a table it is advisable to to drop the GIN 
+     So, for bulk insertions into a table it is advisable to drop the GIN
-	 index and recreate it after finishing bulk insertion.
+     index and recreate it after finishing bulk insertion.
-	</para>
+    </para>
   </listitem>
  </varlistentry>
  <varlistentry>
-   <term>gin_fuzzy_search_limit</term>
+   <term><xref linkend="guc-gin-fuzzy-search-limit"></term>
   <listitem>
-	<para>
+    <para>
-	 The primary goal of developing <acronym>GIN</acronym> indices was 
+     The primary goal of developing <acronym>GIN</acronym> indexes was
-	 support for highly scalable, full-text search in 
+     to create support for highly scalable, full-text search in
-	 <productname>PostgreSQL</productname> and there are often situations when 
+     <productname>PostgreSQL</productname>, and there are often situations when
-	 a full-text search returns a very large set of results.  Since reading 
+     a full-text search returns a very large set of results.  Moreover, this
-	 tuples from the disk and sorting them could take a lot of time, this is 
+     often happens when the query contains very frequent words, so that the
-	 unacceptable for production.  (Note that the index search itself is very 
+     large result set is not even useful.  Since reading many
-	 fast.) 
+     tuples from the disk and sorting them could take a lot of time, this is
+     unacceptable for production.  (Note that the index search itself is very
+     fast.)
+    </para>
+    <para>
+     To facilitate controlled execution of such queries
+     <acronym>GIN</acronym> has a configurable soft upper limit on the size
+     of the returned set, the
+     <varname>gin_fuzzy_search_limit</varname> configuration parameter.
+     It is set to 0 (meaning no limit) by default.
+     If a non-zero limit is set, then the returned set is a subset of
+     the whole result set, chosen at random.
+    </para>
+    <para>
+     <quote>Soft</quote> means that the actual number of returned results
+     could differ slightly from the specified limit, depending on the query
+     and the quality of the system's random number generator.
    </para>
-	<para>
-	 Such queries usually contain very frequent words, so the results are not 
-	 very helpful. To facilitate execution of such queries 
-	 <acronym>GIN</acronym> has a configurable soft upper limit of the size 
-	 of the returned set, determined by the 
-	 <varname>gin_fuzzy_search_limit</varname> GUC variable.  It is set to 0 by
-	 default (no limit).
-	</para>
-	<para>
-	 If a non-zero search limit is set, then the returned set is a subset of 
-	 the whole result set, chosen at random.
-	</para>
-	<para>
-	 <quote>Soft</quote> means that the actual number of returned results
-	 could slightly differ from the specified limit, depending on the query
-	 and the quality of the system's random number generator.
-	</para>
   </listitem>
  </varlistentry>
 </variablelist>
@@ -182,21 +193,30 @@
 <title>Limitations</title>
 <para>
-  <acronym>GIN</acronym> doesn't support full index scans due to their
+  <acronym>GIN</acronym> doesn't support full index scans: because there are
-  extreme inefficiency: because there are often many keys per value,
+  often many keys per value, each heap pointer would be returned many times,
-  each heap pointer will be returned several times.
+  and there is no easy way to prevent this.
 </para>
 <para>
  When <function>extractQuery</function> returns zero keys,
-  <acronym>GIN</acronym> will emit an error: for different opclasses and
+  <acronym>GIN</acronym> will emit an error.  Depending on the operator,
-  strategies the semantic meaning of a void query may be different (for
+  a void query might match all, some, or none of the indexed values (for
-  example, any array contains the void array, but they don't overlap the
+  example, every array contains the empty array, but does not overlap the
-  void array), and <acronym>GIN</acronym> can't suggest a reasonable answer.
+  empty array), and <acronym>GIN</acronym> can't determine the correct
+  answer, nor produce a full-index-scan result if it could determine that
+  that was correct.
 </para>
 <para>
-  <acronym>GIN</acronym> searches keys only by equality matching.  This may 
+  It is not an error for <function>extractValue</> to return zero keys,
+  but in this case the indexed value will be unrepresented in the index.
+  This is another reason why full index scan is not useful &mdash; it would
+  miss such rows.
+ </para>
+ <para>
+  <acronym>GIN</acronym> searches keys only by equality matching.  This may
  be improved in future.
 </para>
 </sect1>
@@ -206,12 +226,12 @@
 <para>
  The <productname>PostgreSQL</productname> source distribution includes
-  <acronym>GIN</acronym> classes for one-dimensional arrays of all internal 
+  <acronym>GIN</acronym> classes for one-dimensional arrays of all internal
  types.  The following
  <filename>contrib</> modules also contain <acronym>GIN</acronym>
-  operator classes: 
+  operator classes:
 </para>
 <variablelist>
  <varlistentry>
   <term>intarray</term>

--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
-<!-- $PostgreSQL: pgsql/doc/src/sgml/indices.sgml,v 1.65 2006/10/23 18:10:31 petere Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/indices.sgml,v 1.66 2006/12/01 23:46:46 tgl Exp $ -->
 <chapter id="indexes">
 <title id="indexes-title">Indexes</title>
@@ -116,7 +116,7 @@ CREATE INDEX test1_id_index ON test1 (id);
  <para>
   <productname>PostgreSQL</productname> provides several index types:
-   B-tree, Hash, GIN and GiST.  Each index type uses a different
+   B-tree, Hash, GiST and GIN.  Each index type uses a different
   algorithm that is best suited to different types of queries.
   By default, the <command>CREATE INDEX</command> command will create a
   B-tree index, which fits the most common situations.
@@ -247,8 +247,8 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
    <primary>GIN</primary>
    <see>index</see>
   </indexterm>
-   GIN is a inverted index and it's usable for values which have more
+   GIN indexes are inverted indexes which can handle values that contain more
-   than one key, arrays for example. Like GiST, GIN may support
+   than one key, arrays for example. Like GiST, GIN can support
   many different user-defined indexing strategies and the particular 
   operators with which a GIN index can be used vary depending on the 
   indexing strategy.  
@@ -267,7 +267,8 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
   (See <xref linkend="functions-array"> for the meaning of
   these operators.)
   Other GIN operator classes are available in the <literal>contrib</>
-   <literal>tsearch2</literal> and <literal>intarray</literal> modules. For more information see <xref linkend="GIN">.
+   <literal>tsearch2</literal> and <literal>intarray</literal> modules.
+   For more information see <xref linkend="GIN">.
  </para>
 </sect1>

--- a/doc/src/sgml/xindex.sgml
+++ b/doc/src/sgml/xindex.sgml
-<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.52 2006/10/23 18:10:32 petere Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.53 2006/12/01 23:46:46 tgl Exp $ -->
 <sect1 id="xindex">
 <title>Interfacing Extensions To Indexes</title>
@@ -243,15 +243,16 @@
   </table>
  <para>
-   GIN indexes are similar to GiST's in flexibility: they don't have a fixed
+   GIN indexes are similar to GiST indexes in flexibility: they don't have a
-   et of strategies. Instead, the <quote>consistency</> support routine
+   fixed set of strategies. Instead the support routines of each operator
-   interprets the strategy numbers accordingly with operator class
+   class interpret the strategy numbers according to the operator class's
-   definition. As an example, strategies of operator class over arrays
+   definition. As an example, the strategy numbers used by the built-in
-   is shown in <xref linkend="xindex-gin-array-strat-table">.
+   operator classes for arrays are
+   shown in <xref linkend="xindex-gin-array-strat-table">.
  </para>
   <table tocentry="1" id="xindex-gin-array-strat-table">
-    <title>GIN Array's Strategies</title>
+    <title>GIN Array Strategies</title>
    <tgroup cols="2">
     <thead>
      <row>
@@ -388,36 +389,35 @@
     <tbody>
      <row>
       <entry>consistent - determine whether key satisfies the 
-		query qualifier</entry>
+        query qualifier</entry>
       <entry>1</entry>
      </row>
      <row>
-       <entry>union - compute union of of a set of given keys</entry>
+       <entry>union - compute union of a set of keys</entry>
       <entry>2</entry>
      </row>
      <row>
-       <entry>compress - computes a compressed representation of a key or value
+       <entry>compress - compute a compressed representation of a key or value
-		to be indexed</entry>
+        to be indexed</entry>
       <entry>3</entry>
      </row>
      <row>
-       <entry>decompress - computes a decompressed representation of a 
+       <entry>decompress - compute a decompressed representation of a 
-	   compressed key </entry>
+        compressed key</entry>
       <entry>4</entry>
      </row>
      <row>
       <entry>penalty - compute penalty for inserting new key into subtree 
-	   with given subtree's key</entry>
+       with given subtree's key</entry>
       <entry>5</entry>
      </row>
      <row>
       <entry>picksplit - determine which entries of a page are to be moved
-	   to the new page and compute the union keys for resulting pages </entry>
+       to the new page and compute the union keys for resulting pages</entry>
       <entry>6</entry>
      </row>
      <row>
-       <entry>equal - compare two keys and returns true if they are equal 
+       <entry>equal - compare two keys and return true if they are equal</entry>
-		</entry>
       <entry>7</entry>
      </row>
     </tbody>
@@ -441,23 +441,22 @@
     <tbody>
      <row>
       <entry>
-   compare - Compare two keys and return an integer less than zero, zero, or
+        compare - compare two keys and return an integer less than zero, zero,
-   greater than zero, indicating whether the first key is less than, equal to,
+        or greater than zero, indicating whether the first key is less than,
-   or greater than the second.
+        equal to, or greater than the second
-	   </entry>
+       </entry>
       <entry>1</entry>
      </row>
      <row>
-       <entry>extractValue - extract keys from value to be indexed</entry>
+       <entry>extractValue - extract keys from a value to be indexed</entry>
       <entry>2</entry>
      </row>
      <row>
-       <entry>extractQuery - extract keys from query</entry>
+       <entry>extractQuery - extract keys from a query condition</entry>
       <entry>3</entry>
      </row>
      <row>
-       <entry>consistent - determine whether value matches by the
+       <entry>consistent - determine whether value matches query condition</entry>
-			   query</entry>
       <entry>4</entry>
      </row>
     </tbody>
@@ -822,12 +821,16 @@ CREATE OPERATOR CLASS polygon_ops
        STORAGE box;
 </programlisting>
-   At present, only the GiST and GIN index method supports a
+   At present, only the GiST and GIN index methods support a
   <literal>STORAGE</> type that's different from the column data type.
-   The GiST <literal>compress</> and <literal>decompress</> support
+   The GiST <function>compress</> and <function>decompress</> support
   routines must deal with data-type conversion when <literal>STORAGE</>
-   is used. Functions named <literal>extractValue</> and <literal>extractQuery</>
+   is used.  In GIN, the <literal>STORAGE</> type identifies the type of
-   do conversation into internally used types for GIN.
+   the <quote>key</> values, which normally is different from the type
+   of the indexed column &mdash; for example, an operator class for
+   integer array columns might have keys that are just integers.  The
+   GIN <function>extractValue</> and <function>extractQuery</> support
+   routines are responsible for extracting keys from indexed values.
  </para>
 </sect2>