From 47309464e4e937e1b11320eab9b0eff9ad63cd80 Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Fri, 31 Oct 2003 22:41:21 +0000 Subject: [PATCH] Rewrite GiST documentation into something actually useful. Christopher Kings-Lynne --- doc/src/sgml/gist.sgml | 370 +++++++++++++++++++++++++++++------------ 1 file changed, 260 insertions(+), 110 deletions(-) diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml index 4354d8a4b6..6b34e498ea 100644 --- a/doc/src/sgml/gist.sgml +++ b/doc/src/sgml/gist.sgml @@ -1,113 +1,263 @@ <!-- -$Header: /cvsroot/pgsql/doc/src/sgml/gist.sgml,v 1.12 2003/09/29 18:18:35 momjian Exp $ +$Header: /cvsroot/pgsql/doc/src/sgml/gist.sgml,v 1.13 2003/10/31 22:41:21 tgl Exp $ --> -<Chapter Id="gist"> -<DocInfo> -<AuthorGroup> -<Author> -<FirstName>Gene</FirstName> -<Surname>Selkov</Surname> -</Author> -</AuthorGroup> -<Date>Transcribed 1998-02-19</Date> -</DocInfo> -<Title>GiST Indexes</Title> - -<Para> -The information about GIST is at - <ULink url="http://GiST.CS.Berkeley.EDU:8000/gist/">http://GiST.CS.Berkeley.EDU:8000/gist/</ULink> - -with more on different indexing and sorting schemes at -<ULink url="http://s2k-ftp.CS.Berkeley.EDU:8000/personal/jmh/">http://s2k-ftp.CS.Berkeley.EDU:8000/personal/jmh/</ULink>. - -And there is more interesting reading at -<ULink url="http://epoch.cs.berkeley.edu:8000/">http://epoch.cs.berkeley.edu:8000/</ULink> and -<ULink url="http://www.sai.msu.su/~megera/postgres/gist/">http://www.sai.msu.su/~megera/postgres/gist/</ULink>. -</para> - -<Para> -<Note> -<Title>Author</Title> -<Para> -This extraction from an email sent by -Eugene Selkov, Jr. (<email>selkovjr@mcs.anl.gov</email>) -contains good information -on GiST. Hopefully we will learn more in the future and update this information. -- thomas 1998-03-01 -</Para> -</Note> -</para> -<Para> -Well, I can't say I quite understand what's going on, but at least -I (almost) succeeded in porting GiST examples to linux. The GiST access -method is already in the postgres tree (<FileName>src/backend/access/gist</FileName>). -</para> -<Para> -<ULink url="ftp://s2k-ftp.cs.berkeley.edu/pub/gist/pggist/pggist.tgz">Examples at Berkeley</ULink> -come with an overview of the methods and demonstrate spatial index -mechanisms for 2D boxes, polygons, integer intervals and text -(see also <ULink url="http://gist.cs.berkeley.edu:8000/gist/">GiST at Berkeley</ULink>). -In the box example, we -are supposed to see a performance gain when using the GiST index; it did -work for me but I do not have a reasonably large collection of boxes -to check that. Other examples also worked, except polygons: I got an -error doing - -<ProgramListing> -test=> CREATE INDEX pix ON polytmp -test-> USING GIST (p:box gist_poly_ops) WITH (ISLOSSY); -ERROR: cannot open pix - -(PostgreSQL 6.3 Sun Feb 1 14:57:30 EST 1998) -</ProgramListing> -</para> -<Para> -I could not get sense of this error message; it appears to be something -we'd rather ask the developers about (see also Note 4 below). What I -would suggest here is that someone of you linux guys (linux==gcc?) fetch the -original sources quoted above and apply my patch (see attachment) and -tell us what you feel about it. Looks cool to me, but I would not like -to hold it up while there are so many competent people around. -</para> -<Para> -A few notes on the sources: -</para> -<Para> -1. I failed to make use of the original (HP-UX) Makefile and rearranged - the Makefile from the ancient postgres95 tutorial to do the job. I tried - to keep it generic, but I am a very poor makefile writer -- just did - some monkey work. Sorry about that, but I guess it is now a little - more portable that the original makefile. -</para> -<Para> -2. I built the example sources right under pgsql/src (just extracted the - tar file there). The aforementioned Makefile assumes it is one level - below pgsql/src (in our case, in pgsql/src/pggist). -</para> -<Para> -3. The changes I made to the *.c files were all about #include's, - function prototypes and typecasting. Other than that, I just threw - away a bunch of unused vars and added a couple parentheses to please - gcc. I hope I did not screw up too much :) -</para> -<Para> -4. There is a comment in polyproc.sql: - -<ProgramListing> --- -- there's a memory leak in rtree poly_ops!! --- -- CREATE INDEX pix2 ON polytmp USING RTREE (p poly_ops); -</ProgramListing> - - Roger that!! I thought it could be related to a number of - <ProductName>PostgreSQL</ProductName> versions - back and tried the query. My system went nuts and I had to shoot down - the postmaster in about ten minutes. -</para> - -<Para> -I will continue to look into GiST for a while, but I would also -appreciate -more examples of R-tree usage. -</para> -</Chapter> +<chapter Id="GiST"> +<title>GiST Indexes</title> + +<sect1 id="intro"> + <title>Introduction</title> + + <para> + <acronym>GiST</acronym> stands for Generalized Search Tree. It is a + balanced, tree-structured access method, that acts as a base template in + which to implement arbitrary indexing schemes. B+-trees, R-trees and many + other indexing schemes can be implemented in <acronym>GiST</acronym>. + </para> + + <para> + One advantage of <acronym>GiST</acronym> is that it allows the development + of custom data types with the appropriate access methods, by + an expert in the domain of the data type, rather than a database expert. + </para> + + <para> + Some of the information here is derived from <ulink + url="http://gist.cs.berkeley.edu/">the University of California at + Berkeley's GiST Indexing Project web site</ulink> and Marcel Kornacker's + thesis, + <ulink url="http://citeseer.nj.nec.com/448594.html">Access Methods for + Next-Generation Database Systems</ulink>. The <acronym>GiST</acronym> + implementation in <productname>PostgreSQL</productname> is primarily + maintained by Teodor Sigaev and Oleg Bartunov, and there is more + information on their website: <ulink + url="http://www.sai.msu.su/~megera/postgres/gist/"></>. + </para> + +</sect1> + +<sect1 id="extensibility"> + <title>Extensibility</title> + + <para> + Traditionally, implementing a new index access method meant a lot of + difficult work. It was necessary to understand the inner workings of the + database, such as the lock manager and Write-Ahead Log. The + <acronym>GiST</acronym> interface has a high level of abstraction, + requiring the access method implementor to only implement the semantics of + the data type being accessed. The <acronym>GiST</acronym> layer itself + takes care of concurrency, logging and searching the tree structure. + </para> + + <para> + This extensibility should not be confused with the extensibility of the + other standard search trees in terms of the data they can handle. For + example, <productname>PostgreSQL</productname> supports extensible B+-trees + and R-trees. That means that you can use + <productname>PostgreSQL</productname> to build a B+-tree or R-tree over any + data type you want. But B+-trees only support range predicates + (<literal><</literal>, <literal>=</literal>, <literal>></literal>), + and R-trees only support n-D range queries (contains, contained, equals). + </para> + + <para> + So if you index, say, an image collection with a + <productname>PostgreSQL</productname> B+-tree, you can only issue queries + such as <quote>is imagex equal to imagey</quote>, <quote>is imagex less + than imagey</quote> and <quote>is imagex greater than imagey</quote>? + Depending on how you define <quote>equals</quote>, <quote>less than</quote> + and <quote>greater than</quote> in this context, this could be useful. + However, by using a <acronym>GiST</acronym> based index, you could create + ways to ask domain-specific questions, perhaps <quote>find all images of + horses</quote> or <quote>find all over-exposed images</quote>. + </para> + + <para> + All it takes to get a <acronym>GiST</acronym> access method up and running + is to implement seven user-defined methods, which define the behavior of + keys in the tree. Of course these methods have to be pretty fancy to + support fancy queries, but for all the standard queries (B+-trees, + R-trees, etc.) they're relatively straightforward. In short, + <acronym>GiST</acronym> combines extensibility along with generality, code + reuse, and a clean interface. + </para> + +</sect1> + +<sect1 id="implementation"> + <title>Implementation</title> + + <para> + There are seven methods that an index operator class for + <acronym>GiST</acronym> must provide: + </para> + + <variablelist> + <varlistentry> + <term>consistent</term> + <listitem> + <para> + Given a predicate <literal>p</literal> on a tree page, and a user + query, <literal>q</literal>, this method will return false if it is + certain that both <literal>p</literal> and <literal>q</literal> cannot + be true for a given data item. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>union</term> + <listitem> + <para> + This method consolidates information in the tree. Given a set of + entries, this function generates a new predicate that is true for all + the entries. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>compress</term> + <listitem> + <para> + Converts the data item into a format suitable for physical storage in + an index page. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>decompress</term> + <listitem> + <para> + The reverse of the <function>compress</function> method. Converts the + index representation of the data item into a format that can be + manipulated by the database. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>penalty</term> + <listitem> + <para> + Returns a value indicating the <quote>cost</quote> of inserting the new + entry into a particular branch of the tree. items will be inserted + down the path of least <function>penalty</function> in the tree. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>picksplit</term> + <listitem> + <para> + When a page split is necessary, this function decides which entries on + the page are to stay on the old page, and which are to move to the new + page. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>same</term> + <listitem> + <para> + Returns true if two entries are identical, false otherwise. + </para> + </listitem> + </varlistentry> + + </variablelist> + +</sect1> + +<sect1 id="limitations"> + <title>Limitations</title> + + <para> + The current implementation of <acronym>GiST</acronym> within + <productname>PostgreSQL</productname> has some major limitations: + <acronym>GiST</acronym> access is not concurrent; the + <acronym>GiST</acronym> interface doesn't allow the development of certain + data types, such as digital trees (see papers by Aoki et al); and there + is not yet any support for write-ahead logging of updates in + <acronym>GiST</acronym> indexes. + </para> + + <para> + Solutions to the concurrency problems appear in Marcel Kornacker's + thesis; however these ideas have not yet been put into practice in the + <productname>PostgreSQL</productname> implementation. + </para> + + <para> + The lack of write-ahead logging is just a small matter of programming, + but since it isn't done yet, a crash could render a <acronym>GiST</acronym> + index inconsistent, forcing a REINDEX. + </para> + +</sect1> + +<sect1 id="examples"> + <title>Examples</title> + + <para> + To see example implementations of index methods implemented using + <acronym>GiST</acronym>, examine the following contrib modules: + </para> + + <variablelist> + <varlistentry> + <term>btree_gist</term> + <listitem> + <para>B-Tree</para> + </listitem> + </varlistentry> + + <varlistentry> + <term>cube</term> + <listitem> + <para>Indexing for multi-dimensional cubes</para> + </listitem> + </varlistentry> + + <varlistentry> + <term>intarray</term> + <listitem> + <para>RD-Tree for one-dimensional array of int4 values</para> + </listitem> + </varlistentry> + + <varlistentry> + <term>ltree</term> + <listitem> + <para>Indexing for tree-like stuctures</para> + </listitem> + </varlistentry> + + <varlistentry> + <term>rtree_gist</term> + <listitem> + <para>R-Tree</para> + </listitem> + </varlistentry> + + <varlistentry> + <term>seg</term> + <listitem> + <para>Storage and indexed access for <quote>float ranges</quote></para> + </listitem> + </varlistentry> + + <varlistentry> + <term>tsearch and tsearch2</term> + <listitem> + <para>Full text indexing</para> + </listitem> + </varlistentry> + </variablelist> + +</sect1> + +</chapter> -- 2.24.1