Buffering GiST index build algorithm.

When building a GiST index that doesn't fit in cache, buffers are attached to some internal nodes in the index. This speeds up the build by avoiding random I/O that would otherwise be needed to traverse all the way down the tree to the find right leaf page for tuple. Alexander Korotkov

Buffering GiST index build algorithm.
When building a GiST index that doesn't fit in cache, buffers are attached to some internal nodes in the index. This speeds up the build by avoiding random I/O that would otherwise be needed to traverse all the way down the tree to the find right leaf page for tuple. Alexander Korotkov
5edb24a8 · Heikki Linnakangas · 09b68c70 · 5edb24a8 · 5edb24a8 · 5edb24a8
Commit 5edb24a8 authored Sep 08, 2011 by Heikki Linnakangas
11 changed files
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -642,6 +642,40 @@ my_distance(PG_FUNCTION_ARGS)

  </variablelist>

+ <sect2 id="gist-buffering-build">
+  <title>GiST buffering build</title>
+  <para>
+   Building large GiST indexes by simply inserting all the tuples tends to be
+   slow, because if the index tuples are scattered across the index and the
+   index is large enough to not fit in cache, the insertions need to perform
+   a lot of random I/O. PostgreSQL from version 9.2 supports a more efficient
+   method to build GiST indexes based on buffering, which can dramatically
+   reduce number of random I/O needed for non-ordered data sets. For
+   well-ordered datasets the benefit is smaller or non-existent, because
+   only a small number of pages receive new tuples at a time, and those pages
+   fit in cache even if the index as whole does not.
+  </para>
+
+  <para>
+   However, buffering index build needs to call the <function>penalty</>
+   function more often, which consumes some extra CPU resources. Also, the
+   buffers used in the buffering build need temporary disk space, up to
+   the size of the resulting index. Buffering can also infuence the quality
+   of the produced index, in both positive and negative directions. That
+   influence depends on various factors, like the distribution of the input
+   data and operator class implementation.
+  </para>
+
+  <para>
+   By default, the index build switches to the buffering method when the
+   index size reaches <xref linkend="guc-effective-cache-size">. It can
+   be manually turned on or off by the <literal>BUFFERING</literal> parameter
+   to the CREATE INDEX clause. The default behavior is good for most cases,
+   but turning buffering off might speed up the build somewhat if the input
+   data is ordered.
+  </para>
+
+ </sect2>
 </sect1>

 <sect1 id="gist-examples">

--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -340,6 +340,26 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ <replaceable class="parameter">name</
    </listitem>
   </varlistentry>

+   </variablelist>
+   <para>
+    GiST indexes additionaly accepts parameters:
+   </para>
+
+   <variablelist>
+
+   <varlistentry>
+    <term><literal>BUFFERING</></term>
+    <listitem>
+    <para>
+     Determines whether the buffering build technique described in
+     <xref linkend="gist-buffering-build"> is used to build the index. With
+     <literal>OFF</> it is disabled, with <literal>ON</> it is enabled, and
+     with <literal>AUTO</> it is initially disabled, but turned on
+     on-the-fly once the index size reaches <xref linkend="guc-effective-cache-size">. The default is <literal>AUTO</>.
+    </para>
+    </listitem>
+   </varlistentry>
+
   </variablelist>
  </refsect2>


--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -219,6 +219,17 @@ static relopt_real realRelOpts[] =

 static relopt_string stringRelOpts[] =
 {
+	{
+		{
+			"buffering",
+			"Enables buffering build for this GiST index",
+			RELOPT_KIND_GIST
+		},
+		4,
+		false,
+		gistValidateBufferingOption,
+		"auto"
+	},
 	/* list terminator */
 	{{NULL}}
 };

--- a/src/backend/access/gist/Makefile
+++ b/src/backend/access/gist/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global

 OBJS = gist.o gistutil.o gistxlog.o gistvacuum.o gistget.o gistscan.o \
-       gistproc.o gistsplit.o
+       gistproc.o gistsplit.o gistbuild.o gistbuildbuffers.o

 include $(top_srcdir)/src/backend/common.mk
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -24,6 +24,7 @@ The current implementation of GiST supports:
  * provides NULL-safe interface to GiST core
  * Concurrency
  * Recovery support via WAL logging
+  * Buffering build algorithm

 The support for concurrency implemented in PostgreSQL was developed based on
 the paper "Access Methods for Next-Generation Database Systems" by
@@ -31,6 +32,12 @@ Marcel Kornaker:

    http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/access-methods-for-next-generation.pdf.gz

+Buffering build algorithm for GiST was developed based on the paper "Efficient
+Bulk Operations on Dynamic R-trees" by Lars Arge, Klaus Hinrichs, Jan Vahrenhold
+and Jeffrey Scott Vitter.
+
+    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.9894&rep=rep1&type=pdf
+
 The original algorithms were modified in several ways:

 * They had to be adapted to PostgreSQL conventions. For example, the SEARCH
@@ -278,6 +285,134 @@ would complicate the insertion algorithm. So when an insertion sees a page
 with F_FOLLOW_RIGHT set, it immediately tries to bring the split that
 crashed in the middle to completion by adding the downlink in the parent.

+Buffering build algorithm
+-------------------------
+
+In the buffering index build algorithm, some or all internal nodes have a
+buffer attached to them. When a tuple is inserted at the top, the descend down
+the tree is stopped as soon as a buffer is reached, and the tuple is pushed to
+the buffer. When a buffer gets too full, all the tuples in it are flushed to
+the lower level, where they again hit lower level buffers or leaf pages. This
+makes the insertions happen in more of a breadth-first than depth-first order,
+which greatly reduces the amount of random I/O required.
+
+In the algorithm, levels are numbered so that leaf pages have level zero,
+and internal node levels count up from 1. This numbering ensures that a page's
+level number never changes, even when the root page is split.
+
+Level                    Tree
+
+3                         *
+                      /       \
+2                *                 *
+              /  |  \           /  |  \
+1          *     *     *     *     *     *
+          / \   / \   / \   / \   / \   / \
+0        o   o o   o o   o o   o o   o o   o
+
+* - internal page
+o - leaf page
+
+Internal pages that belong to certain levels have buffers associated with
+them. Leaf pages never have buffers. Which levels have buffers is controlled
+by "level step" parameter: level numbers that are multiples of level_step
+have buffers, while others do not. For example, if level_step = 2, then
+pages on levels 2, 4, 6, ... have buffers. If level_step = 1 then every
+internal page has a buffer.
+
+Level        Tree (level_step = 1)                Tree (level_step = 2)
+
+3                      *                                     *
+                   /       \                             /       \
+2             *(b)              *(b)                *(b)              *(b)
+           /  |  \           /  |  \             /  |  \           /  |  \
+1       *(b)  *(b)  *(b)  *(b)  *(b)  *(b)    *     *     *     *     *     *
+       / \   / \   / \   / \   / \   / \     / \   / \   / \   / \   / \   / \
+0     o   o o   o o   o o   o o   o o   o   o   o o   o o   o o   o o   o o   o
+
+(b) - buffer
+
+Logically, a buffer is just bunch of tuples. Physically, it is divided in
+pages, backed by a temporary file. Each buffer can be in one of two states:
+a) Last page of the buffer is kept in main memory. A node buffer is
+automatically switched to this state when a new index tuple is added to it,
+or a tuple is removed from it.
+b) All pages of the buffer are swapped out to disk. When a buffer becomes too
+full, and we start to flush it, all other buffers are switched to this state.
+
+When an index tuple is inserted, its initial processing can end in one of the
+following points:
+1) Leaf page, if the depth of the index <= level_step, meaning that
+   none of the internal pages have buffers associated with them.
+2) Buffer of topmost level page that has buffers.
+
+New index tuples are processed until one of the buffers in the topmost
+buffered level becomes half-full. When a buffer becomes half-full, it's added
+to the emptying queue, and will be emptied before a new tuple is processed.
+
+Buffer emptying process means that index tuples from the buffer are moved
+into buffers at a lower level, or leaf pages. First, all the other buffers are
+swapped to disk to free up the memory. Then tuples are popped from the buffer
+one by one, and cascaded down the tree to the next buffer or leaf page below
+the buffered node.
+
+Emptying a buffer has the interesting dynamic property that any intermediate
+pages between the buffer being emptied, and the next buffered or leaf level
+below it, become cached. If there are no more buffers below the node, the leaf
+pages where the tuples finally land on get cached too. If there are, the last
+buffer page of each buffer below is kept in memory. This is illustrated in
+the figures below:
+
+   Buffer being emptied to
+     lower-level buffers               Buffer being emptied to leaf pages
+
+               +(fb)                                 +(fb)
+            /     \                                /     \
+        +             +                        +             +
+      /   \         /   \                    /   \         /   \
+    *(ab)   *(ab) *(ab)   *(ab)            x       x     x       x
+
+    - cached internal page
+x    - cached leaf page
+*    - non-cached internal page
+(fb) - buffer being emptied
+(ab) - buffers being appended to, with last page in memory
+
+In the beginning of the index build, the level-step is chosen so that all those
+pages involved in emptying one buffer fit in cache, so after each of those
+pages have been accessed once and cached, emptying a buffer doesn't involve
+any more I/O. This locality is where the speedup of the buffering algorithm
+comes from.
+
+Emptying one buffer can fill up one or more of the lower-level buffers,
+triggering emptying of them as well. Whenever a buffer becomes too full, it's
+added to the emptying queue, and will be emptied after the current buffer has
+been processed.
+
+To keep the size of each buffer limited even in the worst case, buffer emptying
+is scheduled as soon as a buffer becomes half-full, and emptying it continues
+until 1/2 of the nominal buffer size worth of tuples has been emptied. This
+guarantees that when buffer emptying begins, all the lower-level buffers
+are at most half-full. In the worst case that all the tuples are cascaded down
+to the same lower-level buffer, that buffer therefore has enough space to
+accommodate all the tuples emptied from the upper-level buffer. There is no
+hard size limit in any of the data structures used, though, so this only needs
+to be approximate; small overfilling of some buffers doesn't matter.
+
+If an internal page that has a buffer associated with it is split, the buffer
+needs to be split too. All tuples in the buffer are scanned through and
+relocated to the correct sibling buffers, using the penalty function to decide
+which buffer each tuple should go to.
+
+After all tuples from the heap have been processed, there are still some index
+tuples in the buffers. At this point, final buffer emptying starts. All buffers
+are emptied in top-down order. This is slightly complicated by the fact that
+new buffers can be allocated during the emptying, due to page splits. However,
+the new buffers will always be siblings of buffers that haven't been fully
+emptied yet; tuples never move upwards in the tree. The final emptying loops
+through buffers at a given level until all buffers at that level have been
+emptied, and then moves down to the next level.
+

 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>

--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
--- a/src/backend/access/gist/gistbuildbuffers.c
+++ b/src/backend/access/gist/gistbuildbuffers.c
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -667,13 +667,30 @@ gistoptions(PG_FUNCTION_ARGS)
 {
 	Datum		reloptions = PG_GETARG_DATUM(0);
 	bool		validate = PG_GETARG_BOOL(1);
-	bytea	   *result;
+	relopt_value *options;
+	GiSTOptions *rdopts;
+	int			numoptions;
+	static const relopt_parse_elt tab[] = {
+		{"fillfactor", RELOPT_TYPE_INT, offsetof(GiSTOptions, fillfactor)},
+		{"buffering", RELOPT_TYPE_STRING, offsetof(GiSTOptions, bufferingModeOffset)}
+	};

-	result = default_reloptions(reloptions, validate, RELOPT_KIND_GIST);
+	options = parseRelOptions(reloptions, validate, RELOPT_KIND_GIST,
+							  &numoptions);
+
+	/* if none set, we're done */
+	if (numoptions == 0)
+		PG_RETURN_NULL();
+
+	rdopts = allocateReloptStruct(sizeof(GiSTOptions), options, numoptions);
+
+	fillRelOptions((void *) rdopts, sizeof(GiSTOptions), options, numoptions,
+				   validate, tab, lengthof(tab));
+
+	pfree(options);
+
+	PG_RETURN_BYTEA_P(rdopts);

-	if (result)
-		PG_RETURN_BYTEA_P(result);
-	PG_RETURN_NULL();
 }

 /*

--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -263,7 +263,8 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
 			else
 				GistPageGetOpaque(page)->rightlink = xldata->origrlink;
 			GistPageGetOpaque(page)->nsn = xldata->orignsn;
-			if (i < xlrec.data->npage - 1 && !isrootsplit)
+			if (i < xlrec.data->npage - 1 && !isrootsplit &&
+				xldata->markfollowright)
 				GistMarkFollowRight(page);
 			else
 				GistClearFollowRight(page);
@@ -411,7 +412,7 @@ XLogRecPtr
 gistXLogSplit(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
 			  SplitedPageLayout *dist,
 			  BlockNumber origrlink, GistNSN orignsn,
-			  Buffer leftchildbuf)
+			  Buffer leftchildbuf, bool markfollowright)
 {
 	XLogRecData *rdata;
 	gistxlogPageSplit xlrec;
@@ -433,6 +434,7 @@ gistXLogSplit(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
 	xlrec.npage = (uint16) npage;
 	xlrec.leftchild =
 		BufferIsValid(leftchildbuf) ? BufferGetBlockNumber(leftchildbuf) : InvalidBlockNumber;
+	xlrec.markfollowright = markfollowright;

 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = sizeof(gistxlogPageSplit);

--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -17,13 +17,31 @@
 #include "access/gist.h"
 #include "access/itup.h"
 #include "storage/bufmgr.h"
+#include "storage/buffile.h"
 #include "utils/rbtree.h"
+#include "utils/hsearch.h"

 /* Buffer lock modes */
 #define GIST_SHARE	BUFFER_LOCK_SHARE
 #define GIST_EXCLUSIVE	BUFFER_LOCK_EXCLUSIVE
 #define GIST_UNLOCK BUFFER_LOCK_UNLOCK

+typedef struct
+{
+	BlockNumber prev;
+	uint32		freespace;
+	char		tupledata[1];
+} GISTNodeBufferPage;
+
+#define BUFFER_PAGE_DATA_OFFSET MAXALIGN(offsetof(GISTNodeBufferPage, tupledata))
+/* Returns free space in node buffer page */
+#define PAGE_FREE_SPACE(nbp) (nbp->freespace)
+/* Checks if node buffer page is empty */
+#define PAGE_IS_EMPTY(nbp) (nbp->freespace == BLCKSZ - BUFFER_PAGE_DATA_OFFSET)
+/* Checks if node buffers page don't contain sufficient space for index tuple */
+#define PAGE_NO_SPACE(nbp, itup) (PAGE_FREE_SPACE(nbp) < \
+										MAXALIGN(IndexTupleSize(itup)))
+
 /*
 * GISTSTATE: information needed for any GiST index operation
 *
@@ -170,6 +188,7 @@ typedef struct gistxlogPageSplit

 	BlockNumber leftchild;		/* like in gistxlogPageUpdate */
 	uint16		npage;			/* # of pages in the split */
+	bool		markfollowright;	/* set F_FOLLOW_RIGHT flags */

 	/*
 	 * follow: 1. gistxlogPage and array of IndexTupleData per page
@@ -279,13 +298,149 @@ typedef struct
 #define  GistTupleIsInvalid(itup)	( ItemPointerGetOffsetNumber( &((itup)->t_tid) ) == TUPLE_IS_INVALID )
 #define  GistTupleSetValid(itup)	ItemPointerSetOffsetNumber( &((itup)->t_tid), TUPLE_IS_VALID )

+
+
+
+/*
+ * A buffer attached to an internal node, used when building an index in
+ * buffering mode.
+ */
+typedef struct
+{
+	BlockNumber nodeBlocknum;	/* index block # this buffer is for */
+	int32		blocksCount;	/* current # of blocks occupied by buffer */
+
+	BlockNumber pageBlocknum;	/* temporary file block # */
+	GISTNodeBufferPage *pageBuffer;		/* in-memory buffer page */
+
+	/* is this buffer queued for emptying? */
+	bool		queuedForEmptying;
+
+	struct GISTBufferingInsertStack *path;
+} GISTNodeBuffer;
+
+/*
+ * Does specified level have buffers? (Beware of multiple evaluation of
+ * arguments.)
+ */
+#define LEVEL_HAS_BUFFERS(nlevel, gfbb) \
+	((nlevel) != 0 && (nlevel) % (gfbb)->levelStep == 0 && \
+	 (nlevel) != (gfbb)->rootitem->level)
+
+/* Is specified buffer at least half-filled (should be queued for emptying)? */
+#define BUFFER_HALF_FILLED(nodeBuffer, gfbb) \
+	((nodeBuffer)->blocksCount > (gfbb)->pagesPerBuffer / 2)
+
+/*
+ * Is specified buffer full? Our buffers can actually grow indefinitely,
+ * beyond the "maximum" size, so this just means whether the buffer has grown
+ * beyond the nominal maximum size.
+ */
+#define BUFFER_OVERFLOWED(nodeBuffer, gfbb) \
+	((nodeBuffer)->blocksCount > (gfbb)->pagesPerBuffer)
+
+/*
+ * Extended GISTInsertStack for buffering GiST index build.
+ */
+typedef struct GISTBufferingInsertStack
+{
+	/* current page */
+	BlockNumber blkno;
+
+	/* offset of the downlink in the parent page, that points to this page */
+	OffsetNumber downlinkoffnum;
+
+	/* pointer to parent */
+	struct GISTBufferingInsertStack *parent;
+
+	int			refCount;
+
+	/* level number */
+	int			level;
+} GISTBufferingInsertStack;
+
+/*
+ * Data structure with general information about build buffers.
+ */
+typedef struct GISTBuildBuffers
+{
+	/* Persistent memory context for the buffers and metadata. */
+	MemoryContext context;
+
+	BufFile    *pfile;			/* Temporary file to store buffers in */
+	long		nFileBlocks;	/* Current size of the temporary file */
+
+	/*
+	 * resizable array of free blocks.
+	 */
+	long	   *freeBlocks;
+	int			nFreeBlocks;	/* # of currently free blocks in the array */
+	int			freeBlocksLen;	/* current allocated length of the array */
+
+	/* Hash for buffers by block number */
+	HTAB	   *nodeBuffersTab;
+
+	/* List of buffers scheduled for emptying */
+	List	   *bufferEmptyingQueue;
+
+	/*
+	 * Parameters to the buffering build algorithm. levelStep determines which
+	 * levels in the tree have buffers, and pagesPerBuffer determines how
+	 * large each buffer is.
+	 */
+	int			levelStep;
+	int			pagesPerBuffer;
+
+	/* Array of lists of buffers on each level, for final emptying */
+	List	  **buffersOnLevels;
+	int			buffersOnLevelsLen;
+
+	/*
+	 * Dynamically-sized array of buffers that currently have their last page
+	 * loaded in main memory.
+	 */
+	GISTNodeBuffer **loadedBuffers;
+	int			loadedBuffersCount;		/* # of entries in loadedBuffers */
+	int			loadedBuffersLen;		/* allocated size of loadedBuffers */
+
+	/* A path item that points to the current root node */
+	GISTBufferingInsertStack *rootitem;
+} GISTBuildBuffers;
+
+/*
+ * Storage type for GiST's reloptions
+ */
+typedef struct GiSTOptions
+{
+	int32		vl_len_;		/* varlena header (do not touch directly!) */
+	int			fillfactor;		/* page fill factor in percent (0..100) */
+	int			bufferingModeOffset;	/* use buffering build? */
+}	GiSTOptions;
+
 /* gist.c */
-extern Datum gistbuild(PG_FUNCTION_ARGS);
 extern Datum gistbuildempty(PG_FUNCTION_ARGS);
 extern Datum gistinsert(PG_FUNCTION_ARGS);
 extern MemoryContext createTempGistContext(void);
 extern void initGISTstate(GISTSTATE *giststate, Relation index);
 extern void freeGISTstate(GISTSTATE *giststate);
+extern void gistdoinsert(Relation r,
+			 IndexTuple itup,
+			 Size freespace,
+			 GISTSTATE *GISTstate);
+
+/* A List of these is returned from gistplacetopage() in *splitinfo */
+typedef struct
+{
+	Buffer		buf;			/* the split page "half" */
+	IndexTuple	downlink;		/* downlink for this half. */
+} GISTPageSplitInfo;
+
+extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
+				Buffer buffer,
+				IndexTuple *itup, int ntup, OffsetNumber oldoffnum,
+				Buffer leftchildbuf,
+				List **splitinfo,
+				bool markleftchild);

 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
@@ -305,7 +460,7 @@ extern XLogRecPtr gistXLogSplit(RelFileNode node,
 			  BlockNumber blkno, bool page_is_leaf,
 			  SplitedPageLayout *dist,
 			  BlockNumber origrlink, GistNSN oldnsn,
-			  Buffer leftchild);
+			  Buffer leftchild, bool markfollowright);

 /* gistget.c */
 extern Datum gistgettuple(PG_FUNCTION_ARGS);
@@ -380,4 +535,27 @@ extern void gistSplitByKey(Relation r, Page page, IndexTuple *itup,
 			   GistSplitVector *v, GistEntryVector *entryvec,
 			   int attno);

+/* gistbuild.c */
+extern Datum gistbuild(PG_FUNCTION_ARGS);
+extern void gistValidateBufferingOption(char *value);
+extern void gistDecreasePathRefcount(GISTBufferingInsertStack *path);
+
+/* gistbuildbuffers.c */
+extern GISTBuildBuffers *gistInitBuildBuffers(int pagesPerBuffer, int levelStep,
+					 int maxLevel);
+extern GISTNodeBuffer *gistGetNodeBuffer(GISTBuildBuffers *gfbb,
+				  GISTSTATE *giststate,
+				  BlockNumber blkno, OffsetNumber downlinkoffnum,
+				  GISTBufferingInsertStack *parent);
+extern void gistPushItupToNodeBuffer(GISTBuildBuffers *gfbb,
+						 GISTNodeBuffer *nodeBuffer, IndexTuple item);
+extern bool gistPopItupFromNodeBuffer(GISTBuildBuffers *gfbb,
+						  GISTNodeBuffer *nodeBuffer, IndexTuple *item);
+extern void gistFreeBuildBuffers(GISTBuildBuffers *gfbb);
+extern void gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb,
+								GISTSTATE *giststate, Relation r,
+								GISTBufferingInsertStack *path, Buffer buffer,
+								List *splitinfo);
+extern void gistUnloadNodeBuffers(GISTBuildBuffers *gfbb);
+
 #endif   /* GIST_PRIVATE_H */