Commit 5edb24a8 authored by Heikki Linnakangas's avatar Heikki Linnakangas

Buffering GiST index build algorithm.

When building a GiST index that doesn't fit in cache, buffers are attached
to some internal nodes in the index. This speeds up the build by avoiding
random I/O that would otherwise be needed to traverse all the way down the
tree to the find right leaf page for tuple.

Alexander Korotkov
parent 09b68c70
......@@ -642,6 +642,40 @@ my_distance(PG_FUNCTION_ARGS)
</variablelist>
<sect2 id="gist-buffering-build">
<title>GiST buffering build</title>
<para>
Building large GiST indexes by simply inserting all the tuples tends to be
slow, because if the index tuples are scattered across the index and the
index is large enough to not fit in cache, the insertions need to perform
a lot of random I/O. PostgreSQL from version 9.2 supports a more efficient
method to build GiST indexes based on buffering, which can dramatically
reduce number of random I/O needed for non-ordered data sets. For
well-ordered datasets the benefit is smaller or non-existent, because
only a small number of pages receive new tuples at a time, and those pages
fit in cache even if the index as whole does not.
</para>
<para>
However, buffering index build needs to call the <function>penalty</>
function more often, which consumes some extra CPU resources. Also, the
buffers used in the buffering build need temporary disk space, up to
the size of the resulting index. Buffering can also infuence the quality
of the produced index, in both positive and negative directions. That
influence depends on various factors, like the distribution of the input
data and operator class implementation.
</para>
<para>
By default, the index build switches to the buffering method when the
index size reaches <xref linkend="guc-effective-cache-size">. It can
be manually turned on or off by the <literal>BUFFERING</literal> parameter
to the CREATE INDEX clause. The default behavior is good for most cases,
but turning buffering off might speed up the build somewhat if the input
data is ordered.
</para>
</sect2>
</sect1>
<sect1 id="gist-examples">
......
......@@ -340,6 +340,26 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ <replaceable class="parameter">name</
</listitem>
</varlistentry>
</variablelist>
<para>
GiST indexes additionaly accepts parameters:
</para>
<variablelist>
<varlistentry>
<term><literal>BUFFERING</></term>
<listitem>
<para>
Determines whether the buffering build technique described in
<xref linkend="gist-buffering-build"> is used to build the index. With
<literal>OFF</> it is disabled, with <literal>ON</> it is enabled, and
with <literal>AUTO</> it is initially disabled, but turned on
on-the-fly once the index size reaches <xref linkend="guc-effective-cache-size">. The default is <literal>AUTO</>.
</para>
</listitem>
</varlistentry>
</variablelist>
</refsect2>
......
......@@ -219,6 +219,17 @@ static relopt_real realRelOpts[] =
static relopt_string stringRelOpts[] =
{
{
{
"buffering",
"Enables buffering build for this GiST index",
RELOPT_KIND_GIST
},
4,
false,
gistValidateBufferingOption,
"auto"
},
/* list terminator */
{{NULL}}
};
......
......@@ -13,6 +13,6 @@ top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = gist.o gistutil.o gistxlog.o gistvacuum.o gistget.o gistscan.o \
gistproc.o gistsplit.o
gistproc.o gistsplit.o gistbuild.o gistbuildbuffers.o
include $(top_srcdir)/src/backend/common.mk
......@@ -24,6 +24,7 @@ The current implementation of GiST supports:
* provides NULL-safe interface to GiST core
* Concurrency
* Recovery support via WAL logging
* Buffering build algorithm
The support for concurrency implemented in PostgreSQL was developed based on
the paper "Access Methods for Next-Generation Database Systems" by
......@@ -31,6 +32,12 @@ Marcel Kornaker:
http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/access-methods-for-next-generation.pdf.gz
Buffering build algorithm for GiST was developed based on the paper "Efficient
Bulk Operations on Dynamic R-trees" by Lars Arge, Klaus Hinrichs, Jan Vahrenhold
and Jeffrey Scott Vitter.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.9894&rep=rep1&type=pdf
The original algorithms were modified in several ways:
* They had to be adapted to PostgreSQL conventions. For example, the SEARCH
......@@ -278,6 +285,134 @@ would complicate the insertion algorithm. So when an insertion sees a page
with F_FOLLOW_RIGHT set, it immediately tries to bring the split that
crashed in the middle to completion by adding the downlink in the parent.
Buffering build algorithm
-------------------------
In the buffering index build algorithm, some or all internal nodes have a
buffer attached to them. When a tuple is inserted at the top, the descend down
the tree is stopped as soon as a buffer is reached, and the tuple is pushed to
the buffer. When a buffer gets too full, all the tuples in it are flushed to
the lower level, where they again hit lower level buffers or leaf pages. This
makes the insertions happen in more of a breadth-first than depth-first order,
which greatly reduces the amount of random I/O required.
In the algorithm, levels are numbered so that leaf pages have level zero,
and internal node levels count up from 1. This numbering ensures that a page's
level number never changes, even when the root page is split.
Level Tree
3 *
/ \
2 * *
/ | \ / | \
1 * * * * * *
/ \ / \ / \ / \ / \ / \
0 o o o o o o o o o o o o
* - internal page
o - leaf page
Internal pages that belong to certain levels have buffers associated with
them. Leaf pages never have buffers. Which levels have buffers is controlled
by "level step" parameter: level numbers that are multiples of level_step
have buffers, while others do not. For example, if level_step = 2, then
pages on levels 2, 4, 6, ... have buffers. If level_step = 1 then every
internal page has a buffer.
Level Tree (level_step = 1) Tree (level_step = 2)
3 * *
/ \ / \
2 *(b) *(b) *(b) *(b)
/ | \ / | \ / | \ / | \
1 *(b) *(b) *(b) *(b) *(b) *(b) * * * * * *
/ \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \
0 o o o o o o o o o o o o o o o o o o o o o o o o
(b) - buffer
Logically, a buffer is just bunch of tuples. Physically, it is divided in
pages, backed by a temporary file. Each buffer can be in one of two states:
a) Last page of the buffer is kept in main memory. A node buffer is
automatically switched to this state when a new index tuple is added to it,
or a tuple is removed from it.
b) All pages of the buffer are swapped out to disk. When a buffer becomes too
full, and we start to flush it, all other buffers are switched to this state.
When an index tuple is inserted, its initial processing can end in one of the
following points:
1) Leaf page, if the depth of the index <= level_step, meaning that
none of the internal pages have buffers associated with them.
2) Buffer of topmost level page that has buffers.
New index tuples are processed until one of the buffers in the topmost
buffered level becomes half-full. When a buffer becomes half-full, it's added
to the emptying queue, and will be emptied before a new tuple is processed.
Buffer emptying process means that index tuples from the buffer are moved
into buffers at a lower level, or leaf pages. First, all the other buffers are
swapped to disk to free up the memory. Then tuples are popped from the buffer
one by one, and cascaded down the tree to the next buffer or leaf page below
the buffered node.
Emptying a buffer has the interesting dynamic property that any intermediate
pages between the buffer being emptied, and the next buffered or leaf level
below it, become cached. If there are no more buffers below the node, the leaf
pages where the tuples finally land on get cached too. If there are, the last
buffer page of each buffer below is kept in memory. This is illustrated in
the figures below:
Buffer being emptied to
lower-level buffers Buffer being emptied to leaf pages
+(fb) +(fb)
/ \ / \
+ + + +
/ \ / \ / \ / \
*(ab) *(ab) *(ab) *(ab) x x x x
+ - cached internal page
x - cached leaf page
* - non-cached internal page
(fb) - buffer being emptied
(ab) - buffers being appended to, with last page in memory
In the beginning of the index build, the level-step is chosen so that all those
pages involved in emptying one buffer fit in cache, so after each of those
pages have been accessed once and cached, emptying a buffer doesn't involve
any more I/O. This locality is where the speedup of the buffering algorithm
comes from.
Emptying one buffer can fill up one or more of the lower-level buffers,
triggering emptying of them as well. Whenever a buffer becomes too full, it's
added to the emptying queue, and will be emptied after the current buffer has
been processed.
To keep the size of each buffer limited even in the worst case, buffer emptying
is scheduled as soon as a buffer becomes half-full, and emptying it continues
until 1/2 of the nominal buffer size worth of tuples has been emptied. This
guarantees that when buffer emptying begins, all the lower-level buffers
are at most half-full. In the worst case that all the tuples are cascaded down
to the same lower-level buffer, that buffer therefore has enough space to
accommodate all the tuples emptied from the upper-level buffer. There is no
hard size limit in any of the data structures used, though, so this only needs
to be approximate; small overfilling of some buffers doesn't matter.
If an internal page that has a buffer associated with it is split, the buffer
needs to be split too. All tuples in the buffer are scanned through and
relocated to the correct sibling buffers, using the penalty function to decide
which buffer each tuple should go to.
After all tuples from the heap have been processed, there are still some index
tuples in the buffers. At this point, final buffer emptying starts. All buffers
are emptied in top-down order. This is slightly complicated by the fact that
new buffers can be allocated during the emptying, due to page splits. However,
the new buffers will always be siblings of buffers that haven't been fully
emptied yet; tuples never move upwards in the tree. The final emptying loops
through buffers at a given level until all buffers at that level have been
emptied, and then moves down to the next level.
Authors:
Teodor Sigaev <teodor@sigaev.ru>
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -667,13 +667,30 @@ gistoptions(PG_FUNCTION_ARGS)
{
Datum reloptions = PG_GETARG_DATUM(0);
bool validate = PG_GETARG_BOOL(1);
bytea *result;
relopt_value *options;
GiSTOptions *rdopts;
int numoptions;
static const relopt_parse_elt tab[] = {
{"fillfactor", RELOPT_TYPE_INT, offsetof(GiSTOptions, fillfactor)},
{"buffering", RELOPT_TYPE_STRING, offsetof(GiSTOptions, bufferingModeOffset)}
};
result = default_reloptions(reloptions, validate, RELOPT_KIND_GIST);
options = parseRelOptions(reloptions, validate, RELOPT_KIND_GIST,
&numoptions);
/* if none set, we're done */
if (numoptions == 0)
PG_RETURN_NULL();
rdopts = allocateReloptStruct(sizeof(GiSTOptions), options, numoptions);
fillRelOptions((void *) rdopts, sizeof(GiSTOptions), options, numoptions,
validate, tab, lengthof(tab));
pfree(options);
PG_RETURN_BYTEA_P(rdopts);
if (result)
PG_RETURN_BYTEA_P(result);
PG_RETURN_NULL();
}
/*
......
......@@ -263,7 +263,8 @@ gistRedoPageSplitRecord(XLogRecPtr lsn, XLogRecord *record)
else
GistPageGetOpaque(page)->rightlink = xldata->origrlink;
GistPageGetOpaque(page)->nsn = xldata->orignsn;
if (i < xlrec.data->npage - 1 && !isrootsplit)
if (i < xlrec.data->npage - 1 && !isrootsplit &&
xldata->markfollowright)
GistMarkFollowRight(page);
else
GistClearFollowRight(page);
......@@ -411,7 +412,7 @@ XLogRecPtr
gistXLogSplit(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
SplitedPageLayout *dist,
BlockNumber origrlink, GistNSN orignsn,
Buffer leftchildbuf)
Buffer leftchildbuf, bool markfollowright)
{
XLogRecData *rdata;
gistxlogPageSplit xlrec;
......@@ -433,6 +434,7 @@ gistXLogSplit(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
xlrec.npage = (uint16) npage;
xlrec.leftchild =
BufferIsValid(leftchildbuf) ? BufferGetBlockNumber(leftchildbuf) : InvalidBlockNumber;
xlrec.markfollowright = markfollowright;
rdata[0].data = (char *) &xlrec;
rdata[0].len = sizeof(gistxlogPageSplit);
......
......@@ -17,13 +17,31 @@
#include "access/gist.h"
#include "access/itup.h"
#include "storage/bufmgr.h"
#include "storage/buffile.h"
#include "utils/rbtree.h"
#include "utils/hsearch.h"
/* Buffer lock modes */
#define GIST_SHARE BUFFER_LOCK_SHARE
#define GIST_EXCLUSIVE BUFFER_LOCK_EXCLUSIVE
#define GIST_UNLOCK BUFFER_LOCK_UNLOCK
typedef struct
{
BlockNumber prev;
uint32 freespace;
char tupledata[1];
} GISTNodeBufferPage;
#define BUFFER_PAGE_DATA_OFFSET MAXALIGN(offsetof(GISTNodeBufferPage, tupledata))
/* Returns free space in node buffer page */
#define PAGE_FREE_SPACE(nbp) (nbp->freespace)
/* Checks if node buffer page is empty */
#define PAGE_IS_EMPTY(nbp) (nbp->freespace == BLCKSZ - BUFFER_PAGE_DATA_OFFSET)
/* Checks if node buffers page don't contain sufficient space for index tuple */
#define PAGE_NO_SPACE(nbp, itup) (PAGE_FREE_SPACE(nbp) < \
MAXALIGN(IndexTupleSize(itup)))
/*
* GISTSTATE: information needed for any GiST index operation
*
......@@ -170,6 +188,7 @@ typedef struct gistxlogPageSplit
BlockNumber leftchild; /* like in gistxlogPageUpdate */
uint16 npage; /* # of pages in the split */
bool markfollowright; /* set F_FOLLOW_RIGHT flags */
/*
* follow: 1. gistxlogPage and array of IndexTupleData per page
......@@ -279,13 +298,149 @@ typedef struct
#define GistTupleIsInvalid(itup) ( ItemPointerGetOffsetNumber( &((itup)->t_tid) ) == TUPLE_IS_INVALID )
#define GistTupleSetValid(itup) ItemPointerSetOffsetNumber( &((itup)->t_tid), TUPLE_IS_VALID )
/*
* A buffer attached to an internal node, used when building an index in
* buffering mode.
*/
typedef struct
{
BlockNumber nodeBlocknum; /* index block # this buffer is for */
int32 blocksCount; /* current # of blocks occupied by buffer */
BlockNumber pageBlocknum; /* temporary file block # */
GISTNodeBufferPage *pageBuffer; /* in-memory buffer page */
/* is this buffer queued for emptying? */
bool queuedForEmptying;
struct GISTBufferingInsertStack *path;
} GISTNodeBuffer;
/*
* Does specified level have buffers? (Beware of multiple evaluation of
* arguments.)
*/
#define LEVEL_HAS_BUFFERS(nlevel, gfbb) \
((nlevel) != 0 && (nlevel) % (gfbb)->levelStep == 0 && \
(nlevel) != (gfbb)->rootitem->level)
/* Is specified buffer at least half-filled (should be queued for emptying)? */
#define BUFFER_HALF_FILLED(nodeBuffer, gfbb) \
((nodeBuffer)->blocksCount > (gfbb)->pagesPerBuffer / 2)
/*
* Is specified buffer full? Our buffers can actually grow indefinitely,
* beyond the "maximum" size, so this just means whether the buffer has grown
* beyond the nominal maximum size.
*/
#define BUFFER_OVERFLOWED(nodeBuffer, gfbb) \
((nodeBuffer)->blocksCount > (gfbb)->pagesPerBuffer)
/*
* Extended GISTInsertStack for buffering GiST index build.
*/
typedef struct GISTBufferingInsertStack
{
/* current page */
BlockNumber blkno;
/* offset of the downlink in the parent page, that points to this page */
OffsetNumber downlinkoffnum;
/* pointer to parent */
struct GISTBufferingInsertStack *parent;
int refCount;
/* level number */
int level;
} GISTBufferingInsertStack;
/*
* Data structure with general information about build buffers.
*/
typedef struct GISTBuildBuffers
{
/* Persistent memory context for the buffers and metadata. */
MemoryContext context;
BufFile *pfile; /* Temporary file to store buffers in */
long nFileBlocks; /* Current size of the temporary file */
/*
* resizable array of free blocks.
*/
long *freeBlocks;
int nFreeBlocks; /* # of currently free blocks in the array */
int freeBlocksLen; /* current allocated length of the array */
/* Hash for buffers by block number */
HTAB *nodeBuffersTab;
/* List of buffers scheduled for emptying */
List *bufferEmptyingQueue;
/*
* Parameters to the buffering build algorithm. levelStep determines which
* levels in the tree have buffers, and pagesPerBuffer determines how
* large each buffer is.
*/
int levelStep;
int pagesPerBuffer;
/* Array of lists of buffers on each level, for final emptying */
List **buffersOnLevels;
int buffersOnLevelsLen;
/*
* Dynamically-sized array of buffers that currently have their last page
* loaded in main memory.
*/
GISTNodeBuffer **loadedBuffers;
int loadedBuffersCount; /* # of entries in loadedBuffers */
int loadedBuffersLen; /* allocated size of loadedBuffers */
/* A path item that points to the current root node */
GISTBufferingInsertStack *rootitem;
} GISTBuildBuffers;
/*
* Storage type for GiST's reloptions
*/
typedef struct GiSTOptions
{
int32 vl_len_; /* varlena header (do not touch directly!) */
int fillfactor; /* page fill factor in percent (0..100) */
int bufferingModeOffset; /* use buffering build? */
} GiSTOptions;
/* gist.c */
extern Datum gistbuild(PG_FUNCTION_ARGS);
extern Datum gistbuildempty(PG_FUNCTION_ARGS);
extern Datum gistinsert(PG_FUNCTION_ARGS);
extern MemoryContext createTempGistContext(void);
extern void initGISTstate(GISTSTATE *giststate, Relation index);
extern void freeGISTstate(GISTSTATE *giststate);
extern void gistdoinsert(Relation r,
IndexTuple itup,
Size freespace,
GISTSTATE *GISTstate);
/* A List of these is returned from gistplacetopage() in *splitinfo */
typedef struct
{
Buffer buf; /* the split page "half" */
IndexTuple downlink; /* downlink for this half. */
} GISTPageSplitInfo;
extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
Buffer buffer,
IndexTuple *itup, int ntup, OffsetNumber oldoffnum,
Buffer leftchildbuf,
List **splitinfo,
bool markleftchild);
extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
int len, GISTSTATE *giststate);
......@@ -305,7 +460,7 @@ extern XLogRecPtr gistXLogSplit(RelFileNode node,
BlockNumber blkno, bool page_is_leaf,
SplitedPageLayout *dist,
BlockNumber origrlink, GistNSN oldnsn,
Buffer leftchild);
Buffer leftchild, bool markfollowright);
/* gistget.c */
extern Datum gistgettuple(PG_FUNCTION_ARGS);
......@@ -380,4 +535,27 @@ extern void gistSplitByKey(Relation r, Page page, IndexTuple *itup,
GistSplitVector *v, GistEntryVector *entryvec,
int attno);
/* gistbuild.c */
extern Datum gistbuild(PG_FUNCTION_ARGS);
extern void gistValidateBufferingOption(char *value);
extern void gistDecreasePathRefcount(GISTBufferingInsertStack *path);
/* gistbuildbuffers.c */
extern GISTBuildBuffers *gistInitBuildBuffers(int pagesPerBuffer, int levelStep,
int maxLevel);
extern GISTNodeBuffer *gistGetNodeBuffer(GISTBuildBuffers *gfbb,
GISTSTATE *giststate,
BlockNumber blkno, OffsetNumber downlinkoffnum,
GISTBufferingInsertStack *parent);
extern void gistPushItupToNodeBuffer(GISTBuildBuffers *gfbb,
GISTNodeBuffer *nodeBuffer, IndexTuple item);
extern bool gistPopItupFromNodeBuffer(GISTBuildBuffers *gfbb,
GISTNodeBuffer *nodeBuffer, IndexTuple *item);
extern void gistFreeBuildBuffers(GISTBuildBuffers *gfbb);
extern void gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb,
GISTSTATE *giststate, Relation r,
GISTBufferingInsertStack *path, Buffer buffer,
List *splitinfo);
extern void gistUnloadNodeBuffers(GISTBuildBuffers *gfbb);
#endif /* GIST_PRIVATE_H */
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment