Commit c6521b1b authored by Tom Lane's avatar Tom Lane

Write some real documentation about the index access method API.

parent 67ff8009
<!--
Documentation of the system catalogs, directed toward PostgreSQL developers
$PostgreSQL: pgsql/doc/src/sgml/catalogs.sgml,v 2.95 2005/01/05 23:42:03 tgl Exp $
$PostgreSQL: pgsql/doc/src/sgml/catalogs.sgml,v 2.96 2005/02/13 03:04:15 tgl Exp $
-->
<chapter id="catalogs">
......@@ -289,9 +289,10 @@
</indexterm>
<para>
The catalog <structname>pg_am</structname> stores information about index access
methods. There is one row for each index access method supported by
the system.
The catalog <structname>pg_am</structname> stores information about index
access methods. There is one row for each index access method supported by
the system. The contents of this catalog are discussed in detail in
<xref linkend="indexam">.
</para>
<table>
......@@ -453,20 +454,6 @@
</tgroup>
</table>
<para>
An index access method that supports multiple columns (has
<structfield>amcanmulticol</structfield> true) <emphasis>must</>
support indexing null values in columns after the first, because the planner
will assume the index can be used for queries on just the first
column(s). For example, consider an index on (a,b) and a query with
<literal>WHERE a = 4</literal>. The system will assume the index can be used to scan for
rows with <literal>a = 4</literal>, which is wrong if the index omits rows where <literal>b</> is null.
It is, however, OK to omit rows where the first indexed column is null.
(GiST currently does so.)
<structfield>amindexnulls</structfield> should be set true only if the
index access method indexes all rows, including arbitrary combinations of null values.
</para>
</sect1>
......
<!-- $PostgreSQL: pgsql/doc/src/sgml/filelist.sgml,v 1.41 2005/01/10 00:04:38 tgl Exp $ -->
<!-- $PostgreSQL: pgsql/doc/src/sgml/filelist.sgml,v 1.42 2005/02/13 03:04:15 tgl Exp $ -->
<!entity history SYSTEM "history.sgml">
<!entity info SYSTEM "info.sgml">
......@@ -77,7 +77,7 @@
<!entity catalogs SYSTEM "catalogs.sgml">
<!entity geqo SYSTEM "geqo.sgml">
<!entity gist SYSTEM "gist.sgml">
<!entity indexcost SYSTEM "indexcost.sgml">
<!entity indexam SYSTEM "indexam.sgml">
<!entity nls SYSTEM "nls.sgml">
<!entity plhandler SYSTEM "plhandler.sgml">
<!entity protocol SYSTEM "protocol.sgml">
......
This diff is collapsed.
<!--
$PostgreSQL: pgsql/doc/src/sgml/indexcost.sgml,v 2.19 2005/01/22 22:06:17 momjian Exp $
-->
<chapter id="indexcost">
<title>Index Cost Estimation Functions</title>
<note>
<title>Author</title>
<para>
Written by Tom Lane (<email>tgl@sss.pgh.pa.us</email>) on 2000-01-24
</para>
</note>
<note>
<para>
This must eventually become part of a much larger chapter about
writing new index access methods.
</para>
</note>
<para>
Every index access method must provide a cost estimation function for
use by the planner/optimizer. The procedure OID of this function is
given in the <literal>amcostestimate</literal> field of the access
method's <literal>pg_am</literal> entry.
<note>
<para>
Prior to <productname>PostgreSQL</productname> 7.0, a different
scheme was used for registering
index-specific cost estimation functions.
</para>
</note>
</para>
<para>
The amcostestimate function is given a list of WHERE clauses that have
been determined to be usable with the index. It must return estimates
of the cost of accessing the index and the selectivity of the WHERE
clauses (that is, the fraction of main-table rows that will be
retrieved during the index scan). For simple cases, nearly all the
work of the cost estimator can be done by calling standard routines
in the optimizer; the point of having an amcostestimate function is
to allow index access methods to provide index-type-specific knowledge,
in case it is possible to improve on the standard estimates.
</para>
<para>
Each amcostestimate function must have the signature:
<programlisting>
void
amcostestimate (Query *root,
RelOptInfo *rel,
IndexOptInfo *index,
List *indexQuals,
Cost *indexStartupCost,
Cost *indexTotalCost,
Selectivity *indexSelectivity,
double *indexCorrelation);
</programlisting>
The first four parameters are inputs:
<variablelist>
<varlistentry>
<term>root</term>
<listitem>
<para>
The query being processed.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>rel</term>
<listitem>
<para>
The relation the index is on.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>index</term>
<listitem>
<para>
The index itself.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>indexQuals</term>
<listitem>
<para>
List of index qual clauses (implicitly ANDed);
a NIL list indicates no qualifiers are available.
</para>
</listitem>
</varlistentry>
</variablelist>
</para>
<para>
The last four parameters are pass-by-reference outputs:
<variablelist>
<varlistentry>
<term>*indexStartupCost</term>
<listitem>
<para>
Set to cost of index start-up processing
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>*indexTotalCost</term>
<listitem>
<para>
Set to total cost of index processing
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>*indexSelectivity</term>
<listitem>
<para>
Set to index selectivity
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>*indexCorrelation</term>
<listitem>
<para>
Set to correlation coefficient between index scan order and
underlying table's order
</para>
</listitem>
</varlistentry>
</variablelist>
</para>
<para>
Note that cost estimate functions must be written in C, not in SQL or
any available procedural language, because they must access internal
data structures of the planner/optimizer.
</para>
<para>
The index access costs should be computed in the units used by
<filename>src/backend/optimizer/path/costsize.c</filename>: a sequential disk block fetch
has cost 1.0, a nonsequential fetch has cost random_page_cost, and
the cost of processing one index row should usually be taken as
cpu_index_tuple_cost (which is a user-adjustable optimizer parameter).
In addition, an appropriate multiple of cpu_operator_cost should be charged
for any comparison operators invoked during index processing (especially
evaluation of the indexQuals themselves).
</para>
<para>
The access costs should include all disk and CPU costs associated with
scanning the index itself, but NOT the costs of retrieving or processing
the main-table rows that are identified by the index.
</para>
<para>
The <quote>start-up cost</quote> is the part of the total scan cost that must be expended
before we can begin to fetch the first row. For most indexes this can
be taken as zero, but an index type with a high start-up cost might want
to set it nonzero.
</para>
<para>
The indexSelectivity should be set to the estimated fraction of the main
table rows that will be retrieved during the index scan. In the case
of a lossy index, this will typically be higher than the fraction of
rows that actually pass the given qual conditions.
</para>
<para>
The indexCorrelation should be set to the correlation (ranging between
-1.0 and 1.0) between the index order and the table order. This is used
to adjust the estimate for the cost of fetching rows from the main
table.
</para>
<procedure>
<title>Cost Estimation</title>
<para>
A typical cost estimator will proceed as follows:
</para>
<step>
<para>
Estimate and return the fraction of main-table rows that will be visited
based on the given qual conditions. In the absence of any index-type-specific
knowledge, use the standard optimizer function <function>clauselist_selectivity()</function>:
<programlisting>
*indexSelectivity = clauselist_selectivity(root, indexQuals,
rel-&gt;relid, JOIN_INNER);
</programlisting>
</para>
</step>
<step>
<para>
Estimate the number of index rows that will be visited during the
scan. For many index types this is the same as indexSelectivity times
the number of rows in the index, but it might be more. (Note that the
index's size in pages and rows is available from the IndexOptInfo struct.)
</para>
</step>
<step>
<para>
Estimate the number of index pages that will be retrieved during the scan.
This might be just indexSelectivity times the index's size in pages.
</para>
</step>
<step>
<para>
Compute the index access cost. A generic estimator might do this:
<programlisting>
/*
* Our generic assumption is that the index pages will be read
* sequentially, so they have cost 1.0 each, not random_page_cost.
* Also, we charge for evaluation of the indexquals at each index row.
* All the costs are assumed to be paid incrementally during the scan.
*/
cost_qual_eval(&amp;index_qual_cost, indexQuals);
*indexStartupCost = index_qual_cost.startup;
*indexTotalCost = numIndexPages +
(cpu_index_tuple_cost + index_qual_cost.per_tuple) * numIndexTuples;
</programlisting>
</para>
</step>
<step>
<para>
Estimate the index correlation. For a simple ordered index on a single
field, this can be retrieved from pg_statistic. If the correlation
is not known, the conservative estimate is zero (no correlation).
</para>
</step>
</procedure>
<para>
Examples of cost estimator functions can be found in
<filename>src/backend/utils/adt/selfuncs.c</filename>.
</para>
<para>
By convention, the <literal>pg_proc</literal> entry for an
<literal>amcostestimate</literal> function should show
eight arguments all declared as <type>internal</> (since none of them have
types that are known to SQL), and the return type is <type>void</>.
</para>
</chapter>
<!-- Keep this comment at the end of the file
Local variables:
mode:sgml
sgml-omittag:nil
sgml-shorttag:t
sgml-minimize-attributes:nil
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
sgml-parent-document:nil
sgml-default-dtd-file:"./reference.ced"
sgml-exposed-tags:nil
sgml-local-catalogs:("/usr/lib/sgml/catalog")
sgml-local-ecat-files:nil
End:
-->
<!--
$PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.73 2005/01/10 00:04:38 tgl Exp $
$PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.74 2005/02/13 03:04:15 tgl Exp $
-->
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V4.2//EN" [
......@@ -235,7 +235,7 @@ $PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.73 2005/01/10 00:04:38 tgl Exp
&nls;
&plhandler;
&geqo;
&indexcost;
&indexam;
&gist;
&storage;
&bki;
......
<!--
$PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.38 2005/01/23 00:30:18 momjian Exp $
$PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.39 2005/02/13 03:04:15 tgl Exp $
-->
<sect1 id="xindex">
......@@ -43,7 +43,7 @@ $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.38 2005/01/23 00:30:18 momjian E
described in <classname>pg_am</classname>. It is possible to add a
new index method by defining the required interface routines and
then creating a row in <classname>pg_am</classname> &mdash; but that is
far beyond the scope of this chapter.
beyond the scope of this chapter (see <xref linkend="indexam">).
</para>
<para>
......@@ -514,7 +514,7 @@ CREATE OPERATOR &lt; (
<listitem>
<para>
Although <productname>PostgreSQL</productname> can cope with
functions having the same name as long as they have different
functions having the same SQL name as long as they have different
argument data types, C can only cope with one global function
having a given name. So we shouldn't name the C function
something simple like <filename>abs_eq</filename>. Usually it's
......@@ -525,14 +525,12 @@ CREATE OPERATOR &lt; (
<listitem>
<para>
We could have made the <productname>PostgreSQL</productname> name
We could have made the SQL name
of the function <filename>abs_eq</filename>, relying on
<productname>PostgreSQL</productname> to distinguish it by
argument data types from any other
<productname>PostgreSQL</productname> function of the same name.
argument data types from any other SQL function of the same name.
To keep the example simple, we make the function have the same
names at the C level and <productname>PostgreSQL</productname>
level.
names at the C level and SQL level.
</para>
</listitem>
</itemizedlist>
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment