From 8e144b09aed9244d70ea9c4c904f6a1ed90ed286 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut <peter_e@gmx.net> Date: Wed, 31 Oct 2001 20:38:26 +0000 Subject: [PATCH] More information about partial indexes, and some tips about examining index usage. --- doc/src/sgml/indices.sgml | 345 +++++++++++++++++++++++++++++++------- 1 file changed, 286 insertions(+), 59 deletions(-) diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml index 7ac0dedcfe..3aee1d3de7 100644 --- a/doc/src/sgml/indices.sgml +++ b/doc/src/sgml/indices.sgml @@ -1,4 +1,4 @@ -<!-- $Header: /cvsroot/pgsql/doc/src/sgml/indices.sgml,v 1.23 2001/09/09 17:21:59 petere Exp $ --> +<!-- $Header: /cvsroot/pgsql/doc/src/sgml/indices.sgml,v 1.24 2001/10/31 20:38:26 petere Exp $ --> <chapter id="indexes"> <title id="indexes-title">Indexes</title> @@ -68,7 +68,7 @@ CREATE INDEX test1_id_index ON test1 (id); <para> To remove an index, use the <command>DROP INDEX</command> command. - Indexes can be added and removed from tables at any time. + Indexes can be added to and removed from tables at any time. </para> <para> @@ -204,11 +204,11 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> <sect1 id="indexes-multicolumn"> - <title>Multi-Column Indexes</title> + <title>Multicolumn Indexes</title> <indexterm zone="indexes-multicolumn"> <primary>indexes</primary> - <secondary>multi-column</secondary> + <secondary>multicolumn</secondary> </indexterm> <para> @@ -235,14 +235,14 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor); </para> <para> - Currently, only the B-tree implementation supports multi-column + Currently, only the B-tree implementation supports multicolumn indexes. Up to 16 columns may be specified. (This limit can be altered when building <productname>Postgres</productname>; see the file <filename>pg_config.h</filename>.) </para> <para> - The query optimizer can use a multi-column index for queries that + The query optimizer can use a multicolumn index for queries that involve the first <parameter>n</parameter> consecutive columns in the index (when used with appropriate operators), up to the total number of columns specified in the index definition. For example, @@ -258,7 +258,7 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor); </para> <para> - Multi-column indexes can only be used if the clauses involving the + Multicolumn indexes can only be used if the clauses involving the indexed columns are joined with <literal>AND</literal>. For instance, <programlisting> SELECT name FROM test2 WHERE major = <replaceable>constant</replaceable> OR minor = <replaceable>constant</replaceable>; @@ -269,7 +269,7 @@ SELECT name FROM test2 WHERE major = <replaceable>constant</replaceable> OR mino </para> <para> - Multi-column indexes should be used sparingly. Most of the time, + Multicolumn indexes should be used sparingly. Most of the time, an index on a single column is sufficient and saves space and time. Indexes with more than three columns are almost certainly inappropriate. @@ -304,11 +304,19 @@ CREATE UNIQUE INDEX <replaceable>name</replaceable> ON <replaceable>table</repla <productname>PostgreSQL</productname> automatically creates unique indexes when a table is declared with a unique constraint or a primary key, on the columns that make up the primary key or unique - columns (a multi-column index, if appropriate), to enforce that + columns (a multicolumn index, if appropriate), to enforce that constraint. A unique index can be added to a table at any later - time, to add a unique constraint. (But a primary key cannot be - added after table creation.) + time, to add a unique constraint. </para> + + <note> + <para> + The preferred way to add a unique constraint to a table is + <literal>ALTER TABLE ... ADD CONSTRAINT</literal>. The use of + indexes to enforce unique constraints could be considered an + implementation detail that should not be accessed directly. + </para> + </note> </sect1> @@ -346,7 +354,7 @@ CREATE INDEX test1_lower_col1_idx ON test1 (lower(col1)); argument, but they must be table columns, not constants. Functional indexes are always single-column (namely, the function result) even if the function uses more than one input field; there - cannot be multi-column indexes that contain function calls. + cannot be multicolumn indexes that contain function calls. </para> <tip> @@ -580,71 +588,290 @@ CREATE MEMSTORE ON <replaceable>table</replaceable> COLUMNS <replaceable>cols</r </sect1> - <sect1 id="partial-index"> - <title id="partial-index-title">Partial Indexes</title> + <sect1 id="indexes-partial"> + <title>Partial Indexes</title> - <indexterm zone="partial-index"> + <indexterm zone="indexes-partial"> <primary>indexes</primary> <secondary>partial</secondary> </indexterm> - <note> - <title>Author</title> - <para> - This is from a reply to a question on the email list - by Paul M. Aoki (<email>aoki@CS.Berkeley.EDU</email>) - on 1998-08-11. -<!-- - Paul M. Aoki | University of California at Berkeley - aoki@CS.Berkeley.EDU | Dept. of EECS, Computer Science Division #1776 - | Berkeley, CA 94720-1776 ---> - </para> - </note> + <para> + A <firstterm>partial index</firstterm> is an index built over a + subset of a table; the subset is defined by a conditional + expression (called the <firstterm>predicate</firstterm> of the + partial index). + </para> + + <para> + A major motivation for partial indexes is to avoid indexing common + values. Since a query conditionalized on a common value will not + use the index anyway, there is no point in keeping those rows in the + index at all. This reduces the size of the index, which will speed + up queries that do use the index. It will also speed up many data + manipulation operations because the index does not need to be + updated in all cases. <xref linkend="indexes-partial-ex1"> shows a + possible application of this idea. + </para> + + <example id="indexes-partial-ex1"> + <title>Setting up a Partial Index to Exclude Common Values</title> + + <para> + Suppose you are storing web server access logs in a database. + Most accesses originate from the IP range of your organization but + some are from elsewhere (say, employees on dial-up connections). + So you do not want to index the IP range that corresponds to your + organization's subnet. + </para> + + <para> + Assume a table like this: +<programlisting> +CREATE TABLE access_log ( + url varchar, + client_ip inet, + ... +); +</programlisting> + </para> + + <para> + To create a partial index that suits our example, use a command + such as this: +<programlisting> +CREATE INDEX access_log_client_ip_ix ON access_log (client_ip) + WHERE NOT (client_ip > inet '192.168.100.0' AND client_ip < inet '192.168.100.255'); +</programlisting> + </para> <para> - A <firstterm>partial index</firstterm> - is an index built over a subset of a table; the subset is defined by - a predicate. <productname>Postgres</productname> - supports partial indexes with arbitrary - predicates. I believe IBM's <productname>DB2</productname> - for AS/400 supports partial indexes - using single-clause predicates. + A typical query that can use this index would be: +<programlisting> +SELECT * FROM access_log WHERE url = '/index.html' AND client_ip = inet '212.78.10.32'; +</programlisting> + A query that cannot use this index is: +<programlisting> +SELECT * FROM access_log WHERE client_ip = inet '192.168.100.23'; +</programlisting> </para> <para> - The main motivation for partial indexes is this: - if all of the queries you ask that can - profitably use an index fall into a certain range, why build an index - over the whole table and suffer the associated space/time costs? + Observe that this kind of partial index requires that the common + values be actively tracked. If the distribution of values is + inherent (due to the nature of the application) and static (does + not change), this is not difficult, but if the common values are + merely due to the coincidental data load this can require a lot of + maintenance work. + </para> + </example> + + <para> + Another possibility is to exclude values from the index that the + typical query workload is not interested in; this is shown in <xref + linkend="indexes-partial-ex2">. This results in the same + advantages as listed above, but it prevents the + <quote>uninteresting</quote> values from being accessed via an + index at all, even if an index scan might be profitable in that + case. Obviously, setting up partial indexes for this kind of + scenario will require a lot of care and experimentation. + </para> - (There are other reasons too; see - <xref endterm="STON89b" linkend="STON89b-full"> for details.) + <example id="indexes-partial-ex2"> + <title>Setting up a Partial Index to Exclude Uninteresting Values</title> + + <para> + If you have a table that contains both billed and unbilled orders + where the unbilled orders take up a small fraction of the total + table and yet that is an often used section, you can improve + performance by creating an index on just that portion. The + command the create the index would look like this: +<programlisting> +CREATE INDEX orders_unbilled_index ON orders (order_nr) + WHERE billed is not true; +</programlisting> </para> <para> - The machinery to build, update and query partial indexes isn't too - bad. The hairy parts are index selection (which indexes do I build?) - and query optimization (which indexes do I use?); i.e., the parts - that involve deciding what predicate(s) match the workload/query in - some useful way. For those who are into database theory, the problems - are basically analogous to the corresponding materialized view - problems, albeit with different cost parameters and formulas. These - are, in the general case, hard problems for the standard ordinal - <acronym>SQL</acronym> - types; they're super-hard problems with black-box extension types, - because the selectivity estimation technology is so crude. + A possible query to use this index would be +<programlisting> +SELECT * FROM orders WHERE billed is not true AND order_nr < 10000; +</programlisting> + However, the index can also be used in queries that do not involve + <structfield>order_nr</> at all, e.g., +<programlisting> +SELECT * FROM orders WHERE billed is not true AND amount > 5000.00; +</programlisting> + This is not as efficient as a partial index on the + <structfield>amount</> column would be, since the system has to + scan the entire index in any case. </para> <para> - Check <xref endterm="STON89b" linkend="STON89b-full">, - <xref endterm="OLSON93" linkend="OLSON93-full">, - and - <xref endterm="SESHADRI95" linkend="SESHADRI95-full"> - for more information. + Note that this query cannot use this index: +<programlisting> +SELECT * FROM orders WHERE order_nr = 3501; +</programlisting> + The order 3501 may be among the billed or among the unbilled + orders. </para> - </sect1> - </chapter> + </example> + + <para> + <xref linkend="indexes-partial-ex2"> also illustrates that the + indexed column and the column used in the predicate do not need to + match. <productname>PostgreSQL</productname> supports partial + indexes with arbitrary predicates, as long as only columns of the + table being indexed are involved. However, keep in mind that the + predicate must actually match the condition used in the query that + is supposed to benefit from the index. + <productname>PostgreSQL</productname> does not have a sophisticated + theorem prover that can recognize mathematically equivalent + predicates that are written in different forms. (Not + only is such a general theorem prover extremely difficult to + create, it would probably be too slow to be of any real use.) + </para> + + <para> + Finally, a partial index can also be used to override the system's + query plan choices. It may occur that data sets with peculiar + distributions will cause the system to use an index when it really + should not. In that case the index can be set up so that it is not + available for the offending query. Normally, + <productname>PostgreSQL</> makes reasonable choices about index + usage (e.g., it avoids them when retrieving common values, so the + earlier example really only saves index size, it is not required to + avoid index usage), and grossly incorrect plan choices are cause + for a bug report. + </para> + + <para> + Keep in mind that setting up a partial index indicates that you + know at least as much as the query planner knows, in particular you + know when an index might be profitable. Forming this knowledge + requires experience and understanding of how indexes in + <productname>PostgreSQL</> work. In most cases, the advantage of a + partial index over a regular index will not be much. + </para> + + <para> + More information about partial indexes can be found in <xref + endterm="STON89b" linkend="STON89b-full">, <xref endterm="OLSON93" + linkend="OLSON93-full">, and <xref endterm="SESHADRI95" + linkend="SESHADRI95-full">. + </para> + </sect1> + + <sect1 id="indexes-examine"> + <title>Examining Index Usage</title> + + <para> + Although indexes in <productname>PostgreSQL</> do not need + maintenance and tuning, it is still important that it is checked + which indexes are actually used by the real-life query workload. + Examining index usage is done with the <command>EXPLAIN</> command; + its application for this purpose is illustrated in <xref + linkend="using-explain">. + </para> + + <para> + It is difficult to formulate a general procedure for determining + which indexes to set up. There are a number of typical cases that + have been shown in the examples throughout the previous sections. + A good deal of experimentation will be necessary in most cases. + The rest of this section gives some tips for that. + </para> + + <itemizedlist> + <listitem> + <para> + Always run <command>ANALYZE</command> first. This command + collects statistics about the distribution of the values in the + table. This information is required to guess the number of rows + returned by a query, which is needed by the planner to assign + realistic costs to each possible query plan. In absence of any + real statistics, some default values are assumed, which are + almost certain to be inaccurate. Examining an application's + index usage without having run <command>ANALYZE</command> is + therefore a lost cause. + </para> + </listitem> + + <listitem> + <para> + Use real data for experimentation. Using test data for setting + up indexes will tell you what indexes you need for the test data, + but that is all. + </para> + + <para> + It is especially fatal to use proportionally reduced data sets. + While selecting 1000 out of 100000 rows could be a candidate for + an index, selecting 1 out of 100 rows will hardly be, because the + 100 rows will probably fit within a single disk page, and there + is no plan that can beat sequentially fetching 1 disk page. + </para> + + <para> + Also be careful when making up test data, which is often + unavoidable when the application is not in production use yet. + Values that are very similar, completely random, or inserted in + sorted order will skew the statistics away from the distribution + that real data would have. + </para> + </listitem> + + <listitem> + <para> + When indexes are not used, it can be useful for testing to force + their use. There are run-time parameters that can turn off + various plan types (described in the <citetitle>Administrator's + Guide</citetitle>). For instance, turning off sequential scans + (<varname>enable_seqscan</>) and nested-loop joins + (<varname>enable_nestloop</>), which are the most basic plans, + will force the system to use a different plan. If the system + still chooses a sequential scan or nested-loop join then there is + probably a more fundamental problem for why the index is not + used, for example, the query condition does not match the index. + (What kind of query can use what kind of index is explained in + the previous sections.) + </para> + </listitem> + + <listitem> + <para> + If forcing index usage does use the index, then there are two + possibilities: Either the system is right and using the index is + indeed not appropriate, or the cost estimates of the query plans + are not reflecting reality. So you should time your query with + and without indexes. The <command>EXPLAIN ANALYZE</command> + command can be useful here. + </para> + </listitem> + + <listitem> + <para> + If it turns out that the cost estimates are wrong, there are, + again, two possibilities. The total cost is computed from the + per-row costs of each plan node times the selectivity estimate of + the plan node. The costs of the plan nodes can be tuned with + run-time parameters (described in the <citetitle>Administrator's + Guide</citetitle>). An inaccurate selectivity estimate is due to + insufficient statistics. It may be possible to help this by + tuning the statistics-gathering parameters (see <command>ALTER + TABLE</command> reference). + </para> + + <para> + If you do not succeed in adjusting the costs to be more + appropriate, then you may have to do with forcing index usage + explicitly. You may also want to contact the + <productname>PostgreSQL</> developers to examine the issue. + </para> + </listitem> + </itemizedlist> + </sect1> +</chapter> <!-- Keep this comment at the end of the file Local variables: -- 2.24.1