Commit 0ede57a1 authored by Robert Haas's avatar Robert Haas

Corrections and improvements to generic parallel query documentation.

David Rowley, reviewed by Brad DeJong, Amit Kapila, and me.

Discussion: http://postgr.es/m/CAKJS1f81fob-M6RJyTVv3SCasxMuQpj37ReNOJ=tprhwd7hAVg@mail.gmail.com
parent f10637eb
...@@ -284,44 +284,41 @@ EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%'; ...@@ -284,44 +284,41 @@ EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%';
<para> <para>
The driving table may be joined to one or more other tables using nested The driving table may be joined to one or more other tables using nested
loops or hash joins. The outer side of the join may be any kind of loops or hash joins. The inner side of the join may be any kind of
non-parallel plan that is otherwise supported by the planner provided that non-parallel plan that is otherwise supported by the planner provided that
it is safe to run within a parallel worker. For example, it may be an it is safe to run within a parallel worker. For example, it may be an
index scan which looks up a value based on a column taken from the inner index scan which looks up a value taken from the outer side of the join.
table. Each worker will execute the outer side of the plan in full, which Each worker will execute the inner side of the join in full, which for
is why merge joins are not supported here. The outer side of a merge join hash join means that an identical hash table is built in each worker
will often involve sorting the entire inner table; even if it involves an process.
index, it is unlikely to be productive to have multiple processes each
conduct a full index scan of the inner table.
</para> </para>
</sect2> </sect2>
<sect2 id="parallel-aggregation"> <sect2 id="parallel-aggregation">
<title>Parallel Aggregation</title> <title>Parallel Aggregation</title>
<para> <para>
It is not possible to perform the aggregation portion of a query entirely <productname>PostgreSQL</> supports parallel aggregation by aggregating in
in parallel. For example, if a query involves selecting two stages. First, each process participating in the parallel portion of
<literal>COUNT(*)</>, each worker could compute a total, but those totals the query performs an aggregation step, producing a partial result for
would need to combined in order to produce a final answer. If the query each group of which that process is aware. This is reflected in the plan
involved a <literal>GROUP BY</> clause, a separate total would need to as a <literal>Partial Aggregate</> node. Second, the partial results are
be computed for each group. Even though aggregation can't be done entirely
in parallel, queries involving aggregation are often excellent candidates
for parallel query, because they typically read many rows but return only
a few rows to the client. Queries that return many rows to the client
are often limited by the speed at which the client can read the data,
in which case parallel query cannot help very much.
</para>
<para>
<productname>PostgreSQL</> supports parallel aggregation by aggregating
twice. First, each process participating in the parallel portion of the
query performs an aggregation step, producing a partial result for each
group of which that process is aware. This is reflected in the plan as
a <literal>PartialAggregate</> node. Second, the partial results are
transferred to the leader via the <literal>Gather</> node. Finally, the transferred to the leader via the <literal>Gather</> node. Finally, the
leader re-aggregates the results across all workers in order to produce leader re-aggregates the results across all workers in order to produce
the final result. This is reflected in the plan as a the final result. This is reflected in the plan as a
<literal>FinalizeAggregate</> node. <literal>Finalize Aggregate</> node.
</para>
<para>
Because the <literal>Finalize Aggregate</> node runs on the leader
process, queries which produce a relatively large number of groups in
comparison to the number of input rows will appear less favorable to the
query planner. For example, in the worst-case scenario the number of
groups seen by the <literal>Finalize Aggregate</> node could be as many as
the number of input rows which were seen by all worker processes in the
<literal>Partial Aggregate</> stage. For such cases, there is clearly
going to be no performance benefit to using parallel aggregation. The
query planner takes this into account during the planning process and is
unlikely to choose parallel aggregate in this scenario.
</para> </para>
<para> <para>
...@@ -330,10 +327,11 @@ EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%'; ...@@ -330,10 +327,11 @@ EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%';
have a combine function. If the aggregate has a transition state of type have a combine function. If the aggregate has a transition state of type
<literal>internal</>, it must have serialization and deserialization <literal>internal</>, it must have serialization and deserialization
functions. See <xref linkend="sql-createaggregate"> for more details. functions. See <xref linkend="sql-createaggregate"> for more details.
Parallel aggregation is not supported for ordered set aggregates or when Parallel aggregation is not supported if any aggregate function call
the query involves <literal>GROUPING SETS</>. It can only be used when contains <literal>DISTINCT</> or <literal>ORDER BY</> clause and is also
all joins involved in the query are also part of the parallel portion not supported for ordered set aggregates or when the query involves
of the plan. <literal>GROUPING SETS</>. It can only be used when all joins involved in
the query are also part of the parallel portion of the plan.
</para> </para>
</sect2> </sect2>
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment