Commit e2f1eb0e authored by Robert Haas's avatar Robert Haas

Implement partition-wise grouping/aggregation.

If the partition keys of input relation are part of the GROUP BY
clause, all the rows belonging to a given group come from a single
partition.  This allows aggregation/grouping over a partitioned
relation to be broken down * into aggregation/grouping on each
partition.  This should be no worse, and often better, than the normal
approach.

If the GROUP BY clause does not contain all the partition keys, we can
still perform partial aggregation for each partition and then finalize
aggregation after appending the partial results.  This is less certain
to be a win, but it's still useful.

Jeevan Chalke, Ashutosh Bapat, Robert Haas.  The larger patch series
of which this patch is a part was also reviewed and tested by Antonin
Houska, Rajkumar Raghuwanshi, David Rowley, Dilip Kumar, Konstantin
Knizhnik, Pascal Legrand, and Rafia Sabih.

Discussion: http://postgr.es/m/CAM2+6=V64_xhstVHie0Rz=KPEQnLJMZt_e314P0jaT_oJ9MR8A@mail.gmail.com
parent 2058d6a2
...@@ -3821,6 +3821,26 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class=" ...@@ -3821,6 +3821,26 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry id="guc-enable-partitionwise-aggregate" xreflabel="enable_partitionwise_aggregate">
<term><varname>enable_partitionwise_aggregate</varname> (<type>boolean</type>)
<indexterm>
<primary><varname>enable_partitionwise_aggregate</varname> configuration parameter</primary>
</indexterm>
</term>
<listitem>
<para>
Enables or disables the query planner's use of partitionwise grouping
or aggregation, which allows grouping or aggregation on a partitioned
tables performed separately for each partition. If the <literal>GROUP
BY</literal> clause does not include the partition keys, only partial
aggregation can be performed on a per-partition basis, and
finalization must be performed later. Because partitionwise grouping
or aggregation can use significantly more CPU time and memory during
planning, the default is <literal>off</literal>.
</para>
</listitem>
</varlistentry>
<varlistentry id="guc-enable-seqscan" xreflabel="enable_seqscan"> <varlistentry id="guc-enable-seqscan" xreflabel="enable_seqscan">
<term><varname>enable_seqscan</varname> (<type>boolean</type>) <term><varname>enable_seqscan</varname> (<type>boolean</type>)
<indexterm> <indexterm>
......
...@@ -1079,6 +1079,7 @@ busy for a long time to come. ...@@ -1079,6 +1079,7 @@ busy for a long time to come.
Partitionwise joins Partitionwise joins
------------------- -------------------
A join between two similarly partitioned tables can be broken down into joins A join between two similarly partitioned tables can be broken down into joins
between their matching partitions if there exists an equi-join condition between their matching partitions if there exists an equi-join condition
between the partition keys of the joining tables. The equi-join between between the partition keys of the joining tables. The equi-join between
...@@ -1102,3 +1103,16 @@ any two partitioned relations with same partitioning scheme point to the same ...@@ -1102,3 +1103,16 @@ any two partitioned relations with same partitioning scheme point to the same
PartitionSchemeData object. This reduces memory consumed by PartitionSchemeData object. This reduces memory consumed by
PartitionSchemeData objects and makes it easy to compare the partition schemes PartitionSchemeData objects and makes it easy to compare the partition schemes
of joining relations. of joining relations.
Partition-wise aggregates/grouping
----------------------------------
If the GROUP BY clause has contains all of the partition keys, all the rows
that belong to a given group must come from a single partition; therefore,
aggregation can be done completely separately for each partition. Otherwise,
partial aggregates can be computed for each partition, and then finalized
after appending the results from the individual partitions. This technique of
breaking down aggregation or grouping over a partitioned relation into
aggregation or grouping over its partitions is called partitionwise
aggregation. Especially when the partition keys match the GROUP BY clause,
this can be significantly faster than the regular method.
...@@ -134,8 +134,6 @@ static void subquery_push_qual(Query *subquery, ...@@ -134,8 +134,6 @@ static void subquery_push_qual(Query *subquery,
static void recurse_push_qual(Node *setOp, Query *topquery, static void recurse_push_qual(Node *setOp, Query *topquery,
RangeTblEntry *rte, Index rti, Node *qual); RangeTblEntry *rte, Index rti, Node *qual);
static void remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel); static void remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel);
static void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
List *live_childrels);
/* /*
...@@ -1326,7 +1324,7 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, ...@@ -1326,7 +1324,7 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
* parameterization or ordering. Similarly it collects partial paths from * parameterization or ordering. Similarly it collects partial paths from
* non-dummy children to create partial append paths. * non-dummy children to create partial append paths.
*/ */
static void void
add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel, add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
List *live_childrels) List *live_childrels)
{ {
...@@ -1413,8 +1411,12 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel, ...@@ -1413,8 +1411,12 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
* If child has an unparameterized cheapest-total path, add that to * If child has an unparameterized cheapest-total path, add that to
* the unparameterized Append path we are constructing for the parent. * the unparameterized Append path we are constructing for the parent.
* If not, there's no workable unparameterized path. * If not, there's no workable unparameterized path.
*
* With partitionwise aggregates, the child rel's pathlist may be
* empty, so don't assume that a path exists here.
*/ */
if (childrel->cheapest_total_path->param_info == NULL) if (childrel->pathlist != NIL &&
childrel->cheapest_total_path->param_info == NULL)
accumulate_append_subpath(childrel->cheapest_total_path, accumulate_append_subpath(childrel->cheapest_total_path,
&subpaths, NULL); &subpaths, NULL);
else else
...@@ -1682,6 +1684,13 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel, ...@@ -1682,6 +1684,13 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
RelOptInfo *childrel = (RelOptInfo *) lfirst(lcr); RelOptInfo *childrel = (RelOptInfo *) lfirst(lcr);
Path *subpath; Path *subpath;
if (childrel->pathlist == NIL)
{
/* failed to make a suitable path for this child */
subpaths_valid = false;
break;
}
subpath = get_cheapest_parameterized_child_path(root, subpath = get_cheapest_parameterized_child_path(root,
childrel, childrel,
required_outer); required_outer);
......
...@@ -135,6 +135,7 @@ bool enable_mergejoin = true; ...@@ -135,6 +135,7 @@ bool enable_mergejoin = true;
bool enable_hashjoin = true; bool enable_hashjoin = true;
bool enable_gathermerge = true; bool enable_gathermerge = true;
bool enable_partitionwise_join = false; bool enable_partitionwise_join = false;
bool enable_partitionwise_aggregate = false;
bool enable_parallel_append = true; bool enable_parallel_append = true;
bool enable_parallel_hash = true; bool enable_parallel_hash = true;
......
...@@ -1670,7 +1670,15 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags) ...@@ -1670,7 +1670,15 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, int flags)
subplan = create_plan_recurse(root, best_path->subpath, subplan = create_plan_recurse(root, best_path->subpath,
flags | CP_SMALL_TLIST); flags | CP_SMALL_TLIST);
plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys, NULL); /*
* make_sort_from_pathkeys() indirectly calls find_ec_member_for_tle(),
* which will ignore any child EC members that don't belong to the given
* relids. Thus, if this sort path is based on a child relation, we must
* pass its relids.
*/
plan = make_sort_from_pathkeys(subplan, best_path->path.pathkeys,
IS_OTHER_REL(best_path->subpath->parent) ?
best_path->path.parent->relids : NULL);
copy_generic_path_info(&plan->plan, (Path *) best_path); copy_generic_path_info(&plan->plan, (Path *) best_path);
......
This diff is collapsed.
...@@ -923,6 +923,15 @@ static struct config_bool ConfigureNamesBool[] = ...@@ -923,6 +923,15 @@ static struct config_bool ConfigureNamesBool[] =
false, false,
NULL, NULL, NULL NULL, NULL, NULL
}, },
{
{"enable_partitionwise_aggregate", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables partitionwise aggregation and grouping."),
NULL
},
&enable_partitionwise_aggregate,
false,
NULL, NULL, NULL
},
{ {
{"enable_parallel_append", PGC_USERSET, QUERY_TUNING_METHOD, {"enable_parallel_append", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of parallel append plans."), gettext_noop("Enables the planner's use of parallel append plans."),
......
...@@ -306,6 +306,7 @@ ...@@ -306,6 +306,7 @@
#enable_sort = on #enable_sort = on
#enable_tidscan = on #enable_tidscan = on
#enable_partitionwise_join = off #enable_partitionwise_join = off
#enable_partitionwise_aggregate = off
#enable_parallel_hash = on #enable_parallel_hash = on
# - Planner Cost Constants - # - Planner Cost Constants -
......
...@@ -553,6 +553,7 @@ typedef enum RelOptKind ...@@ -553,6 +553,7 @@ typedef enum RelOptKind
RELOPT_OTHER_MEMBER_REL, RELOPT_OTHER_MEMBER_REL,
RELOPT_OTHER_JOINREL, RELOPT_OTHER_JOINREL,
RELOPT_UPPER_REL, RELOPT_UPPER_REL,
RELOPT_OTHER_UPPER_REL,
RELOPT_DEADREL RELOPT_DEADREL
} RelOptKind; } RelOptKind;
...@@ -570,12 +571,15 @@ typedef enum RelOptKind ...@@ -570,12 +571,15 @@ typedef enum RelOptKind
(rel)->reloptkind == RELOPT_OTHER_JOINREL) (rel)->reloptkind == RELOPT_OTHER_JOINREL)
/* Is the given relation an upper relation? */ /* Is the given relation an upper relation? */
#define IS_UPPER_REL(rel) ((rel)->reloptkind == RELOPT_UPPER_REL) #define IS_UPPER_REL(rel) \
((rel)->reloptkind == RELOPT_UPPER_REL || \
(rel)->reloptkind == RELOPT_OTHER_UPPER_REL)
/* Is the given relation an "other" relation? */ /* Is the given relation an "other" relation? */
#define IS_OTHER_REL(rel) \ #define IS_OTHER_REL(rel) \
((rel)->reloptkind == RELOPT_OTHER_MEMBER_REL || \ ((rel)->reloptkind == RELOPT_OTHER_MEMBER_REL || \
(rel)->reloptkind == RELOPT_OTHER_JOINREL) (rel)->reloptkind == RELOPT_OTHER_JOINREL || \
(rel)->reloptkind == RELOPT_OTHER_UPPER_REL)
typedef struct RelOptInfo typedef struct RelOptInfo
{ {
...@@ -2291,6 +2295,73 @@ typedef struct JoinPathExtraData ...@@ -2291,6 +2295,73 @@ typedef struct JoinPathExtraData
Relids param_source_rels; Relids param_source_rels;
} JoinPathExtraData; } JoinPathExtraData;
/*
* Various flags indicating what kinds of grouping are possible.
*
* GROUPING_CAN_USE_SORT should be set if it's possible to perform
* sort-based implementations of grouping. When grouping sets are in use,
* this will be true if sorting is potentially usable for any of the grouping
* sets, even if it's not usable for all of them.
*
* GROUPING_CAN_USE_HASH should be set if it's possible to perform
* hash-based implementations of grouping.
*
* GROUPING_CAN_PARTIAL_AGG should be set if the aggregation is of a type
* for which we support partial aggregation (not, for example, grouping sets).
* It says nothing about parallel-safety or the availability of suitable paths.
*/
#define GROUPING_CAN_USE_SORT 0x0001
#define GROUPING_CAN_USE_HASH 0x0002
#define GROUPING_CAN_PARTIAL_AGG 0x0004
/*
* What kind of partitionwise aggregation is in use?
*
* PARTITIONWISE_AGGREGATE_NONE: Not used.
*
* PARTITIONWISE_AGGREGATE_FULL: Aggregate each partition separately, and
* append the results.
*
* PARTITIONWISE_AGGREGATE_PARTIAL: Partially aggregate each partition
* separately, append the results, and then finalize aggregation.
*/
typedef enum
{
PARTITIONWISE_AGGREGATE_NONE,
PARTITIONWISE_AGGREGATE_FULL,
PARTITIONWISE_AGGREGATE_PARTIAL
} PartitionwiseAggregateType;
/*
* Struct for extra information passed to subroutines of create_grouping_paths
*
* flags indicating what kinds of grouping are possible.
* partial_costs_set is true if the agg_partial_costs and agg_final_costs
* have been initialized.
* agg_partial_costs gives partial aggregation costs.
* agg_final_costs gives finalization costs.
* target is the PathTarget to be used while creating paths.
* target_parallel_safe is true if target is parallel safe.
* havingQual gives list of quals to be applied after aggregation.
* targetList gives list of columns to be projected.
* patype is the type of partitionwise aggregation that is being performed.
*/
typedef struct
{
/* Data which remains constant once set. */
int flags;
bool partial_costs_set;
AggClauseCosts agg_partial_costs;
AggClauseCosts agg_final_costs;
/* Data which may differ across partitions. */
PathTarget *target;
bool target_parallel_safe;
Node *havingQual;
List *targetList;
PartitionwiseAggregateType patype;
} GroupPathExtraData;
/* /*
* For speed reasons, cost estimation for join paths is performed in two * For speed reasons, cost estimation for join paths is performed in two
* phases: the first phase tries to quickly derive a lower bound for the * phases: the first phase tries to quickly derive a lower bound for the
......
...@@ -68,6 +68,7 @@ extern PGDLLIMPORT bool enable_mergejoin; ...@@ -68,6 +68,7 @@ extern PGDLLIMPORT bool enable_mergejoin;
extern PGDLLIMPORT bool enable_hashjoin; extern PGDLLIMPORT bool enable_hashjoin;
extern PGDLLIMPORT bool enable_gathermerge; extern PGDLLIMPORT bool enable_gathermerge;
extern PGDLLIMPORT bool enable_partitionwise_join; extern PGDLLIMPORT bool enable_partitionwise_join;
extern PGDLLIMPORT bool enable_partitionwise_aggregate;
extern PGDLLIMPORT bool enable_parallel_append; extern PGDLLIMPORT bool enable_parallel_append;
extern PGDLLIMPORT bool enable_parallel_hash; extern PGDLLIMPORT bool enable_parallel_hash;
extern PGDLLIMPORT int constraint_exclusion; extern PGDLLIMPORT int constraint_exclusion;
......
...@@ -236,5 +236,7 @@ extern bool has_useful_pathkeys(PlannerInfo *root, RelOptInfo *rel); ...@@ -236,5 +236,7 @@ extern bool has_useful_pathkeys(PlannerInfo *root, RelOptInfo *rel);
extern PathKey *make_canonical_pathkey(PlannerInfo *root, extern PathKey *make_canonical_pathkey(PlannerInfo *root,
EquivalenceClass *eclass, Oid opfamily, EquivalenceClass *eclass, Oid opfamily,
int strategy, bool nulls_first); int strategy, bool nulls_first);
extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
List *live_childrels);
#endif /* PATHS_H */ #endif /* PATHS_H */
This diff is collapsed.
...@@ -70,24 +70,25 @@ select count(*) >= 0 as ok from pg_prepared_xacts; ...@@ -70,24 +70,25 @@ select count(*) >= 0 as ok from pg_prepared_xacts;
-- This is to record the prevailing planner enable_foo settings during -- This is to record the prevailing planner enable_foo settings during
-- a regression test run. -- a regression test run.
select name, setting from pg_settings where name like 'enable%'; select name, setting from pg_settings where name like 'enable%';
name | setting name | setting
---------------------------+--------- --------------------------------+---------
enable_bitmapscan | on enable_bitmapscan | on
enable_gathermerge | on enable_gathermerge | on
enable_hashagg | on enable_hashagg | on
enable_hashjoin | on enable_hashjoin | on
enable_indexonlyscan | on enable_indexonlyscan | on
enable_indexscan | on enable_indexscan | on
enable_material | on enable_material | on
enable_mergejoin | on enable_mergejoin | on
enable_nestloop | on enable_nestloop | on
enable_parallel_append | on enable_parallel_append | on
enable_parallel_hash | on enable_parallel_hash | on
enable_partitionwise_join | off enable_partitionwise_aggregate | off
enable_seqscan | on enable_partitionwise_join | off
enable_sort | on enable_seqscan | on
enable_tidscan | on enable_sort | on
(15 rows) enable_tidscan | on
(16 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail -- more-or-less working. We can't test their contents in any great detail
......
...@@ -116,7 +116,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare without_oid c ...@@ -116,7 +116,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare without_oid c
# ---------- # ----------
# Another group of parallel tests # Another group of parallel tests
# ---------- # ----------
test: identity partition_join partition_prune reloptions hash_part indexing test: identity partition_join partition_prune reloptions hash_part indexing partition_aggregate
# event triggers cannot run concurrently with any test that runs DDL # event triggers cannot run concurrently with any test that runs DDL
test: event_trigger test: event_trigger
......
...@@ -185,5 +185,6 @@ test: partition_prune ...@@ -185,5 +185,6 @@ test: partition_prune
test: reloptions test: reloptions
test: hash_part test: hash_part
test: indexing test: indexing
test: partition_aggregate
test: event_trigger test: event_trigger
test: stats test: stats
This diff is collapsed.
...@@ -884,6 +884,7 @@ GrantStmt ...@@ -884,6 +884,7 @@ GrantStmt
GrantTargetType GrantTargetType
Group Group
GroupPath GroupPath
GroupPathExtraData
GroupState GroupState
GroupVarInfo GroupVarInfo
GroupingFunc GroupingFunc
...@@ -1597,6 +1598,7 @@ PartitionScheme ...@@ -1597,6 +1598,7 @@ PartitionScheme
PartitionSpec PartitionSpec
PartitionTupleRouting PartitionTupleRouting
PartitionedChildRelInfo PartitionedChildRelInfo
PartitionwiseAggregateType
PasswordType PasswordType
Path Path
PathClauseUsage PathClauseUsage
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment