Commit a929e17e authored by David Rowley's avatar David Rowley

Allow run-time pruning on nested Append/MergeAppend nodes

Previously we only tagged on the required information to allow the
executor to perform run-time partition pruning for Append/MergeAppend
nodes belonging to base relations.  It was thought that nested
Append/MergeAppend nodes were just about always pulled up into the
top-level Append/MergeAppend and that making the run-time pruning info for
any sub Append/MergeAppend nodes was a waste of time.  However, that was
likely badly thought through.

Some examples of cases we're unable to pullup nested Append/MergeAppends
are: 1) Parallel Append nodes with a mix of parallel and non-parallel
paths into a Parallel Append.  2) When planning an ordered Append scan a
sub-partition which is unordered may require a nested MergeAppend path to
ensure sub-partitions don't mix up the order of tuples being fed into the
top-level Append.

Unfortunately, it was not just as simple as removing the lines in
createplan.c which were purposefully not building the run-time pruning
info for anything but RELOPT_BASEREL relations.  The code in
add_paths_to_append_rel() was far too sloppy about which partitioned_rels
it included for the Append/MergeAppend paths.  The original code there
would always assume accumulate_append_subpath() would pull each sub-Append
and sub-MergeAppend path into the top-level path.  While it does not
appear that there were any actual bugs caused by having the additional
partitioned table RT indexes recorded, what it did mean is that later in
planning, when we built the run-time pruning info that we wasted effort
and built PartitionedRelPruneInfos for partitioned tables that we had no
subpaths for the executor to run-time prune.

Here we tighten that up so that partitioned_rels only ever contains the RT
index for partitioned tables which actually have subpaths in the given
Append/MergeAppend.  We can now Assert that every PartitionedRelPruneInfo
has a non-empty present_parts.  That should allow us to catch any weird
corner cases that have been missed.

In passing, it seems there is no longer a good reason to have the
AppendPath and MergeAppendPath's partitioned_rel fields a List of IntList.
We can simply have a List of Relids instead.  This is more compact in
memory and faster to add new members to.  We still know which is the root
level partition as these always have a lower relid than their children.
Previously this field was used for more things, but run-time partition
pruning now remains the only user of it and it has no need for a List of
IntLists.

Here we also get rid of the RelOptInfo partitioned_child_rels field. This
is what was previously used to (sometimes incorrectly) set the
Append/MergeAppend path's partitioned_rels field.  That was the only usage
of that field, so we can happily just remove it.

I also couldn't resist changing some nearby code to make use of the newly
added for_each_from macro so we can skip the first element in the list
without checking if the current item was the first one on each
iteration.

A bug report from Andreas Kretschmer prompted all this work, however,
after some consideration, I'm not personally classing this as a bug fix.
So no backpatch.  In Andreas' test case, it just wasn't that clear that
there was a nested Append since the top-level Append just had a single
sub-path which was pulled up a level, per 8edd0e79.

Author: David Rowley
Reviewed-by: Amit Langote
Discussion: https://postgr.es/m/flat/CAApHDvqSchs%2BubdybcfFaSPB%2B%2BEA7kqMaoqajtP0GtZvzOOR3g%40mail.gmail.com
parent dfc79773
...@@ -2310,7 +2310,6 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node) ...@@ -2310,7 +2310,6 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
WRITE_BITMAPSET_FIELD(top_parent_relids); WRITE_BITMAPSET_FIELD(top_parent_relids);
WRITE_BOOL_FIELD(partbounds_merged); WRITE_BOOL_FIELD(partbounds_merged);
WRITE_BITMAPSET_FIELD(all_partrels); WRITE_BITMAPSET_FIELD(all_partrels);
WRITE_NODE_FIELD(partitioned_child_rels);
} }
static void static void
......
This diff is collapsed.
...@@ -1228,7 +1228,6 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags) ...@@ -1228,7 +1228,6 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
* do partition pruning. * do partition pruning.
*/ */
if (enable_partition_pruning && if (enable_partition_pruning &&
rel->reloptkind == RELOPT_BASEREL &&
best_path->partitioned_rels != NIL) best_path->partitioned_rels != NIL)
{ {
List *prunequal; List *prunequal;
...@@ -1395,7 +1394,6 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path, ...@@ -1395,7 +1394,6 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
* do partition pruning. * do partition pruning.
*/ */
if (enable_partition_pruning && if (enable_partition_pruning &&
rel->reloptkind == RELOPT_BASEREL &&
best_path->partitioned_rels != NIL) best_path->partitioned_rels != NIL)
{ {
List *prunequal; List *prunequal;
......
...@@ -257,7 +257,6 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent) ...@@ -257,7 +257,6 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent)
rel->all_partrels = NULL; rel->all_partrels = NULL;
rel->partexprs = NULL; rel->partexprs = NULL;
rel->nullable_partexprs = NULL; rel->nullable_partexprs = NULL;
rel->partitioned_child_rels = NIL;
/* /*
* Pass assorted information down the inheritance hierarchy. * Pass assorted information down the inheritance hierarchy.
...@@ -672,7 +671,6 @@ build_join_rel(PlannerInfo *root, ...@@ -672,7 +671,6 @@ build_join_rel(PlannerInfo *root,
joinrel->all_partrels = NULL; joinrel->all_partrels = NULL;
joinrel->partexprs = NULL; joinrel->partexprs = NULL;
joinrel->nullable_partexprs = NULL; joinrel->nullable_partexprs = NULL;
joinrel->partitioned_child_rels = NIL;
/* Compute information relevant to the foreign relations. */ /* Compute information relevant to the foreign relations. */
set_foreign_rel_properties(joinrel, outer_rel, inner_rel); set_foreign_rel_properties(joinrel, outer_rel, inner_rel);
...@@ -850,7 +848,6 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel, ...@@ -850,7 +848,6 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
joinrel->all_partrels = NULL; joinrel->all_partrels = NULL;
joinrel->partexprs = NULL; joinrel->partexprs = NULL;
joinrel->nullable_partexprs = NULL; joinrel->nullable_partexprs = NULL;
joinrel->partitioned_child_rels = NIL;
joinrel->top_parent_relids = bms_union(outer_rel->top_parent_relids, joinrel->top_parent_relids = bms_union(outer_rel->top_parent_relids,
inner_rel->top_parent_relids); inner_rel->top_parent_relids);
......
...@@ -141,7 +141,7 @@ typedef struct PruneStepResult ...@@ -141,7 +141,7 @@ typedef struct PruneStepResult
static List *make_partitionedrel_pruneinfo(PlannerInfo *root, static List *make_partitionedrel_pruneinfo(PlannerInfo *root,
RelOptInfo *parentrel, RelOptInfo *parentrel,
int *relid_subplan_map, int *relid_subplan_map,
List *partitioned_rels, List *prunequal, Relids partrelids, List *prunequal,
Bitmapset **matchedsubplans); Bitmapset **matchedsubplans);
static void gen_partprune_steps(RelOptInfo *rel, List *clauses, static void gen_partprune_steps(RelOptInfo *rel, List *clauses,
PartClauseTarget target, PartClauseTarget target,
...@@ -267,13 +267,13 @@ make_partition_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel, ...@@ -267,13 +267,13 @@ make_partition_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
prunerelinfos = NIL; prunerelinfos = NIL;
foreach(lc, partitioned_rels) foreach(lc, partitioned_rels)
{ {
List *rels = (List *) lfirst(lc); Relids partrelids = (Relids) lfirst(lc);
List *pinfolist; List *pinfolist;
Bitmapset *matchedsubplans = NULL; Bitmapset *matchedsubplans = NULL;
pinfolist = make_partitionedrel_pruneinfo(root, parentrel, pinfolist = make_partitionedrel_pruneinfo(root, parentrel,
relid_subplan_map, relid_subplan_map,
rels, prunequal, partrelids, prunequal,
&matchedsubplans); &matchedsubplans);
/* When pruning is possible, record the matched subplans */ /* When pruning is possible, record the matched subplans */
...@@ -342,7 +342,7 @@ make_partition_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel, ...@@ -342,7 +342,7 @@ make_partition_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
static List * static List *
make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel, make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
int *relid_subplan_map, int *relid_subplan_map,
List *partitioned_rels, List *prunequal, Relids partrelids, List *prunequal,
Bitmapset **matchedsubplans) Bitmapset **matchedsubplans)
{ {
RelOptInfo *targetpart = NULL; RelOptInfo *targetpart = NULL;
...@@ -351,6 +351,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel, ...@@ -351,6 +351,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
int *relid_subpart_map; int *relid_subpart_map;
Bitmapset *subplansfound = NULL; Bitmapset *subplansfound = NULL;
ListCell *lc; ListCell *lc;
int rti;
int i; int i;
/* /*
...@@ -364,9 +365,9 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel, ...@@ -364,9 +365,9 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
relid_subpart_map = palloc0(sizeof(int) * root->simple_rel_array_size); relid_subpart_map = palloc0(sizeof(int) * root->simple_rel_array_size);
i = 1; i = 1;
foreach(lc, partitioned_rels) rti = -1;
while ((rti = bms_next_member(partrelids, rti)) > 0)
{ {
Index rti = lfirst_int(lc);
RelOptInfo *subpart = find_base_rel(root, rti); RelOptInfo *subpart = find_base_rel(root, rti);
PartitionedRelPruneInfo *pinfo; PartitionedRelPruneInfo *pinfo;
List *partprunequal; List *partprunequal;
...@@ -379,14 +380,11 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel, ...@@ -379,14 +380,11 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
* Fill the mapping array. * Fill the mapping array.
* *
* relid_subpart_map maps relid of a non-leaf partition to the index * relid_subpart_map maps relid of a non-leaf partition to the index
* in 'partitioned_rels' of that rel (which will also be the index in * in the returned PartitionedRelPruneInfo list of the info for that
* the returned PartitionedRelPruneInfo list of the info for that * partition. We use 1-based indexes here, so that zero can represent
* partition). We use 1-based indexes here, so that zero can * an un-filled array entry.
* represent an un-filled array entry.
*/ */
Assert(rti < root->simple_rel_array_size); Assert(rti < root->simple_rel_array_size);
/* No duplicates please */
Assert(relid_subpart_map[rti] == 0);
relid_subpart_map[rti] = i++; relid_subpart_map[rti] = i++;
/* /*
...@@ -582,6 +580,13 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel, ...@@ -582,6 +580,13 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
present_parts = bms_add_member(present_parts, i); present_parts = bms_add_member(present_parts, i);
} }
/*
* Ensure there were no stray PartitionedRelPruneInfo generated for
* partitioned tables that we have no sub-paths or
* sub-PartitionedRelPruneInfo for.
*/
Assert(!bms_is_empty(present_parts));
/* Record the maps and other information. */ /* Record the maps and other information. */
pinfo->present_parts = present_parts; pinfo->present_parts = present_parts;
pinfo->nparts = nparts; pinfo->nparts = nparts;
......
...@@ -601,9 +601,6 @@ typedef struct PartitionSchemeData *PartitionScheme; ...@@ -601,9 +601,6 @@ typedef struct PartitionSchemeData *PartitionScheme;
* part_rels - RelOptInfos for each partition * part_rels - RelOptInfos for each partition
* all_partrels - Relids set of all partition relids * all_partrels - Relids set of all partition relids
* partexprs, nullable_partexprs - Partition key expressions * partexprs, nullable_partexprs - Partition key expressions
* partitioned_child_rels - RT indexes of unpruned partitions of
* this relation that are partitioned tables
* themselves, in hierarchical order
* *
* The partexprs and nullable_partexprs arrays each contain * The partexprs and nullable_partexprs arrays each contain
* part_scheme->partnatts elements. Each of the elements is a list of * part_scheme->partnatts elements. Each of the elements is a list of
...@@ -751,7 +748,6 @@ typedef struct RelOptInfo ...@@ -751,7 +748,6 @@ typedef struct RelOptInfo
Relids all_partrels; /* Relids set of all partition relids */ Relids all_partrels; /* Relids set of all partition relids */
List **partexprs; /* Non-nullable partition key expressions */ List **partexprs; /* Non-nullable partition key expressions */
List **nullable_partexprs; /* Nullable partition key expressions */ List **nullable_partexprs; /* Nullable partition key expressions */
List *partitioned_child_rels; /* List of RT indexes */
} RelOptInfo; } RelOptInfo;
/* /*
...@@ -1401,8 +1397,9 @@ typedef struct CustomPath ...@@ -1401,8 +1397,9 @@ typedef struct CustomPath
typedef struct AppendPath typedef struct AppendPath
{ {
Path path; Path path;
/* RT indexes of non-leaf tables in a partition tree */ List *partitioned_rels; /* List of Relids containing RT indexes of
List *partitioned_rels; * non-leaf tables for each partition
* hierarchy whose paths are in 'subpaths' */
List *subpaths; /* list of component Paths */ List *subpaths; /* list of component Paths */
/* Index of first partial path in subpaths; list_length(subpaths) if none */ /* Index of first partial path in subpaths; list_length(subpaths) if none */
int first_partial_path; int first_partial_path;
...@@ -1427,8 +1424,9 @@ extern bool is_dummy_rel(RelOptInfo *rel); ...@@ -1427,8 +1424,9 @@ extern bool is_dummy_rel(RelOptInfo *rel);
typedef struct MergeAppendPath typedef struct MergeAppendPath
{ {
Path path; Path path;
/* RT indexes of non-leaf tables in a partition tree */ List *partitioned_rels; /* List of Relids containing RT indexes of
List *partitioned_rels; * non-leaf tables for each partition
* hierarchy whose paths are in 'subpaths' */
List *subpaths; /* list of component Paths */ List *subpaths; /* list of component Paths */
double limit_tuples; /* hard limit on output tuples, or -1 */ double limit_tuples; /* hard limit on output tuples, or -1 */
} MergeAppendPath; } MergeAppendPath;
......
...@@ -3671,6 +3671,108 @@ explain (costs off) update listp1 set a = 1 where a = 2; ...@@ -3671,6 +3671,108 @@ explain (costs off) update listp1 set a = 1 where a = 2;
reset constraint_exclusion; reset constraint_exclusion;
reset enable_partition_pruning; reset enable_partition_pruning;
drop table listp; drop table listp;
-- Ensure run-time pruning works correctly for nested Append nodes
set parallel_setup_cost to 0;
set parallel_tuple_cost to 0;
create table listp (a int) partition by list(a);
create table listp_12 partition of listp for values in(1,2) partition by list(a);
create table listp_12_1 partition of listp_12 for values in(1);
create table listp_12_2 partition of listp_12 for values in(2);
-- Force the 2nd subnode of the Append to be non-parallel. This results in
-- a nested Append node because the mixed parallel / non-parallel paths cannot
-- be pulled into the top-level Append.
alter table listp_12_1 set (parallel_workers = 0);
-- Ensure that listp_12_2 is not scanned. (The nested Append is not seen in
-- the plan as it's pulled in setref.c due to having just a single subnode).
explain (analyze on, costs off, timing off, summary off)
select * from listp where a = (select 1);
QUERY PLAN
----------------------------------------------------------------------
Gather (actual rows=0 loops=1)
Workers Planned: 2
Params Evaluated: $0
Workers Launched: 2
InitPlan 1 (returns $0)
-> Result (actual rows=1 loops=1)
-> Parallel Append (actual rows=0 loops=3)
-> Seq Scan on listp_12_1 listp_1 (actual rows=0 loops=1)
Filter: (a = $0)
-> Parallel Seq Scan on listp_12_2 listp_2 (never executed)
Filter: (a = $0)
(11 rows)
-- Like the above but throw some more complexity at the planner by adding
-- a UNION ALL. We expect both sides of the union not to scan the
-- non-required partitions.
explain (analyze on, costs off, timing off, summary off)
select * from listp where a = (select 1)
union all
select * from listp where a = (select 2);
QUERY PLAN
-----------------------------------------------------------------------------------
Append (actual rows=0 loops=1)
-> Gather (actual rows=0 loops=1)
Workers Planned: 2
Params Evaluated: $0
Workers Launched: 2
InitPlan 1 (returns $0)
-> Result (actual rows=1 loops=1)
-> Parallel Append (actual rows=0 loops=3)
-> Seq Scan on listp_12_1 listp_1 (actual rows=0 loops=1)
Filter: (a = $0)
-> Parallel Seq Scan on listp_12_2 listp_2 (never executed)
Filter: (a = $0)
-> Gather (actual rows=0 loops=1)
Workers Planned: 2
Params Evaluated: $1
Workers Launched: 2
InitPlan 2 (returns $1)
-> Result (actual rows=1 loops=1)
-> Parallel Append (actual rows=0 loops=3)
-> Seq Scan on listp_12_1 listp_4 (never executed)
Filter: (a = $1)
-> Parallel Seq Scan on listp_12_2 listp_5 (actual rows=0 loops=1)
Filter: (a = $1)
(23 rows)
drop table listp;
reset parallel_tuple_cost;
reset parallel_setup_cost;
-- Test case for run-time pruning with a nested Merge Append
set enable_sort to 0;
create table rangep (a int, b int) partition by range (a);
create table rangep_0_to_100 partition of rangep for values from (0) to (100) partition by list (b);
-- We need 3 sub-partitions. 1 to validate pruning worked and another two
-- because a single remaining partition would be pulled up to the main Append.
create table rangep_0_to_100_1 partition of rangep_0_to_100 for values in(1);
create table rangep_0_to_100_2 partition of rangep_0_to_100 for values in(2);
create table rangep_0_to_100_3 partition of rangep_0_to_100 for values in(3);
create table rangep_100_to_200 partition of rangep for values from (100) to (200);
create index on rangep (a);
-- Ensure run-time pruning works on the nested Merge Append
explain (analyze on, costs off, timing off, summary off)
select * from rangep where b IN((select 1),(select 2)) order by a;
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Append (actual rows=0 loops=1)
InitPlan 1 (returns $0)
-> Result (actual rows=1 loops=1)
InitPlan 2 (returns $1)
-> Result (actual rows=1 loops=1)
-> Merge Append (actual rows=0 loops=1)
Sort Key: rangep_2.a
-> Index Scan using rangep_0_to_100_1_a_idx on rangep_0_to_100_1 rangep_2 (actual rows=0 loops=1)
Filter: (b = ANY (ARRAY[$0, $1]))
-> Index Scan using rangep_0_to_100_2_a_idx on rangep_0_to_100_2 rangep_3 (actual rows=0 loops=1)
Filter: (b = ANY (ARRAY[$0, $1]))
-> Index Scan using rangep_0_to_100_3_a_idx on rangep_0_to_100_3 rangep_4 (never executed)
Filter: (b = ANY (ARRAY[$0, $1]))
-> Index Scan using rangep_100_to_200_a_idx on rangep_100_to_200 rangep_5 (actual rows=0 loops=1)
Filter: (b = ANY (ARRAY[$0, $1]))
(15 rows)
reset enable_sort;
drop table rangep;
-- --
-- Check that gen_prune_steps_from_opexps() works well for various cases of -- Check that gen_prune_steps_from_opexps() works well for various cases of
-- clauses for different partition keys -- clauses for different partition keys
......
...@@ -1051,6 +1051,55 @@ reset enable_partition_pruning; ...@@ -1051,6 +1051,55 @@ reset enable_partition_pruning;
drop table listp; drop table listp;
-- Ensure run-time pruning works correctly for nested Append nodes
set parallel_setup_cost to 0;
set parallel_tuple_cost to 0;
create table listp (a int) partition by list(a);
create table listp_12 partition of listp for values in(1,2) partition by list(a);
create table listp_12_1 partition of listp_12 for values in(1);
create table listp_12_2 partition of listp_12 for values in(2);
-- Force the 2nd subnode of the Append to be non-parallel. This results in
-- a nested Append node because the mixed parallel / non-parallel paths cannot
-- be pulled into the top-level Append.
alter table listp_12_1 set (parallel_workers = 0);
-- Ensure that listp_12_2 is not scanned. (The nested Append is not seen in
-- the plan as it's pulled in setref.c due to having just a single subnode).
explain (analyze on, costs off, timing off, summary off)
select * from listp where a = (select 1);
-- Like the above but throw some more complexity at the planner by adding
-- a UNION ALL. We expect both sides of the union not to scan the
-- non-required partitions.
explain (analyze on, costs off, timing off, summary off)
select * from listp where a = (select 1)
union all
select * from listp where a = (select 2);
drop table listp;
reset parallel_tuple_cost;
reset parallel_setup_cost;
-- Test case for run-time pruning with a nested Merge Append
set enable_sort to 0;
create table rangep (a int, b int) partition by range (a);
create table rangep_0_to_100 partition of rangep for values from (0) to (100) partition by list (b);
-- We need 3 sub-partitions. 1 to validate pruning worked and another two
-- because a single remaining partition would be pulled up to the main Append.
create table rangep_0_to_100_1 partition of rangep_0_to_100 for values in(1);
create table rangep_0_to_100_2 partition of rangep_0_to_100 for values in(2);
create table rangep_0_to_100_3 partition of rangep_0_to_100 for values in(3);
create table rangep_100_to_200 partition of rangep for values from (100) to (200);
create index on rangep (a);
-- Ensure run-time pruning works on the nested Merge Append
explain (analyze on, costs off, timing off, summary off)
select * from rangep where b IN((select 1),(select 2)) order by a;
reset enable_sort;
drop table rangep;
-- --
-- Check that gen_prune_steps_from_opexps() works well for various cases of -- Check that gen_prune_steps_from_opexps() works well for various cases of
-- clauses for different partition keys -- clauses for different partition keys
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment