Commit c8434d64 authored by Etsuro Fujita's avatar Etsuro Fujita

Allow partitionwise joins in more cases.

Previously, the partitionwise join technique only allowed partitionwise
join when input partitioned tables had exactly the same partition
bounds.  This commit extends the technique to some cases when the tables
have different partition bounds, by using an advanced partition-matching
algorithm introduced by this commit.  For both the input partitioned
tables, the algorithm checks whether every partition of one input
partitioned table only matches one partition of the other input
partitioned table at most, and vice versa.  In such a case the join
between the tables can be broken down into joins between the matching
partitions, so the algorithm produces the pairs of the matching
partitions, plus the partition bounds for the join relation, to allow
partitionwise join for computing the join.  Currently, the algorithm
works for list-partitioned and range-partitioned tables, but not
hash-partitioned tables.  See comments in partition_bounds_merge().

Ashutosh Bapat and Etsuro Fujita, most of regression tests by Rajkumar
Raghuwanshi, some of the tests by Mark Dilger and Amul Sul, reviewed by
Dmitry Dolgov and Amul Sul, with additional review at various points by
Ashutosh Bapat, Mark Dilger, Robert Haas, Antonin Houska, Amit Langote,
Justin Pryzby, and Tomas Vondra

Discussion: https://postgr.es/m/CAFjFpRdjQvaUEV5DJX3TW6pU5eq54NCkadtxHX2JiJG_GvbrCA@mail.gmail.com
parent 41a194f4
......@@ -4749,9 +4749,9 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
which allows a join between partitioned tables to be performed by
joining the matching partitions. Partitionwise join currently applies
only when the join conditions include all the partition keys, which
must be of the same data type and have exactly matching sets of child
partitions. Because partitionwise join planning can use significantly
more CPU time and memory during planning, the default is
must be of the same data type and have one-to-one matching sets of
child partitions. Because partitionwise join planning can use
significantly more CPU time and memory during planning, the default is
<literal>off</literal>.
</para>
</listitem>
......
......@@ -2309,6 +2309,8 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
WRITE_BOOL_FIELD(has_eclass_joins);
WRITE_BOOL_FIELD(consider_partitionwise_join);
WRITE_BITMAPSET_FIELD(top_parent_relids);
WRITE_BOOL_FIELD(partbounds_merged);
WRITE_BITMAPSET_FIELD(all_partrels);
WRITE_NODE_FIELD(partitioned_child_rels);
}
......
......@@ -1106,6 +1106,33 @@ into joins between their partitions is called partitionwise join. We will use
term "partitioned relation" for either a partitioned table or a join between
compatibly partitioned tables.
Even if the joining relations don't have exactly the same partition bounds,
partitionwise join can still be applied by using an advanced
partition-matching algorithm. For both the joining relations, the algorithm
checks wether every partition of one joining relation only matches one
partition of the other joining relation at most. In such a case the join
between the joining relations can be broken down into joins between the
matching partitions. The join relation can then be considerd partitioned.
The algorithm produces the pairs of the matching partitions, plus the
partition bounds for the join relation, to allow partitionwise join for
computing the join. The algorithm is implemented in partition_bounds_merge().
For an N-way join relation considered partitioned this way, not every pair of
joining relations can use partitionwise join. For example:
(A leftjoin B on (Pab)) innerjoin C on (Pac)
where A, B, and C are partitioned tables, and A has an extra partition
compared to B and C. When considering partitionwise join for the join {A B},
the extra partition of A doesn't have a matching partition on the nullable
side, which is the case that the current implementation of partitionwise join
can't handle. So {A B} is not considered partitioned, and the pair of {A B}
and C considered for the 3-way join can't use partitionwise join. On the
other hand, the pair of {A C} and B can use partitionwise join because {A C}
is considered partitioned by eliminating the extra partition (see identity 1
on outer join reordering). Whether an N-way join can use partitionwise join
is determined based on the first pair of joining relations that are both
partitioned and can use partitionwise join.
The partitioning properties of a partitioned relation are stored in its
RelOptInfo. The information about data types of partition keys are stored in
PartitionSchemeData structure. The planner maintains a list of canonical
......
This diff is collapsed.
......@@ -376,6 +376,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RelOptInfo *relinfo,
/* Create the otherrel RelOptInfo too. */
childrelinfo = build_simple_rel(root, childRTindex, relinfo);
relinfo->part_rels[i] = childrelinfo;
relinfo->all_partrels = bms_add_members(relinfo->all_partrels,
childrelinfo->relids);
/* If this child is itself partitioned, recurse */
if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
......
......@@ -249,10 +249,12 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent)
rel->has_eclass_joins = false;
rel->consider_partitionwise_join = false; /* might get changed later */
rel->part_scheme = NULL;
rel->nparts = 0;
rel->nparts = -1;
rel->boundinfo = NULL;
rel->partbounds_merged = false;
rel->partition_qual = NIL;
rel->part_rels = NULL;
rel->all_partrels = NULL;
rel->partexprs = NULL;
rel->nullable_partexprs = NULL;
rel->partitioned_child_rels = NIL;
......@@ -662,10 +664,12 @@ build_join_rel(PlannerInfo *root,
joinrel->consider_partitionwise_join = false; /* might get changed later */
joinrel->top_parent_relids = NULL;
joinrel->part_scheme = NULL;
joinrel->nparts = 0;
joinrel->nparts = -1;
joinrel->boundinfo = NULL;
joinrel->partbounds_merged = false;
joinrel->partition_qual = NIL;
joinrel->part_rels = NULL;
joinrel->all_partrels = NULL;
joinrel->partexprs = NULL;
joinrel->nullable_partexprs = NULL;
joinrel->partitioned_child_rels = NIL;
......@@ -838,10 +842,12 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
joinrel->consider_partitionwise_join = false; /* might get changed later */
joinrel->top_parent_relids = NULL;
joinrel->part_scheme = NULL;
joinrel->nparts = 0;
joinrel->nparts = -1;
joinrel->boundinfo = NULL;
joinrel->partbounds_merged = false;
joinrel->partition_qual = NIL;
joinrel->part_rels = NULL;
joinrel->all_partrels = NULL;
joinrel->partexprs = NULL;
joinrel->nullable_partexprs = NULL;
joinrel->partitioned_child_rels = NIL;
......@@ -1645,7 +1651,7 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
* of the way the query planner deduces implied equalities and reorders
* the joins. Please see optimizer/README for details.
*/
if (!IS_PARTITIONED_REL(outer_rel) || !IS_PARTITIONED_REL(inner_rel) ||
if (outer_rel->part_scheme == NULL || inner_rel->part_scheme == NULL ||
!outer_rel->consider_partitionwise_join ||
!inner_rel->consider_partitionwise_join ||
outer_rel->part_scheme != inner_rel->part_scheme ||
......@@ -1658,24 +1664,6 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
part_scheme = outer_rel->part_scheme;
Assert(REL_HAS_ALL_PART_PROPS(outer_rel) &&
REL_HAS_ALL_PART_PROPS(inner_rel));
/*
* For now, our partition matching algorithm can match partitions only
* when the partition bounds of the joining relations are exactly same.
* So, bail out otherwise.
*/
if (outer_rel->nparts != inner_rel->nparts ||
!partition_bounds_equal(part_scheme->partnatts,
part_scheme->parttyplen,
part_scheme->parttypbyval,
outer_rel->boundinfo, inner_rel->boundinfo))
{
Assert(!IS_PARTITIONED_REL(joinrel));
return;
}
/*
* This function will be called only once for each joinrel, hence it
* should not have partitioning fields filled yet.
......@@ -1685,18 +1673,15 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
!joinrel->boundinfo);
/*
* Join relation is partitioned using the same partitioning scheme as the
* joining relations and has same bounds.
* If the join relation is partitioned, it uses the same partitioning
* scheme as the joining relations.
*
* Note: we calculate the partition bounds, number of partitions, and
* child-join relations of the join relation in try_partitionwise_join().
*/
joinrel->part_scheme = part_scheme;
joinrel->boundinfo = outer_rel->boundinfo;
joinrel->nparts = outer_rel->nparts;
set_joinrel_partition_key_exprs(joinrel, outer_rel, inner_rel, jointype);
/* part_rels[] will be filled later, but allocate it now */
joinrel->part_rels =
(RelOptInfo **) palloc0(sizeof(RelOptInfo *) * joinrel->nparts);
/*
* Set the consider_partitionwise_join flag.
*/
......
This diff is collapsed.
......@@ -597,8 +597,10 @@ typedef struct PartitionSchemeData *PartitionScheme;
* part_scheme - Partitioning scheme of the relation
* nparts - Number of partitions
* boundinfo - Partition bounds
* partbounds_merged - true if partition bounds are merged ones
* partition_qual - Partition constraint if not the root
* part_rels - RelOptInfos for each partition
* all_partrels - Relids set of all partition relids
* partexprs, nullable_partexprs - Partition key expressions
* partitioned_child_rels - RT indexes of unpruned partitions of
* this relation that are partitioned tables
......@@ -735,11 +737,16 @@ typedef struct RelOptInfo
/* used for partitioned relations: */
PartitionScheme part_scheme; /* Partitioning scheme */
int nparts; /* Number of partitions */
int nparts; /* Number of partitions; -1 if not yet set;
* in case of a join relation 0 means it's
* considered unpartitioned */
struct PartitionBoundInfoData *boundinfo; /* Partition bounds */
bool partbounds_merged; /* True if partition bounds were created
* by partition_bounds_merge() */
List *partition_qual; /* Partition constraint, if not the root */
struct RelOptInfo **part_rels; /* Array of RelOptInfos of partitions,
* stored in the same order as bounds */
Relids all_partrels; /* Relids set of all partition relids */
List **partexprs; /* Non-nullable partition key expressions */
List **nullable_partexprs; /* Nullable partition key expressions */
List *partitioned_child_rels; /* List of RT indexes */
......
......@@ -16,6 +16,7 @@
#include "nodes/pg_list.h"
#include "partitioning/partdefs.h"
#include "utils/relcache.h"
struct RelOptInfo; /* avoid including pathnodes.h here */
/*
......@@ -87,6 +88,14 @@ extern bool partition_bounds_equal(int partnatts, int16 *parttyplen,
PartitionBoundInfo b2);
extern PartitionBoundInfo partition_bounds_copy(PartitionBoundInfo src,
PartitionKey key);
extern PartitionBoundInfo partition_bounds_merge(int partnatts,
FmgrInfo *partsupfunc,
Oid *partcollation,
struct RelOptInfo *outer_rel,
struct RelOptInfo *inner_rel,
JoinType jointype,
List **outer_parts,
List **inner_parts);
extern bool partitions_are_ordered(PartitionBoundInfo boundinfo, int nparts);
extern void check_new_partition_bound(char *relname, Relation parent,
PartitionBoundSpec *spec);
......
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment