• Robert Haas's avatar
    Allow ATTACH PARTITION with only ShareUpdateExclusiveLock. · 898e5e32
    Robert Haas authored
    We still require AccessExclusiveLock on the partition itself, because
    otherwise an insert that violates the newly-imposed partition
    constraint could be in progress at the same time that we're changing
    that constraint; only the lock level on the parent relation is
    weakened.
    
    To make this safe, we have to cope with (at least) three separate
    problems. First, relevant DDL might commit while we're in the process
    of building a PartitionDesc.  If so, find_inheritance_children() might
    see a new partition while the RELOID system cache still has the old
    partition bound cached, and even before invalidation messages have
    been queued.  To fix that, if we see that the pg_class tuple seems to
    be missing or to have a null relpartbound, refetch the value directly
    from the table. We can't get the wrong value, because DETACH PARTITION
    still requires AccessExclusiveLock throughout; if we ever want to
    change that, this will need more thought. In testing, I found it quite
    difficult to hit even the null-relpartbound case; the race condition
    is extremely tight, but the theoretical risk is there.
    
    Second, successive calls to RelationGetPartitionDesc might not return
    the same answer.  The query planner will get confused if lookup up the
    PartitionDesc for a particular relation does not return a consistent
    answer for the entire duration of query planning.  Likewise, query
    execution will get confused if the same relation seems to have a
    different PartitionDesc at different times.  Invent a new
    PartitionDirectory concept and use it to ensure consistency.  This
    ensures that a single invocation of either the planner or the executor
    sees the same view of the PartitionDesc from beginning to end, but it
    does not guarantee that the planner and the executor see the same
    view.  Since this allows pointers to old PartitionDesc entries to
    survive even after a relcache rebuild, also postpone removing the old
    PartitionDesc entry until we're certain no one is using it.
    
    For the most part, it seems to be OK for the planner and executor to
    have different views of the PartitionDesc, because the executor will
    just ignore any concurrently added partitions which were unknown at
    plan time; those partitions won't be part of the inheritance
    expansion, but invalidation messages will trigger replanning at some
    point.  Normally, this happens by the time the very next command is
    executed, but if the next command acquires no locks and executes a
    prepared query, it can manage not to notice until a new transaction is
    started.  We might want to tighten that up, but it's material for a
    separate patch.  There would still be a small window where a query
    that started just after an ATTACH PARTITION command committed might
    fail to notice its results -- but only if the command starts before
    the commit has been acknowledged to the user. All in all, the warts
    here around serializability seem small enough to be worth accepting
    for the considerable advantage of being able to add partitions without
    a full table lock.
    
    Although in general the consequences of new partitions showing up
    between planning and execution are limited to the query not noticing
    the new partitions, run-time partition pruning will get confused in
    that case, so that's the third problem that this patch fixes.
    Run-time partition pruning assumes that indexes into the PartitionDesc
    are stable between planning and execution.  So, add code so that if
    new partitions are added between plan time and execution time, the
    indexes stored in the subplan_map[] and subpart_map[] arrays within
    the plan's PartitionedRelPruneInfo get adjusted accordingly.  There
    does not seem to be a simple way to generalize this scheme to cope
    with partitions that are removed, mostly because they could then get
    added back again with different bounds, but it works OK for added
    partitions.
    
    This code does not try to ensure that every backend participating in
    a parallel query sees the same view of the PartitionDesc.  That
    currently doesn't matter, because we never pass PartitionDesc
    indexes between backends.  Each backend will ignore the concurrently
    added partitions which it notices, and it doesn't matter if different
    backends are ignoring different sets of concurrently added partitions.
    If in the future that matters, for example because we allow writes in
    parallel query and want all participants to do tuple routing to the same
    set of partitions, the PartitionDirectory concept could be improved to
    share PartitionDescs across backends.  There is a draft patch to
    serialize and restore PartitionDescs on the thread where this patch
    was discussed, which may be a useful place to start.
    
    Patch by me.  Thanks to Alvaro Herrera, David Rowley, Simon Riggs,
    Amit Langote, and Michael Paquier for discussion, and to Alvaro
    Herrera for some review.
    
    Discussion: http://postgr.es/m/CA+Tgmobt2upbSocvvDej3yzokd7AkiT+PvgFH+a9-5VV1oJNSQ@mail.gmail.com
    Discussion: http://postgr.es/m/CA+TgmoZE0r9-cyA-aY6f8WFEROaDLLL7Vf81kZ8MtFCkxpeQSw@mail.gmail.com
    Discussion: http://postgr.es/m/CA+TgmoY13KQZF-=HNTrt9UYWYx3_oYOQpu9ioNT49jGgiDpUEA@mail.gmail.com
    898e5e32
planner.c 230 KB