Commits · 5db6df0c0117ff2a4e0cd87594d2db408cd5022f · Abuhujair Javed / Postgres FD Implementation

24 Mar, 2019 1 commit

tableam: Add tuple_{insert, delete, update, lock} and use. · 5db6df0c

Andres Freund authored Mar 23, 2019

This adds new, required, table AM callbacks for insert/delete/update
and lock_tuple. To be able to reasonably use those, the EvalPlanQual
mechanism had to be adapted, moving more logic into the AM.

Previously both delete/update/lock call-sites and the EPQ mechanism had
to have awareness of the specific tuple format to be able to fetch the
latest version of a tuple. Obviously that needs to be abstracted
away. To do so, move the logic that find the latest row version into
the AM. lock_tuple has a new flag argument,
TUPLE_LOCK_FLAG_FIND_LAST_VERSION, that forces it to lock the last
version, rather than the current one. It'd have been possible to do
so via a separate callback as well, but finding the last version
usually also necessitates locking the newest version, making it
sensible to combine the two. This replaces the previous use of
EvalPlanQualFetch(). Additionally HeapTupleUpdated, which previously
signaled either a concurrent update or delete, is now split into two,
to avoid callers needing AM specific knowledge to differentiate.

The move of finding the latest row version into tuple_lock means that
encountering a row concurrently moved into another partition will now
raise an error about "tuple to be locked" rather than "tuple to be
updated/deleted" - which is accurate, as that always happens when
locking rows. While possible slightly less helpful for users, it seems
like an acceptable trade-off.

As part of this commit HTSU_Result has been renamed to TM_Result, and
its members been expanded to differentiated between updating and
deleting. HeapUpdateFailureData has been renamed to TM_FailureData.

The interface to speculative insertion is changed so nodeModifyTable.c
does not have to set the speculative token itself anymore. Instead
there's a version of tuple_insert, tuple_insert_speculative, that
performs the speculative insertion (without requiring a flag to signal
that fact), and the speculative insertion is either made permanent
with table_complete_speculative(succeeded = true) or aborted with
succeeded = false).

Note that multi_insert is not yet routed through tableam, nor is
COPY. Changing multi_insert requires changes to copy.c that are large
enough to better be done separately.

Similarly, although simpler, CREATE TABLE AS and CREATE MATERIALIZED
VIEW are also only going to be adjusted in a later commit.

Author: Andres Freund and Haribabu Kommi
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20190313003903.nwvrxi7rw3ywhdel@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql

5db6df0c

23 Mar, 2019 8 commits

Remove inadequate check for duplicate "xml" PI. · f778e537

Tom Lane authored Mar 23, 2019

I failed to think about PIs starting with "xml".  We don't really
need this check at all, so just take it out.  Oversight in
commit 8d1dadb2 et al.

f778e537

Ensure xmloption = content while restoring pg_dump output. · 4870dce3

Tom Lane authored Mar 23, 2019

In combination with the previous commit, this ensures that valid XML
data can always be dumped and reloaded, whether it is "document"
or "content".

Discussion: https://postgr.es/m/CAN-V+g-6JqUQEQZ55Q3toXEN6d5Ez5uvzL4VR+8KtvJKj31taw@mail.gmail.com

4870dce3

Accept XML documents when xmloption = content, as required by SQL:2006+. · 8d1dadb2

Tom Lane authored Mar 23, 2019

Previously we were using the SQL:2003 definition, which doesn't allow
this, but that creates a serious dump/restore gotcha: there is no
setting of xmloption that will allow all valid XML data.  Hence,
switch to the 2006 definition.

Since libxml doesn't accept <!DOCTYPE> directives in the mode we
use for CONTENT parsing, the implementation is to detect <!DOCTYPE>
in the input and switch to DOCUMENT parsing mode.  This should not
cost much, because <!DOCTYPE> should be close to the front of the
input if it's there at all.  It's possible that this causes the
error messages for malformed input to be slightly different than
they were before, if said input includes <!DOCTYPE>; but that does
not seem like a big problem.

In passing, buy back a few cycles in parsing of large XML documents
by not doing strlen() of the whole input in parse_xml_decl().

Back-patch because dump/restore failures are not nice.  This change
shouldn't break any cases that worked before, so it seems safe to
back-patch.

Chapman Flack (revised a bit by me)

Discussion: https://postgr.es/m/CAN-V+g-6JqUQEQZ55Q3toXEN6d5Ez5uvzL4VR+8KtvJKj31taw@mail.gmail.com

8d1dadb2

Suppress DETAIL output from an event_trigger test. · 05f110cc

Peter Geoghegan authored Mar 23, 2019

Suppress 3 lines of unstable DETAIL output from a DROP ROLE statement in
event_trigger.sql.  This is further cleanup for commit dd299df8.

Note that the event_trigger test instability issue is very similar to
the recently suppressed foreign_data test instability issue.  Both
issues involve DETAIL output for a DROP ROLE statement that needed to be
changed as part of dd299df8.

Per buildfarm member macaque.

05f110cc

Add nbtree high key "continuescan" optimization. · 29b64d1d

Peter Geoghegan authored Mar 23, 2019

Teach nbtree forward index scans to check the high key before moving to
the right sibling page in the hope of finding that it isn't actually
necessary to do so. The new check may indicate that the scan definitely
cannot find matching tuples to the right, ending the scan immediately.
We already opportunistically force a similar "continuescan orientated"
key check of the final non-pivot tuple when it's clear that it cannot be
returned to the scan due to being dead-to-all. The new high key check
is complementary.

The new approach for forward scans is more effective than checking the
final non-pivot tuple, especially with composite indexes and non-unique
indexes. The improvements to the logic for picking a split point added
by commit fab25024 make it likely that relatively dissimilar high keys
will appear on a page. A distinguishing key value that can only appear
on non-pivot tuples on the right sibling page will often be present in
leaf page high keys.

Since forcing the final item to be key checked no longer makes any
difference in the case of forward scans, the existing extra key check is
now only used for backwards scans. Backward scans continue to
opportunistically check the final non-pivot tuple, which is actually the
first non-pivot tuple on the page (not the last).

Note that even pg_upgrade'd v3 indexes make use of this optimization.

Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzkOmUduME31QnuTFpimejuQoiZ-HOf0pOWeFZNhTMctvA@mail.gmail.com

29b64d1d

Improve format of code and some error messages in pg_checksums · 4ba96d1b

Michael Paquier authored Mar 23, 2019

This makes the code more consistent with the surroundings.

Author: Fabrízio de Royes Mello
Discussion: https://postgr.es/m/CAFcNs+pXb_35r5feMU3-dWsWxXU=Yjq+spUsthFyGFbT0QcaKg@mail.gmail.com

4ba96d1b

Add unreachable "break" to satisfy -Wimplicit-fallthrough. · fb50d3f0

Tom Lane authored Mar 23, 2019

gcc is a bit pickier about this than perhaps it should be.

Discussion: https://postgr.es/m/E1h6zzT-0003ft-DD@gemulon.postgresql.org

fb50d3f0

Expand EPQ tests for UPDATEs and DELETEs · cdcffe22

Andres Freund authored Mar 22, 2019

Previously there was basically no coverage for UPDATEs encountering
deleted rows, and no coverage for DELETE having to perform EPQ. That's
problematic for an upcoming commit in which EPQ is tought to integrate
with tableams.  Also, there was no test for UPDATE to encounter a row
UPDATEd into another partition.

Author: Andres Freund

cdcffe22

22 Mar, 2019 18 commits

Add option -N/--no-sync to pg_checksums · e0090c86

Michael Paquier authored Mar 23, 2019

This is an option consistent with what pg_dump, pg_rewind and
pg_basebackup provide which is useful for leveraging the I/O effort when
testing things, not to be used in a production environment.

Author: Michael Paquier
Reviewed-by: Michael Banck, Fabien Coelho, Sergei Kornilov
Discussion: https://postgr.es/m/20181221201616.GD4974@nighthawk.caipicrew.dd-dns.de

e0090c86

Revert "Add gitignore entries for jsonpath_gram.h" · 7b084b38
Peter Eisentraut authored Mar 23, 2019
```
This reverts commit 4e274a04.

These files aren't actually built anymore since 550b9d26.
```
7b084b38

Add options to enable and disable checksums in pg_checksums · ed308d78

Michael Paquier authored Mar 23, 2019

An offline cluster can now work with more modes in pg_checksums:
- --enable enables checksums in a cluster, updating all blocks with a
correct checksum, and updating the control file at the end.
- --disable disables checksums in a cluster, updating only the control
file.
- --check is an extra option able to verify checksums for a cluster, and
the default used if no mode is specified.

When running --enable or --disable, the data folder gets fsync'd for
durability, and then it is followed by a control file update and flush
to keep the operation consistent should the tool be interrupted, killed
or the host unplugged.  If no mode is specified in the options, then
--check is used for compatibility with older versions of pg_checksums
(named pg_verify_checksums in v11 where it was introduced).

Author: Michael Banck, Michael Paquier
Reviewed-by: Fabien Coelho, Magnus Hagander, Sergei Kornilov
Discussion: https://postgr.es/m/20181221201616.GD4974@nighthawk.caipicrew.dd-dns.de

ed308d78

Make subscription collation test work independent of locale · 87914e70
Peter Eisentraut authored Mar 22, 2019
```
We need to set the database to UTF8 encoding so that the test can use
Unicode escapes.
```
87914e70
Add gitignore entries for jsonpath_gram.h · 4e274a04
Peter Eisentraut authored Mar 22, 2019

4e274a04

Rearrange make_partitionedrel_pruneinfo to avoid work when we can't prune. · 734308a2

Tom Lane authored Mar 22, 2019

Postpone most of the effort of constructing PartitionedRelPruneInfos
until after we have found out whether run-time pruning is needed at all.
This costs very little duplicated effort (basically just an extra
find_base_rel() call per partition) and saves quite a bit when we
can't do run-time pruning.

Also, merge the first loop (for building relid_subpart_map) into
the second loop, since we don't need the map to be valid during
that loop.

Amit Langote

Discussion: https://postgr.es/m/9d7c5112-cb99-6a47-d3be-cf1ee6862a1d@lab.ntt.co.jp

734308a2

Go back to suppressing foreign_data DETAIL test output. · 09963ced

Peter Geoghegan authored Mar 22, 2019

This is almost a straight revert of commit fff518d0, which itself was a
revert of 7d3bf73a.

It turns out that commit 8aa9dd74, which sorted dependent objects before
deletion in DROP OWNED BY, was not sufficient to make all remaining
unstable DETAIL output stable. Unstable DETAIL output from DROP ROLE
was not affected, because that happens to use a different code path. It
doesn't seem worthwhile to fix the other code path at this time.

Discussion: https://postgr.es/m/6226.1553274783@sss.pgh.pa.us

09963ced

Don't copy PartitionBoundInfo in set_relation_partition_info. · c8151e64

Tom Lane authored Mar 22, 2019

I (tgl) remain dubious that it's a good idea for PartitionDirectory
to hold a pin on a relcache entry throughout planning, rather than
copying the data or using some kind of refcount scheme. However, it's
certainly the responsibility of the PartitionDirectory code to ensure
that what it's handing back is a stable data structure, not that of
its caller. So this is a pretty clear oversight in commit 898e5e32,
and one that can cost a lot of performance when there are many
partitions.

Amit Langote (extracted from a much larger patch set)

Discussion: https://postgr.es/m/CA+TgmoY3bRmGB6-DUnoVy5fJoreiBJ43rwMrQRCdPXuKt4Ykaw@mail.gmail.com
Discussion: https://postgr.es/m/9d7c5112-cb99-6a47-d3be-cf1ee6862a1d@lab.ntt.co.jp

c8151e64

Fix yet more portability bugs in integerset and its tests. · b5fd4972

Heikki Linnakangas authored Mar 22, 2019

There were more large constants that needed UINT64CONST. And one variable
was declared as "int", when it needed to be uint64. These bugs were only
visible on 32-bit systems; clearly I should've tested on one, given that
this code does a lot of work with 64-bit integers.

Also, in the test "huge distances" test, the code created some values with
random distances between them, but the test logic didn't take into account
the possibility that the random distance was exactly 1. That never actually
happens with the seed we're using, but let's be tidy.

b5fd4972

Fix ICU tests for older ICU versions · 638db078

Peter Eisentraut authored Mar 22, 2019

Change the tests to use old-style ICU locale specifications so that
they can run on older ICU versions.

638db078

More portability fixes for integerset tests. · c477c68c
Heikki Linnakangas authored Mar 22, 2019
```
Use UINT64CONST for large constants.
```
c477c68c
Make printf format strings in test_integerset portable. · 32f8ddf7
Heikki Linnakangas authored Mar 22, 2019
```
Use UINT64_FORMAT for printing uint64s.
```
32f8ddf7

Make the integerset test more verbose. · 608c5f43

Heikki Linnakangas authored Mar 22, 2019

Buildfarm member 'woodlouse' failed one of the tests, and I'm not sure
which test failed. Better to print the names of the tests, so that it
will appear in the regression.diffs on failure.

608c5f43

Fix bug in the GiST vacuum's 2nd stage. · d1b9ee4e

Heikki Linnakangas authored Mar 22, 2019

We mustn't assume that the IndexVacuumInfo pointer passed to bulkdelete()
stage is still valid in the vacuumcleanup() stage.

Per very pink buildfarm.

d1b9ee4e

Delete empty pages during GiST VACUUM. · 7df159a6

Heikki Linnakangas authored Mar 22, 2019

To do this, we scan GiST two times. In the first pass we make note of
empty leaf pages and internal pages. At second pass we scan through
internal pages, looking for downlinks to the empty pages.

Deleting internal pages is still not supported, like in nbtree, the last
child of an internal page is never deleted. That means that if you have a
workload where new keys are always inserted to different area than where
old keys are removed, the index will still grow without bound. But the rate
of growth will be an order of magnitude slower than before.

Author: Andrey Borodin
Discussion: https://www.postgresql.org/message-id/B1E4DF12-6CD3-4706-BDBD-BF3283328F60@yandex-team.ru

7df159a6

Add IntegerSet, to hold large sets of 64-bit ints efficiently. · df816f6a

Heikki Linnakangas authored Mar 22, 2019

The set is implemented as a B-tree, with a compact representation at leaf
items, using Simple-8b algorithm, so that clusters of nearby values use
less memory.

The IntegerSet isn't used for anything yet, aside from the test code, but
we have two patches in the works that would benefit from this: A patch to
allow GiST vacuum to delete empty pages, and a patch to reduce heap
VACUUM's memory usage, by storing the list of dead TIDs more efficiently
and lifting the 1 GB limit on its size.

This includes a unit test module, in src/test/modules/test_integerset.
It can be used to verify correctness, as a regression test, but if you run
it manully, it can also print memory usage and execution time of some of
the tests.

Author: Heikki Linnakangas, Andrey Borodin
Reviewed-by: Julien Rouhaud
Discussion: https://www.postgresql.org/message-id/b5e82599-1966-5783-733c-1a947ddb729f@iki.fi

df816f6a

Collations with nondeterministic comparison · 5e1963fb

Peter Eisentraut authored Mar 22, 2019

This adds a flag "deterministic" to collations. If that is false,
such a collation disables various optimizations that assume that
strings are equal only if they are byte-wise equal. That then allows
use cases such as case-insensitive or accent-insensitive comparisons
or handling of strings with different Unicode normal forms.

This functionality is only supported with the ICU provider. At least
glibc doesn't appear to have any locales that work in a
nondeterministic way, so it's not worth supporting this for the libc
provider.

The term "deterministic comparison" in this context is from Unicode
Technical Standard #10
(https://unicode.org/reports/tr10/#Deterministic_Comparison).

This patch makes changes in three areas:

- CREATE COLLATION DDL changes and system catalog changes to support
this new flag.

- Many executor nodes and auxiliary code are extended to track
collations. Previously, this code would just throw away collation
information, because the eventually-called user-defined functions
didn't use it since they only cared about equality, which didn't
need collation information.

- String data type functions that do equality comparisons and hashing
are changed to take the (non-)deterministic flag into account. For
comparison, this just means skipping various shortcuts and tie
breakers that use byte-wise comparison. For hashing, we first need
to convert the input string to a canonical "sort key" using the ICU
analogue of strxfrm().
Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/flat/1ccc668f-4cbc-0bef-af67-450b47cdfee7@2ndquadrant.com

5e1963fb

Fix crash with pg_partition_root · 2ab6d28d

Michael Paquier authored Mar 22, 2019

Trying to call the function with the top-most parent of a partition tree
was leading to a crash.  In this case the correct result is to return
the top-most parent itself.

Reported-by: Álvaro Herrera
Author: Michael Paquier
Reviewed-by: Amit Langote
Discussion: https://postgr.es/m/20190322032612.GA323@alvherre.pgsql

2ab6d28d

21 Mar, 2019 5 commits

Revert "Suppress DETAIL output from a foreign_data test." · fff518d0
Peter Geoghegan authored Mar 21, 2019
```
This should be superseded by commit 8aa9dd74.
```
fff518d0
Catversion bump announced in previous commit but forgotten · 03ae9d59
Alvaro Herrera authored Mar 21, 2019

03ae9d59

Fix dependency recording bug for partitioned PKs · 7e7c57bb

Alvaro Herrera authored Mar 21, 2019

When DefineIndex recurses to create constraints on partitions, it needs
to use the value returned by index_constraint_create to set up partition
dependencies. However, in the course of fixing the DEPENDENCY_INTERNAL_AUTO
mess, commit 1d92a0c9 introduced some code to that function that
clobbered the return value, causing the recorded OID to be of the wrong
object. Close examination of pg_depend after creating the tables leads
to indescribable objects :-( My sin (in commit bdc3d7fa, while
preparing for DDL deparsing in event triggers) was to use a variable
name for the return value that's typically used for throwaway objects in
dependency-setting calls ("referenced"). Fix by changing the variable
names to match extended practice (the return value is "myself" rather
than "referenced".)

The pg_upgrade test notices the problem (in an indirect way: the pg_dump
outputs are in different order), but only if you create the objects in a
specific way that wasn't being used in the existing tests. Add a stanza
to leave some objects around that shows the bug.

Catversion bump because preexisting databases might have bogus pg_depend
entries.

Discussion: https://postgr.es/m/20190318204235.GA30360@alvherre.pgsql

7e7c57bb

Improve error reporting for DROP FUNCTION/PROCEDURE/AGGREGATE/ROUTINE. · bfb456c1

Tom Lane authored Mar 21, 2019

These commands allow the argument type list to be omitted if there is
just one object that matches by name.  However, if that syntax was
used with DROP IF EXISTS and there was more than one match, you got
a "function ... does not exist, skipping" notice message rather than a
truthful complaint about the ambiguity.  This was basically due to
poor factorization and a rats-nest of logic, so refactor the relevant
lookup code to make it cleaner.

Note that this amounts to narrowing the scope of which sorts of
error conditions IF EXISTS will bypass.  Per discussion, we only
intend it to skip no-such-object cases, not multiple-possible-matches
cases.

Per bug #15572 from Ash Marath.  Although this definitely seems like
a bug, it's not clear that people would thank us for changing the
behavior in minor releases, so no back-patch.

David Rowley, reviewed by Julien Rouhaud and Pavel Stehule

Discussion: https://postgr.es/m/15572-ed1b9ed09503de8a@postgresql.org

bfb456c1

Add DNS SRV support for LDAP server discovery. · 0f086f84

Thomas Munro authored Mar 21, 2019

LDAP servers can be advertised on a network with RFC 2782 DNS SRV
records.  The OpenLDAP command-line tools automatically try to find
servers that way, if no server name is provided by the user.  Teach
PostgreSQL to do the same using OpenLDAP's support functions, when
building with OpenLDAP.

For now, we assume that HAVE_LDAP_INITIALIZE (an OpenLDAP extension
available since OpenLDAP 2.0 and also present in Apple LDAP) implies
that you also have ldap_domain2hostlist() (which arrived in the same
OpenLDAP version and is also present in Apple LDAP).

Author: Thomas Munro
Reviewed-by: Daniel Gustafsson
Discussion: https://postgr.es/m/CAEepm=2hAnSfhdsd6vXsM6VZVN0br-FbAZ-O+Swk18S5HkCP=A@mail.gmail.com

0f086f84

20 Mar, 2019 8 commits

Sort the dependent objects before deletion in DROP OWNED BY. · 8aa9dd74

Tom Lane authored Mar 20, 2019

This finishes a task we left undone in commit f1ad067f, by extending
the delete-in-descending-OID-order rule to deletions triggered by
DROP OWNED BY. We've coped with machine-dependent deletion orders
one time too many, and the new issues caused by Peter G's recent
nbtree hacking seem like the last straw.

Discussion: https://postgr.es/m/E1h6eep-0001Mw-Vd@gemulon.postgresql.org

8aa9dd74

Add index_get_partition convenience function · a6da0047

Alvaro Herrera authored Mar 20, 2019

This new function simplifies some existing coding, as well as supports
future patches.

Discussion: https://postgr.es/m/201901222145.t6wws6t6vrcu@alvherre.pgsql
Reviewed-by: Amit Langote, Jesper Pedersen

a6da0047

Fix spurious compiler warning in nbtxlog.c. · 3d0dcc5c
Peter Geoghegan authored Mar 20, 2019
```
Cleanup from commit dd299df8.

Per complaint from Tom Lane.
```
3d0dcc5c

Suppress DETAIL output from a foreign_data test. · 7d3bf73a

Peter Geoghegan authored Mar 20, 2019

Unstable sort order related to changes to nbtree from commit dd299df8
can cause two lines of DETAIL output to be in opposite-of-expected
order.  Suppress the output using the same VERBOSITY hack that is used
elsewhere in the foreign_data tests.

Note that the same foreign_data.out DETAIL output was mechanically
updated by commit dd299df8.  Only a few such changes were required,
though.

Per buildfarm member batfish.

Discussion: https://postgr.es/m/CAH2-WzkCQ_MtKeOpzozj7QhhgP1unXsK8o9DMAFvDqQFEPpkYQ@mail.gmail.com

7d3bf73a

Restore RI trigger sanity check · 815b20ae

Alvaro Herrera authored Mar 20, 2019

I unnecessarily removed this check in 3de241db because I
misunderstood what the final representation of constraints across a
partitioning hierarchy was to be. Put it back (in both branches).

Discussion: https://postgr.es/m/201901222145.t6wws6t6vrcu@alvherre.pgsql

815b20ae

Allow amcheck to re-find tuples using new search. · c1afd175

Peter Geoghegan authored Mar 20, 2019

Teach contrib/amcheck's bt_index_parent_check() function to take
advantage of the uniqueness property of heapkeyspace indexes in support
of a new verification option: non-pivot tuples (non-highkey tuples on
the leaf level) can optionally be re-found using a new search for each,
that starts from the root page. If a tuple cannot be re-found, report
that the index is corrupt.

The new "rootdescend" verification option is exhaustive, and can
therefore make a call to bt_index_parent_check() take a lot longer.
Re-finding tuples during verification is mostly intended as an option
for backend developers, since the corruption scenarios that it alone is
uniquely capable of detecting seem fairly far-fetched.

For example, "rootdescend" verification is much more likely to detect
corruption of the least significant byte of a key from a pivot tuple in
the root page of a B-Tree that already has at least three levels.
Typically, only a few tuples on a cousin leaf page are at risk of
"getting overlooked" by index scans in this scenario. The corrupt key
in the root page is only slightly corrupt: corrupt enough to give wrong
answers to some queries, and yet not corrupt enough to allow the problem
to be detected without verifying agreement between the leaf page and the
root page, skipping at least one internal page level. The existing
bt_index_parent_check() checks never cross more than a single level.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-Wz=yTWnVu+HeHGKb2AGiADL9eprn-cKYAto4MkKOuiGtRQ@mail.gmail.com

c1afd175

Consider secondary factors during nbtree splits. · fab25024

Peter Geoghegan authored Mar 20, 2019

Teach nbtree to give some consideration to how "distinguishing"
candidate leaf page split points are. This should not noticeably affect
the balance of free space within each half of the split, while still
making suffix truncation truncate away significantly more attributes on
average.

The logic for choosing a leaf split point now uses a fallback mode in
the case where the page is full of duplicates and it isn't possible to
find even a minimally distinguishing split point. When the page is full
of duplicates, the split should pack the left half very tightly, while
leaving the right half mostly empty. Our assumption is that logical
duplicates will almost always be inserted in ascending heap TID order
with v4 indexes. This strategy leaves most of the free space on the
half of the split that will likely be where future logical duplicates of
the same value need to be placed.

The number of cycles added is not very noticeable. This is important
because deciding on a split point takes place while at least one
exclusive buffer lock is held. We avoid using authoritative insertion
scankey comparisons to save cycles, unlike suffix truncation proper. We
use a faster binary comparison instead.

Note that even pg_upgrade'd v3 indexes make use of these optimizations.
Benchmarking has shown that even v3 indexes benefit, despite the fact
that suffix truncation will only truncate non-key attributes in INCLUDE
indexes. Grouping relatively similar tuples together is beneficial in
and of itself, since it reduces the number of leaf pages that must be
accessed by subsequent index scans.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzmmoLNQOj9mAD78iQHfWLJDszHEDrAzGTUMG3mVh5xWPw@mail.gmail.com

fab25024

Make heap TID a tiebreaker nbtree index column. · dd299df8

Peter Geoghegan authored Mar 20, 2019

Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique. This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added. This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes. This can increase fan-out,
especially in a multi-column index. Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now. A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved). Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3. contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes. These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.

A later patch will enhance the logic used by nbtree to pick a split
point. This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at. Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.

The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits. The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised. However, there should be a compatibility note in the v12
release notes.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com

dd299df8