Commit e5d8a999 authored by Peter Geoghegan's avatar Peter Geoghegan

Use full 64-bit XIDs in deleted nbtree pages.

Otherwise we risk "leaking" deleted pages by making them non-recyclable
indefinitely.  Commit 6655a729 did the same thing for deleted pages in
GiST indexes.  That work was used as a starting point here.

Stop storing an XID indicating the oldest bpto.xact across all deleted
though unrecycled pages in nbtree metapages.  There is no longer any
reason to care about that condition/the oldest XID.  It only ever made
sense when wraparound was something _bt_vacuum_needs_cleanup() had to
consider.

The btm_oldest_btpo_xact metapage field has been repurposed and renamed.
It is now btm_last_cleanup_num_delpages, which is used to remember how
many non-recycled deleted pages remain from the last VACUUM (in practice
its value is usually the precise number of pages that were _newly
deleted_ during the specific VACUUM operation that last set the field).

The general idea behind storing btm_last_cleanup_num_delpages is to use
it to give _some_ consideration to non-recycled deleted pages inside
_bt_vacuum_needs_cleanup() -- though never too much.  We only really
need to avoid leaving a truly excessive number of deleted pages in an
unrecycled state forever.  We only do this to cover certain narrow cases
where no other factor makes VACUUM do a full scan, and yet the index
continues to grow (and so actually misses out on recycling existing
deleted pages).

These metapage changes result in a clear user-visible benefit: We no
longer trigger full index scans during VACUUM operations solely due to
the presence of only 1 or 2 known deleted (though unrecycled) blocks
from a very large index.  All that matters now is keeping the costs and
benefits in balance over time.

Fix an issue that has been around since commit 857f9c36, which added the
"skip full scan of index" mechanism (i.e. the _bt_vacuum_needs_cleanup()
logic).  The accuracy of btm_last_cleanup_num_heap_tuples accidentally
hinged upon _when_ the source value gets stored.  We now always store
btm_last_cleanup_num_heap_tuples in btvacuumcleanup().  This fixes the
issue because IndexVacuumInfo.num_heap_tuples (the source field) is
expected to accurately indicate the state of the table _after_ the
VACUUM completes inside btvacuumcleanup().

A backpatchable fix cannot easily be extracted from this commit.  A
targeted fix for the issue will follow in a later commit, though that
won't happen today.

I (pgeoghegan) have chosen to remove any mention of deleted pages in the
documentation of the vacuum_cleanup_index_scale_factor GUC/param, since
the presence of deleted (though unrecycled) pages is no longer of much
concern to users.  The vacuum_cleanup_index_scale_factor description in
the docs now seems rather unclear in any case, and it should probably be
rewritten in the near future.  Perhaps some passing mention of page
deletion will be added back at the same time.

Bump XLOG_PAGE_MAGIC due to nbtree WAL records using full XIDs now.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: default avatarMasahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznpdHvujGUwYZ8sihX=d5u-tRYhi-F4wnV2uN2zHpMUXw@mail.gmail.com
parent 8a4f9522
...@@ -769,7 +769,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level) ...@@ -769,7 +769,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
P_FIRSTDATAKEY(opaque)); P_FIRSTDATAKEY(opaque));
itup = (IndexTuple) PageGetItem(state->target, itemid); itup = (IndexTuple) PageGetItem(state->target, itemid);
nextleveldown.leftmost = BTreeTupleGetDownLink(itup); nextleveldown.leftmost = BTreeTupleGetDownLink(itup);
nextleveldown.level = opaque->btpo.level - 1; nextleveldown.level = opaque->btpo_level - 1;
} }
else else
{ {
...@@ -794,14 +794,14 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level) ...@@ -794,14 +794,14 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
if (opaque->btpo_prev != leftcurrent) if (opaque->btpo_prev != leftcurrent)
bt_recheck_sibling_links(state, opaque->btpo_prev, leftcurrent); bt_recheck_sibling_links(state, opaque->btpo_prev, leftcurrent);
/* Check level, which must be valid for non-ignorable page */ /* Check level */
if (level.level != opaque->btpo.level) if (level.level != opaque->btpo_level)
ereport(ERROR, ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED), (errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("leftmost down link for level points to block in index \"%s\" whose level is not one level down", errmsg("leftmost down link for level points to block in index \"%s\" whose level is not one level down",
RelationGetRelationName(state->rel)), RelationGetRelationName(state->rel)),
errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.", errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
current, level.level, opaque->btpo.level))); current, level.level, opaque->btpo_level)));
/* Verify invariants for page */ /* Verify invariants for page */
bt_target_page_check(state); bt_target_page_check(state);
...@@ -1164,7 +1164,7 @@ bt_target_page_check(BtreeCheckState *state) ...@@ -1164,7 +1164,7 @@ bt_target_page_check(BtreeCheckState *state)
bt_child_highkey_check(state, bt_child_highkey_check(state,
offset, offset,
NULL, NULL,
topaque->btpo.level); topaque->btpo_level);
} }
continue; continue;
} }
...@@ -1520,7 +1520,7 @@ bt_target_page_check(BtreeCheckState *state) ...@@ -1520,7 +1520,7 @@ bt_target_page_check(BtreeCheckState *state)
if (!P_ISLEAF(topaque) && P_RIGHTMOST(topaque) && state->readonly) if (!P_ISLEAF(topaque) && P_RIGHTMOST(topaque) && state->readonly)
{ {
bt_child_highkey_check(state, InvalidOffsetNumber, bt_child_highkey_check(state, InvalidOffsetNumber,
NULL, topaque->btpo.level); NULL, topaque->btpo_level);
} }
} }
...@@ -1597,7 +1597,7 @@ bt_right_page_check_scankey(BtreeCheckState *state) ...@@ -1597,7 +1597,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
ereport(DEBUG1, ereport(DEBUG1,
(errcode(ERRCODE_NO_DATA), (errcode(ERRCODE_NO_DATA),
errmsg_internal("level %u leftmost page of index \"%s\" was found deleted or half dead", errmsg_internal("level %u leftmost page of index \"%s\" was found deleted or half dead",
opaque->btpo.level, RelationGetRelationName(state->rel)), opaque->btpo_level, RelationGetRelationName(state->rel)),
errdetail_internal("Deleted page found when building scankey from right sibling."))); errdetail_internal("Deleted page found when building scankey from right sibling.")));
/* Be slightly more pro-active in freeing this memory, just in case */ /* Be slightly more pro-active in freeing this memory, just in case */
...@@ -1900,14 +1900,15 @@ bt_child_highkey_check(BtreeCheckState *state, ...@@ -1900,14 +1900,15 @@ bt_child_highkey_check(BtreeCheckState *state,
state->targetblock, blkno, state->targetblock, blkno,
LSN_FORMAT_ARGS(state->targetlsn)))); LSN_FORMAT_ARGS(state->targetlsn))));
/* Check level for non-ignorable page */ /* Do level sanity check */
if (!P_IGNORE(opaque) && opaque->btpo.level != target_level - 1) if ((!P_ISDELETED(opaque) || P_HAS_FULLXID(opaque)) &&
opaque->btpo_level != target_level - 1)
ereport(ERROR, ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED), (errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block found while following rightlinks from child of index \"%s\" has invalid level", errmsg("block found while following rightlinks from child of index \"%s\" has invalid level",
RelationGetRelationName(state->rel)), RelationGetRelationName(state->rel)),
errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.", errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
blkno, target_level - 1, opaque->btpo.level))); blkno, target_level - 1, opaque->btpo_level)));
/* Try to detect circular links */ /* Try to detect circular links */
if ((!first && blkno == state->prevrightlink) || blkno == opaque->btpo_prev) if ((!first && blkno == state->prevrightlink) || blkno == opaque->btpo_prev)
...@@ -2132,7 +2133,7 @@ bt_child_check(BtreeCheckState *state, BTScanInsert targetkey, ...@@ -2132,7 +2133,7 @@ bt_child_check(BtreeCheckState *state, BTScanInsert targetkey,
* check for downlink connectivity. * check for downlink connectivity.
*/ */
bt_child_highkey_check(state, downlinkoffnum, bt_child_highkey_check(state, downlinkoffnum,
child, topaque->btpo.level); child, topaque->btpo_level);
/* /*
* Since there cannot be a concurrent VACUUM operation in readonly mode, * Since there cannot be a concurrent VACUUM operation in readonly mode,
...@@ -2275,7 +2276,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit, ...@@ -2275,7 +2276,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
errmsg_internal("harmless interrupted page split detected in index %s", errmsg_internal("harmless interrupted page split detected in index %s",
RelationGetRelationName(state->rel)), RelationGetRelationName(state->rel)),
errdetail_internal("Block=%u level=%u left sibling=%u page lsn=%X/%X.", errdetail_internal("Block=%u level=%u left sibling=%u page lsn=%X/%X.",
blkno, opaque->btpo.level, blkno, opaque->btpo_level,
opaque->btpo_prev, opaque->btpo_prev,
LSN_FORMAT_ARGS(pagelsn)))); LSN_FORMAT_ARGS(pagelsn))));
return; return;
...@@ -2304,7 +2305,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit, ...@@ -2304,7 +2305,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
elog(DEBUG1, "checking for interrupted multi-level deletion due to missing downlink in index \"%s\"", elog(DEBUG1, "checking for interrupted multi-level deletion due to missing downlink in index \"%s\"",
RelationGetRelationName(state->rel)); RelationGetRelationName(state->rel));
level = opaque->btpo.level; level = opaque->btpo_level;
itemid = PageGetItemIdCareful(state, blkno, page, P_FIRSTDATAKEY(opaque)); itemid = PageGetItemIdCareful(state, blkno, page, P_FIRSTDATAKEY(opaque));
itup = (IndexTuple) PageGetItem(page, itemid); itup = (IndexTuple) PageGetItem(page, itemid);
childblk = BTreeTupleGetDownLink(itup); childblk = BTreeTupleGetDownLink(itup);
...@@ -2319,16 +2320,16 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit, ...@@ -2319,16 +2320,16 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
break; break;
/* Do an extra sanity check in passing on internal pages */ /* Do an extra sanity check in passing on internal pages */
if (copaque->btpo.level != level - 1) if (copaque->btpo_level != level - 1)
ereport(ERROR, ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED), (errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("downlink points to block in index \"%s\" whose level is not one level down", errmsg_internal("downlink points to block in index \"%s\" whose level is not one level down",
RelationGetRelationName(state->rel)), RelationGetRelationName(state->rel)),
errdetail_internal("Top parent/under check block=%u block pointed to=%u expected level=%u level in pointed to block=%u.", errdetail_internal("Top parent/under check block=%u block pointed to=%u expected level=%u level in pointed to block=%u.",
blkno, childblk, blkno, childblk,
level - 1, copaque->btpo.level))); level - 1, copaque->btpo_level)));
level = copaque->btpo.level; level = copaque->btpo_level;
itemid = PageGetItemIdCareful(state, childblk, child, itemid = PageGetItemIdCareful(state, childblk, child,
P_FIRSTDATAKEY(copaque)); P_FIRSTDATAKEY(copaque));
itup = (IndexTuple) PageGetItem(child, itemid); itup = (IndexTuple) PageGetItem(child, itemid);
...@@ -2389,7 +2390,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit, ...@@ -2389,7 +2390,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
errmsg("internal index block lacks downlink in index \"%s\"", errmsg("internal index block lacks downlink in index \"%s\"",
RelationGetRelationName(state->rel)), RelationGetRelationName(state->rel)),
errdetail_internal("Block=%u level=%u page lsn=%X/%X.", errdetail_internal("Block=%u level=%u page lsn=%X/%X.",
blkno, opaque->btpo.level, blkno, opaque->btpo_level,
LSN_FORMAT_ARGS(pagelsn)))); LSN_FORMAT_ARGS(pagelsn))));
} }
...@@ -2983,21 +2984,28 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum) ...@@ -2983,21 +2984,28 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
} }
/* /*
* Deleted pages have no sane "level" field, so can only check non-deleted * Deleted pages that still use the old 32-bit XID representation have no
* page level * sane "level" field because they type pun the field, but all other pages
* (including pages deleted on Postgres 14+) have a valid value.
*/ */
if (P_ISLEAF(opaque) && !P_ISDELETED(opaque) && opaque->btpo.level != 0) if (!P_ISDELETED(opaque) || P_HAS_FULLXID(opaque))
ereport(ERROR, {
(errcode(ERRCODE_INDEX_CORRUPTED), /* Okay, no reason not to trust btpo_level field from page */
errmsg("invalid leaf page level %u for block %u in index \"%s\"",
opaque->btpo.level, blocknum, RelationGetRelationName(state->rel))));
if (!P_ISLEAF(opaque) && !P_ISDELETED(opaque) && if (P_ISLEAF(opaque) && opaque->btpo_level != 0)
opaque->btpo.level == 0) ereport(ERROR,
ereport(ERROR, (errcode(ERRCODE_INDEX_CORRUPTED),
(errcode(ERRCODE_INDEX_CORRUPTED), errmsg_internal("invalid leaf page level %u for block %u in index \"%s\"",
errmsg("invalid internal page level 0 for block %u in index \"%s\"", opaque->btpo_level, blocknum,
blocknum, RelationGetRelationName(state->rel)))); RelationGetRelationName(state->rel))));
if (!P_ISLEAF(opaque) && opaque->btpo_level == 0)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("invalid internal page level 0 for block %u in index \"%s\"",
blocknum,
RelationGetRelationName(state->rel))));
}
/* /*
* Sanity checks for number of items on page. * Sanity checks for number of items on page.
...@@ -3044,8 +3052,6 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum) ...@@ -3044,8 +3052,6 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
* state. This state is nonetheless treated as corruption by VACUUM on * state. This state is nonetheless treated as corruption by VACUUM on
* from version 9.4 on, so do the same here. See _bt_pagedel() for full * from version 9.4 on, so do the same here. See _bt_pagedel() for full
* details. * details.
*
* Internal pages should never have garbage items, either.
*/ */
if (!P_ISLEAF(opaque) && P_ISHALFDEAD(opaque)) if (!P_ISLEAF(opaque) && P_ISHALFDEAD(opaque))
ereport(ERROR, ereport(ERROR,
...@@ -3054,11 +3060,27 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum) ...@@ -3054,11 +3060,27 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
blocknum, RelationGetRelationName(state->rel)), blocknum, RelationGetRelationName(state->rel)),
errhint("This can be caused by an interrupted VACUUM in version 9.3 or older, before upgrade. Please REINDEX it."))); errhint("This can be caused by an interrupted VACUUM in version 9.3 or older, before upgrade. Please REINDEX it.")));
/*
* Check that internal pages have no garbage items, and that no page has
* an invalid combination of deletion-related page level flags
*/
if (!P_ISLEAF(opaque) && P_HAS_GARBAGE(opaque)) if (!P_ISLEAF(opaque) && P_HAS_GARBAGE(opaque))
ereport(ERROR, ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED), (errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("internal page block %u in index \"%s\" has garbage items", errmsg_internal("internal page block %u in index \"%s\" has garbage items",
blocknum, RelationGetRelationName(state->rel)))); blocknum, RelationGetRelationName(state->rel))));
if (P_HAS_FULLXID(opaque) && !P_ISDELETED(opaque))
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("full transaction id page flag appears in non-deleted block %u in index \"%s\"",
blocknum, RelationGetRelationName(state->rel))));
if (P_ISDELETED(opaque) && P_ISHALFDEAD(opaque))
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("deleted page block %u in index \"%s\" is half-dead",
blocknum, RelationGetRelationName(state->rel))));
return page; return page;
} }
......
...@@ -75,11 +75,7 @@ typedef struct BTPageStat ...@@ -75,11 +75,7 @@ typedef struct BTPageStat
/* opaque data */ /* opaque data */
BlockNumber btpo_prev; BlockNumber btpo_prev;
BlockNumber btpo_next; BlockNumber btpo_next;
union uint32 btpo_level;
{
uint32 level;
TransactionId xact;
} btpo;
uint16 btpo_flags; uint16 btpo_flags;
BTCycleId btpo_cycleid; BTCycleId btpo_cycleid;
} BTPageStat; } BTPageStat;
...@@ -112,9 +108,33 @@ GetBTPageStatistics(BlockNumber blkno, Buffer buffer, BTPageStat *stat) ...@@ -112,9 +108,33 @@ GetBTPageStatistics(BlockNumber blkno, Buffer buffer, BTPageStat *stat)
/* page type (flags) */ /* page type (flags) */
if (P_ISDELETED(opaque)) if (P_ISDELETED(opaque))
{ {
stat->type = 'd'; /* We divide deleted pages into leaf ('d') or internal ('D') */
stat->btpo.xact = opaque->btpo.xact; if (P_ISLEAF(opaque) || !P_HAS_FULLXID(opaque))
return; stat->type = 'd';
else
stat->type = 'D';
/*
* Report safexid in a deleted page.
*
* Handle pg_upgrade'd deleted pages that used the previous safexid
* representation in btpo_level field (this used to be a union type
* called "bpto").
*/
if (P_HAS_FULLXID(opaque))
{
FullTransactionId safexid = BTPageGetDeleteXid(page);
elog(NOTICE, "deleted page from block %u has safexid %u:%u",
blkno, EpochFromFullTransactionId(safexid),
XidFromFullTransactionId(safexid));
}
else
elog(NOTICE, "deleted page from block %u has safexid %u",
blkno, opaque->btpo_level);
/* Don't interpret BTDeletedPageData as index tuples */
maxoff = InvalidOffsetNumber;
} }
else if (P_IGNORE(opaque)) else if (P_IGNORE(opaque))
stat->type = 'e'; stat->type = 'e';
...@@ -128,7 +148,7 @@ GetBTPageStatistics(BlockNumber blkno, Buffer buffer, BTPageStat *stat) ...@@ -128,7 +148,7 @@ GetBTPageStatistics(BlockNumber blkno, Buffer buffer, BTPageStat *stat)
/* btpage opaque data */ /* btpage opaque data */
stat->btpo_prev = opaque->btpo_prev; stat->btpo_prev = opaque->btpo_prev;
stat->btpo_next = opaque->btpo_next; stat->btpo_next = opaque->btpo_next;
stat->btpo.level = opaque->btpo.level; stat->btpo_level = opaque->btpo_level;
stat->btpo_flags = opaque->btpo_flags; stat->btpo_flags = opaque->btpo_flags;
stat->btpo_cycleid = opaque->btpo_cycleid; stat->btpo_cycleid = opaque->btpo_cycleid;
...@@ -237,7 +257,7 @@ bt_page_stats_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version) ...@@ -237,7 +257,7 @@ bt_page_stats_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version)
values[j++] = psprintf("%u", stat.free_size); values[j++] = psprintf("%u", stat.free_size);
values[j++] = psprintf("%u", stat.btpo_prev); values[j++] = psprintf("%u", stat.btpo_prev);
values[j++] = psprintf("%u", stat.btpo_next); values[j++] = psprintf("%u", stat.btpo_next);
values[j++] = psprintf("%u", (stat.type == 'd') ? stat.btpo.xact : stat.btpo.level); values[j++] = psprintf("%u", stat.btpo_level);
values[j++] = psprintf("%d", stat.btpo_flags); values[j++] = psprintf("%d", stat.btpo_flags);
tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc), tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
...@@ -503,10 +523,14 @@ bt_page_items_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version) ...@@ -503,10 +523,14 @@ bt_page_items_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version)
opaque = (BTPageOpaque) PageGetSpecialPointer(uargs->page); opaque = (BTPageOpaque) PageGetSpecialPointer(uargs->page);
if (P_ISDELETED(opaque)) if (!P_ISDELETED(opaque))
elog(NOTICE, "page is deleted"); fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
else
fctx->max_calls = PageGetMaxOffsetNumber(uargs->page); {
/* Don't interpret BTDeletedPageData as index tuples */
elog(NOTICE, "page from block " INT64_FORMAT " is deleted", blkno);
fctx->max_calls = 0;
}
uargs->leafpage = P_ISLEAF(opaque); uargs->leafpage = P_ISLEAF(opaque);
uargs->rightmost = P_RIGHTMOST(opaque); uargs->rightmost = P_RIGHTMOST(opaque);
...@@ -603,7 +627,14 @@ bt_page_items_bytea(PG_FUNCTION_ARGS) ...@@ -603,7 +627,14 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
if (P_ISDELETED(opaque)) if (P_ISDELETED(opaque))
elog(NOTICE, "page is deleted"); elog(NOTICE, "page is deleted");
fctx->max_calls = PageGetMaxOffsetNumber(uargs->page); if (!P_ISDELETED(opaque))
fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
else
{
/* Don't interpret BTDeletedPageData as index tuples */
elog(NOTICE, "page from block is deleted");
fctx->max_calls = 0;
}
uargs->leafpage = P_ISLEAF(opaque); uargs->leafpage = P_ISLEAF(opaque);
uargs->rightmost = P_RIGHTMOST(opaque); uargs->rightmost = P_RIGHTMOST(opaque);
...@@ -692,10 +723,7 @@ bt_metap(PG_FUNCTION_ARGS) ...@@ -692,10 +723,7 @@ bt_metap(PG_FUNCTION_ARGS)
/* /*
* We need a kluge here to detect API versions prior to 1.8. Earlier * We need a kluge here to detect API versions prior to 1.8. Earlier
* versions incorrectly used int4 for certain columns. This caused * versions incorrectly used int4 for certain columns.
* various problems. For example, an int4 version of the "oldest_xact"
* column would not work with TransactionId values that happened to exceed
* PG_INT32_MAX.
* *
* There is no way to reliably avoid the problems created by the old * There is no way to reliably avoid the problems created by the old
* function definition at this point, so insist that the user update the * function definition at this point, so insist that the user update the
...@@ -723,7 +751,8 @@ bt_metap(PG_FUNCTION_ARGS) ...@@ -723,7 +751,8 @@ bt_metap(PG_FUNCTION_ARGS)
*/ */
if (metad->btm_version >= BTREE_NOVAC_VERSION) if (metad->btm_version >= BTREE_NOVAC_VERSION)
{ {
values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact); values[j++] = psprintf(INT64_FORMAT,
(int64) metad->btm_last_cleanup_num_delpages);
values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples); values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
values[j++] = metad->btm_allequalimage ? "t" : "f"; values[j++] = metad->btm_allequalimage ? "t" : "f";
} }
......
...@@ -3,16 +3,16 @@ INSERT INTO test1 VALUES (72057594037927937, 'text'); ...@@ -3,16 +3,16 @@ INSERT INTO test1 VALUES (72057594037927937, 'text');
CREATE INDEX test1_a_idx ON test1 USING btree (a); CREATE INDEX test1_a_idx ON test1 USING btree (a);
\x \x
SELECT * FROM bt_metap('test1_a_idx'); SELECT * FROM bt_metap('test1_a_idx');
-[ RECORD 1 ]-----------+------- -[ RECORD 1 ]-------------+-------
magic | 340322 magic | 340322
version | 4 version | 4
root | 1 root | 1
level | 0 level | 0
fastroot | 1 fastroot | 1
fastlevel | 0 fastlevel | 0
oldest_xact | 0 last_cleanup_num_delpages | 0
last_cleanup_num_tuples | -1 last_cleanup_num_tuples | -1
allequalimage | t allequalimage | t
SELECT * FROM bt_page_stats('test1_a_idx', -1); SELECT * FROM bt_page_stats('test1_a_idx', -1);
ERROR: invalid block number ERROR: invalid block number
...@@ -29,7 +29,7 @@ page_size | 8192 ...@@ -29,7 +29,7 @@ page_size | 8192
free_size | 8128 free_size | 8128
btpo_prev | 0 btpo_prev | 0
btpo_next | 0 btpo_next | 0
btpo | 0 btpo_level | 0
btpo_flags | 3 btpo_flags | 3
SELECT * FROM bt_page_stats('test1_a_idx', 2); SELECT * FROM bt_page_stats('test1_a_idx', 2);
......
...@@ -66,6 +66,23 @@ RETURNS smallint ...@@ -66,6 +66,23 @@ RETURNS smallint
AS 'MODULE_PATHNAME', 'page_checksum_1_9' AS 'MODULE_PATHNAME', 'page_checksum_1_9'
LANGUAGE C STRICT PARALLEL SAFE; LANGUAGE C STRICT PARALLEL SAFE;
--
-- bt_metap()
--
DROP FUNCTION bt_metap(text);
CREATE FUNCTION bt_metap(IN relname text,
OUT magic int4,
OUT version int4,
OUT root int8,
OUT level int8,
OUT fastroot int8,
OUT fastlevel int8,
OUT last_cleanup_num_delpages int8,
OUT last_cleanup_num_tuples float8,
OUT allequalimage boolean)
AS 'MODULE_PATHNAME', 'bt_metap'
LANGUAGE C STRICT PARALLEL SAFE;
-- --
-- bt_page_stats() -- bt_page_stats()
-- --
...@@ -80,7 +97,7 @@ CREATE FUNCTION bt_page_stats(IN relname text, IN blkno int8, ...@@ -80,7 +97,7 @@ CREATE FUNCTION bt_page_stats(IN relname text, IN blkno int8,
OUT free_size int4, OUT free_size int4,
OUT btpo_prev int8, OUT btpo_prev int8,
OUT btpo_next int8, OUT btpo_next int8,
OUT btpo int4, OUT btpo_level int8,
OUT btpo_flags int4) OUT btpo_flags int4)
AS 'MODULE_PATHNAME', 'bt_page_stats_1_9' AS 'MODULE_PATHNAME', 'bt_page_stats_1_9'
LANGUAGE C STRICT PARALLEL SAFE; LANGUAGE C STRICT PARALLEL SAFE;
......
...@@ -283,8 +283,12 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo) ...@@ -283,8 +283,12 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
page = BufferGetPage(buffer); page = BufferGetPage(buffer);
opaque = (BTPageOpaque) PageGetSpecialPointer(page); opaque = (BTPageOpaque) PageGetSpecialPointer(page);
/* Determine page type, and update totals */ /*
* Determine page type, and update totals.
*
* Note that we arbitrarily bucket deleted pages together without
* considering if they're leaf pages or internal pages.
*/
if (P_ISDELETED(opaque)) if (P_ISDELETED(opaque))
indexStat.deleted_pages++; indexStat.deleted_pages++;
else if (P_IGNORE(opaque)) else if (P_IGNORE(opaque))
......
...@@ -8529,11 +8529,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; ...@@ -8529,11 +8529,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
<para> <para>
If no tuples were deleted from the heap, B-tree indexes are still If no tuples were deleted from the heap, B-tree indexes are still
scanned at the <command>VACUUM</command> cleanup stage when at least one scanned at the <command>VACUUM</command> cleanup stage when the
of the following conditions is met: the index statistics are stale, or index's statistics are stale. Index statistics are considered
the index contains deleted pages that can be recycled during cleanup. stale if the number of newly inserted tuples exceeds the
Index statistics are considered to be stale if the number of newly <varname>vacuum_cleanup_index_scale_factor</varname>
inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
fraction of the total number of heap tuples detected by the previous fraction of the total number of heap tuples detected by the previous
statistics collection. The total number of heap tuples is stored in statistics collection. The total number of heap tuples is stored in
the index meta-page. Note that the meta-page does not include this data the index meta-page. Note that the meta-page does not include this data
......
...@@ -298,16 +298,16 @@ test=# SELECT t_ctid, raw_flags, combined_flags ...@@ -298,16 +298,16 @@ test=# SELECT t_ctid, raw_flags, combined_flags
index's metapage. For example: index's metapage. For example:
<screen> <screen>
test=# SELECT * FROM bt_metap('pg_cast_oid_index'); test=# SELECT * FROM bt_metap('pg_cast_oid_index');
-[ RECORD 1 ]-----------+------- -[ RECORD 1 ]-------------+-------
magic | 340322 magic | 340322
version | 4 version | 4
root | 1 root | 1
level | 0 level | 0
fastroot | 1 fastroot | 1
fastlevel | 0 fastlevel | 0
oldest_xact | 582 last_cleanup_num_delpages | 0
last_cleanup_num_tuples | 1000 last_cleanup_num_tuples | 230
allequalimage | f allequalimage | f
</screen> </screen>
</para> </para>
</listitem> </listitem>
...@@ -337,7 +337,7 @@ page_size | 8192 ...@@ -337,7 +337,7 @@ page_size | 8192
free_size | 3668 free_size | 3668
btpo_prev | 0 btpo_prev | 0
btpo_next | 0 btpo_next | 0
btpo | 0 btpo_level | 0
btpo_flags | 3 btpo_flags | 3
</screen> </screen>
</para> </para>
......
...@@ -394,28 +394,8 @@ gistRedoPageReuse(XLogReaderState *record) ...@@ -394,28 +394,8 @@ gistRedoPageReuse(XLogReaderState *record)
* same exclusion effect on primary and standby. * same exclusion effect on primary and standby.
*/ */
if (InHotStandby) if (InHotStandby)
{ ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
FullTransactionId latestRemovedFullXid = xlrec->latestRemovedFullXid; xlrec->node);
FullTransactionId nextXid = ReadNextFullTransactionId();
uint64 diff;
/*
* ResolveRecoveryConflictWithSnapshot operates on 32-bit
* TransactionIds, so truncate the logged FullTransactionId. If the
* logged value is very old, so that XID wrap-around already happened
* on it, there can't be any snapshots that still see it.
*/
diff = U64FromFullTransactionId(nextXid) -
U64FromFullTransactionId(latestRemovedFullXid);
if (diff < MaxTransactionId / 2)
{
TransactionId latestRemovedXid;
latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
xlrec->node);
}
}
} }
void void
......
...@@ -1241,7 +1241,7 @@ _bt_insertonpg(Relation rel, ...@@ -1241,7 +1241,7 @@ _bt_insertonpg(Relation rel,
metapg = BufferGetPage(metabuf); metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg); metad = BTPageGetMeta(metapg);
if (metad->btm_fastlevel >= opaque->btpo.level) if (metad->btm_fastlevel >= opaque->btpo_level)
{ {
/* no update wanted */ /* no update wanted */
_bt_relbuf(rel, metabuf); _bt_relbuf(rel, metabuf);
...@@ -1268,7 +1268,7 @@ _bt_insertonpg(Relation rel, ...@@ -1268,7 +1268,7 @@ _bt_insertonpg(Relation rel,
if (metad->btm_version < BTREE_NOVAC_VERSION) if (metad->btm_version < BTREE_NOVAC_VERSION)
_bt_upgrademetapage(metapg); _bt_upgrademetapage(metapg);
metad->btm_fastroot = BufferGetBlockNumber(buf); metad->btm_fastroot = BufferGetBlockNumber(buf);
metad->btm_fastlevel = opaque->btpo.level; metad->btm_fastlevel = opaque->btpo_level;
MarkBufferDirty(metabuf); MarkBufferDirty(metabuf);
} }
...@@ -1331,7 +1331,7 @@ _bt_insertonpg(Relation rel, ...@@ -1331,7 +1331,7 @@ _bt_insertonpg(Relation rel,
xlmeta.level = metad->btm_level; xlmeta.level = metad->btm_level;
xlmeta.fastroot = metad->btm_fastroot; xlmeta.fastroot = metad->btm_fastroot;
xlmeta.fastlevel = metad->btm_fastlevel; xlmeta.fastlevel = metad->btm_fastlevel;
xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact; xlmeta.last_cleanup_num_delpages = metad->btm_last_cleanup_num_delpages;
xlmeta.last_cleanup_num_heap_tuples = xlmeta.last_cleanup_num_heap_tuples =
metad->btm_last_cleanup_num_heap_tuples; metad->btm_last_cleanup_num_heap_tuples;
xlmeta.allequalimage = metad->btm_allequalimage; xlmeta.allequalimage = metad->btm_allequalimage;
...@@ -1537,7 +1537,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf, ...@@ -1537,7 +1537,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT; lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;
lopaque->btpo_prev = oopaque->btpo_prev; lopaque->btpo_prev = oopaque->btpo_prev;
/* handle btpo_next after rightpage buffer acquired */ /* handle btpo_next after rightpage buffer acquired */
lopaque->btpo.level = oopaque->btpo.level; lopaque->btpo_level = oopaque->btpo_level;
/* handle btpo_cycleid after rightpage buffer acquired */ /* handle btpo_cycleid after rightpage buffer acquired */
/* /*
...@@ -1722,7 +1722,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf, ...@@ -1722,7 +1722,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
ropaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE); ropaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE);
ropaque->btpo_prev = origpagenumber; ropaque->btpo_prev = origpagenumber;
ropaque->btpo_next = oopaque->btpo_next; ropaque->btpo_next = oopaque->btpo_next;
ropaque->btpo.level = oopaque->btpo.level; ropaque->btpo_level = oopaque->btpo_level;
ropaque->btpo_cycleid = lopaque->btpo_cycleid; ropaque->btpo_cycleid = lopaque->btpo_cycleid;
/* /*
...@@ -1950,7 +1950,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf, ...@@ -1950,7 +1950,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
uint8 xlinfo; uint8 xlinfo;
XLogRecPtr recptr; XLogRecPtr recptr;
xlrec.level = ropaque->btpo.level; xlrec.level = ropaque->btpo_level;
/* See comments below on newitem, orignewitem, and posting lists */ /* See comments below on newitem, orignewitem, and posting lists */
xlrec.firstrightoff = firstrightoff; xlrec.firstrightoff = firstrightoff;
xlrec.newitemoff = newitemoff; xlrec.newitemoff = newitemoff;
...@@ -2142,7 +2142,7 @@ _bt_insert_parent(Relation rel, ...@@ -2142,7 +2142,7 @@ _bt_insert_parent(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel)))); BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* Find the leftmost page at the next level up */ /* Find the leftmost page at the next level up */
pbuf = _bt_get_endpoint(rel, opaque->btpo.level + 1, false, NULL); pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
/* Set up a phony stack entry pointing there */ /* Set up a phony stack entry pointing there */
stack = &fakestack; stack = &fakestack;
stack->bts_blkno = BufferGetBlockNumber(pbuf); stack->bts_blkno = BufferGetBlockNumber(pbuf);
...@@ -2480,15 +2480,15 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf) ...@@ -2480,15 +2480,15 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage); rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE; rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE;
rootopaque->btpo_flags = BTP_ROOT; rootopaque->btpo_flags = BTP_ROOT;
rootopaque->btpo.level = rootopaque->btpo_level =
((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo.level + 1; ((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo_level + 1;
rootopaque->btpo_cycleid = 0; rootopaque->btpo_cycleid = 0;
/* update metapage data */ /* update metapage data */
metad->btm_root = rootblknum; metad->btm_root = rootblknum;
metad->btm_level = rootopaque->btpo.level; metad->btm_level = rootopaque->btpo_level;
metad->btm_fastroot = rootblknum; metad->btm_fastroot = rootblknum;
metad->btm_fastlevel = rootopaque->btpo.level; metad->btm_fastlevel = rootopaque->btpo_level;
/* /*
* Insert the left page pointer into the new root page. The root page is * Insert the left page pointer into the new root page. The root page is
...@@ -2548,7 +2548,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf) ...@@ -2548,7 +2548,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
md.level = metad->btm_level; md.level = metad->btm_level;
md.fastroot = rootblknum; md.fastroot = rootblknum;
md.fastlevel = metad->btm_level; md.fastlevel = metad->btm_level;
md.oldest_btpo_xact = metad->btm_oldest_btpo_xact; md.last_cleanup_num_delpages = metad->btm_last_cleanup_num_delpages;
md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples; md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
md.allequalimage = metad->btm_allequalimage; md.allequalimage = metad->btm_allequalimage;
......
This diff is collapsed.
This diff is collapsed.
...@@ -169,7 +169,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access, ...@@ -169,7 +169,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
* we're on the level 1 and asked to lock leaf page in write mode, * we're on the level 1 and asked to lock leaf page in write mode,
* then lock next page in write mode, because it must be a leaf. * then lock next page in write mode, because it must be a leaf.
*/ */
if (opaque->btpo.level == 1 && access == BT_WRITE) if (opaque->btpo_level == 1 && access == BT_WRITE)
page_access = BT_WRITE; page_access = BT_WRITE;
/* drop the read lock on the page, then acquire one on its child */ /* drop the read lock on the page, then acquire one on its child */
...@@ -2341,9 +2341,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost, ...@@ -2341,9 +2341,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
} }
/* Done? */ /* Done? */
if (opaque->btpo.level == level) if (opaque->btpo_level == level)
break; break;
if (opaque->btpo.level < level) if (opaque->btpo_level < level)
ereport(ERROR, ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED), (errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("btree level %u not found in index \"%s\"", errmsg_internal("btree level %u not found in index \"%s\"",
......
...@@ -620,7 +620,7 @@ _bt_blnewpage(uint32 level) ...@@ -620,7 +620,7 @@ _bt_blnewpage(uint32 level)
/* Initialize BT opaque state */ /* Initialize BT opaque state */
opaque = (BTPageOpaque) PageGetSpecialPointer(page); opaque = (BTPageOpaque) PageGetSpecialPointer(page);
opaque->btpo_prev = opaque->btpo_next = P_NONE; opaque->btpo_prev = opaque->btpo_next = P_NONE;
opaque->btpo.level = level; opaque->btpo_level = level;
opaque->btpo_flags = (level > 0) ? 0 : BTP_LEAF; opaque->btpo_flags = (level > 0) ? 0 : BTP_LEAF;
opaque->btpo_cycleid = 0; opaque->btpo_cycleid = 0;
......
...@@ -112,7 +112,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id) ...@@ -112,7 +112,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
md->btm_fastlevel = xlrec->fastlevel; md->btm_fastlevel = xlrec->fastlevel;
/* Cannot log BTREE_MIN_VERSION index metapage without upgrade */ /* Cannot log BTREE_MIN_VERSION index metapage without upgrade */
Assert(md->btm_version >= BTREE_NOVAC_VERSION); Assert(md->btm_version >= BTREE_NOVAC_VERSION);
md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact; md->btm_last_cleanup_num_delpages = xlrec->last_cleanup_num_delpages;
md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples; md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
md->btm_allequalimage = xlrec->allequalimage; md->btm_allequalimage = xlrec->allequalimage;
...@@ -297,7 +297,7 @@ btree_xlog_split(bool newitemonleft, XLogReaderState *record) ...@@ -297,7 +297,7 @@ btree_xlog_split(bool newitemonleft, XLogReaderState *record)
ropaque->btpo_prev = origpagenumber; ropaque->btpo_prev = origpagenumber;
ropaque->btpo_next = spagenumber; ropaque->btpo_next = spagenumber;
ropaque->btpo.level = xlrec->level; ropaque->btpo_level = xlrec->level;
ropaque->btpo_flags = isleaf ? BTP_LEAF : 0; ropaque->btpo_flags = isleaf ? BTP_LEAF : 0;
ropaque->btpo_cycleid = 0; ropaque->btpo_cycleid = 0;
...@@ -773,7 +773,7 @@ btree_xlog_mark_page_halfdead(uint8 info, XLogReaderState *record) ...@@ -773,7 +773,7 @@ btree_xlog_mark_page_halfdead(uint8 info, XLogReaderState *record)
pageop->btpo_prev = xlrec->leftblk; pageop->btpo_prev = xlrec->leftblk;
pageop->btpo_next = xlrec->rightblk; pageop->btpo_next = xlrec->rightblk;
pageop->btpo.level = 0; pageop->btpo_level = 0;
pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF; pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
pageop->btpo_cycleid = 0; pageop->btpo_cycleid = 0;
...@@ -802,6 +802,9 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record) ...@@ -802,6 +802,9 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) XLogRecGetData(record); xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) XLogRecGetData(record);
BlockNumber leftsib; BlockNumber leftsib;
BlockNumber rightsib; BlockNumber rightsib;
uint32 level;
bool isleaf;
FullTransactionId safexid;
Buffer leftbuf; Buffer leftbuf;
Buffer target; Buffer target;
Buffer rightbuf; Buffer rightbuf;
...@@ -810,6 +813,12 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record) ...@@ -810,6 +813,12 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
leftsib = xlrec->leftsib; leftsib = xlrec->leftsib;
rightsib = xlrec->rightsib; rightsib = xlrec->rightsib;
level = xlrec->level;
isleaf = (level == 0);
safexid = xlrec->safexid;
/* No leaftopparent for level 0 (leaf page) or level 1 target */
Assert(xlrec->leaftopparent == InvalidBlockNumber || level > 1);
/* /*
* In normal operation, we would lock all the pages this WAL record * In normal operation, we would lock all the pages this WAL record
...@@ -844,9 +853,9 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record) ...@@ -844,9 +853,9 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
pageop->btpo_prev = leftsib; pageop->btpo_prev = leftsib;
pageop->btpo_next = rightsib; pageop->btpo_next = rightsib;
pageop->btpo.xact = xlrec->btpo_xact; pageop->btpo_level = level;
pageop->btpo_flags = BTP_DELETED; BTPageSetDeleted(page, safexid);
if (!BlockNumberIsValid(xlrec->topparent)) if (isleaf)
pageop->btpo_flags |= BTP_LEAF; pageop->btpo_flags |= BTP_LEAF;
pageop->btpo_cycleid = 0; pageop->btpo_cycleid = 0;
...@@ -892,6 +901,8 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record) ...@@ -892,6 +901,8 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
Buffer leafbuf; Buffer leafbuf;
IndexTupleData trunctuple; IndexTupleData trunctuple;
Assert(!isleaf);
leafbuf = XLogInitBufferForRedo(record, 3); leafbuf = XLogInitBufferForRedo(record, 3);
page = (Page) BufferGetPage(leafbuf); page = (Page) BufferGetPage(leafbuf);
...@@ -901,13 +912,13 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record) ...@@ -901,13 +912,13 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF; pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
pageop->btpo_prev = xlrec->leafleftsib; pageop->btpo_prev = xlrec->leafleftsib;
pageop->btpo_next = xlrec->leafrightsib; pageop->btpo_next = xlrec->leafrightsib;
pageop->btpo.level = 0; pageop->btpo_level = 0;
pageop->btpo_cycleid = 0; pageop->btpo_cycleid = 0;
/* Add a dummy hikey item */ /* Add a dummy hikey item */
MemSet(&trunctuple, 0, sizeof(IndexTupleData)); MemSet(&trunctuple, 0, sizeof(IndexTupleData));
trunctuple.t_info = sizeof(IndexTupleData); trunctuple.t_info = sizeof(IndexTupleData);
BTreeTupleSetTopParent(&trunctuple, xlrec->topparent); BTreeTupleSetTopParent(&trunctuple, xlrec->leaftopparent);
if (PageAddItem(page, (Item) &trunctuple, sizeof(IndexTupleData), P_HIKEY, if (PageAddItem(page, (Item) &trunctuple, sizeof(IndexTupleData), P_HIKEY,
false, false) == InvalidOffsetNumber) false, false) == InvalidOffsetNumber)
...@@ -942,7 +953,7 @@ btree_xlog_newroot(XLogReaderState *record) ...@@ -942,7 +953,7 @@ btree_xlog_newroot(XLogReaderState *record)
pageop->btpo_flags = BTP_ROOT; pageop->btpo_flags = BTP_ROOT;
pageop->btpo_prev = pageop->btpo_next = P_NONE; pageop->btpo_prev = pageop->btpo_next = P_NONE;
pageop->btpo.level = xlrec->level; pageop->btpo_level = xlrec->level;
if (xlrec->level == 0) if (xlrec->level == 0)
pageop->btpo_flags |= BTP_LEAF; pageop->btpo_flags |= BTP_LEAF;
pageop->btpo_cycleid = 0; pageop->btpo_cycleid = 0;
...@@ -963,26 +974,40 @@ btree_xlog_newroot(XLogReaderState *record) ...@@ -963,26 +974,40 @@ btree_xlog_newroot(XLogReaderState *record)
_bt_restore_meta(record, 2); _bt_restore_meta(record, 2);
} }
/*
* In general VACUUM must defer recycling as a way of avoiding certain race
* conditions. Deleted pages contain a safexid value that is used by VACUUM
* to determine whether or not it's safe to place a page that was deleted by
* VACUUM earlier into the FSM now. See nbtree/README.
*
* As far as any backend operating during original execution is concerned, the
* FSM is a cache of recycle-safe pages; the mere presence of the page in the
* FSM indicates that the page must already be safe to recycle (actually,
* _bt_getbuf() verifies it's safe using BTPageIsRecyclable(), but that's just
* because it would be unwise to completely trust the FSM, given its current
* limitations).
*
* This isn't sufficient to prevent similar concurrent recycling race
* conditions during Hot Standby, though. For that we need to log a
* xl_btree_reuse_page record at the point that a page is actually recycled
* and reused for an entirely unrelated page inside _bt_split(). These
* records include the same safexid value from the original deleted page,
* stored in the record's latestRemovedFullXid field.
*
* The GlobalVisCheckRemovableFullXid() test in BTPageIsRecyclable() is used
* to determine if it's safe to recycle a page. This mirrors our own test:
* the PGPROC->xmin > limitXmin test inside GetConflictingVirtualXIDs().
* Consequently, one XID value achieves the same exclusion effect on primary
* and standby.
*/
static void static void
btree_xlog_reuse_page(XLogReaderState *record) btree_xlog_reuse_page(XLogReaderState *record)
{ {
xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) XLogRecGetData(record); xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) XLogRecGetData(record);
/*
* Btree reuse_page records exist to provide a conflict point when we
* reuse pages in the index via the FSM. That's all they do though.
*
* latestRemovedXid was the page's btpo.xact. The
* GlobalVisCheckRemovableXid test in _bt_page_recyclable() conceptually
* mirrors the pgxact->xmin > limitXmin test in
* GetConflictingVirtualXIDs(). Consequently, one XID value achieves the
* same exclusion effect on primary and standby.
*/
if (InHotStandby) if (InHotStandby)
{ ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
xlrec->node);
}
} }
void void
......
...@@ -80,12 +80,13 @@ btree_desc(StringInfo buf, XLogReaderState *record) ...@@ -80,12 +80,13 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{ {
xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) rec; xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) rec;
appendStringInfo(buf, "left %u; right %u; btpo_xact %u; ", appendStringInfo(buf, "left %u; right %u; level %u; safexid %u:%u; ",
xlrec->leftsib, xlrec->rightsib, xlrec->leftsib, xlrec->rightsib, xlrec->level,
xlrec->btpo_xact); EpochFromFullTransactionId(xlrec->safexid),
appendStringInfo(buf, "leafleft %u; leafright %u; topparent %u", XidFromFullTransactionId(xlrec->safexid));
appendStringInfo(buf, "leafleft %u; leafright %u; leaftopparent %u",
xlrec->leafleftsib, xlrec->leafrightsib, xlrec->leafleftsib, xlrec->leafrightsib,
xlrec->topparent); xlrec->leaftopparent);
break; break;
} }
case XLOG_BTREE_NEWROOT: case XLOG_BTREE_NEWROOT:
...@@ -99,9 +100,11 @@ btree_desc(StringInfo buf, XLogReaderState *record) ...@@ -99,9 +100,11 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{ {
xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) rec; xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) rec;
appendStringInfo(buf, "rel %u/%u/%u; latestRemovedXid %u", appendStringInfo(buf, "rel %u/%u/%u; latestRemovedXid %u:%u",
xlrec->node.spcNode, xlrec->node.dbNode, xlrec->node.spcNode, xlrec->node.dbNode,
xlrec->node.relNode, xlrec->latestRemovedXid); xlrec->node.relNode,
EpochFromFullTransactionId(xlrec->latestRemovedFullXid),
XidFromFullTransactionId(xlrec->latestRemovedFullXid));
break; break;
} }
case XLOG_BTREE_META_CLEANUP: case XLOG_BTREE_META_CLEANUP:
...@@ -110,8 +113,8 @@ btree_desc(StringInfo buf, XLogReaderState *record) ...@@ -110,8 +113,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
xlrec = (xl_btree_metadata *) XLogRecGetBlockData(record, 0, xlrec = (xl_btree_metadata *) XLogRecGetBlockData(record, 0,
NULL); NULL);
appendStringInfo(buf, "oldest_btpo_xact %u; last_cleanup_num_heap_tuples: %f", appendStringInfo(buf, "last_cleanup_num_delpages %u; last_cleanup_num_heap_tuples: %f",
xlrec->oldest_btpo_xact, xlrec->last_cleanup_num_delpages,
xlrec->last_cleanup_num_heap_tuples); xlrec->last_cleanup_num_heap_tuples);
break; break;
} }
......
...@@ -452,6 +452,34 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode ...@@ -452,6 +452,34 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
true); true);
} }
/*
* Variant of ResolveRecoveryConflictWithSnapshot that works with
* FullTransactionId values
*/
void
ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
RelFileNode node)
{
/*
* ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
* so truncate the logged FullTransactionId. If the logged value is very
* old, so that XID wrap-around already happened on it, there can't be any
* snapshots that still see it.
*/
FullTransactionId nextXid = ReadNextFullTransactionId();
uint64 diff;
diff = U64FromFullTransactionId(nextXid) -
U64FromFullTransactionId(latestRemovedFullXid);
if (diff < MaxTransactionId / 2)
{
TransactionId latestRemovedXid;
latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
}
}
void void
ResolveRecoveryConflictWithTablespace(Oid tsid) ResolveRecoveryConflictWithTablespace(Oid tsid)
{ {
......
...@@ -37,8 +37,9 @@ typedef uint16 BTCycleId; ...@@ -37,8 +37,9 @@ typedef uint16 BTCycleId;
* *
* In addition, we store the page's btree level (counting upwards from * In addition, we store the page's btree level (counting upwards from
* zero at a leaf page) as well as some flag bits indicating the page type * zero at a leaf page) as well as some flag bits indicating the page type
* and status. If the page is deleted, we replace the level with the * and status. If the page is deleted, a BTDeletedPageData struct is stored
* next-transaction-ID value indicating when it is safe to reclaim the page. * in the page's tuple area, while a standard BTPageOpaqueData struct is
* stored in the page special area.
* *
* We also store a "vacuum cycle ID". When a page is split while VACUUM is * We also store a "vacuum cycle ID". When a page is split while VACUUM is
* processing the index, a nonzero value associated with the VACUUM run is * processing the index, a nonzero value associated with the VACUUM run is
...@@ -52,17 +53,17 @@ typedef uint16 BTCycleId; ...@@ -52,17 +53,17 @@ typedef uint16 BTCycleId;
* *
* NOTE: the BTP_LEAF flag bit is redundant since level==0 could be tested * NOTE: the BTP_LEAF flag bit is redundant since level==0 could be tested
* instead. * instead.
*
* NOTE: the btpo_level field used to be a union type in order to allow
* deleted pages to store a 32-bit safexid in the same field. We now store
* 64-bit/full safexid values using BTDeletedPageData instead.
*/ */
typedef struct BTPageOpaqueData typedef struct BTPageOpaqueData
{ {
BlockNumber btpo_prev; /* left sibling, or P_NONE if leftmost */ BlockNumber btpo_prev; /* left sibling, or P_NONE if leftmost */
BlockNumber btpo_next; /* right sibling, or P_NONE if rightmost */ BlockNumber btpo_next; /* right sibling, or P_NONE if rightmost */
union uint32 btpo_level; /* tree level --- zero for leaf pages */
{
uint32 level; /* tree level --- zero for leaf pages */
TransactionId xact; /* next transaction ID, if deleted */
} btpo;
uint16 btpo_flags; /* flag bits, see below */ uint16 btpo_flags; /* flag bits, see below */
BTCycleId btpo_cycleid; /* vacuum cycle ID of latest split */ BTCycleId btpo_cycleid; /* vacuum cycle ID of latest split */
} BTPageOpaqueData; } BTPageOpaqueData;
...@@ -78,6 +79,7 @@ typedef BTPageOpaqueData *BTPageOpaque; ...@@ -78,6 +79,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
#define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */ #define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples (deprecated) */ #define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples (deprecated) */
#define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */ #define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
#define BTP_HAS_FULLXID (1 << 8) /* contains BTDeletedPageData */
/* /*
* The max allowed value of a cycle ID is a bit less than 64K. This is * The max allowed value of a cycle ID is a bit less than 64K. This is
...@@ -105,10 +107,12 @@ typedef struct BTMetaPageData ...@@ -105,10 +107,12 @@ typedef struct BTMetaPageData
BlockNumber btm_fastroot; /* current "fast" root location */ BlockNumber btm_fastroot; /* current "fast" root location */
uint32 btm_fastlevel; /* tree level of the "fast" root page */ uint32 btm_fastlevel; /* tree level of the "fast" root page */
/* remaining fields only valid when btm_version >= BTREE_NOVAC_VERSION */ /* remaining fields only valid when btm_version >= BTREE_NOVAC_VERSION */
TransactionId btm_oldest_btpo_xact; /* oldest btpo_xact among all deleted
* pages */ /* number of deleted, non-recyclable pages during last cleanup */
float8 btm_last_cleanup_num_heap_tuples; /* number of heap tuples uint32 btm_last_cleanup_num_delpages;
* during last cleanup */ /* number of heap tuples during last cleanup */
float8 btm_last_cleanup_num_heap_tuples;
bool btm_allequalimage; /* are all columns "equalimage"? */ bool btm_allequalimage; /* are all columns "equalimage"? */
} BTMetaPageData; } BTMetaPageData;
...@@ -220,6 +224,93 @@ typedef struct BTMetaPageData ...@@ -220,6 +224,93 @@ typedef struct BTMetaPageData
#define P_IGNORE(opaque) (((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD)) != 0) #define P_IGNORE(opaque) (((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD)) != 0)
#define P_HAS_GARBAGE(opaque) (((opaque)->btpo_flags & BTP_HAS_GARBAGE) != 0) #define P_HAS_GARBAGE(opaque) (((opaque)->btpo_flags & BTP_HAS_GARBAGE) != 0)
#define P_INCOMPLETE_SPLIT(opaque) (((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0) #define P_INCOMPLETE_SPLIT(opaque) (((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0)
#define P_HAS_FULLXID(opaque) (((opaque)->btpo_flags & BTP_HAS_FULLXID) != 0)
/*
* BTDeletedPageData is the page contents of a deleted page
*/
typedef struct BTDeletedPageData
{
FullTransactionId safexid; /* See BTPageIsRecyclable() */
} BTDeletedPageData;
static inline void
BTPageSetDeleted(Page page, FullTransactionId safexid)
{
BTPageOpaque opaque;
PageHeader header;
BTDeletedPageData *contents;
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
header = ((PageHeader) page);
opaque->btpo_flags &= ~BTP_HALF_DEAD;
opaque->btpo_flags |= BTP_DELETED | BTP_HAS_FULLXID;
header->pd_lower = MAXALIGN(SizeOfPageHeaderData) +
sizeof(BTDeletedPageData);
header->pd_upper = header->pd_special;
/* Set safexid in deleted page */
contents = ((BTDeletedPageData *) PageGetContents(page));
contents->safexid = safexid;
}
static inline FullTransactionId
BTPageGetDeleteXid(Page page)
{
BTPageOpaque opaque;
BTDeletedPageData *contents;
/* We only expect to be called with a deleted page */
Assert(!PageIsNew(page));
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
Assert(P_ISDELETED(opaque));
/* pg_upgrade'd deleted page -- must be safe to delete now */
if (!P_HAS_FULLXID(opaque))
return FirstNormalFullTransactionId;
/* Get safexid from deleted page */
contents = ((BTDeletedPageData *) PageGetContents(page));
return contents->safexid;
}
/*
* Is an existing page recyclable?
*
* This exists to centralize the policy on which deleted pages are now safe to
* re-use.
*
* Note: PageIsNew() pages are always safe to recycle, but we can't deal with
* them here (caller is responsible for that case themselves). Caller might
* well need special handling for new pages anyway.
*/
static inline bool
BTPageIsRecyclable(Page page)
{
BTPageOpaque opaque;
Assert(!PageIsNew(page));
/* Recycling okay iff page is deleted and safexid is old enough */
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
if (P_ISDELETED(opaque))
{
/*
* The page was deleted, but when? If it was just deleted, a scan
* might have seen the downlink to it, and will read the page later.
* As long as that can happen, we must keep the deleted page around as
* a tombstone.
*
* For that check if the deletion XID could still be visible to
* anyone. If not, then no scan that's still in progress could have
* seen its downlink, and we can recycle it.
*/
return GlobalVisCheckRemovableFullXid(NULL, BTPageGetDeleteXid(page));
}
return false;
}
/* /*
* Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost * Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
...@@ -962,7 +1053,7 @@ typedef struct BTOptions ...@@ -962,7 +1053,7 @@ typedef struct BTOptions
{ {
int32 varlena_header_; /* varlena header (do not touch directly!) */ int32 varlena_header_; /* varlena header (do not touch directly!) */
int fillfactor; /* page fill factor in percent (0..100) */ int fillfactor; /* page fill factor in percent (0..100) */
/* fraction of newly inserted tuples prior to trigger index cleanup */ /* fraction of newly inserted tuples needed to trigger index cleanup */
float8 vacuum_cleanup_index_scale_factor; float8 vacuum_cleanup_index_scale_factor;
bool deduplicate_items; /* Try to deduplicate items? */ bool deduplicate_items; /* Try to deduplicate items? */
} BTOptions; } BTOptions;
...@@ -1066,8 +1157,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage, ...@@ -1066,8 +1157,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
*/ */
extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level, extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
bool allequalimage); bool allequalimage);
extern void _bt_update_meta_cleanup_info(Relation rel, extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
TransactionId oldestBtpoXact, float8 numHeapTuples); float8 num_heap_tuples);
extern void _bt_upgrademetapage(Page page); extern void _bt_upgrademetapage(Page page);
extern Buffer _bt_getroot(Relation rel, int access); extern Buffer _bt_getroot(Relation rel, int access);
extern Buffer _bt_gettrueroot(Relation rel); extern Buffer _bt_gettrueroot(Relation rel);
...@@ -1084,15 +1175,13 @@ extern void _bt_unlockbuf(Relation rel, Buffer buf); ...@@ -1084,15 +1175,13 @@ extern void _bt_unlockbuf(Relation rel, Buffer buf);
extern bool _bt_conditionallockbuf(Relation rel, Buffer buf); extern bool _bt_conditionallockbuf(Relation rel, Buffer buf);
extern void _bt_upgradelockbufcleanup(Relation rel, Buffer buf); extern void _bt_upgradelockbufcleanup(Relation rel, Buffer buf);
extern void _bt_pageinit(Page page, Size size); extern void _bt_pageinit(Page page, Size size);
extern bool _bt_page_recyclable(Page page);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf, extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *deletable, int ndeletable, OffsetNumber *deletable, int ndeletable,
BTVacuumPosting *updatable, int nupdatable); BTVacuumPosting *updatable, int nupdatable);
extern void _bt_delitems_delete_check(Relation rel, Buffer buf, extern void _bt_delitems_delete_check(Relation rel, Buffer buf,
Relation heapRel, Relation heapRel,
TM_IndexDeleteOp *delstate); TM_IndexDeleteOp *delstate);
extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf, extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf);
TransactionId *oldestBtpoXact);
/* /*
* prototypes for functions in nbtsearch.c * prototypes for functions in nbtsearch.c
......
...@@ -13,6 +13,7 @@ ...@@ -13,6 +13,7 @@
#ifndef NBTXLOG_H #ifndef NBTXLOG_H
#define NBTXLOG_H #define NBTXLOG_H
#include "access/transam.h"
#include "access/xlogreader.h" #include "access/xlogreader.h"
#include "lib/stringinfo.h" #include "lib/stringinfo.h"
#include "storage/off.h" #include "storage/off.h"
...@@ -52,7 +53,7 @@ typedef struct xl_btree_metadata ...@@ -52,7 +53,7 @@ typedef struct xl_btree_metadata
uint32 level; uint32 level;
BlockNumber fastroot; BlockNumber fastroot;
uint32 fastlevel; uint32 fastlevel;
TransactionId oldest_btpo_xact; uint32 last_cleanup_num_delpages;
float8 last_cleanup_num_heap_tuples; float8 last_cleanup_num_heap_tuples;
bool allequalimage; bool allequalimage;
} xl_btree_metadata; } xl_btree_metadata;
...@@ -187,7 +188,7 @@ typedef struct xl_btree_reuse_page ...@@ -187,7 +188,7 @@ typedef struct xl_btree_reuse_page
{ {
RelFileNode node; RelFileNode node;
BlockNumber block; BlockNumber block;
TransactionId latestRemovedXid; FullTransactionId latestRemovedFullXid;
} xl_btree_reuse_page; } xl_btree_reuse_page;
#define SizeOfBtreeReusePage (sizeof(xl_btree_reuse_page)) #define SizeOfBtreeReusePage (sizeof(xl_btree_reuse_page))
...@@ -282,9 +283,12 @@ typedef struct xl_btree_mark_page_halfdead ...@@ -282,9 +283,12 @@ typedef struct xl_btree_mark_page_halfdead
#define SizeOfBtreeMarkPageHalfDead (offsetof(xl_btree_mark_page_halfdead, topparent) + sizeof(BlockNumber)) #define SizeOfBtreeMarkPageHalfDead (offsetof(xl_btree_mark_page_halfdead, topparent) + sizeof(BlockNumber))
/* /*
* This is what we need to know about deletion of a btree page. Note we do * This is what we need to know about deletion of a btree page. Note that we
* not store any content for the deleted page --- it is just rewritten as empty * only leave behind a small amount of bookkeeping information in deleted
* during recovery, apart from resetting the btpo.xact. * pages (deleted pages must be kept around as tombstones for a while). It is
* convenient for the REDO routine to regenerate its target page from scratch.
* This is why WAL record describes certain details that are actually directly
* available from the target page.
* *
* Backup Blk 0: target block being deleted * Backup Blk 0: target block being deleted
* Backup Blk 1: target block's left sibling, if any * Backup Blk 1: target block's left sibling, if any
...@@ -296,20 +300,24 @@ typedef struct xl_btree_unlink_page ...@@ -296,20 +300,24 @@ typedef struct xl_btree_unlink_page
{ {
BlockNumber leftsib; /* target block's left sibling, if any */ BlockNumber leftsib; /* target block's left sibling, if any */
BlockNumber rightsib; /* target block's right sibling */ BlockNumber rightsib; /* target block's right sibling */
uint32 level; /* target block's level */
FullTransactionId safexid; /* target block's BTPageSetDeleted() XID */
/* /*
* Information needed to recreate the leaf page, when target is an * Information needed to recreate a half-dead leaf page with correct
* internal page. * topparent link. The fields are only used when deletion operation's
* target page is an internal page. REDO routine creates half-dead page
* from scratch to keep things simple (this is the same convenient
* approach used for the target page itself).
*/ */
BlockNumber leafleftsib; BlockNumber leafleftsib;
BlockNumber leafrightsib; BlockNumber leafrightsib;
BlockNumber topparent; /* next child down in the subtree */ BlockNumber leaftopparent; /* next child down in the subtree */
TransactionId btpo_xact; /* value of btpo.xact for use in recovery */
/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_UNLINK_PAGE_META */ /* xl_btree_metadata FOLLOWS IF XLOG_BTREE_UNLINK_PAGE_META */
} xl_btree_unlink_page; } xl_btree_unlink_page;
#define SizeOfBtreeUnlinkPage (offsetof(xl_btree_unlink_page, btpo_xact) + sizeof(TransactionId)) #define SizeOfBtreeUnlinkPage (offsetof(xl_btree_unlink_page, leaftopparent) + sizeof(BlockNumber))
/* /*
* New root log record. There are zero tuples if this is to establish an * New root log record. There are zero tuples if this is to establish an
......
...@@ -31,7 +31,7 @@ ...@@ -31,7 +31,7 @@
/* /*
* Each page of XLOG file has a header like this: * Each page of XLOG file has a header like this:
*/ */
#define XLOG_PAGE_MAGIC 0xD109 /* can be used as WAL version indicator */ #define XLOG_PAGE_MAGIC 0xD10A /* can be used as WAL version indicator */
typedef struct XLogPageHeaderData typedef struct XLogPageHeaderData
{ {
......
...@@ -31,6 +31,8 @@ extern void ShutdownRecoveryTransactionEnvironment(void); ...@@ -31,6 +31,8 @@ extern void ShutdownRecoveryTransactionEnvironment(void);
extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
RelFileNode node); RelFileNode node);
extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
RelFileNode node);
extern void ResolveRecoveryConflictWithTablespace(Oid tsid); extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
extern void ResolveRecoveryConflictWithDatabase(Oid dbid); extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment