Commit e5d8a999 authored by Peter Geoghegan's avatar Peter Geoghegan

Use full 64-bit XIDs in deleted nbtree pages.

Otherwise we risk "leaking" deleted pages by making them non-recyclable
indefinitely.  Commit 6655a729 did the same thing for deleted pages in
GiST indexes.  That work was used as a starting point here.

Stop storing an XID indicating the oldest bpto.xact across all deleted
though unrecycled pages in nbtree metapages.  There is no longer any
reason to care about that condition/the oldest XID.  It only ever made
sense when wraparound was something _bt_vacuum_needs_cleanup() had to
consider.

The btm_oldest_btpo_xact metapage field has been repurposed and renamed.
It is now btm_last_cleanup_num_delpages, which is used to remember how
many non-recycled deleted pages remain from the last VACUUM (in practice
its value is usually the precise number of pages that were _newly
deleted_ during the specific VACUUM operation that last set the field).

The general idea behind storing btm_last_cleanup_num_delpages is to use
it to give _some_ consideration to non-recycled deleted pages inside
_bt_vacuum_needs_cleanup() -- though never too much.  We only really
need to avoid leaving a truly excessive number of deleted pages in an
unrecycled state forever.  We only do this to cover certain narrow cases
where no other factor makes VACUUM do a full scan, and yet the index
continues to grow (and so actually misses out on recycling existing
deleted pages).

These metapage changes result in a clear user-visible benefit: We no
longer trigger full index scans during VACUUM operations solely due to
the presence of only 1 or 2 known deleted (though unrecycled) blocks
from a very large index.  All that matters now is keeping the costs and
benefits in balance over time.

Fix an issue that has been around since commit 857f9c36, which added the
"skip full scan of index" mechanism (i.e. the _bt_vacuum_needs_cleanup()
logic).  The accuracy of btm_last_cleanup_num_heap_tuples accidentally
hinged upon _when_ the source value gets stored.  We now always store
btm_last_cleanup_num_heap_tuples in btvacuumcleanup().  This fixes the
issue because IndexVacuumInfo.num_heap_tuples (the source field) is
expected to accurately indicate the state of the table _after_ the
VACUUM completes inside btvacuumcleanup().

A backpatchable fix cannot easily be extracted from this commit.  A
targeted fix for the issue will follow in a later commit, though that
won't happen today.

I (pgeoghegan) have chosen to remove any mention of deleted pages in the
documentation of the vacuum_cleanup_index_scale_factor GUC/param, since
the presence of deleted (though unrecycled) pages is no longer of much
concern to users.  The vacuum_cleanup_index_scale_factor description in
the docs now seems rather unclear in any case, and it should probably be
rewritten in the near future.  Perhaps some passing mention of page
deletion will be added back at the same time.

Bump XLOG_PAGE_MAGIC due to nbtree WAL records using full XIDs now.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: default avatarMasahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznpdHvujGUwYZ8sihX=d5u-tRYhi-F4wnV2uN2zHpMUXw@mail.gmail.com
parent 8a4f9522
......@@ -769,7 +769,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
P_FIRSTDATAKEY(opaque));
itup = (IndexTuple) PageGetItem(state->target, itemid);
nextleveldown.leftmost = BTreeTupleGetDownLink(itup);
nextleveldown.level = opaque->btpo.level - 1;
nextleveldown.level = opaque->btpo_level - 1;
}
else
{
......@@ -794,14 +794,14 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
if (opaque->btpo_prev != leftcurrent)
bt_recheck_sibling_links(state, opaque->btpo_prev, leftcurrent);
/* Check level, which must be valid for non-ignorable page */
if (level.level != opaque->btpo.level)
/* Check level */
if (level.level != opaque->btpo_level)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("leftmost down link for level points to block in index \"%s\" whose level is not one level down",
RelationGetRelationName(state->rel)),
errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
current, level.level, opaque->btpo.level)));
current, level.level, opaque->btpo_level)));
/* Verify invariants for page */
bt_target_page_check(state);
......@@ -1164,7 +1164,7 @@ bt_target_page_check(BtreeCheckState *state)
bt_child_highkey_check(state,
offset,
NULL,
topaque->btpo.level);
topaque->btpo_level);
}
continue;
}
......@@ -1520,7 +1520,7 @@ bt_target_page_check(BtreeCheckState *state)
if (!P_ISLEAF(topaque) && P_RIGHTMOST(topaque) && state->readonly)
{
bt_child_highkey_check(state, InvalidOffsetNumber,
NULL, topaque->btpo.level);
NULL, topaque->btpo_level);
}
}
......@@ -1597,7 +1597,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
ereport(DEBUG1,
(errcode(ERRCODE_NO_DATA),
errmsg_internal("level %u leftmost page of index \"%s\" was found deleted or half dead",
opaque->btpo.level, RelationGetRelationName(state->rel)),
opaque->btpo_level, RelationGetRelationName(state->rel)),
errdetail_internal("Deleted page found when building scankey from right sibling.")));
/* Be slightly more pro-active in freeing this memory, just in case */
......@@ -1900,14 +1900,15 @@ bt_child_highkey_check(BtreeCheckState *state,
state->targetblock, blkno,
LSN_FORMAT_ARGS(state->targetlsn))));
/* Check level for non-ignorable page */
if (!P_IGNORE(opaque) && opaque->btpo.level != target_level - 1)
/* Do level sanity check */
if ((!P_ISDELETED(opaque) || P_HAS_FULLXID(opaque)) &&
opaque->btpo_level != target_level - 1)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block found while following rightlinks from child of index \"%s\" has invalid level",
RelationGetRelationName(state->rel)),
errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
blkno, target_level - 1, opaque->btpo.level)));
blkno, target_level - 1, opaque->btpo_level)));
/* Try to detect circular links */
if ((!first && blkno == state->prevrightlink) || blkno == opaque->btpo_prev)
......@@ -2132,7 +2133,7 @@ bt_child_check(BtreeCheckState *state, BTScanInsert targetkey,
* check for downlink connectivity.
*/
bt_child_highkey_check(state, downlinkoffnum,
child, topaque->btpo.level);
child, topaque->btpo_level);
/*
* Since there cannot be a concurrent VACUUM operation in readonly mode,
......@@ -2275,7 +2276,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
errmsg_internal("harmless interrupted page split detected in index %s",
RelationGetRelationName(state->rel)),
errdetail_internal("Block=%u level=%u left sibling=%u page lsn=%X/%X.",
blkno, opaque->btpo.level,
blkno, opaque->btpo_level,
opaque->btpo_prev,
LSN_FORMAT_ARGS(pagelsn))));
return;
......@@ -2304,7 +2305,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
elog(DEBUG1, "checking for interrupted multi-level deletion due to missing downlink in index \"%s\"",
RelationGetRelationName(state->rel));
level = opaque->btpo.level;
level = opaque->btpo_level;
itemid = PageGetItemIdCareful(state, blkno, page, P_FIRSTDATAKEY(opaque));
itup = (IndexTuple) PageGetItem(page, itemid);
childblk = BTreeTupleGetDownLink(itup);
......@@ -2319,16 +2320,16 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
break;
/* Do an extra sanity check in passing on internal pages */
if (copaque->btpo.level != level - 1)
if (copaque->btpo_level != level - 1)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("downlink points to block in index \"%s\" whose level is not one level down",
RelationGetRelationName(state->rel)),
errdetail_internal("Top parent/under check block=%u block pointed to=%u expected level=%u level in pointed to block=%u.",
blkno, childblk,
level - 1, copaque->btpo.level)));
level - 1, copaque->btpo_level)));
level = copaque->btpo.level;
level = copaque->btpo_level;
itemid = PageGetItemIdCareful(state, childblk, child,
P_FIRSTDATAKEY(copaque));
itup = (IndexTuple) PageGetItem(child, itemid);
......@@ -2389,7 +2390,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
errmsg("internal index block lacks downlink in index \"%s\"",
RelationGetRelationName(state->rel)),
errdetail_internal("Block=%u level=%u page lsn=%X/%X.",
blkno, opaque->btpo.level,
blkno, opaque->btpo_level,
LSN_FORMAT_ARGS(pagelsn))));
}
......@@ -2983,21 +2984,28 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
}
/*
* Deleted pages have no sane "level" field, so can only check non-deleted
* page level
* Deleted pages that still use the old 32-bit XID representation have no
* sane "level" field because they type pun the field, but all other pages
* (including pages deleted on Postgres 14+) have a valid value.
*/
if (P_ISLEAF(opaque) && !P_ISDELETED(opaque) && opaque->btpo.level != 0)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("invalid leaf page level %u for block %u in index \"%s\"",
opaque->btpo.level, blocknum, RelationGetRelationName(state->rel))));
if (!P_ISDELETED(opaque) || P_HAS_FULLXID(opaque))
{
/* Okay, no reason not to trust btpo_level field from page */
if (!P_ISLEAF(opaque) && !P_ISDELETED(opaque) &&
opaque->btpo.level == 0)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("invalid internal page level 0 for block %u in index \"%s\"",
blocknum, RelationGetRelationName(state->rel))));
if (P_ISLEAF(opaque) && opaque->btpo_level != 0)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("invalid leaf page level %u for block %u in index \"%s\"",
opaque->btpo_level, blocknum,
RelationGetRelationName(state->rel))));
if (!P_ISLEAF(opaque) && opaque->btpo_level == 0)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("invalid internal page level 0 for block %u in index \"%s\"",
blocknum,
RelationGetRelationName(state->rel))));
}
/*
* Sanity checks for number of items on page.
......@@ -3044,8 +3052,6 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
* state. This state is nonetheless treated as corruption by VACUUM on
* from version 9.4 on, so do the same here. See _bt_pagedel() for full
* details.
*
* Internal pages should never have garbage items, either.
*/
if (!P_ISLEAF(opaque) && P_ISHALFDEAD(opaque))
ereport(ERROR,
......@@ -3054,11 +3060,27 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
blocknum, RelationGetRelationName(state->rel)),
errhint("This can be caused by an interrupted VACUUM in version 9.3 or older, before upgrade. Please REINDEX it.")));
/*
* Check that internal pages have no garbage items, and that no page has
* an invalid combination of deletion-related page level flags
*/
if (!P_ISLEAF(opaque) && P_HAS_GARBAGE(opaque))
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("internal page block %u in index \"%s\" has garbage items",
blocknum, RelationGetRelationName(state->rel))));
errmsg_internal("internal page block %u in index \"%s\" has garbage items",
blocknum, RelationGetRelationName(state->rel))));
if (P_HAS_FULLXID(opaque) && !P_ISDELETED(opaque))
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("full transaction id page flag appears in non-deleted block %u in index \"%s\"",
blocknum, RelationGetRelationName(state->rel))));
if (P_ISDELETED(opaque) && P_ISHALFDEAD(opaque))
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("deleted page block %u in index \"%s\" is half-dead",
blocknum, RelationGetRelationName(state->rel))));
return page;
}
......
......@@ -75,11 +75,7 @@ typedef struct BTPageStat
/* opaque data */
BlockNumber btpo_prev;
BlockNumber btpo_next;
union
{
uint32 level;
TransactionId xact;
} btpo;
uint32 btpo_level;
uint16 btpo_flags;
BTCycleId btpo_cycleid;
} BTPageStat;
......@@ -112,9 +108,33 @@ GetBTPageStatistics(BlockNumber blkno, Buffer buffer, BTPageStat *stat)
/* page type (flags) */
if (P_ISDELETED(opaque))
{
stat->type = 'd';
stat->btpo.xact = opaque->btpo.xact;
return;
/* We divide deleted pages into leaf ('d') or internal ('D') */
if (P_ISLEAF(opaque) || !P_HAS_FULLXID(opaque))
stat->type = 'd';
else
stat->type = 'D';
/*
* Report safexid in a deleted page.
*
* Handle pg_upgrade'd deleted pages that used the previous safexid
* representation in btpo_level field (this used to be a union type
* called "bpto").
*/
if (P_HAS_FULLXID(opaque))
{
FullTransactionId safexid = BTPageGetDeleteXid(page);
elog(NOTICE, "deleted page from block %u has safexid %u:%u",
blkno, EpochFromFullTransactionId(safexid),
XidFromFullTransactionId(safexid));
}
else
elog(NOTICE, "deleted page from block %u has safexid %u",
blkno, opaque->btpo_level);
/* Don't interpret BTDeletedPageData as index tuples */
maxoff = InvalidOffsetNumber;
}
else if (P_IGNORE(opaque))
stat->type = 'e';
......@@ -128,7 +148,7 @@ GetBTPageStatistics(BlockNumber blkno, Buffer buffer, BTPageStat *stat)
/* btpage opaque data */
stat->btpo_prev = opaque->btpo_prev;
stat->btpo_next = opaque->btpo_next;
stat->btpo.level = opaque->btpo.level;
stat->btpo_level = opaque->btpo_level;
stat->btpo_flags = opaque->btpo_flags;
stat->btpo_cycleid = opaque->btpo_cycleid;
......@@ -237,7 +257,7 @@ bt_page_stats_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version)
values[j++] = psprintf("%u", stat.free_size);
values[j++] = psprintf("%u", stat.btpo_prev);
values[j++] = psprintf("%u", stat.btpo_next);
values[j++] = psprintf("%u", (stat.type == 'd') ? stat.btpo.xact : stat.btpo.level);
values[j++] = psprintf("%u", stat.btpo_level);
values[j++] = psprintf("%d", stat.btpo_flags);
tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
......@@ -503,10 +523,14 @@ bt_page_items_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version)
opaque = (BTPageOpaque) PageGetSpecialPointer(uargs->page);
if (P_ISDELETED(opaque))
elog(NOTICE, "page is deleted");
fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
if (!P_ISDELETED(opaque))
fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
else
{
/* Don't interpret BTDeletedPageData as index tuples */
elog(NOTICE, "page from block " INT64_FORMAT " is deleted", blkno);
fctx->max_calls = 0;
}
uargs->leafpage = P_ISLEAF(opaque);
uargs->rightmost = P_RIGHTMOST(opaque);
......@@ -603,7 +627,14 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
if (P_ISDELETED(opaque))
elog(NOTICE, "page is deleted");
fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
if (!P_ISDELETED(opaque))
fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
else
{
/* Don't interpret BTDeletedPageData as index tuples */
elog(NOTICE, "page from block is deleted");
fctx->max_calls = 0;
}
uargs->leafpage = P_ISLEAF(opaque);
uargs->rightmost = P_RIGHTMOST(opaque);
......@@ -692,10 +723,7 @@ bt_metap(PG_FUNCTION_ARGS)
/*
* We need a kluge here to detect API versions prior to 1.8. Earlier
* versions incorrectly used int4 for certain columns. This caused
* various problems. For example, an int4 version of the "oldest_xact"
* column would not work with TransactionId values that happened to exceed
* PG_INT32_MAX.
* versions incorrectly used int4 for certain columns.
*
* There is no way to reliably avoid the problems created by the old
* function definition at this point, so insist that the user update the
......@@ -723,7 +751,8 @@ bt_metap(PG_FUNCTION_ARGS)
*/
if (metad->btm_version >= BTREE_NOVAC_VERSION)
{
values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
values[j++] = psprintf(INT64_FORMAT,
(int64) metad->btm_last_cleanup_num_delpages);
values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
values[j++] = metad->btm_allequalimage ? "t" : "f";
}
......
......@@ -3,16 +3,16 @@ INSERT INTO test1 VALUES (72057594037927937, 'text');
CREATE INDEX test1_a_idx ON test1 USING btree (a);
\x
SELECT * FROM bt_metap('test1_a_idx');
-[ RECORD 1 ]-----------+-------
magic | 340322
version | 4
root | 1
level | 0
fastroot | 1
fastlevel | 0
oldest_xact | 0
last_cleanup_num_tuples | -1
allequalimage | t
-[ RECORD 1 ]-------------+-------
magic | 340322
version | 4
root | 1
level | 0
fastroot | 1
fastlevel | 0
last_cleanup_num_delpages | 0
last_cleanup_num_tuples | -1
allequalimage | t
SELECT * FROM bt_page_stats('test1_a_idx', -1);
ERROR: invalid block number
......@@ -29,7 +29,7 @@ page_size | 8192
free_size | 8128
btpo_prev | 0
btpo_next | 0
btpo | 0
btpo_level | 0
btpo_flags | 3
SELECT * FROM bt_page_stats('test1_a_idx', 2);
......
......@@ -66,6 +66,23 @@ RETURNS smallint
AS 'MODULE_PATHNAME', 'page_checksum_1_9'
LANGUAGE C STRICT PARALLEL SAFE;
--
-- bt_metap()
--
DROP FUNCTION bt_metap(text);
CREATE FUNCTION bt_metap(IN relname text,
OUT magic int4,
OUT version int4,
OUT root int8,
OUT level int8,
OUT fastroot int8,
OUT fastlevel int8,
OUT last_cleanup_num_delpages int8,
OUT last_cleanup_num_tuples float8,
OUT allequalimage boolean)
AS 'MODULE_PATHNAME', 'bt_metap'
LANGUAGE C STRICT PARALLEL SAFE;
--
-- bt_page_stats()
--
......@@ -80,7 +97,7 @@ CREATE FUNCTION bt_page_stats(IN relname text, IN blkno int8,
OUT free_size int4,
OUT btpo_prev int8,
OUT btpo_next int8,
OUT btpo int4,
OUT btpo_level int8,
OUT btpo_flags int4)
AS 'MODULE_PATHNAME', 'bt_page_stats_1_9'
LANGUAGE C STRICT PARALLEL SAFE;
......
......@@ -283,8 +283,12 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
page = BufferGetPage(buffer);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
/* Determine page type, and update totals */
/*
* Determine page type, and update totals.
*
* Note that we arbitrarily bucket deleted pages together without
* considering if they're leaf pages or internal pages.
*/
if (P_ISDELETED(opaque))
indexStat.deleted_pages++;
else if (P_IGNORE(opaque))
......
......@@ -8529,11 +8529,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
<para>
If no tuples were deleted from the heap, B-tree indexes are still
scanned at the <command>VACUUM</command> cleanup stage when at least one
of the following conditions is met: the index statistics are stale, or
the index contains deleted pages that can be recycled during cleanup.
Index statistics are considered to be stale if the number of newly
inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
scanned at the <command>VACUUM</command> cleanup stage when the
index's statistics are stale. Index statistics are considered
stale if the number of newly inserted tuples exceeds the
<varname>vacuum_cleanup_index_scale_factor</varname>
fraction of the total number of heap tuples detected by the previous
statistics collection. The total number of heap tuples is stored in
the index meta-page. Note that the meta-page does not include this data
......
......@@ -298,16 +298,16 @@ test=# SELECT t_ctid, raw_flags, combined_flags
index's metapage. For example:
<screen>
test=# SELECT * FROM bt_metap('pg_cast_oid_index');
-[ RECORD 1 ]-----------+-------
magic | 340322
version | 4
root | 1
level | 0
fastroot | 1
fastlevel | 0
oldest_xact | 582
last_cleanup_num_tuples | 1000
allequalimage | f
-[ RECORD 1 ]-------------+-------
magic | 340322
version | 4
root | 1
level | 0
fastroot | 1
fastlevel | 0
last_cleanup_num_delpages | 0
last_cleanup_num_tuples | 230
allequalimage | f
</screen>
</para>
</listitem>
......@@ -337,7 +337,7 @@ page_size | 8192
free_size | 3668
btpo_prev | 0
btpo_next | 0
btpo | 0
btpo_level | 0
btpo_flags | 3
</screen>
</para>
......
......@@ -394,28 +394,8 @@ gistRedoPageReuse(XLogReaderState *record)
* same exclusion effect on primary and standby.
*/
if (InHotStandby)
{
FullTransactionId latestRemovedFullXid = xlrec->latestRemovedFullXid;
FullTransactionId nextXid = ReadNextFullTransactionId();
uint64 diff;
/*
* ResolveRecoveryConflictWithSnapshot operates on 32-bit
* TransactionIds, so truncate the logged FullTransactionId. If the
* logged value is very old, so that XID wrap-around already happened
* on it, there can't be any snapshots that still see it.
*/
diff = U64FromFullTransactionId(nextXid) -
U64FromFullTransactionId(latestRemovedFullXid);
if (diff < MaxTransactionId / 2)
{
TransactionId latestRemovedXid;
latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
xlrec->node);
}
}
ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
xlrec->node);
}
void
......
......@@ -1241,7 +1241,7 @@ _bt_insertonpg(Relation rel,
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
if (metad->btm_fastlevel >= opaque->btpo.level)
if (metad->btm_fastlevel >= opaque->btpo_level)
{
/* no update wanted */
_bt_relbuf(rel, metabuf);
......@@ -1268,7 +1268,7 @@ _bt_insertonpg(Relation rel,
if (metad->btm_version < BTREE_NOVAC_VERSION)
_bt_upgrademetapage(metapg);
metad->btm_fastroot = BufferGetBlockNumber(buf);
metad->btm_fastlevel = opaque->btpo.level;
metad->btm_fastlevel = opaque->btpo_level;
MarkBufferDirty(metabuf);
}
......@@ -1331,7 +1331,7 @@ _bt_insertonpg(Relation rel,
xlmeta.level = metad->btm_level;
xlmeta.fastroot = metad->btm_fastroot;
xlmeta.fastlevel = metad->btm_fastlevel;
xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
xlmeta.last_cleanup_num_delpages = metad->btm_last_cleanup_num_delpages;
xlmeta.last_cleanup_num_heap_tuples =
metad->btm_last_cleanup_num_heap_tuples;
xlmeta.allequalimage = metad->btm_allequalimage;
......@@ -1537,7 +1537,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;
lopaque->btpo_prev = oopaque->btpo_prev;
/* handle btpo_next after rightpage buffer acquired */
lopaque->btpo.level = oopaque->btpo.level;
lopaque->btpo_level = oopaque->btpo_level;
/* handle btpo_cycleid after rightpage buffer acquired */
/*
......@@ -1722,7 +1722,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
ropaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE);
ropaque->btpo_prev = origpagenumber;
ropaque->btpo_next = oopaque->btpo_next;
ropaque->btpo.level = oopaque->btpo.level;
ropaque->btpo_level = oopaque->btpo_level;
ropaque->btpo_cycleid = lopaque->btpo_cycleid;
/*
......@@ -1950,7 +1950,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
uint8 xlinfo;
XLogRecPtr recptr;
xlrec.level = ropaque->btpo.level;
xlrec.level = ropaque->btpo_level;
/* See comments below on newitem, orignewitem, and posting lists */
xlrec.firstrightoff = firstrightoff;
xlrec.newitemoff = newitemoff;
......@@ -2142,7 +2142,7 @@ _bt_insert_parent(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* Find the leftmost page at the next level up */
pbuf = _bt_get_endpoint(rel, opaque->btpo.level + 1, false, NULL);
pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
/* Set up a phony stack entry pointing there */
stack = &fakestack;
stack->bts_blkno = BufferGetBlockNumber(pbuf);
......@@ -2480,15 +2480,15 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE;
rootopaque->btpo_flags = BTP_ROOT;
rootopaque->btpo.level =
((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo.level + 1;
rootopaque->btpo_level =
((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo_level + 1;
rootopaque->btpo_cycleid = 0;
/* update metapage data */
metad->btm_root = rootblknum;
metad->btm_level = rootopaque->btpo.level;
metad->btm_level = rootopaque->btpo_level;
metad->btm_fastroot = rootblknum;
metad->btm_fastlevel = rootopaque->btpo.level;
metad->btm_fastlevel = rootopaque->btpo_level;
/*
* Insert the left page pointer into the new root page. The root page is
......@@ -2548,7 +2548,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
md.level = metad->btm_level;
md.fastroot = rootblknum;
md.fastlevel = metad->btm_level;
md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
md.last_cleanup_num_delpages = metad->btm_last_cleanup_num_delpages;
md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
md.allequalimage = metad->btm_allequalimage;
......
This diff is collapsed.
This diff is collapsed.
......@@ -169,7 +169,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
* we're on the level 1 and asked to lock leaf page in write mode,
* then lock next page in write mode, because it must be a leaf.
*/
if (opaque->btpo.level == 1 && access == BT_WRITE)
if (opaque->btpo_level == 1 && access == BT_WRITE)
page_access = BT_WRITE;
/* drop the read lock on the page, then acquire one on its child */
......@@ -2341,9 +2341,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
}
/* Done? */
if (opaque->btpo.level == level)
if (opaque->btpo_level == level)
break;
if (opaque->btpo.level < level)
if (opaque->btpo_level < level)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("btree level %u not found in index \"%s\"",
......
......@@ -620,7 +620,7 @@ _bt_blnewpage(uint32 level)
/* Initialize BT opaque state */
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
opaque->btpo_prev = opaque->btpo_next = P_NONE;
opaque->btpo.level = level;
opaque->btpo_level = level;
opaque->btpo_flags = (level > 0) ? 0 : BTP_LEAF;
opaque->btpo_cycleid = 0;
......
......@@ -112,7 +112,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
md->btm_fastlevel = xlrec->fastlevel;
/* Cannot log BTREE_MIN_VERSION index metapage without upgrade */
Assert(md->btm_version >= BTREE_NOVAC_VERSION);
md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
md->btm_last_cleanup_num_delpages = xlrec->last_cleanup_num_delpages;
md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
md->btm_allequalimage = xlrec->allequalimage;
......@@ -297,7 +297,7 @@ btree_xlog_split(bool newitemonleft, XLogReaderState *record)
ropaque->btpo_prev = origpagenumber;
ropaque->btpo_next = spagenumber;
ropaque->btpo.level = xlrec->level;
ropaque->btpo_level = xlrec->level;
ropaque->btpo_flags = isleaf ? BTP_LEAF : 0;
ropaque->btpo_cycleid = 0;
......@@ -773,7 +773,7 @@ btree_xlog_mark_page_halfdead(uint8 info, XLogReaderState *record)
pageop->btpo_prev = xlrec->leftblk;
pageop->btpo_next = xlrec->rightblk;
pageop->btpo.level = 0;
pageop->btpo_level = 0;
pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
pageop->btpo_cycleid = 0;
......@@ -802,6 +802,9 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) XLogRecGetData(record);
BlockNumber leftsib;
BlockNumber rightsib;
uint32 level;
bool isleaf;
FullTransactionId safexid;
Buffer leftbuf;
Buffer target;
Buffer rightbuf;
......@@ -810,6 +813,12 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
leftsib = xlrec->leftsib;
rightsib = xlrec->rightsib;
level = xlrec->level;
isleaf = (level == 0);
safexid = xlrec->safexid;
/* No leaftopparent for level 0 (leaf page) or level 1 target */
Assert(xlrec->leaftopparent == InvalidBlockNumber || level > 1);
/*
* In normal operation, we would lock all the pages this WAL record
......@@ -844,9 +853,9 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
pageop->btpo_prev = leftsib;
pageop->btpo_next = rightsib;
pageop->btpo.xact = xlrec->btpo_xact;
pageop->btpo_flags = BTP_DELETED;
if (!BlockNumberIsValid(xlrec->topparent))
pageop->btpo_level = level;
BTPageSetDeleted(page, safexid);
if (isleaf)
pageop->btpo_flags |= BTP_LEAF;
pageop->btpo_cycleid = 0;
......@@ -892,6 +901,8 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
Buffer leafbuf;
IndexTupleData trunctuple;
Assert(!isleaf);
leafbuf = XLogInitBufferForRedo(record, 3);
page = (Page) BufferGetPage(leafbuf);
......@@ -901,13 +912,13 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
pageop->btpo_prev = xlrec->leafleftsib;
pageop->btpo_next = xlrec->leafrightsib;
pageop->btpo.level = 0;
pageop->btpo_level = 0;
pageop->btpo_cycleid = 0;
/* Add a dummy hikey item */
MemSet(&trunctuple, 0, sizeof(IndexTupleData));
trunctuple.t_info = sizeof(IndexTupleData);
BTreeTupleSetTopParent(&trunctuple, xlrec->topparent);
BTreeTupleSetTopParent(&trunctuple, xlrec->leaftopparent);
if (PageAddItem(page, (Item) &trunctuple, sizeof(IndexTupleData), P_HIKEY,
false, false) == InvalidOffsetNumber)
......@@ -942,7 +953,7 @@ btree_xlog_newroot(XLogReaderState *record)
pageop->btpo_flags = BTP_ROOT;
pageop->btpo_prev = pageop->btpo_next = P_NONE;
pageop->btpo.level = xlrec->level;
pageop->btpo_level = xlrec->level;
if (xlrec->level == 0)
pageop->btpo_flags |= BTP_LEAF;
pageop->btpo_cycleid = 0;
......@@ -963,26 +974,40 @@ btree_xlog_newroot(XLogReaderState *record)
_bt_restore_meta(record, 2);
}
/*
* In general VACUUM must defer recycling as a way of avoiding certain race
* conditions. Deleted pages contain a safexid value that is used by VACUUM
* to determine whether or not it's safe to place a page that was deleted by
* VACUUM earlier into the FSM now. See nbtree/README.
*
* As far as any backend operating during original execution is concerned, the
* FSM is a cache of recycle-safe pages; the mere presence of the page in the
* FSM indicates that the page must already be safe to recycle (actually,
* _bt_getbuf() verifies it's safe using BTPageIsRecyclable(), but that's just
* because it would be unwise to completely trust the FSM, given its current
* limitations).
*
* This isn't sufficient to prevent similar concurrent recycling race
* conditions during Hot Standby, though. For that we need to log a
* xl_btree_reuse_page record at the point that a page is actually recycled
* and reused for an entirely unrelated page inside _bt_split(). These
* records include the same safexid value from the original deleted page,
* stored in the record's latestRemovedFullXid field.
*
* The GlobalVisCheckRemovableFullXid() test in BTPageIsRecyclable() is used
* to determine if it's safe to recycle a page. This mirrors our own test:
* the PGPROC->xmin > limitXmin test inside GetConflictingVirtualXIDs().
* Consequently, one XID value achieves the same exclusion effect on primary
* and standby.
*/
static void
btree_xlog_reuse_page(XLogReaderState *record)
{
xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) XLogRecGetData(record);
/*
* Btree reuse_page records exist to provide a conflict point when we
* reuse pages in the index via the FSM. That's all they do though.
*
* latestRemovedXid was the page's btpo.xact. The
* GlobalVisCheckRemovableXid test in _bt_page_recyclable() conceptually
* mirrors the pgxact->xmin > limitXmin test in
* GetConflictingVirtualXIDs(). Consequently, one XID value achieves the
* same exclusion effect on primary and standby.
*/
if (InHotStandby)
{
ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
xlrec->node);
}
ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
xlrec->node);
}
void
......
......@@ -80,12 +80,13 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) rec;
appendStringInfo(buf, "left %u; right %u; btpo_xact %u; ",
xlrec->leftsib, xlrec->rightsib,
xlrec->btpo_xact);
appendStringInfo(buf, "leafleft %u; leafright %u; topparent %u",
appendStringInfo(buf, "left %u; right %u; level %u; safexid %u:%u; ",
xlrec->leftsib, xlrec->rightsib, xlrec->level,
EpochFromFullTransactionId(xlrec->safexid),
XidFromFullTransactionId(xlrec->safexid));
appendStringInfo(buf, "leafleft %u; leafright %u; leaftopparent %u",
xlrec->leafleftsib, xlrec->leafrightsib,
xlrec->topparent);
xlrec->leaftopparent);
break;
}
case XLOG_BTREE_NEWROOT:
......@@ -99,9 +100,11 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) rec;
appendStringInfo(buf, "rel %u/%u/%u; latestRemovedXid %u",
appendStringInfo(buf, "rel %u/%u/%u; latestRemovedXid %u:%u",
xlrec->node.spcNode, xlrec->node.dbNode,
xlrec->node.relNode, xlrec->latestRemovedXid);
xlrec->node.relNode,
EpochFromFullTransactionId(xlrec->latestRemovedFullXid),
XidFromFullTransactionId(xlrec->latestRemovedFullXid));
break;
}
case XLOG_BTREE_META_CLEANUP:
......@@ -110,8 +113,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
xlrec = (xl_btree_metadata *) XLogRecGetBlockData(record, 0,
NULL);
appendStringInfo(buf, "oldest_btpo_xact %u; last_cleanup_num_heap_tuples: %f",
xlrec->oldest_btpo_xact,
appendStringInfo(buf, "last_cleanup_num_delpages %u; last_cleanup_num_heap_tuples: %f",
xlrec->last_cleanup_num_delpages,
xlrec->last_cleanup_num_heap_tuples);
break;
}
......
......@@ -452,6 +452,34 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
true);
}
/*
* Variant of ResolveRecoveryConflictWithSnapshot that works with
* FullTransactionId values
*/
void
ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
RelFileNode node)
{
/*
* ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
* so truncate the logged FullTransactionId. If the logged value is very
* old, so that XID wrap-around already happened on it, there can't be any
* snapshots that still see it.
*/
FullTransactionId nextXid = ReadNextFullTransactionId();
uint64 diff;
diff = U64FromFullTransactionId(nextXid) -
U64FromFullTransactionId(latestRemovedFullXid);
if (diff < MaxTransactionId / 2)
{
TransactionId latestRemovedXid;
latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
}
}
void
ResolveRecoveryConflictWithTablespace(Oid tsid)
{
......
......@@ -37,8 +37,9 @@ typedef uint16 BTCycleId;
*
* In addition, we store the page's btree level (counting upwards from
* zero at a leaf page) as well as some flag bits indicating the page type
* and status. If the page is deleted, we replace the level with the
* next-transaction-ID value indicating when it is safe to reclaim the page.
* and status. If the page is deleted, a BTDeletedPageData struct is stored
* in the page's tuple area, while a standard BTPageOpaqueData struct is
* stored in the page special area.
*
* We also store a "vacuum cycle ID". When a page is split while VACUUM is
* processing the index, a nonzero value associated with the VACUUM run is
......@@ -52,17 +53,17 @@ typedef uint16 BTCycleId;
*
* NOTE: the BTP_LEAF flag bit is redundant since level==0 could be tested
* instead.
*
* NOTE: the btpo_level field used to be a union type in order to allow
* deleted pages to store a 32-bit safexid in the same field. We now store
* 64-bit/full safexid values using BTDeletedPageData instead.
*/
typedef struct BTPageOpaqueData
{
BlockNumber btpo_prev; /* left sibling, or P_NONE if leftmost */
BlockNumber btpo_next; /* right sibling, or P_NONE if rightmost */
union
{
uint32 level; /* tree level --- zero for leaf pages */
TransactionId xact; /* next transaction ID, if deleted */
} btpo;
uint32 btpo_level; /* tree level --- zero for leaf pages */
uint16 btpo_flags; /* flag bits, see below */
BTCycleId btpo_cycleid; /* vacuum cycle ID of latest split */
} BTPageOpaqueData;
......@@ -78,6 +79,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
#define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples (deprecated) */
#define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
#define BTP_HAS_FULLXID (1 << 8) /* contains BTDeletedPageData */
/*
* The max allowed value of a cycle ID is a bit less than 64K. This is
......@@ -105,10 +107,12 @@ typedef struct BTMetaPageData
BlockNumber btm_fastroot; /* current "fast" root location */
uint32 btm_fastlevel; /* tree level of the "fast" root page */
/* remaining fields only valid when btm_version >= BTREE_NOVAC_VERSION */
TransactionId btm_oldest_btpo_xact; /* oldest btpo_xact among all deleted
* pages */
float8 btm_last_cleanup_num_heap_tuples; /* number of heap tuples
* during last cleanup */
/* number of deleted, non-recyclable pages during last cleanup */
uint32 btm_last_cleanup_num_delpages;
/* number of heap tuples during last cleanup */
float8 btm_last_cleanup_num_heap_tuples;
bool btm_allequalimage; /* are all columns "equalimage"? */
} BTMetaPageData;
......@@ -220,6 +224,93 @@ typedef struct BTMetaPageData
#define P_IGNORE(opaque) (((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD)) != 0)
#define P_HAS_GARBAGE(opaque) (((opaque)->btpo_flags & BTP_HAS_GARBAGE) != 0)
#define P_INCOMPLETE_SPLIT(opaque) (((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0)
#define P_HAS_FULLXID(opaque) (((opaque)->btpo_flags & BTP_HAS_FULLXID) != 0)
/*
* BTDeletedPageData is the page contents of a deleted page
*/
typedef struct BTDeletedPageData
{
FullTransactionId safexid; /* See BTPageIsRecyclable() */
} BTDeletedPageData;
static inline void
BTPageSetDeleted(Page page, FullTransactionId safexid)
{
BTPageOpaque opaque;
PageHeader header;
BTDeletedPageData *contents;
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
header = ((PageHeader) page);
opaque->btpo_flags &= ~BTP_HALF_DEAD;
opaque->btpo_flags |= BTP_DELETED | BTP_HAS_FULLXID;
header->pd_lower = MAXALIGN(SizeOfPageHeaderData) +
sizeof(BTDeletedPageData);
header->pd_upper = header->pd_special;
/* Set safexid in deleted page */
contents = ((BTDeletedPageData *) PageGetContents(page));
contents->safexid = safexid;
}
static inline FullTransactionId
BTPageGetDeleteXid(Page page)
{
BTPageOpaque opaque;
BTDeletedPageData *contents;
/* We only expect to be called with a deleted page */
Assert(!PageIsNew(page));
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
Assert(P_ISDELETED(opaque));
/* pg_upgrade'd deleted page -- must be safe to delete now */
if (!P_HAS_FULLXID(opaque))
return FirstNormalFullTransactionId;
/* Get safexid from deleted page */
contents = ((BTDeletedPageData *) PageGetContents(page));
return contents->safexid;
}
/*
* Is an existing page recyclable?
*
* This exists to centralize the policy on which deleted pages are now safe to
* re-use.
*
* Note: PageIsNew() pages are always safe to recycle, but we can't deal with
* them here (caller is responsible for that case themselves). Caller might
* well need special handling for new pages anyway.
*/
static inline bool
BTPageIsRecyclable(Page page)
{
BTPageOpaque opaque;
Assert(!PageIsNew(page));
/* Recycling okay iff page is deleted and safexid is old enough */
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
if (P_ISDELETED(opaque))
{
/*
* The page was deleted, but when? If it was just deleted, a scan
* might have seen the downlink to it, and will read the page later.
* As long as that can happen, we must keep the deleted page around as
* a tombstone.
*
* For that check if the deletion XID could still be visible to
* anyone. If not, then no scan that's still in progress could have
* seen its downlink, and we can recycle it.
*/
return GlobalVisCheckRemovableFullXid(NULL, BTPageGetDeleteXid(page));
}
return false;
}
/*
* Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
......@@ -962,7 +1053,7 @@ typedef struct BTOptions
{
int32 varlena_header_; /* varlena header (do not touch directly!) */
int fillfactor; /* page fill factor in percent (0..100) */
/* fraction of newly inserted tuples prior to trigger index cleanup */
/* fraction of newly inserted tuples needed to trigger index cleanup */
float8 vacuum_cleanup_index_scale_factor;
bool deduplicate_items; /* Try to deduplicate items? */
} BTOptions;
......@@ -1066,8 +1157,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
*/
extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
bool allequalimage);
extern void _bt_update_meta_cleanup_info(Relation rel,
TransactionId oldestBtpoXact, float8 numHeapTuples);
extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
float8 num_heap_tuples);
extern void _bt_upgrademetapage(Page page);
extern Buffer _bt_getroot(Relation rel, int access);
extern Buffer _bt_gettrueroot(Relation rel);
......@@ -1084,15 +1175,13 @@ extern void _bt_unlockbuf(Relation rel, Buffer buf);
extern bool _bt_conditionallockbuf(Relation rel, Buffer buf);
extern void _bt_upgradelockbufcleanup(Relation rel, Buffer buf);
extern void _bt_pageinit(Page page, Size size);
extern bool _bt_page_recyclable(Page page);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *deletable, int ndeletable,
BTVacuumPosting *updatable, int nupdatable);
extern void _bt_delitems_delete_check(Relation rel, Buffer buf,
Relation heapRel,
TM_IndexDeleteOp *delstate);
extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf,
TransactionId *oldestBtpoXact);
extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf);
/*
* prototypes for functions in nbtsearch.c
......
......@@ -13,6 +13,7 @@
#ifndef NBTXLOG_H
#define NBTXLOG_H
#include "access/transam.h"
#include "access/xlogreader.h"
#include "lib/stringinfo.h"
#include "storage/off.h"
......@@ -52,7 +53,7 @@ typedef struct xl_btree_metadata
uint32 level;
BlockNumber fastroot;
uint32 fastlevel;
TransactionId oldest_btpo_xact;
uint32 last_cleanup_num_delpages;
float8 last_cleanup_num_heap_tuples;
bool allequalimage;
} xl_btree_metadata;
......@@ -187,7 +188,7 @@ typedef struct xl_btree_reuse_page
{
RelFileNode node;
BlockNumber block;
TransactionId latestRemovedXid;
FullTransactionId latestRemovedFullXid;
} xl_btree_reuse_page;
#define SizeOfBtreeReusePage (sizeof(xl_btree_reuse_page))
......@@ -282,9 +283,12 @@ typedef struct xl_btree_mark_page_halfdead
#define SizeOfBtreeMarkPageHalfDead (offsetof(xl_btree_mark_page_halfdead, topparent) + sizeof(BlockNumber))
/*
* This is what we need to know about deletion of a btree page. Note we do
* not store any content for the deleted page --- it is just rewritten as empty
* during recovery, apart from resetting the btpo.xact.
* This is what we need to know about deletion of a btree page. Note that we
* only leave behind a small amount of bookkeeping information in deleted
* pages (deleted pages must be kept around as tombstones for a while). It is
* convenient for the REDO routine to regenerate its target page from scratch.
* This is why WAL record describes certain details that are actually directly
* available from the target page.
*
* Backup Blk 0: target block being deleted
* Backup Blk 1: target block's left sibling, if any
......@@ -296,20 +300,24 @@ typedef struct xl_btree_unlink_page
{
BlockNumber leftsib; /* target block's left sibling, if any */
BlockNumber rightsib; /* target block's right sibling */
uint32 level; /* target block's level */
FullTransactionId safexid; /* target block's BTPageSetDeleted() XID */
/*
* Information needed to recreate the leaf page, when target is an
* internal page.
* Information needed to recreate a half-dead leaf page with correct
* topparent link. The fields are only used when deletion operation's
* target page is an internal page. REDO routine creates half-dead page
* from scratch to keep things simple (this is the same convenient
* approach used for the target page itself).
*/
BlockNumber leafleftsib;
BlockNumber leafrightsib;
BlockNumber topparent; /* next child down in the subtree */
BlockNumber leaftopparent; /* next child down in the subtree */
TransactionId btpo_xact; /* value of btpo.xact for use in recovery */
/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_UNLINK_PAGE_META */
} xl_btree_unlink_page;
#define SizeOfBtreeUnlinkPage (offsetof(xl_btree_unlink_page, btpo_xact) + sizeof(TransactionId))
#define SizeOfBtreeUnlinkPage (offsetof(xl_btree_unlink_page, leaftopparent) + sizeof(BlockNumber))
/*
* New root log record. There are zero tuples if this is to establish an
......
......@@ -31,7 +31,7 @@
/*
* Each page of XLOG file has a header like this:
*/
#define XLOG_PAGE_MAGIC 0xD109 /* can be used as WAL version indicator */
#define XLOG_PAGE_MAGIC 0xD10A /* can be used as WAL version indicator */
typedef struct XLogPageHeaderData
{
......
......@@ -31,6 +31,8 @@ extern void ShutdownRecoveryTransactionEnvironment(void);
extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
RelFileNode node);
extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
RelFileNode node);
extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment