Skip to content

Commit 48c2277

Browse files
knizhniktristan957
authored andcommitted
Implement index prefetch for index and index-only scans (#277)
* Implement index prefetch for index and index-only scans * Move prefetch_blocks array to the end of BTScanOpaqueData struct
1 parent 5a560e3 commit 48c2277

File tree

9 files changed

+301
-6
lines changed

9 files changed

+301
-6
lines changed

src/backend/access/nbtree/README

+44
Original file line numberDiff line numberDiff line change
@@ -1054,3 +1054,47 @@ item is irrelevant, and need not be stored at all. This arrangement
10541054
corresponds to the fact that an L&Y non-leaf page has one more pointer
10551055
than key. Suffix truncation's negative infinity attributes behave in
10561056
the same way.
1057+
1058+
Notes About Index Scan Prefetch
1059+
-------------------------------
1060+
1061+
Prefetch can significantly improve the speed of OLAP queries.
1062+
To be able to perform prefetch, we need to know which pages will
1063+
be accessed during the scan. It is trivial for heap- and bitmap scans,
1064+
but requires more effort for index scans: to implement prefetch for
1065+
index scans, we need to find out subsequent leaf pages.
1066+
1067+
Postgres links all pages at the same level of the B-Tree in a doubly linked list and uses this list for
1068+
forward and backward iteration. This list, however, can not trivially be used for prefetching because to locate the next page because we need first to load the current page. To prefetch more than only the next page, we can utilize the parent page's downlinks instead, as it contains references to most of the target page's sibling pages.
1069+
1070+
Because Postgres' nbtree pages have no reference to their parent page, we need to remember the parent page when descending the btree and use it to prefetch subsequent pages. We will utilize the parent's linked list to improve the performance of this prefetch system past the key range of the parent page.
1071+
1072+
We should prefetch not only leaf pages, but also the next parent page.
1073+
The trick is to correctly calculate the moment when it will be needed:
1074+
We should not issue the prefetch request when prefetch requests for all children from the current parent page have already been issued, but when there are only effective_io_concurrency line pointers left to prefetch from the page.
1075+
1076+
Currently there are two different prefetch implementations for
1077+
index-only scan and index scan. Index-only scan doesn't need to access heap tuples so it prefetches
1078+
only B-Tree leave pages (and their parents). Prefetch of index-only scan is performed only
1079+
if parallel plan is not used. Parallel index scan is using critical section for obtaining next
1080+
page by parallel worker. Leaf page is loaded in this critical section.
1081+
And if most of time is spent in loading the page, then it actually eliminates any concurrency
1082+
and makes prefetch useless. For relatively small tables Postgres will not choose parallel plan in
1083+
any case. And for large tables it can be enforced by setting max_parallel_workers_per_gather=0.
1084+
1085+
Prefetch for normal (not index-only) index tries to prefetch heap tuples
1086+
referenced from leaf page. Average number of items per page
1087+
is about 100 which is comparable with default value of effective_io_concurrency.
1088+
So there is not so much sense trying to prefetch also next leaf page.
1089+
1090+
As far as it is difficult to estimate number of entries traversed by index scan,
1091+
we prefer not to prefetch large number of pages from the very beginning.
1092+
Such useless prefetch can reduce the performance of point lookups.
1093+
Instead of it we start with smallest prefetch distance and increase it
1094+
by INCREASE_PREFETCH_DISTANCE_STEP after processing each item
1095+
until it reaches effective_io_concurrency. In case of index-only
1096+
scan we increase prefetch distance after processing each leaf pages
1097+
and for index scan - after processing each tuple.
1098+
The only exception is case when no key bounds are specified.
1099+
In this case we traverse the whole relation and it makes sense
1100+
to start with the largest possible prefetch distance from the very beginning.

src/backend/access/nbtree/nbtinsert.c

+1-1
Original file line numberDiff line numberDiff line change
@@ -2159,7 +2159,7 @@ _bt_insert_parent(Relation rel,
21592159
BlockNumberIsValid(RelationGetTargetBlock(rel))));
21602160

21612161
/* Find the leftmost page at the next level up */
2162-
pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
2162+
pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL, NULL);
21632163
/* Set up a phony stack entry pointing there */
21642164
stack = &fakestack;
21652165
stack->bts_blkno = BufferGetBlockNumber(pbuf);

src/backend/access/nbtree/nbtree.c

+1
Original file line numberDiff line numberDiff line change
@@ -368,6 +368,7 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
368368

369369
so->killedItems = NULL; /* until needed */
370370
so->numKilled = 0;
371+
so->prefetch_maximum = 0; /* disable prefetch */
371372

372373
/*
373374
* We don't know yet whether the scan will be index-only, so we do not

src/backend/access/nbtree/nbtsearch.c

+210-4
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,14 @@
1818
#include "access/nbtree.h"
1919
#include "access/relscan.h"
2020
#include "access/xact.h"
21+
#include "catalog/catalog.h"
2122
#include "miscadmin.h"
23+
#include "optimizer/cost.h"
2224
#include "pgstat.h"
2325
#include "storage/predicate.h"
2426
#include "utils/lsyscache.h"
2527
#include "utils/rel.h"
26-
28+
#include "utils/spccache.h"
2729

2830
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
2931
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
@@ -47,6 +49,7 @@ static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
4749
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
4850
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
4951

52+
#define INCREASE_PREFETCH_DISTANCE_STEP 1
5053

5154
/*
5255
* _bt_drop_lock_and_maybe_pin()
@@ -842,6 +845,70 @@ _bt_compare(Relation rel,
842845
return 0;
843846
}
844847

848+
849+
/*
850+
* _bt_read_parent_for_prefetch - read parent page and extract references to children for prefetch.
851+
* This functions returns offset of first item.
852+
*/
853+
static int
854+
_bt_read_parent_for_prefetch(IndexScanDesc scan, BlockNumber parent, ScanDirection dir)
855+
{
856+
Relation rel = scan->indexRelation;
857+
BTScanOpaque so = (BTScanOpaque) scan->opaque;
858+
Buffer buf;
859+
Page page;
860+
BTPageOpaque opaque;
861+
OffsetNumber offnum;
862+
OffsetNumber n_child;
863+
int next_parent_prefetch_index;
864+
int i, j;
865+
866+
buf = _bt_getbuf(rel, parent, BT_READ);
867+
page = BufferGetPage(buf);
868+
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
869+
offnum = P_FIRSTDATAKEY(opaque);
870+
n_child = PageGetMaxOffsetNumber(page) - offnum + 1;
871+
872+
/* Position where we should insert prefetch of parent page: we intentionally use prefetch_maximum here instead of current_prefetch_distance,
873+
* assuming that it will reach prefetch_maximum before we reach and of the parent page
874+
*/
875+
next_parent_prefetch_index = (n_child > so->prefetch_maximum)
876+
? n_child - so->prefetch_maximum : 0;
877+
878+
if (ScanDirectionIsForward(dir))
879+
{
880+
so->next_parent = opaque->btpo_next;
881+
if (so->next_parent == P_NONE)
882+
next_parent_prefetch_index = -1;
883+
for (i = 0, j = 0; i < n_child; i++)
884+
{
885+
ItemId itemid = PageGetItemId(page, offnum + i);
886+
IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
887+
if (i == next_parent_prefetch_index)
888+
so->prefetch_blocks[j++] = so->next_parent; /* time to prefetch next parent page */
889+
so->prefetch_blocks[j++] = BTreeTupleGetDownLink(itup);
890+
}
891+
}
892+
else
893+
{
894+
so->next_parent = opaque->btpo_prev;
895+
if (so->next_parent == P_NONE)
896+
next_parent_prefetch_index = -1;
897+
for (i = 0, j = 0; i < n_child; i++)
898+
{
899+
ItemId itemid = PageGetItemId(page, offnum + n_child - i - 1);
900+
IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
901+
if (i == next_parent_prefetch_index)
902+
so->prefetch_blocks[j++] = so->next_parent; /* time to prefetch next parent page */
903+
so->prefetch_blocks[j++] = BTreeTupleGetDownLink(itup);
904+
}
905+
}
906+
so->n_prefetch_blocks = j;
907+
so->last_prefetch_index = 0;
908+
_bt_relbuf(rel, buf);
909+
return offnum;
910+
}
911+
845912
/*
846913
* _bt_first() -- Find the first item in a scan.
847914
*
@@ -1101,6 +1168,37 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
11011168
}
11021169
}
11031170

1171+
/* Neon: initialize prefetch */
1172+
so->n_prefetch_requests = 0;
1173+
so->n_prefetch_blocks = 0;
1174+
so->last_prefetch_index = 0;
1175+
so->next_parent = P_NONE;
1176+
so->prefetch_maximum = IsCatalogRelation(rel)
1177+
? effective_io_concurrency
1178+
: get_tablespace_io_concurrency(rel->rd_rel->reltablespace);
1179+
1180+
if (scan->xs_want_itup) /* index only scan */
1181+
{
1182+
if (enable_indexonlyscan_prefetch)
1183+
{
1184+
/* We disable prefetch for parallel index-only scan.
1185+
* Neon prefetch is efficient only if prefetched blocks are accessed by the same worker
1186+
* which issued prefetch request. The logic of splitting pages between parallel workers in
1187+
* index scan doesn't allow to satisfy this requirement.
1188+
* Also prefetch of leave pages will be useless if expected number of rows fits in one page.
1189+
*/
1190+
if (scan->parallel_scan)
1191+
so->prefetch_maximum = 0; /* disable prefetch */
1192+
}
1193+
else
1194+
so->prefetch_maximum = 0; /* disable prefetch */
1195+
}
1196+
else if (!enable_indexscan_prefetch || !scan->heapRelation)
1197+
so->prefetch_maximum = 0; /* disable prefetch */
1198+
1199+
/* If key bounds are not specified, then we will scan the whole relation and it make sense to start with the largest possible prefetch distance */
1200+
so->current_prefetch_distance = (keysCount == 0) ? so->prefetch_maximum : 0;
1201+
11041202
/*
11051203
* If we found no usable boundary keys, we have to start from one end of
11061204
* the tree. Walk down that edge to the first or last key, and scan from
@@ -1371,6 +1469,21 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
13711469
*/
13721470
stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
13731471

1472+
/* Start prefetching for index only scan */
1473+
if (so->prefetch_maximum > 0 && stack != NULL && scan->xs_want_itup) /* index only scan */
1474+
{
1475+
int first_offset = _bt_read_parent_for_prefetch(scan, stack->bts_blkno, dir);
1476+
int skip = ScanDirectionIsForward(dir)
1477+
? stack->bts_offset - first_offset
1478+
: first_offset + so->n_prefetch_blocks - 1 - stack->bts_offset;
1479+
Assert(so->n_prefetch_blocks >= skip);
1480+
so->current_prefetch_distance = INCREASE_PREFETCH_DISTANCE_STEP;
1481+
so->n_prefetch_requests = Min(so->current_prefetch_distance, so->n_prefetch_blocks - skip);
1482+
so->last_prefetch_index = skip + so->n_prefetch_requests;
1483+
for (int i = skip; i < so->last_prefetch_index; i++)
1484+
PrefetchBuffer(rel, MAIN_FORKNUM, so->prefetch_blocks[i]);
1485+
}
1486+
13741487
/* don't need to keep the stack around... */
13751488
_bt_freestack(stack);
13761489

@@ -1510,9 +1623,63 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
15101623
/* OK, itemIndex says what to return */
15111624
currItem = &so->currPos.items[so->currPos.itemIndex];
15121625
scan->xs_heaptid = currItem->heapTid;
1513-
if (scan->xs_want_itup)
1626+
if (scan->xs_want_itup) /* index-only scan */
1627+
{
15141628
scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
1629+
}
1630+
else if (so->prefetch_maximum > 0)
1631+
{
1632+
int prefetchLimit, prefetchDistance;
1633+
1634+
/* Neon: prefetch referenced heap pages.
1635+
* As far as it is difficult to predict how much items index scan will return
1636+
* we do not want to prefetch many heap pages from the very beginning because
1637+
* them may not be needed. So we are going to increase prefetch distance by INCREASE_PREFETCH_DISTANCE_STEP
1638+
* at each index scan iteration until it reaches prefetch_maximum.
1639+
*/
1640+
1641+
/* Advance pefetch distance until it reaches prefetch_maximum */
1642+
if (so->current_prefetch_distance + INCREASE_PREFETCH_DISTANCE_STEP <= so->prefetch_maximum)
1643+
so->current_prefetch_distance += INCREASE_PREFETCH_DISTANCE_STEP;
1644+
else
1645+
so->current_prefetch_distance = so->prefetch_maximum;
1646+
1647+
/* How much we can prefetch */
1648+
prefetchLimit = Min(so->current_prefetch_distance, so->currPos.lastItem - so->currPos.firstItem + 1);
15151649

1650+
/* Active prefeth requests */
1651+
prefetchDistance = so->n_prefetch_requests;
1652+
1653+
/*
1654+
* Consume one prefetch request (if any)
1655+
*/
1656+
if (prefetchDistance != 0)
1657+
prefetchDistance -= 1;
1658+
1659+
/* Keep number of active prefetch requests equal to the current prefetch distance.
1660+
* When prefetch distance reaches prefetch maximum, this loop performs at most one iteration,
1661+
* but at the beginning of index scan it performs up to INCREASE_PREFETCH_DISTANCE_STEP+1 iterations
1662+
*/
1663+
if (ScanDirectionIsForward(dir))
1664+
{
1665+
while (prefetchDistance < prefetchLimit && so->currPos.itemIndex + prefetchDistance <= so->currPos.lastItem)
1666+
{
1667+
BlockNumber blkno = BlockIdGetBlockNumber(&so->currPos.items[so->currPos.itemIndex + prefetchDistance].heapTid.ip_blkid);
1668+
PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, blkno);
1669+
prefetchDistance += 1;
1670+
}
1671+
}
1672+
else
1673+
{
1674+
while (prefetchDistance < prefetchLimit && so->currPos.itemIndex - prefetchDistance >= so->currPos.firstItem)
1675+
{
1676+
BlockNumber blkno = BlockIdGetBlockNumber(&so->currPos.items[so->currPos.itemIndex - prefetchDistance].heapTid.ip_blkid);
1677+
PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, blkno);
1678+
prefetchDistance += 1;
1679+
}
1680+
}
1681+
so->n_prefetch_requests = prefetchDistance; /* update number of active prefetch requests */
1682+
}
15161683
return true;
15171684
}
15181685

@@ -1919,6 +2086,30 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
19192086
so->markItemIndex = -1;
19202087
}
19212088

2089+
if (scan->xs_want_itup && so->prefetch_maximum > 0) /* Prefetching of leave pages for index-only scan */
2090+
{
2091+
/* Advance pefetch distance until it reaches prefetch_maximum */
2092+
if (so->current_prefetch_distance + INCREASE_PREFETCH_DISTANCE_STEP <= so->prefetch_maximum)
2093+
so->current_prefetch_distance += INCREASE_PREFETCH_DISTANCE_STEP;
2094+
2095+
so->n_prefetch_requests -= 1; /* we load next leaf page, so decrement number of active prefetch requests */
2096+
2097+
/* Check if the are more children to prefetch at current parent page */
2098+
if (so->last_prefetch_index == so->n_prefetch_blocks && so->next_parent != P_NONE)
2099+
{
2100+
/* we have prefetched all items from current parent page, let's move to the next parent page */
2101+
_bt_read_parent_for_prefetch(scan, so->next_parent, dir);
2102+
so->n_prefetch_requests -= 1; /* loading parent page consumes one more prefetch request */
2103+
}
2104+
2105+
/* Try to keep number of active prefetch requests equal to current prefetch distance */
2106+
while (so->n_prefetch_requests < so->current_prefetch_distance && so->last_prefetch_index < so->n_prefetch_blocks)
2107+
{
2108+
so->n_prefetch_requests += 1;
2109+
PrefetchBuffer(scan->indexRelation, MAIN_FORKNUM, so->prefetch_blocks[so->last_prefetch_index++]);
2110+
}
2111+
}
2112+
19222113
if (ScanDirectionIsForward(dir))
19232114
{
19242115
/* Walk right to the next page with data */
@@ -2323,6 +2514,7 @@ _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
23232514
*/
23242515
Buffer
23252516
_bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
2517+
BlockNumber* parent,
23262518
Snapshot snapshot)
23272519
{
23282520
Buffer buf;
@@ -2331,6 +2523,7 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
23312523
OffsetNumber offnum;
23322524
BlockNumber blkno;
23332525
IndexTuple itup;
2526+
BlockNumber parent_blocknum = P_NONE;
23342527

23352528
/*
23362529
* If we are looking for a leaf page, okay to descend from fast root;
@@ -2348,6 +2541,7 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
23482541
page = BufferGetPage(buf);
23492542
TestForOldSnapshot(snapshot, rel, page);
23502543
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
2544+
blkno = BufferGetBlockNumber(buf);
23512545

23522546
for (;;)
23532547
{
@@ -2386,12 +2580,15 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
23862580
offnum = P_FIRSTDATAKEY(opaque);
23872581

23882582
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
2583+
parent_blocknum = blkno;
23892584
blkno = BTreeTupleGetDownLink(itup);
23902585

23912586
buf = _bt_relandgetbuf(rel, buf, blkno, BT_READ);
23922587
page = BufferGetPage(buf);
23932588
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
23942589
}
2590+
if (parent)
2591+
*parent = parent_blocknum;
23952592

23962593
return buf;
23972594
}
@@ -2415,13 +2612,13 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
24152612
BTPageOpaque opaque;
24162613
OffsetNumber start;
24172614
BTScanPosItem *currItem;
2418-
2615+
BlockNumber parent;
24192616
/*
24202617
* Scan down to the leftmost or rightmost leaf page. This is a simplified
24212618
* version of _bt_search(). We don't maintain a stack since we know we
24222619
* won't need it.
24232620
*/
2424-
buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
2621+
buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), &parent, scan->xs_snapshot);
24252622

24262623
if (!BufferIsValid(buf))
24272624
{
@@ -2434,6 +2631,15 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
24342631
return false;
24352632
}
24362633

2634+
/* Start prefetching for index-only scan */
2635+
if (so->prefetch_maximum > 0 && parent != P_NONE && scan->xs_want_itup) /* index only scan */
2636+
{
2637+
_bt_read_parent_for_prefetch(scan, parent, dir);
2638+
so->n_prefetch_requests = so->last_prefetch_index = Min(so->prefetch_maximum, so->n_prefetch_blocks);
2639+
for (int i = 0; i < so->last_prefetch_index; i++)
2640+
PrefetchBuffer(rel, MAIN_FORKNUM, so->prefetch_blocks[i]);
2641+
}
2642+
24372643
PredicateLockPage(rel, BufferGetBlockNumber(buf), scan->xs_snapshot);
24382644
page = BufferGetPage(buf);
24392645
opaque = (BTPageOpaque) PageGetSpecialPointer(page);

src/backend/optimizer/path/costsize.c

+2
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,8 @@ bool enable_parallel_hash = true;
151151
bool enable_partition_pruning = true;
152152
bool enable_async_append = true;
153153
bool enable_seqscan_prefetch = true;
154+
bool enable_indexscan_prefetch = true;
155+
bool enable_indexonlyscan_prefetch = true;
154156

155157
typedef struct
156158
{

0 commit comments

Comments
 (0)