Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pg_regress sometimes hangs in CI #413

Closed
kelvich opened this issue Aug 11, 2021 · 10 comments · Fixed by neondatabase/postgres#65 or #424
Closed

pg_regress sometimes hangs in CI #413

kelvich opened this issue Aug 11, 2021 · 10 comments · Fixed by neondatabase/postgres#65 or #424
Assignees
Milestone

Comments

@kelvich
Copy link
Contributor

kelvich commented Aug 11, 2021

No description provided.

@kelvich kelvich self-assigned this Aug 11, 2021
@kelvich
Copy link
Contributor Author

kelvich commented Aug 11, 2021

Pageserver catches crc mismatch:

Aug 11 17:54:36.945 INFO switched segment 0/12FFF939 to 0/130019F9, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:257
Aug 11 17:54:51.964 INFO switched segment 0/13FFF01A to 0/14001C73, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:257
Aug 11 17:55:02.495 INFO switched segment 0/14FFEB15 to 0/15000B71, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:257
Aug 11 17:55:13.528 INFO switched segment 0/15FFF661 to 0/16004515, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:257
Aug 11 17:55:24.847 INFO switched segment 0/16FFFABD to 0/17003FBD, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:257
Aug 11 17:55:34.102 INFO switched segment 0/17FFF311 to 0/18005CA9, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:257
Aug 11 17:55:42.583 INFO switched segment 0/18FFE9F9 to 0/19000F59, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:257
Aug 11 17:55:48.275 INFO WAL streaming connection failed (WAL record crc mismatch at 0/198E6525), retrying in 1 second, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:109
Aug 11 17:55:49.276 INFO connecting to "host=127.0.0.1 port=55432 user=zenith_admin dbname=postgres", location: pageserver::walreceiver, pageserver/src/walreceiver.rs:125
Aug 11 17:55:49.289 INFO connected!, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:132

pg_waldump on the corresponding wal file:

./tmp_install/bin/pg_waldump ./test_output/test_pg_regress/repo/pgdatadirs/tenants/87b8d7ae0f401079f95d1f5b87227176/test_pg_regress/pg_wal/000000010000000000000019

<...>

rmgr: Heap2       len (rec/tot):    463/   463, tx:    1008216, lsn: 0/198E44D8, prev 0/198E4498, desc: MULTI_INSERT 3 tuples flags 0x02, blkref #0: rel 1663/16384/1249 blk 164
rmgr: Btree       len (rec/tot):     64/    64, tx:    1008216, lsn: 0/198E46A8, prev 0/198E44D8, desc: INSERT_LEAF off 131, blkref #0: rel 1663/16384/2658 blk 64
rmgr: Btree       len (rec/tot):     64/    64, tx:    1008216, lsn: 0/198E46E8, prev 0/198E46A8, desc: INSERT_LEAF off 147, blkref #0: rel 1663/16384/2659 blk 51
rmgr: Btree       len (rec/tot):     64/    64, tx:    1008216, lsn: 0/198E4728, prev 0/198E46E8, desc: INSERT_LEAF off 132, blkref #0: rel 1663/16384/2658 blk 64
rmgr: Btree       len (rec/tot):     64/    64, tx:    1008216, lsn: 0/198E4768, prev 0/198E4728, desc: INSERT_LEAF off 148, blkref #0: rel 1663/16384/2659 blk 51
rmgr: Btree       len (rec/tot):     64/    64, tx:    1008216, lsn: 0/198E47A8, prev 0/198E4768, desc: INSERT_LEAF off 133, blkref #0: rel 1663/16384/2658 blk 64
rmgr: Btree       len (rec/tot):     64/    64, tx:    1008216, lsn: 0/198E47E8, prev 0/198E47A8, desc: INSERT_LEAF off 149, blkref #0: rel 1663/16384/2659 blk 51
rmgr: Btree       len (rec/tot):     64/    64, tx:    1008206, lsn: 0/198E4828, prev 0/198E47E8, desc: INSERT_LEAF off 318, blkref #0: rel 1663/16384/3455 blk 16
rmgr: Heap        len (rec/tot):     87/    87, tx:    1008206, lsn: 0/198E4868, prev 0/198E4828, desc: UPDATE off 32 xmax 1008206 flags 0x60 ; new off 43 xmax 0, blkref #0: rel 1663/16384/1259 blk 26
rmgr: Btree       len (rec/tot):     64/    64, tx:    1008206, lsn: 0/198E48C0, prev 0/198E4868, desc: INSERT_LEAF off 294, blkref #0: rel 1663/16384/2662 blk 10
rmgr: Heap2       len (rec/tot):    463/   463, tx:    1008216, lsn: 0/198E4900, prev 0/198E48C0, desc: MULTI_INSERT 3 tuples flags 0x00, blkref #0: rel 1663/16384/1249 blk 164
rmgr: Heap        len (rec/tot):     60/    60, tx:    1008218, lsn: 0/198E4AD0, prev 0/198E4900, desc: DELETE off 15 flags 0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/2995 blk 0
pg_waldump: fatal: error in WAL record at 0/198E4AD0: incorrect resource manager data checksum in record at 0/198E4B10

@kelvich
Copy link
Contributor Author

kelvich commented Aug 11, 2021

wal hexdump around 0/198E4AD0

08e4950 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
08e4980 00 00 00 00 00 00 00 00 1b 00 00 00 00 00 00 00
08e4990 06 00 ff ff 00 00 00 00 ff ff ff ff ff ff ff ff
08e49a0 00 70 73 01 00 00 00 00 00 01 00 00 00 00 00 00
08e49b0 00 00 00 00 00 a5 7a 00 1a 00 01 08 0b 48 00 00
08e49c0 00 00 20 ff ff 3f 00 00 00 00 00 00 f4 66 00 00
08e49d0 78 6d 69 6e 00 00 00 00 00 00 00 00 00 00 00 00
08e49e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
08e4a10 1c 00 00 00 00 00 00 00 04 00 fe ff 00 00 00 00
08e4a20 ff ff ff ff ff ff ff ff 01 70 69 01 00 00 00 00
08e4a30 00 01 00 00 00 00 00 00 00 00 00 00 00 ba 7a 00
08e4a40 1a 00 01 08 52 42 00 00 00 00 20 ff ff 3f 00 00
08e4a50 00 00 00 00 f4 66 00 00 63 6d 69 6e 00 00 00 00
08e4a60 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
08e4a90 00 00 00 00 00 00 00 00 1d 00 00 00 00 00 00 00
08e4aa0 04 00 fd ff 00 00 00 00 ff ff ff ff ff ff ff ff
08e4ab0 01 70 69 01 00 00 00 00 00 01 00 00 00 00 00 00
08e4ac0 00 00 00 00 00 00 40 03 00 11 00 12 00 13 00 00
08e4ad0 3c 00 00 00 5a 62 0f 00 00 49 8e 19 00 00 00 00
08e4ae0 10 0a 00 00 5d 7b 1a 7f 00 00 00 00 7f 06 00 00

@kelvich
Copy link
Contributor Author

kelvich commented Aug 11, 2021

Jumping over 0/198E4AD0 in pg_waldump does not help -- seems that 0/198E4AD0 is actually last valid record in the file, so it not clear where pageserver's 0/198E6525 is coming from.

postgres log around the time with this issue:

2021-08-11 14:55:48.259 GMT [2858] STATEMENT:  DELETE FROM r1;
2021-08-11 14:55:48.274 GMT [2294] FATAL:  invalid standby message type "f"
2021-08-11 14:55:48.274 GMT [2294] STATEMENT:  START_REPLICATION PHYSICAL 0/16B9150
2021-08-11 14:55:49.284 GMT [2915] LOG:  [ZENITH_SMGR] libpqpagestore: connected to 'host=localhost port=64000'
2021-08-11 14:55:49.292 GMT [2915] FATAL:  invalid standby message type "f"
2021-08-11 14:55:49.292 GMT [2915] STATEMENT:  START_REPLICATION PHYSICAL 0/198E4B10
2021-08-11 14:55:50.308 GMT [2916] FATAL:  invalid standby message type "f"
2021-08-11 14:55:50.308 GMT [2916] STATEMENT:  START_REPLICATION PHYSICAL 0/198E4B10
2021-08-11 14:55:51.319 GMT [2917] FATAL:  invalid standby message type "f"

@kelvich
Copy link
Contributor Author

kelvich commented Aug 11, 2021

Breakpoint 2: where = pageserver`pageserver::waldecoder::WalStreamDecoder::poll_decode::h2be9c48a6608dcbf + 2417 at waldecoder.rs:173:36, address = 0x000000010515d611
(lldb) c
Process 2271 resuming
Process 2271 stopped
* thread #13, name = 'WAL receiver thread', stop reason = breakpoint 2.1
    frame #0: 0x000000010515d611 pageserver`pageserver::waldecoder::WalStreamDecoder::poll_decode::h2be9c48a6608dcbf(self=0x000070000d4afd68) at waldecoder.rs:173:36
   170 	                    let mut crc = crc32c_append(0, &recordbuf[XLOG_RECORD_CRC_OFFS + 4..]);
   171 	                    crc = crc32c_append(crc, &recordbuf[0..XLOG_RECORD_CRC_OFFS]);
   172 	                    if crc != xlogrec.xl_crc {
-> 173 	                        return Err(WalDecodeError {
   174 	                            msg: "WAL record crc mismatch".into(),
   175 	                            lsn: self.lsn,
   176 	                        });

(lldb) p/x self->startlsn
(zenith_utils::lsn::Lsn) $7 = (__0 = 0x00000000198e4b10)

(lldb) p/x self->lsn
(zenith_utils::lsn::Lsn) $8 = (__0 = 0x00000000198e6525)

@kelvich
Copy link
Contributor Author

kelvich commented Aug 12, 2021

Seems that our FPW / eviction_of_non_wal_logged_pages(43e2ed429) causes incorrect CRC in some cases:

output of pg_waldump with disabled CRC error-out:

CRC check for 0/19E346A8 = '1'
rmgr: Btree       len (rec/tot):     64/    64, tx:    1008409, lsn: 0/19E346A8, prev 0/19E34648, desc: INSERT_LEAF off 41, blkref #0: rel 1663/16384/2659 blk 50
CRC check for 0/19E346E8 = '0'
rmgr: XLOG        len (rec/tot):     49/  6933, tx:          0, lsn: 0/19E346E8, prev 0/19E346A8, desc: FPI , blkref #0: rel 1663/16384/1247 blk 27 FPW
CRC check for 0/19E36218 = '1'
rmgr: Heap        len (rec/tot):     60/    60, tx:    1008408, lsn: 0/19E36218, prev 0/19E346E8, desc: DELETE off 19 flags 0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/826 blk 0
CRC check for 0/19E36258 = '1'
rmgr: Heap        len (rec/tot):     60/    60, tx:    1008408, lsn: 0/19E36258, prev 0/19E36218, desc: DELETE off 30 flags 0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/2608 blk 101
CRC check for 0/19E36298 = '1'
rmgr: XLOG        len (rec/tot):     49/  7825, tx:          0, lsn: 0/19E36298, prev 0/19E36258, desc: FPI , blkref #0: rel 1663/16384/24997 blk 0 FPW
CRC check for 0/19E38148 = '1'
rmgr: Standby     len (rec/tot):    174/   174, tx:          0, lsn: 0/19E38148, prev 0/19E36298, desc: LOCK xid 1007024 db 16384 rel 25313 xid 1007024 db 16384 rel 25127 xid 1007024 db 16384 rel 26544 xid 1007024 db 16384 rel 25399 xid 1007024 db 16384 rel 24798 xid 1007024 db 16384 rel 24759 xid 1007024 db 16384 rel 25414 xid 1008408 db 16384 rel 26473 xid 1007024 db 16384 rel 25116 xid 1008409 db 16384 rel 26545 xid 1007024 db 16384 rel 26533 xid 1007024 db 16384 rel 25298 
CRC check for 0/19E381F8 = '1'
rmgr: Standby     len (rec/tot):    114/   114, tx:          0, lsn: 0/19E381F8, prev 0/19E38148, desc: RUNNING_XACTS nextXid 1008410 latestCompletedXid 1008407 oldestRunningXid 1007024; 3 xacts: 1008409 1007024 1008408
CRC check for 0/19E38270 = '1'
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/19E38270, prev 0/19E381F8, desc: CHECKPOINT_ONLINE redo 0/19C5F9A0; tli 1; prev tli 1; fpw true; xid 0:1008264; oid 32768; multi 1; offset 0; oldest xid 727 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 1007024; online
CRC check for 0/19E382E8 = '1'
rmgr: Standby     len (rec/tot):    174/   174, tx:          0, lsn: 0/19E382E8, prev 0/19E38270, desc: LOCK xid 1007024 db 16384 rel 25313 xid 1007024 db 16384 rel 25127 xid 1007024 db 16384 rel 26544 xid 1007024 db 16384 rel 25399 xid 1007024 db 16384 rel 24798 xid 1007024 db 16384 rel 24759 xid 1007024 db 16384 rel 25414 xid 1008408 db 16384 rel 26473 xid 1007024 db 16384 rel 25116 xid 1008409 db 16384 rel 26545 xid 1007024 db 16384 rel 26533 xid 1007024 db 16384 rel 25298 
CRC check for 0/19E38398 = '1'
rmgr: Standby     len (rec/tot):    114/   114, tx:          0, lsn: 0/19E38398, prev 0/19E382E8, desc: RUNNING_XACTS nextXid 1008410 latestCompletedXid 1008407 oldestRunningXid 1007024; 3 xacts: 1008409 1007024 1008408

@hlinnaka
Copy link
Contributor

Looking at XLogRegisterBuffer() and XLogRegisterBlock(), it's quite bogus that they clear the PD_WAL_LOGGED flag, even if the page does not use the standard page layout, i.e. even if REGBUF_STANDARD was not used. That doesn't explain the CRC failure, but clearly we need a better solution to replace PD_WAL_LOGGED

@kelvich
Copy link
Contributor Author

kelvich commented Aug 12, 2021

I tried to flip every bit in bkpb->bkp_image of record with wrong CRC to check if I hit expected CRC. Interestingly enough it actually worked and resulted in right CRC.

In one WAL sample that is:

CRC check for 0/19E346E8 = '0' (res=aa7a2db1, expected=9b797f02)
flip bit bytes_off=1753, bits_off=0: CRC = 9b797f02, hole = 1308, hole_offset = 228

In another:

!CRC check for 0/198E4B10 = '0' (res=11363036, expected=be153c56)
flip bit bytes_off=185, bits_off=0: CRC = be153c56, hole = 1588, hole_offset = 164

So in both cases that is a bitflip in the middle of the page.

@hlinnaka
Copy link
Contributor

With this patch:

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0ff3bcca95..5ab30914b5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1132,6 +1132,9 @@ XLogInsertRecord(XLogRecData *rdata,
 		FIN_CRC32C(rdata_crc);
 		rechdr->xl_crc = rdata_crc;
 
+		if (num_fpi > 0)
+			pg_usleep(10000);
+
 		/*
 		 * All the record data, including the header, is now ready to be
 		 * inserted. Copy the record in the space reserved.

I can reproduce this very quickly on my laptop: run make installcheck-parallel, and it won't get very far.

@kelvich kelvich added this to the Zenith v0.2 milestone Aug 13, 2021
@hlinnaka
Copy link
Contributor

I was able to get an 'rr' trace of this on my laptop. What happens is that one backend is in the process of writing out a FPI of the page, and between computing the CRC and memcpying the page to the WAL buffer, another backend modifies the page. The modification happens from here, in fsmpage.c:

│   295                 /*                                                                                                                                                                   │
│   296                  * Update the next-target pointer. Note that we do this even if we're only                                                                                           │
│   297                  * holding a shared lock, on the grounds that it's better to use a shared                                                                                            │
│   298                  * lock and get a garbled next pointer every now and then, than take the                                                                                             │
│   299                  * concurrency hit of an exclusive lock.                                                                                                                             │
│   300                  *                                                                                                                                                                   │
│   301                  * Wrap-around is handled at the beginning of this function.                                                                                                         │
│   302                  */                                                                                                                                                                  │
│  >303                 fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);                                                                                                                │
│   304                                                                                                                                                                                      │

@hlinnaka
Copy link
Contributor

Long story short, zenith_wallog_page() should not be using log_newpage() when its not holding an exclusive lock on the page. XLogSaveBufferForHint() has to jump through extra hoops and make a temporary copy of the page precisely because of this.

@hlinnaka hlinnaka self-assigned this Aug 13, 2021
lubennikovaav pushed a commit to neondatabase/postgres that referenced this issue Aug 17, 2021
…o CRC errors.

zenith_wallog_page() would call log_newpage() on a buffer, while holding
merely a shared lock on the page. That's not cool, because another backend
could modify the page concurrently. We allow changing hint bits while
holding only a shared lock, and changes on FSM pages, at least. See comments
in XLogSaveBufferForHint() for discussion of this problem.

One instance of the race condition that I was able to capture on my laptop
happened like this:

1. Backend A: needs to evict an FSM page from the buffer cache to make
   room for a new page, and calls zenith_wallog_page() on it. That is
   done while holding a share lock on the page.

2. Backend A: XLogInsertRecord() computes the CRC of the FPI WAL record
   including the FSM page

3. Backend B: Updates the same FSM page while holding only a share lock

4. Backend A: Allocates space in the WAL buffers, and copies the WAL
   record header and the page to the buffers.

At this point, the CRC that backend A computed earlier doesn't match the
contents that were written out to the WAL buffers.

The update of the FSM page in backend B happened from there (fsmpage.c):

	/*
	 * Update the next-target pointer. Note that we do this even if we're only
	 * holding a shared lock, on the grounds that it's better to use a shared
	 * lock and get a garbled next pointer every now and then, than take the
	 * concurrency hit of an exclusive lock.
	 *
	 * Wrap-around is handled at the beginning of this function.
	 */
	fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);

To fix, make a temporary copy of the page in zenith_wallog_page(), and
WAL-log that. Just like XLogSaveBufferForHint() does.

Fixes neondatabase/neon#413
ololobus pushed a commit to neondatabase/postgres that referenced this issue Nov 10, 2021
…o CRC errors.

zenith_wallog_page() would call log_newpage() on a buffer, while holding
merely a shared lock on the page. That's not cool, because another backend
could modify the page concurrently. We allow changing hint bits while
holding only a shared lock, and changes on FSM pages, at least. See comments
in XLogSaveBufferForHint() for discussion of this problem.

One instance of the race condition that I was able to capture on my laptop
happened like this:

1. Backend A: needs to evict an FSM page from the buffer cache to make
   room for a new page, and calls zenith_wallog_page() on it. That is
   done while holding a share lock on the page.

2. Backend A: XLogInsertRecord() computes the CRC of the FPI WAL record
   including the FSM page

3. Backend B: Updates the same FSM page while holding only a share lock

4. Backend A: Allocates space in the WAL buffers, and copies the WAL
   record header and the page to the buffers.

At this point, the CRC that backend A computed earlier doesn't match the
contents that were written out to the WAL buffers.

The update of the FSM page in backend B happened from there (fsmpage.c):

	/*
	 * Update the next-target pointer. Note that we do this even if we're only
	 * holding a shared lock, on the grounds that it's better to use a shared
	 * lock and get a garbled next pointer every now and then, than take the
	 * concurrency hit of an exclusive lock.
	 *
	 * Wrap-around is handled at the beginning of this function.
	 */
	fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);

To fix, make a temporary copy of the page in zenith_wallog_page(), and
WAL-log that. Just like XLogSaveBufferForHint() does.

Fixes neondatabase/neon#413
lubennikovaav pushed a commit to neondatabase/postgres that referenced this issue Feb 9, 2022
…o CRC errors.

zenith_wallog_page() would call log_newpage() on a buffer, while holding
merely a shared lock on the page. That's not cool, because another backend
could modify the page concurrently. We allow changing hint bits while
holding only a shared lock, and changes on FSM pages, at least. See comments
in XLogSaveBufferForHint() for discussion of this problem.

One instance of the race condition that I was able to capture on my laptop
happened like this:

1. Backend A: needs to evict an FSM page from the buffer cache to make
   room for a new page, and calls zenith_wallog_page() on it. That is
   done while holding a share lock on the page.

2. Backend A: XLogInsertRecord() computes the CRC of the FPI WAL record
   including the FSM page

3. Backend B: Updates the same FSM page while holding only a share lock

4. Backend A: Allocates space in the WAL buffers, and copies the WAL
   record header and the page to the buffers.

At this point, the CRC that backend A computed earlier doesn't match the
contents that were written out to the WAL buffers.

The update of the FSM page in backend B happened from there (fsmpage.c):

	/*
	 * Update the next-target pointer. Note that we do this even if we're only
	 * holding a shared lock, on the grounds that it's better to use a shared
	 * lock and get a garbled next pointer every now and then, than take the
	 * concurrency hit of an exclusive lock.
	 *
	 * Wrap-around is handled at the beginning of this function.
	 */
	fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);

To fix, make a temporary copy of the page in zenith_wallog_page(), and
WAL-log that. Just like XLogSaveBufferForHint() does.

Fixes neondatabase/neon#413
MMeent pushed a commit to neondatabase/postgres that referenced this issue Jul 7, 2022
…o CRC errors.

zenith_wallog_page() would call log_newpage() on a buffer, while holding
merely a shared lock on the page. That's not cool, because another backend
could modify the page concurrently. We allow changing hint bits while
holding only a shared lock, and changes on FSM pages, at least. See comments
in XLogSaveBufferForHint() for discussion of this problem.

One instance of the race condition that I was able to capture on my laptop
happened like this:

1. Backend A: needs to evict an FSM page from the buffer cache to make
   room for a new page, and calls zenith_wallog_page() on it. That is
   done while holding a share lock on the page.

2. Backend A: XLogInsertRecord() computes the CRC of the FPI WAL record
   including the FSM page

3. Backend B: Updates the same FSM page while holding only a share lock

4. Backend A: Allocates space in the WAL buffers, and copies the WAL
   record header and the page to the buffers.

At this point, the CRC that backend A computed earlier doesn't match the
contents that were written out to the WAL buffers.

The update of the FSM page in backend B happened from there (fsmpage.c):

	/*
	 * Update the next-target pointer. Note that we do this even if we're only
	 * holding a shared lock, on the grounds that it's better to use a shared
	 * lock and get a garbled next pointer every now and then, than take the
	 * concurrency hit of an exclusive lock.
	 *
	 * Wrap-around is handled at the beginning of this function.
	 */
	fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);

To fix, make a temporary copy of the page in zenith_wallog_page(), and
WAL-log that. Just like XLogSaveBufferForHint() does.

Fixes neondatabase/neon#413
MMeent pushed a commit to neondatabase/postgres that referenced this issue Aug 18, 2022
…o CRC errors.

zenith_wallog_page() would call log_newpage() on a buffer, while holding
merely a shared lock on the page. That's not cool, because another backend
could modify the page concurrently. We allow changing hint bits while
holding only a shared lock, and changes on FSM pages, at least. See comments
in XLogSaveBufferForHint() for discussion of this problem.

One instance of the race condition that I was able to capture on my laptop
happened like this:

1. Backend A: needs to evict an FSM page from the buffer cache to make
   room for a new page, and calls zenith_wallog_page() on it. That is
   done while holding a share lock on the page.

2. Backend A: XLogInsertRecord() computes the CRC of the FPI WAL record
   including the FSM page

3. Backend B: Updates the same FSM page while holding only a share lock

4. Backend A: Allocates space in the WAL buffers, and copies the WAL
   record header and the page to the buffers.

At this point, the CRC that backend A computed earlier doesn't match the
contents that were written out to the WAL buffers.

The update of the FSM page in backend B happened from there (fsmpage.c):

	/*
	 * Update the next-target pointer. Note that we do this even if we're only
	 * holding a shared lock, on the grounds that it's better to use a shared
	 * lock and get a garbled next pointer every now and then, than take the
	 * concurrency hit of an exclusive lock.
	 *
	 * Wrap-around is handled at the beginning of this function.
	 */
	fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);

To fix, make a temporary copy of the page in zenith_wallog_page(), and
WAL-log that. Just like XLogSaveBufferForHint() does.

Fixes neondatabase/neon#413
lubennikovaav pushed a commit to neondatabase/postgres that referenced this issue Nov 21, 2022
…o CRC errors.

zenith_wallog_page() would call log_newpage() on a buffer, while holding
merely a shared lock on the page. That's not cool, because another backend
could modify the page concurrently. We allow changing hint bits while
holding only a shared lock, and changes on FSM pages, at least. See comments
in XLogSaveBufferForHint() for discussion of this problem.

One instance of the race condition that I was able to capture on my laptop
happened like this:

1. Backend A: needs to evict an FSM page from the buffer cache to make
   room for a new page, and calls zenith_wallog_page() on it. That is
   done while holding a share lock on the page.

2. Backend A: XLogInsertRecord() computes the CRC of the FPI WAL record
   including the FSM page

3. Backend B: Updates the same FSM page while holding only a share lock

4. Backend A: Allocates space in the WAL buffers, and copies the WAL
   record header and the page to the buffers.

At this point, the CRC that backend A computed earlier doesn't match the
contents that were written out to the WAL buffers.

The update of the FSM page in backend B happened from there (fsmpage.c):

	/*
	 * Update the next-target pointer. Note that we do this even if we're only
	 * holding a shared lock, on the grounds that it's better to use a shared
	 * lock and get a garbled next pointer every now and then, than take the
	 * concurrency hit of an exclusive lock.
	 *
	 * Wrap-around is handled at the beginning of this function.
	 */
	fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);

To fix, make a temporary copy of the page in zenith_wallog_page(), and
WAL-log that. Just like XLogSaveBufferForHint() does.

Fixes neondatabase/neon#413
MMeent pushed a commit to neondatabase/postgres that referenced this issue Feb 10, 2023
…o CRC errors.

zenith_wallog_page() would call log_newpage() on a buffer, while holding
merely a shared lock on the page. That's not cool, because another backend
could modify the page concurrently. We allow changing hint bits while
holding only a shared lock, and changes on FSM pages, at least. See comments
in XLogSaveBufferForHint() for discussion of this problem.

One instance of the race condition that I was able to capture on my laptop
happened like this:

1. Backend A: needs to evict an FSM page from the buffer cache to make
   room for a new page, and calls zenith_wallog_page() on it. That is
   done while holding a share lock on the page.

2. Backend A: XLogInsertRecord() computes the CRC of the FPI WAL record
   including the FSM page

3. Backend B: Updates the same FSM page while holding only a share lock

4. Backend A: Allocates space in the WAL buffers, and copies the WAL
   record header and the page to the buffers.

At this point, the CRC that backend A computed earlier doesn't match the
contents that were written out to the WAL buffers.

The update of the FSM page in backend B happened from there (fsmpage.c):

	/*
	 * Update the next-target pointer. Note that we do this even if we're only
	 * holding a shared lock, on the grounds that it's better to use a shared
	 * lock and get a garbled next pointer every now and then, than take the
	 * concurrency hit of an exclusive lock.
	 *
	 * Wrap-around is handled at the beginning of this function.
	 */
	fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);

To fix, make a temporary copy of the page in zenith_wallog_page(), and
WAL-log that. Just like XLogSaveBufferForHint() does.

Fixes neondatabase/neon#413
MMeent pushed a commit to neondatabase/postgres that referenced this issue May 11, 2023
…o CRC errors.

zenith_wallog_page() would call log_newpage() on a buffer, while holding
merely a shared lock on the page. That's not cool, because another backend
could modify the page concurrently. We allow changing hint bits while
holding only a shared lock, and changes on FSM pages, at least. See comments
in XLogSaveBufferForHint() for discussion of this problem.

One instance of the race condition that I was able to capture on my laptop
happened like this:

1. Backend A: needs to evict an FSM page from the buffer cache to make
   room for a new page, and calls zenith_wallog_page() on it. That is
   done while holding a share lock on the page.

2. Backend A: XLogInsertRecord() computes the CRC of the FPI WAL record
   including the FSM page

3. Backend B: Updates the same FSM page while holding only a share lock

4. Backend A: Allocates space in the WAL buffers, and copies the WAL
   record header and the page to the buffers.

At this point, the CRC that backend A computed earlier doesn't match the
contents that were written out to the WAL buffers.

The update of the FSM page in backend B happened from there (fsmpage.c):

	/*
	 * Update the next-target pointer. Note that we do this even if we're only
	 * holding a shared lock, on the grounds that it's better to use a shared
	 * lock and get a garbled next pointer every now and then, than take the
	 * concurrency hit of an exclusive lock.
	 *
	 * Wrap-around is handled at the beginning of this function.
	 */
	fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);

To fix, make a temporary copy of the page in zenith_wallog_page(), and
WAL-log that. Just like XLogSaveBufferForHint() does.

Fixes neondatabase/neon#413
tristan957 pushed a commit to neondatabase/postgres that referenced this issue Aug 10, 2023
…o CRC errors.

zenith_wallog_page() would call log_newpage() on a buffer, while holding
merely a shared lock on the page. That's not cool, because another backend
could modify the page concurrently. We allow changing hint bits while
holding only a shared lock, and changes on FSM pages, at least. See comments
in XLogSaveBufferForHint() for discussion of this problem.

One instance of the race condition that I was able to capture on my laptop
happened like this:

1. Backend A: needs to evict an FSM page from the buffer cache to make
   room for a new page, and calls zenith_wallog_page() on it. That is
   done while holding a share lock on the page.

2. Backend A: XLogInsertRecord() computes the CRC of the FPI WAL record
   including the FSM page

3. Backend B: Updates the same FSM page while holding only a share lock

4. Backend A: Allocates space in the WAL buffers, and copies the WAL
   record header and the page to the buffers.

At this point, the CRC that backend A computed earlier doesn't match the
contents that were written out to the WAL buffers.

The update of the FSM page in backend B happened from there (fsmpage.c):

	/*
	 * Update the next-target pointer. Note that we do this even if we're only
	 * holding a shared lock, on the grounds that it's better to use a shared
	 * lock and get a garbled next pointer every now and then, than take the
	 * concurrency hit of an exclusive lock.
	 *
	 * Wrap-around is handled at the beginning of this function.
	 */
	fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);

To fix, make a temporary copy of the page in zenith_wallog_page(), and
WAL-log that. Just like XLogSaveBufferForHint() does.

Fixes neondatabase/neon#413
tristan957 pushed a commit to neondatabase/postgres that referenced this issue Nov 8, 2023
…o CRC errors.

zenith_wallog_page() would call log_newpage() on a buffer, while holding
merely a shared lock on the page. That's not cool, because another backend
could modify the page concurrently. We allow changing hint bits while
holding only a shared lock, and changes on FSM pages, at least. See comments
in XLogSaveBufferForHint() for discussion of this problem.

One instance of the race condition that I was able to capture on my laptop
happened like this:

1. Backend A: needs to evict an FSM page from the buffer cache to make
   room for a new page, and calls zenith_wallog_page() on it. That is
   done while holding a share lock on the page.

2. Backend A: XLogInsertRecord() computes the CRC of the FPI WAL record
   including the FSM page

3. Backend B: Updates the same FSM page while holding only a share lock

4. Backend A: Allocates space in the WAL buffers, and copies the WAL
   record header and the page to the buffers.

At this point, the CRC that backend A computed earlier doesn't match the
contents that were written out to the WAL buffers.

The update of the FSM page in backend B happened from there (fsmpage.c):

	/*
	 * Update the next-target pointer. Note that we do this even if we're only
	 * holding a shared lock, on the grounds that it's better to use a shared
	 * lock and get a garbled next pointer every now and then, than take the
	 * concurrency hit of an exclusive lock.
	 *
	 * Wrap-around is handled at the beginning of this function.
	 */
	fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);

To fix, make a temporary copy of the page in zenith_wallog_page(), and
WAL-log that. Just like XLogSaveBufferForHint() does.

Fixes neondatabase/neon#413
tristan957 pushed a commit to neondatabase/postgres that referenced this issue Nov 8, 2023
…o CRC errors.

zenith_wallog_page() would call log_newpage() on a buffer, while holding
merely a shared lock on the page. That's not cool, because another backend
could modify the page concurrently. We allow changing hint bits while
holding only a shared lock, and changes on FSM pages, at least. See comments
in XLogSaveBufferForHint() for discussion of this problem.

One instance of the race condition that I was able to capture on my laptop
happened like this:

1. Backend A: needs to evict an FSM page from the buffer cache to make
   room for a new page, and calls zenith_wallog_page() on it. That is
   done while holding a share lock on the page.

2. Backend A: XLogInsertRecord() computes the CRC of the FPI WAL record
   including the FSM page

3. Backend B: Updates the same FSM page while holding only a share lock

4. Backend A: Allocates space in the WAL buffers, and copies the WAL
   record header and the page to the buffers.

At this point, the CRC that backend A computed earlier doesn't match the
contents that were written out to the WAL buffers.

The update of the FSM page in backend B happened from there (fsmpage.c):

	/*
	 * Update the next-target pointer. Note that we do this even if we're only
	 * holding a shared lock, on the grounds that it's better to use a shared
	 * lock and get a garbled next pointer every now and then, than take the
	 * concurrency hit of an exclusive lock.
	 *
	 * Wrap-around is handled at the beginning of this function.
	 */
	fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);

To fix, make a temporary copy of the page in zenith_wallog_page(), and
WAL-log that. Just like XLogSaveBufferForHint() does.

Fixes neondatabase/neon#413
tristan957 pushed a commit to neondatabase/postgres that referenced this issue Feb 5, 2024
…o CRC errors.

zenith_wallog_page() would call log_newpage() on a buffer, while holding
merely a shared lock on the page. That's not cool, because another backend
could modify the page concurrently. We allow changing hint bits while
holding only a shared lock, and changes on FSM pages, at least. See comments
in XLogSaveBufferForHint() for discussion of this problem.

One instance of the race condition that I was able to capture on my laptop
happened like this:

1. Backend A: needs to evict an FSM page from the buffer cache to make
   room for a new page, and calls zenith_wallog_page() on it. That is
   done while holding a share lock on the page.

2. Backend A: XLogInsertRecord() computes the CRC of the FPI WAL record
   including the FSM page

3. Backend B: Updates the same FSM page while holding only a share lock

4. Backend A: Allocates space in the WAL buffers, and copies the WAL
   record header and the page to the buffers.

At this point, the CRC that backend A computed earlier doesn't match the
contents that were written out to the WAL buffers.

The update of the FSM page in backend B happened from there (fsmpage.c):

	/*
	 * Update the next-target pointer. Note that we do this even if we're only
	 * holding a shared lock, on the grounds that it's better to use a shared
	 * lock and get a garbled next pointer every now and then, than take the
	 * concurrency hit of an exclusive lock.
	 *
	 * Wrap-around is handled at the beginning of this function.
	 */
	fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);

To fix, make a temporary copy of the page in zenith_wallog_page(), and
WAL-log that. Just like XLogSaveBufferForHint() does.

Fixes neondatabase/neon#413
tristan957 pushed a commit to neondatabase/postgres that referenced this issue Feb 5, 2024
…o CRC errors.

zenith_wallog_page() would call log_newpage() on a buffer, while holding
merely a shared lock on the page. That's not cool, because another backend
could modify the page concurrently. We allow changing hint bits while
holding only a shared lock, and changes on FSM pages, at least. See comments
in XLogSaveBufferForHint() for discussion of this problem.

One instance of the race condition that I was able to capture on my laptop
happened like this:

1. Backend A: needs to evict an FSM page from the buffer cache to make
   room for a new page, and calls zenith_wallog_page() on it. That is
   done while holding a share lock on the page.

2. Backend A: XLogInsertRecord() computes the CRC of the FPI WAL record
   including the FSM page

3. Backend B: Updates the same FSM page while holding only a share lock

4. Backend A: Allocates space in the WAL buffers, and copies the WAL
   record header and the page to the buffers.

At this point, the CRC that backend A computed earlier doesn't match the
contents that were written out to the WAL buffers.

The update of the FSM page in backend B happened from there (fsmpage.c):

	/*
	 * Update the next-target pointer. Note that we do this even if we're only
	 * holding a shared lock, on the grounds that it's better to use a shared
	 * lock and get a garbled next pointer every now and then, than take the
	 * concurrency hit of an exclusive lock.
	 *
	 * Wrap-around is handled at the beginning of this function.
	 */
	fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);

To fix, make a temporary copy of the page in zenith_wallog_page(), and
WAL-log that. Just like XLogSaveBufferForHint() does.

Fixes neondatabase/neon#413
tristan957 pushed a commit to neondatabase/postgres that referenced this issue Feb 6, 2024
…o CRC errors.

zenith_wallog_page() would call log_newpage() on a buffer, while holding
merely a shared lock on the page. That's not cool, because another backend
could modify the page concurrently. We allow changing hint bits while
holding only a shared lock, and changes on FSM pages, at least. See comments
in XLogSaveBufferForHint() for discussion of this problem.

One instance of the race condition that I was able to capture on my laptop
happened like this:

1. Backend A: needs to evict an FSM page from the buffer cache to make
   room for a new page, and calls zenith_wallog_page() on it. That is
   done while holding a share lock on the page.

2. Backend A: XLogInsertRecord() computes the CRC of the FPI WAL record
   including the FSM page

3. Backend B: Updates the same FSM page while holding only a share lock

4. Backend A: Allocates space in the WAL buffers, and copies the WAL
   record header and the page to the buffers.

At this point, the CRC that backend A computed earlier doesn't match the
contents that were written out to the WAL buffers.

The update of the FSM page in backend B happened from there (fsmpage.c):

	/*
	 * Update the next-target pointer. Note that we do this even if we're only
	 * holding a shared lock, on the grounds that it's better to use a shared
	 * lock and get a garbled next pointer every now and then, than take the
	 * concurrency hit of an exclusive lock.
	 *
	 * Wrap-around is handled at the beginning of this function.
	 */
	fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);

To fix, make a temporary copy of the page in zenith_wallog_page(), and
WAL-log that. Just like XLogSaveBufferForHint() does.

Fixes neondatabase/neon#413
tristan957 pushed a commit to neondatabase/postgres that referenced this issue May 10, 2024
…o CRC errors.

zenith_wallog_page() would call log_newpage() on a buffer, while holding
merely a shared lock on the page. That's not cool, because another backend
could modify the page concurrently. We allow changing hint bits while
holding only a shared lock, and changes on FSM pages, at least. See comments
in XLogSaveBufferForHint() for discussion of this problem.

One instance of the race condition that I was able to capture on my laptop
happened like this:

1. Backend A: needs to evict an FSM page from the buffer cache to make
   room for a new page, and calls zenith_wallog_page() on it. That is
   done while holding a share lock on the page.

2. Backend A: XLogInsertRecord() computes the CRC of the FPI WAL record
   including the FSM page

3. Backend B: Updates the same FSM page while holding only a share lock

4. Backend A: Allocates space in the WAL buffers, and copies the WAL
   record header and the page to the buffers.

At this point, the CRC that backend A computed earlier doesn't match the
contents that were written out to the WAL buffers.

The update of the FSM page in backend B happened from there (fsmpage.c):

	/*
	 * Update the next-target pointer. Note that we do this even if we're only
	 * holding a shared lock, on the grounds that it's better to use a shared
	 * lock and get a garbled next pointer every now and then, than take the
	 * concurrency hit of an exclusive lock.
	 *
	 * Wrap-around is handled at the beginning of this function.
	 */
	fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);

To fix, make a temporary copy of the page in zenith_wallog_page(), and
WAL-log that. Just like XLogSaveBufferForHint() does.

Fixes neondatabase/neon#413
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment