pg_regress sometimes hangs in CI #413

kelvich · 2021-08-11T13:37:14Z

No description provided.

kelvich · 2021-08-11T15:10:07Z

Pageserver catches crc mismatch:

Aug 11 17:54:36.945 INFO switched segment 0/12FFF939 to 0/130019F9, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:257
Aug 11 17:54:51.964 INFO switched segment 0/13FFF01A to 0/14001C73, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:257
Aug 11 17:55:02.495 INFO switched segment 0/14FFEB15 to 0/15000B71, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:257
Aug 11 17:55:13.528 INFO switched segment 0/15FFF661 to 0/16004515, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:257
Aug 11 17:55:24.847 INFO switched segment 0/16FFFABD to 0/17003FBD, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:257
Aug 11 17:55:34.102 INFO switched segment 0/17FFF311 to 0/18005CA9, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:257
Aug 11 17:55:42.583 INFO switched segment 0/18FFE9F9 to 0/19000F59, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:257
Aug 11 17:55:48.275 INFO WAL streaming connection failed (WAL record crc mismatch at 0/198E6525), retrying in 1 second, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:109
Aug 11 17:55:49.276 INFO connecting to "host=127.0.0.1 port=55432 user=zenith_admin dbname=postgres", location: pageserver::walreceiver, pageserver/src/walreceiver.rs:125
Aug 11 17:55:49.289 INFO connected!, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:132

pg_waldump on the corresponding wal file:

./tmp_install/bin/pg_waldump ./test_output/test_pg_regress/repo/pgdatadirs/tenants/87b8d7ae0f401079f95d1f5b87227176/test_pg_regress/pg_wal/000000010000000000000019

<...>

rmgr: Heap2       len (rec/tot):    463/   463, tx:    1008216, lsn: 0/198E44D8, prev 0/198E4498, desc: MULTI_INSERT 3 tuples flags 0x02, blkref #0: rel 1663/16384/1249 blk 164
rmgr: Btree       len (rec/tot):     64/    64, tx:    1008216, lsn: 0/198E46A8, prev 0/198E44D8, desc: INSERT_LEAF off 131, blkref #0: rel 1663/16384/2658 blk 64
rmgr: Btree       len (rec/tot):     64/    64, tx:    1008216, lsn: 0/198E46E8, prev 0/198E46A8, desc: INSERT_LEAF off 147, blkref #0: rel 1663/16384/2659 blk 51
rmgr: Btree       len (rec/tot):     64/    64, tx:    1008216, lsn: 0/198E4728, prev 0/198E46E8, desc: INSERT_LEAF off 132, blkref #0: rel 1663/16384/2658 blk 64
rmgr: Btree       len (rec/tot):     64/    64, tx:    1008216, lsn: 0/198E4768, prev 0/198E4728, desc: INSERT_LEAF off 148, blkref #0: rel 1663/16384/2659 blk 51
rmgr: Btree       len (rec/tot):     64/    64, tx:    1008216, lsn: 0/198E47A8, prev 0/198E4768, desc: INSERT_LEAF off 133, blkref #0: rel 1663/16384/2658 blk 64
rmgr: Btree       len (rec/tot):     64/    64, tx:    1008216, lsn: 0/198E47E8, prev 0/198E47A8, desc: INSERT_LEAF off 149, blkref #0: rel 1663/16384/2659 blk 51
rmgr: Btree       len (rec/tot):     64/    64, tx:    1008206, lsn: 0/198E4828, prev 0/198E47E8, desc: INSERT_LEAF off 318, blkref #0: rel 1663/16384/3455 blk 16
rmgr: Heap        len (rec/tot):     87/    87, tx:    1008206, lsn: 0/198E4868, prev 0/198E4828, desc: UPDATE off 32 xmax 1008206 flags 0x60 ; new off 43 xmax 0, blkref #0: rel 1663/16384/1259 blk 26
rmgr: Btree       len (rec/tot):     64/    64, tx:    1008206, lsn: 0/198E48C0, prev 0/198E4868, desc: INSERT_LEAF off 294, blkref #0: rel 1663/16384/2662 blk 10
rmgr: Heap2       len (rec/tot):    463/   463, tx:    1008216, lsn: 0/198E4900, prev 0/198E48C0, desc: MULTI_INSERT 3 tuples flags 0x00, blkref #0: rel 1663/16384/1249 blk 164
rmgr: Heap        len (rec/tot):     60/    60, tx:    1008218, lsn: 0/198E4AD0, prev 0/198E4900, desc: DELETE off 15 flags 0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/2995 blk 0
pg_waldump: fatal: error in WAL record at 0/198E4AD0: incorrect resource manager data checksum in record at 0/198E4B10

kelvich · 2021-08-11T15:10:53Z

wal hexdump around 0/198E4AD0

08e4950 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
08e4980 00 00 00 00 00 00 00 00 1b 00 00 00 00 00 00 00
08e4990 06 00 ff ff 00 00 00 00 ff ff ff ff ff ff ff ff
08e49a0 00 70 73 01 00 00 00 00 00 01 00 00 00 00 00 00
08e49b0 00 00 00 00 00 a5 7a 00 1a 00 01 08 0b 48 00 00
08e49c0 00 00 20 ff ff 3f 00 00 00 00 00 00 f4 66 00 00
08e49d0 78 6d 69 6e 00 00 00 00 00 00 00 00 00 00 00 00
08e49e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
08e4a10 1c 00 00 00 00 00 00 00 04 00 fe ff 00 00 00 00
08e4a20 ff ff ff ff ff ff ff ff 01 70 69 01 00 00 00 00
08e4a30 00 01 00 00 00 00 00 00 00 00 00 00 00 ba 7a 00
08e4a40 1a 00 01 08 52 42 00 00 00 00 20 ff ff 3f 00 00
08e4a50 00 00 00 00 f4 66 00 00 63 6d 69 6e 00 00 00 00
08e4a60 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
08e4a90 00 00 00 00 00 00 00 00 1d 00 00 00 00 00 00 00
08e4aa0 04 00 fd ff 00 00 00 00 ff ff ff ff ff ff ff ff
08e4ab0 01 70 69 01 00 00 00 00 00 01 00 00 00 00 00 00
08e4ac0 00 00 00 00 00 00 40 03 00 11 00 12 00 13 00 00
08e4ad0 3c 00 00 00 5a 62 0f 00 00 49 8e 19 00 00 00 00
08e4ae0 10 0a 00 00 5d 7b 1a 7f 00 00 00 00 7f 06 00 00

kelvich · 2021-08-11T15:23:26Z

Jumping over 0/198E4AD0 in pg_waldump does not help -- seems that 0/198E4AD0 is actually last valid record in the file, so it not clear where pageserver's 0/198E6525 is coming from.

postgres log around the time with this issue:

2021-08-11 14:55:48.259 GMT [2858] STATEMENT:  DELETE FROM r1;
2021-08-11 14:55:48.274 GMT [2294] FATAL:  invalid standby message type "f"
2021-08-11 14:55:48.274 GMT [2294] STATEMENT:  START_REPLICATION PHYSICAL 0/16B9150
2021-08-11 14:55:49.284 GMT [2915] LOG:  [ZENITH_SMGR] libpqpagestore: connected to 'host=localhost port=64000'
2021-08-11 14:55:49.292 GMT [2915] FATAL:  invalid standby message type "f"
2021-08-11 14:55:49.292 GMT [2915] STATEMENT:  START_REPLICATION PHYSICAL 0/198E4B10
2021-08-11 14:55:50.308 GMT [2916] FATAL:  invalid standby message type "f"
2021-08-11 14:55:50.308 GMT [2916] STATEMENT:  START_REPLICATION PHYSICAL 0/198E4B10
2021-08-11 14:55:51.319 GMT [2917] FATAL:  invalid standby message type "f"

kelvich · 2021-08-11T16:18:06Z

Breakpoint 2: where = pageserver`pageserver::waldecoder::WalStreamDecoder::poll_decode::h2be9c48a6608dcbf + 2417 at waldecoder.rs:173:36, address = 0x000000010515d611
(lldb) c
Process 2271 resuming
Process 2271 stopped
* thread #13, name = 'WAL receiver thread', stop reason = breakpoint 2.1
    frame #0: 0x000000010515d611 pageserver`pageserver::waldecoder::WalStreamDecoder::poll_decode::h2be9c48a6608dcbf(self=0x000070000d4afd68) at waldecoder.rs:173:36
   170 	                    let mut crc = crc32c_append(0, &recordbuf[XLOG_RECORD_CRC_OFFS + 4..]);
   171 	                    crc = crc32c_append(crc, &recordbuf[0..XLOG_RECORD_CRC_OFFS]);
   172 	                    if crc != xlogrec.xl_crc {
-> 173 	                        return Err(WalDecodeError {
   174 	                            msg: "WAL record crc mismatch".into(),
   175 	                            lsn: self.lsn,
   176 	                        });

(lldb) p/x self->startlsn
(zenith_utils::lsn::Lsn) $7 = (__0 = 0x00000000198e4b10)

(lldb) p/x self->lsn
(zenith_utils::lsn::Lsn) $8 = (__0 = 0x00000000198e6525)

kelvich · 2021-08-12T11:12:56Z

Seems that our FPW / eviction_of_non_wal_logged_pages(43e2ed429) causes incorrect CRC in some cases:

output of pg_waldump with disabled CRC error-out:

CRC check for 0/19E346A8 = '1'
rmgr: Btree       len (rec/tot):     64/    64, tx:    1008409, lsn: 0/19E346A8, prev 0/19E34648, desc: INSERT_LEAF off 41, blkref #0: rel 1663/16384/2659 blk 50
CRC check for 0/19E346E8 = '0'
rmgr: XLOG        len (rec/tot):     49/  6933, tx:          0, lsn: 0/19E346E8, prev 0/19E346A8, desc: FPI , blkref #0: rel 1663/16384/1247 blk 27 FPW
CRC check for 0/19E36218 = '1'
rmgr: Heap        len (rec/tot):     60/    60, tx:    1008408, lsn: 0/19E36218, prev 0/19E346E8, desc: DELETE off 19 flags 0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/826 blk 0
CRC check for 0/19E36258 = '1'
rmgr: Heap        len (rec/tot):     60/    60, tx:    1008408, lsn: 0/19E36258, prev 0/19E36218, desc: DELETE off 30 flags 0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/2608 blk 101
CRC check for 0/19E36298 = '1'
rmgr: XLOG        len (rec/tot):     49/  7825, tx:          0, lsn: 0/19E36298, prev 0/19E36258, desc: FPI , blkref #0: rel 1663/16384/24997 blk 0 FPW
CRC check for 0/19E38148 = '1'
rmgr: Standby     len (rec/tot):    174/   174, tx:          0, lsn: 0/19E38148, prev 0/19E36298, desc: LOCK xid 1007024 db 16384 rel 25313 xid 1007024 db 16384 rel 25127 xid 1007024 db 16384 rel 26544 xid 1007024 db 16384 rel 25399 xid 1007024 db 16384 rel 24798 xid 1007024 db 16384 rel 24759 xid 1007024 db 16384 rel 25414 xid 1008408 db 16384 rel 26473 xid 1007024 db 16384 rel 25116 xid 1008409 db 16384 rel 26545 xid 1007024 db 16384 rel 26533 xid 1007024 db 16384 rel 25298 
CRC check for 0/19E381F8 = '1'
rmgr: Standby     len (rec/tot):    114/   114, tx:          0, lsn: 0/19E381F8, prev 0/19E38148, desc: RUNNING_XACTS nextXid 1008410 latestCompletedXid 1008407 oldestRunningXid 1007024; 3 xacts: 1008409 1007024 1008408
CRC check for 0/19E38270 = '1'
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/19E38270, prev 0/19E381F8, desc: CHECKPOINT_ONLINE redo 0/19C5F9A0; tli 1; prev tli 1; fpw true; xid 0:1008264; oid 32768; multi 1; offset 0; oldest xid 727 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 1007024; online
CRC check for 0/19E382E8 = '1'
rmgr: Standby     len (rec/tot):    174/   174, tx:          0, lsn: 0/19E382E8, prev 0/19E38270, desc: LOCK xid 1007024 db 16384 rel 25313 xid 1007024 db 16384 rel 25127 xid 1007024 db 16384 rel 26544 xid 1007024 db 16384 rel 25399 xid 1007024 db 16384 rel 24798 xid 1007024 db 16384 rel 24759 xid 1007024 db 16384 rel 25414 xid 1008408 db 16384 rel 26473 xid 1007024 db 16384 rel 25116 xid 1008409 db 16384 rel 26545 xid 1007024 db 16384 rel 26533 xid 1007024 db 16384 rel 25298 
CRC check for 0/19E38398 = '1'
rmgr: Standby     len (rec/tot):    114/   114, tx:          0, lsn: 0/19E38398, prev 0/19E382E8, desc: RUNNING_XACTS nextXid 1008410 latestCompletedXid 1008407 oldestRunningXid 1007024; 3 xacts: 1008409 1007024 1008408

hlinnaka · 2021-08-12T21:06:24Z

Looking at XLogRegisterBuffer() and XLogRegisterBlock(), it's quite bogus that they clear the PD_WAL_LOGGED flag, even if the page does not use the standard page layout, i.e. even if REGBUF_STANDARD was not used. That doesn't explain the CRC failure, but clearly we need a better solution to replace PD_WAL_LOGGED

kelvich · 2021-08-12T21:55:46Z

I tried to flip every bit in bkpb->bkp_image of record with wrong CRC to check if I hit expected CRC. Interestingly enough it actually worked and resulted in right CRC.

In one WAL sample that is:

CRC check for 0/19E346E8 = '0' (res=aa7a2db1, expected=9b797f02)
flip bit bytes_off=1753, bits_off=0: CRC = 9b797f02, hole = 1308, hole_offset = 228

In another:

!CRC check for 0/198E4B10 = '0' (res=11363036, expected=be153c56)
flip bit bytes_off=185, bits_off=0: CRC = be153c56, hole = 1588, hole_offset = 164

So in both cases that is a bitflip in the middle of the page.

hlinnaka · 2021-08-12T22:20:48Z

With this patch:

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0ff3bcca95..5ab30914b5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1132,6 +1132,9 @@ XLogInsertRecord(XLogRecData *rdata,
 		FIN_CRC32C(rdata_crc);
 		rechdr->xl_crc = rdata_crc;
 
+		if (num_fpi > 0)
+			pg_usleep(10000);
+
 		/*
 		 * All the record data, including the header, is now ready to be
 		 * inserted. Copy the record in the space reserved.

I can reproduce this very quickly on my laptop: run make installcheck-parallel, and it won't get very far.

hlinnaka · 2021-08-13T10:24:27Z

I was able to get an 'rr' trace of this on my laptop. What happens is that one backend is in the process of writing out a FPI of the page, and between computing the CRC and memcpying the page to the WAL buffer, another backend modifies the page. The modification happens from here, in fsmpage.c:

│   295                 /*                                                                                                                                                                   │
│   296                  * Update the next-target pointer. Note that we do this even if we're only                                                                                           │
│   297                  * holding a shared lock, on the grounds that it's better to use a shared                                                                                            │
│   298                  * lock and get a garbled next pointer every now and then, than take the                                                                                             │
│   299                  * concurrency hit of an exclusive lock.                                                                                                                             │
│   300                  *                                                                                                                                                                   │
│   301                  * Wrap-around is handled at the beginning of this function.                                                                                                         │
│   302                  */                                                                                                                                                                  │
│  >303                 fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);                                                                                                                │
│   304                                                                                                                                                                                      │

hlinnaka · 2021-08-13T10:33:39Z

Long story short, zenith_wallog_page() should not be using log_newpage() when its not holding an exclusive lock on the page. XLogSaveBufferForHint() has to jump through extra hoops and make a temporary copy of the page precisely because of this.

Fixes #413

…o CRC errors. zenith_wallog_page() would call log_newpage() on a buffer, while holding merely a shared lock on the page. That's not cool, because another backend could modify the page concurrently. We allow changing hint bits while holding only a shared lock, and changes on FSM pages, at least. See comments in XLogSaveBufferForHint() for discussion of this problem. One instance of the race condition that I was able to capture on my laptop happened like this: 1. Backend A: needs to evict an FSM page from the buffer cache to make room for a new page, and calls zenith_wallog_page() on it. That is done while holding a share lock on the page. 2. Backend A: XLogInsertRecord() computes the CRC of the FPI WAL record including the FSM page 3. Backend B: Updates the same FSM page while holding only a share lock 4. Backend A: Allocates space in the WAL buffers, and copies the WAL record header and the page to the buffers. At this point, the CRC that backend A computed earlier doesn't match the contents that were written out to the WAL buffers. The update of the FSM page in backend B happened from there (fsmpage.c): /* * Update the next-target pointer. Note that we do this even if we're only * holding a shared lock, on the grounds that it's better to use a shared * lock and get a garbled next pointer every now and then, than take the * concurrency hit of an exclusive lock. * * Wrap-around is handled at the beginning of this function. */ fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0); To fix, make a temporary copy of the page in zenith_wallog_page(), and WAL-log that. Just like XLogSaveBufferForHint() does. Fixes neondatabase/neon#413

kelvich self-assigned this Aug 11, 2021

sharnoff mentioned this issue Aug 12, 2021

Use postgres protocol for postgres <-> walkeeper communication #366

Merged

kelvich added this to the Zenith v0.2 milestone Aug 13, 2021

hlinnaka self-assigned this Aug 13, 2021

hlinnaka mentioned this issue Aug 13, 2021

Fix race condition while WAL-logging page, leading to CRC errors. neondatabase/postgres#65

Merged

hlinnaka added a commit that referenced this issue Aug 13, 2021

Update vendor/postgres to fix race condition leading to CRC errors.

f8de71e

Fixes #413

hlinnaka mentioned this issue Aug 13, 2021

Update vendor/postgres to fix race condition leading to CRC errors. #424

Merged

hlinnaka closed this as completed in neondatabase/postgres@edf91ce Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pg_regress sometimes hangs in CI #413

pg_regress sometimes hangs in CI #413

kelvich commented Aug 11, 2021

kelvich commented Aug 11, 2021

kelvich commented Aug 11, 2021

kelvich commented Aug 11, 2021

kelvich commented Aug 11, 2021

kelvich commented Aug 12, 2021 •

edited

Loading

hlinnaka commented Aug 12, 2021

kelvich commented Aug 12, 2021

hlinnaka commented Aug 12, 2021

hlinnaka commented Aug 13, 2021

hlinnaka commented Aug 13, 2021

pg_regress sometimes hangs in CI #413

pg_regress sometimes hangs in CI #413

Comments

kelvich commented Aug 11, 2021

kelvich commented Aug 11, 2021

kelvich commented Aug 11, 2021

kelvich commented Aug 11, 2021

kelvich commented Aug 11, 2021

kelvich commented Aug 12, 2021 • edited Loading

hlinnaka commented Aug 12, 2021

kelvich commented Aug 12, 2021

hlinnaka commented Aug 12, 2021

hlinnaka commented Aug 13, 2021

hlinnaka commented Aug 13, 2021

kelvich commented Aug 12, 2021 •

edited

Loading