could not send WAL redo request: "SendError(..)" #416

kelvich · 2021-08-12T07:12:01Z

CI tests caught following thing: https://app.circleci.com/pipelines/github/zenithdb/zenith/1115/workflows/f7a864a0-631c-4463-abbd-d3f47cf6f96d/jobs/6092

Aug 11 18:23:12.504 INFO caught up at LSN 0/1699D48, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:275
thread '<unnamed>' panicked at 'could not send WAL redo request: "SendError(..)"', pageserver/src/walredo.rs:214:14
stack backtrace:
   0: rust_begin_unwind
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:493:5
   1: core::panicking::panic_fmt
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/panicking.rs:92:14
   2: core::option::expect_none_failed
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/option.rs:1300:5
   3: core::result::Result<T,E>::expect
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/result.rs:997:23
   4: <pageserver::walredo::PostgresRedoManager as pageserver::walredo::WalRedoManager>::request_redo
             at /home/circleci/project/pageserver/src/walredo.rs:210:9
   5: <pageserver::object_repository::ObjectTimeline as pageserver::repository::Timeline>::get_page_at_lsn_nowait
             at /home/circleci/project/pageserver/src/object_repository.rs:310:32
   6: <pageserver::object_repository::ObjectTimeline as pageserver::repository::Timeline>::get_page_at_lsn
             at /home/circleci/project/pageserver/src/object_repository.rs:263:9
   7: pageserver::page_service::PageServerHandler::handle_pagerequests::{{closure}}
             at /home/circleci/project/pageserver/src/page_service.rs:310:35
   8: prometheus::histogram::Histogram::observe_closure_duration
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/prometheus-0.12.0/src/histogram.rs:715:19
   9: pageserver::page_service::PageServerHandler::handle_pagerequests
             at /home/circleci/project/pageserver/src/page_service.rs:307:41
  10: <pageserver::page_service::PageServerHandler as zenith_utils::postgres_backend::Handler>::process_query
             at /home/circleci/project/pageserver/src/page_service.rs:467:13
  11: zenith_utils::postgres_backend::PostgresBackend::process_message
             at /home/circleci/project/zenith_utils/src/postgres_backend.rs:300:33
  12: zenith_utils::postgres_backend::PostgresBackend::run_message_loop
             at /home/circleci/project/zenith_utils/src/postgres_backend.rs:190:19
  13: zenith_utils::postgres_backend::PostgresBackend::run
             at /home/circleci/project/zenith_utils/src/postgres_backend.rs:176:19
  14: pageserver::page_service::page_service_conn_main
             at /home/circleci/project/pageserver/src/page_service.rs:182:5
  15: pageserver::page_service::thread_main::{{closure}}
             at /home/circleci/project/pageserver/src/page_service.rs:158:31
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

The text was updated successfully, but these errors were encountered:

lubennikovaav · 2021-08-12T10:27:00Z

PR #410 removes this code, so maybe it isn't worth fixing

hlinnaka · 2021-08-13T14:35:43Z

I don't think PR #410 will fix this, whatever the problem is. ISTM that the WAL redo process returned an error or died unexpectedly, and that failure was propagated to the thread that requested the WAL redo like this. With PR #410, I presume the same failure in WAL redo would have happened, the stack trace would just look different.

hlinnaka · 2021-08-13T14:37:50Z

The pageserver.log from the failed run has this, before the above panic:

Aug 11 18:23:08.250 INFO WAL redo thread started for tenant: 55b3bcc8cd7a4512ba2220b507771e06, location: pageserver::walredo, pageserver/src/walredo.rs:247
Aug 11 18:23:08.250 INFO launching WAL redo postgres process for tenant: 55b3bcc8cd7a4512ba2220b507771e06, location: pageserver::walredo, pageserver/src/walredo.rs:259
Aug 11 18:23:08.250 INFO running initdb in "./tenants/55b3bcc8cd7a4512ba2220b507771e06/wal-redo-datadir", location: pageserver::walredo, pageserver/src/walredo.rs:513
Aug 11 18:23:08.260 INFO standard logging redirected to slog, location: pageserver::branches, pageserver/src/branches.rs:80
Aug 11 18:23:08.260 INFO created directory structure in ./tenants/55b3bcc8cd7a4512ba2220b507771e06, location: pageserver::branches, pageserver/src/branches.rs:86
thread 'WAL redo thread' panicked at 'initdb failed: The files belonging to this database system will be owned by user "circleci".
This user must also own the server process.

The database cluster will be initialized with locale "C".
The default database encoding has accordingly been set to "SQL_ASCII".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory ./tenants/55b3bcc8cd7a4512ba2220b507771e06/wal-redo-datadir ... 
stderr:
initdb: error: could not create directory "./tenants/55b3bcc8cd7a4512ba2220b507771e06": File exists
', pageserver/src/walredo.rs:525:13
stack backtrace:
Aug 11 18:23:08.275 INFO running initdb... , location: pageserver::branches, pageserver/src/branches.rs:130
   0: rust_begin_unwind
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:493:5
   1: std::panicking::begin_panic_fmt
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:435:5
   2: pageserver::walredo::PostgresRedoProcess::launch::{{closure}}
             at /home/circleci/project/pageserver/src/walredo.rs:525:13
   3: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/future/mod.rs:80:19
   4: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/future/future.rs:119:9
   5: tokio::runtime::basic_scheduler::Inner<P>::block_on::{{closure}}::{{closure}}
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/runtime/basic_scheduler.rs:213:62
   6: tokio::coop::with_budget::{{closure}}
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/coop.rs:106:9
   7: std::thread::local::LocalKey<T>::try_with
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/thread/local.rs:272:16
   8: std::thread::local::LocalKey<T>::with
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/thread/local.rs:248:9
   9: tokio::coop::with_budget
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/coop.rs:99:5
  10: tokio::coop::budget
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/coop.rs:76:5
  11: tokio::runtime::basic_scheduler::Inner<P>::block_on::{{closure}}
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/runtime/basic_scheduler.rs:213:39
  12: tokio::runtime::basic_scheduler::enter::{{closure}}
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/runtime/basic_scheduler.rs:323:29
  13: tokio::macros::scoped_tls::ScopedKey<T>::set
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/macros/scoped_tls.rs:61:9
  14: tokio::runtime::basic_scheduler::enter
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/runtime/basic_scheduler.rs:323:5
  15: tokio::runtime::basic_scheduler::Inner<P>::block_on
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/runtime/basic_scheduler.rs:202:9
  16: tokio::runtime::basic_scheduler::InnerGuard<P>::block_on
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/runtime/basic_scheduler.rs:516:9
  17: tokio::runtime::basic_scheduler::BasicScheduler<P>::block_on
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/runtime/basic_scheduler.rs:162:24
  18: tokio::runtime::Runtime::block_on
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/runtime/mod.rs:450:46
  19: pageserver::walredo::PostgresRedoManagerInternal::wal_redo_main
             at /home/circleci/project/pageserver/src/walredo.rs:264:19
  20: pageserver::walredo::PostgresRedoManager::new::{{closure}}
             at /home/circleci/project/pageserver/src/walredo.rs:173:17
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Aug 11 18:23:10.312 INFO initdb succeeded, location: pageserver::branches, pageserver/src/branches.rs:149
Aug 11 18:23:10.312 INFO bootstrap_timeline "./tenants/55b3bcc8cd7a4512ba2220b507771e06/tmp" at lsn 0/1699C98, location: pageserver::branches, pageserver/src/branches.rs:172

Do we have a race condition, where one thread launches the WAL redo thread and then creates the tenant-directory, and meanwhile the WAL redo thread starts initdb which will also try to create the directory if it doesn't already exist? If that's the problem, then PR #410 might actually fix this by accident.

hlinnaka · 2021-10-27T10:05:14Z

Haven't seen this for a while, so I guess this has been fixed. Closing.

sharnoff mentioned this issue Aug 12, 2021

Use postgres protocol for postgres <-> walkeeper communication #366

Merged

kelvich added the t/bug Issue Type: Bug label Sep 6, 2021

hlinnaka closed this as completed Oct 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

could not send WAL redo request: "SendError(..)" #416

could not send WAL redo request: "SendError(..)" #416

kelvich commented Aug 12, 2021 •

edited

Loading

lubennikovaav commented Aug 12, 2021

hlinnaka commented Aug 13, 2021

hlinnaka commented Aug 13, 2021

hlinnaka commented Oct 27, 2021

could not send WAL redo request: "SendError(..)" #416

could not send WAL redo request: "SendError(..)" #416

Comments

kelvich commented Aug 12, 2021 • edited Loading

lubennikovaav commented Aug 12, 2021

hlinnaka commented Aug 13, 2021

hlinnaka commented Aug 13, 2021

hlinnaka commented Oct 27, 2021

kelvich commented Aug 12, 2021 •

edited

Loading