Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

could not send WAL redo request: "SendError(..)" #416

Closed
kelvich opened this issue Aug 12, 2021 · 4 comments
Closed

could not send WAL redo request: "SendError(..)" #416

kelvich opened this issue Aug 12, 2021 · 4 comments
Labels
t/bug Issue Type: Bug

Comments

@kelvich
Copy link
Contributor

kelvich commented Aug 12, 2021

CI tests caught following thing: https://app.circleci.com/pipelines/github/zenithdb/zenith/1115/workflows/f7a864a0-631c-4463-abbd-d3f47cf6f96d/jobs/6092

Aug 11 18:23:12.504 INFO caught up at LSN 0/1699D48, location: pageserver::walreceiver, pageserver/src/walreceiver.rs:275
thread '<unnamed>' panicked at 'could not send WAL redo request: "SendError(..)"', pageserver/src/walredo.rs:214:14
stack backtrace:
   0: rust_begin_unwind
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:493:5
   1: core::panicking::panic_fmt
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/panicking.rs:92:14
   2: core::option::expect_none_failed
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/option.rs:1300:5
   3: core::result::Result<T,E>::expect
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/result.rs:997:23
   4: <pageserver::walredo::PostgresRedoManager as pageserver::walredo::WalRedoManager>::request_redo
             at /home/circleci/project/pageserver/src/walredo.rs:210:9
   5: <pageserver::object_repository::ObjectTimeline as pageserver::repository::Timeline>::get_page_at_lsn_nowait
             at /home/circleci/project/pageserver/src/object_repository.rs:310:32
   6: <pageserver::object_repository::ObjectTimeline as pageserver::repository::Timeline>::get_page_at_lsn
             at /home/circleci/project/pageserver/src/object_repository.rs:263:9
   7: pageserver::page_service::PageServerHandler::handle_pagerequests::{{closure}}
             at /home/circleci/project/pageserver/src/page_service.rs:310:35
   8: prometheus::histogram::Histogram::observe_closure_duration
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/prometheus-0.12.0/src/histogram.rs:715:19
   9: pageserver::page_service::PageServerHandler::handle_pagerequests
             at /home/circleci/project/pageserver/src/page_service.rs:307:41
  10: <pageserver::page_service::PageServerHandler as zenith_utils::postgres_backend::Handler>::process_query
             at /home/circleci/project/pageserver/src/page_service.rs:467:13
  11: zenith_utils::postgres_backend::PostgresBackend::process_message
             at /home/circleci/project/zenith_utils/src/postgres_backend.rs:300:33
  12: zenith_utils::postgres_backend::PostgresBackend::run_message_loop
             at /home/circleci/project/zenith_utils/src/postgres_backend.rs:190:19
  13: zenith_utils::postgres_backend::PostgresBackend::run
             at /home/circleci/project/zenith_utils/src/postgres_backend.rs:176:19
  14: pageserver::page_service::page_service_conn_main
             at /home/circleci/project/pageserver/src/page_service.rs:182:5
  15: pageserver::page_service::thread_main::{{closure}}
             at /home/circleci/project/pageserver/src/page_service.rs:158:31
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
@lubennikovaav
Copy link
Contributor

PR #410 removes this code, so maybe it isn't worth fixing

@hlinnaka
Copy link
Contributor

I don't think PR #410 will fix this, whatever the problem is. ISTM that the WAL redo process returned an error or died unexpectedly, and that failure was propagated to the thread that requested the WAL redo like this. With PR #410, I presume the same failure in WAL redo would have happened, the stack trace would just look different.

@hlinnaka
Copy link
Contributor

The pageserver.log from the failed run has this, before the above panic:

Aug 11 18:23:08.250 INFO WAL redo thread started for tenant: 55b3bcc8cd7a4512ba2220b507771e06, location: pageserver::walredo, pageserver/src/walredo.rs:247
Aug 11 18:23:08.250 INFO launching WAL redo postgres process for tenant: 55b3bcc8cd7a4512ba2220b507771e06, location: pageserver::walredo, pageserver/src/walredo.rs:259
Aug 11 18:23:08.250 INFO running initdb in "./tenants/55b3bcc8cd7a4512ba2220b507771e06/wal-redo-datadir", location: pageserver::walredo, pageserver/src/walredo.rs:513
Aug 11 18:23:08.260 INFO standard logging redirected to slog, location: pageserver::branches, pageserver/src/branches.rs:80
Aug 11 18:23:08.260 INFO created directory structure in ./tenants/55b3bcc8cd7a4512ba2220b507771e06, location: pageserver::branches, pageserver/src/branches.rs:86
thread 'WAL redo thread' panicked at 'initdb failed: The files belonging to this database system will be owned by user "circleci".
This user must also own the server process.

The database cluster will be initialized with locale "C".
The default database encoding has accordingly been set to "SQL_ASCII".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory ./tenants/55b3bcc8cd7a4512ba2220b507771e06/wal-redo-datadir ... 
stderr:
initdb: error: could not create directory "./tenants/55b3bcc8cd7a4512ba2220b507771e06": File exists
', pageserver/src/walredo.rs:525:13
stack backtrace:
Aug 11 18:23:08.275 INFO running initdb... , location: pageserver::branches, pageserver/src/branches.rs:130
   0: rust_begin_unwind
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:493:5
   1: std::panicking::begin_panic_fmt
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:435:5
   2: pageserver::walredo::PostgresRedoProcess::launch::{{closure}}
             at /home/circleci/project/pageserver/src/walredo.rs:525:13
   3: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/future/mod.rs:80:19
   4: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/future/future.rs:119:9
   5: tokio::runtime::basic_scheduler::Inner<P>::block_on::{{closure}}::{{closure}}
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/runtime/basic_scheduler.rs:213:62
   6: tokio::coop::with_budget::{{closure}}
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/coop.rs:106:9
   7: std::thread::local::LocalKey<T>::try_with
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/thread/local.rs:272:16
   8: std::thread::local::LocalKey<T>::with
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/thread/local.rs:248:9
   9: tokio::coop::with_budget
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/coop.rs:99:5
  10: tokio::coop::budget
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/coop.rs:76:5
  11: tokio::runtime::basic_scheduler::Inner<P>::block_on::{{closure}}
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/runtime/basic_scheduler.rs:213:39
  12: tokio::runtime::basic_scheduler::enter::{{closure}}
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/runtime/basic_scheduler.rs:323:29
  13: tokio::macros::scoped_tls::ScopedKey<T>::set
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/macros/scoped_tls.rs:61:9
  14: tokio::runtime::basic_scheduler::enter
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/runtime/basic_scheduler.rs:323:5
  15: tokio::runtime::basic_scheduler::Inner<P>::block_on
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/runtime/basic_scheduler.rs:202:9
  16: tokio::runtime::basic_scheduler::InnerGuard<P>::block_on
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/runtime/basic_scheduler.rs:516:9
  17: tokio::runtime::basic_scheduler::BasicScheduler<P>::block_on
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/runtime/basic_scheduler.rs:162:24
  18: tokio::runtime::Runtime::block_on
             at /home/circleci/.cargo/registry/src/github.ghproxy.top-1ecc6299db9ec823/tokio-1.8.1/src/runtime/mod.rs:450:46
  19: pageserver::walredo::PostgresRedoManagerInternal::wal_redo_main
             at /home/circleci/project/pageserver/src/walredo.rs:264:19
  20: pageserver::walredo::PostgresRedoManager::new::{{closure}}
             at /home/circleci/project/pageserver/src/walredo.rs:173:17
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Aug 11 18:23:10.312 INFO initdb succeeded, location: pageserver::branches, pageserver/src/branches.rs:149
Aug 11 18:23:10.312 INFO bootstrap_timeline "./tenants/55b3bcc8cd7a4512ba2220b507771e06/tmp" at lsn 0/1699C98, location: pageserver::branches, pageserver/src/branches.rs:172

Do we have a race condition, where one thread launches the WAL redo thread and then creates the tenant-directory, and meanwhile the WAL redo thread starts initdb which will also try to create the directory if it doesn't already exist? If that's the problem, then PR #410 might actually fix this by accident.

@kelvich kelvich added the t/bug Issue Type: Bug label Sep 6, 2021
@hlinnaka
Copy link
Contributor

Haven't seen this for a while, so I guess this has been fixed. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

3 participants