Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema change statement fails if one node is not responding #1240

Open
soyacz opened this issue Feb 13, 2025 · 2 comments
Open

Schema change statement fails if one node is not responding #1240

soyacz opened this issue Feb 13, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@soyacz
Copy link

soyacz commented Feb 13, 2025

Test scenario:

  1. set up 3 nodes, create keyspace.
  2. pause one node (SIGSTOP signal)
  3. run create table query
  4. query fails with error SchemaAgreementError(RequestError(BrokenConnectionError(BrokenConnectionError(KeepaliveTimeout(127.0.2.1)))))
2025-02-13T11:36:24.743020Z TRACE Request{kind="unprepared" contents="CREATE TABLE test_schema_agreement_paused (k int primary key, v int)"}:Executing request{node=127.0.2.2:9042}: scylla::client::session: Execution started
2025-02-13T11:36:24.743037Z TRACE Request{kind="unprepared" contents="CREATE TABLE test_schema_agreement_paused (k int primary key, v int)"}: scylla::network::connection_pool: Selecting connection for shard shard=0
2025-02-13T11:36:24.743050Z TRACE Request{kind="unprepared" contents="CREATE TABLE test_schema_agreement_paused (k int primary key, v int)"}: scylla::network::connection_pool: Available connections=127.0.2.2:9042
2025-02-13T11:36:24.743064Z TRACE Request{kind="unprepared" contents="CREATE TABLE test_schema_agreement_paused (k int primary key, v int)"}: scylla::network::connection_pool: Found connection for the target shard shard=0
2025-02-13T11:36:24.743080Z TRACE Request{kind="unprepared" contents="CREATE TABLE test_schema_agreement_paused (k int primary key, v int)" shard=0}:Executing request{node=127.0.2.2:9042}: scylla::client::session: Sending connection=127.0.2.2:9042
2025-02-13T11:36:24.743148Z TRACE scylla::network::connection: Sending 1 requests; 86 bytes
2025-02-13T11:36:46.985471Z DEBUG scylla::cluster::worker: Received server event: StatusChange(Down(127.0.2.1:9042))
2025-02-13T11:36:46.994083Z DEBUG scylla::cluster::worker: Received server event: SchemaChange(TableChange { change_type: Created, keyspace_name: "test_rust_1739446584_0", object_name: "test_schema_agreement_paused" })
2025-02-13T11:36:46.994944Z TRACE Request{kind="unprepared" contents="CREATE TABLE test_schema_agreement_paused (k int primary key, v int)" shard=0 request_size=0}:Executing request{node=127.0.2.2:9042}: scylla::client::session: Request succeeded
2025-02-13T11:36:47.196290Z TRACE scylla::network::connection: Sending 1 requests; 73 bytes
2025-02-13T11:36:47.196404Z TRACE scylla::network::connection: Sending 1 requests; 73 bytes
2025-02-13T11:36:47.196461Z TRACE scylla::network::connection: Sending 1 requests; 73 bytes
2025-02-13T11:36:54.076672Z TRACE scylla::network::connection: Sending 1 requests; 9 bytes
2025-02-13T11:36:54.077107Z TRACE scylla::network::connection: Keepalive request successful on connection to node 127.0.2.2
2025-02-13T11:36:54.084480Z TRACE scylla::network::connection: Sending 1 requests; 9 bytes
2025-02-13T11:36:54.084586Z TRACE scylla::network::connection: Sending 1 requests; 9 bytes
2025-02-13T11:36:54.084793Z TRACE scylla::network::connection: Keepalive request successful on connection to node 127.0.2.2
2025-02-13T11:36:54.084882Z TRACE scylla::network::connection: Keepalive request successful on connection to node 127.0.2.3
2025-02-13T11:36:54.086101Z TRACE scylla::network::connection: Sending 1 requests; 9 bytes
test schema_agreement::test_schema_agreement_with_paused_node has been running for over 60 seconds
2025-02-13T11:37:24.076722Z TRACE scylla::network::connection: Sending 1 requests; 9 bytes
2025-02-13T11:37:24.077259Z TRACE scylla::network::connection: Keepalive request successful on connection to node 127.0.2.2
2025-02-13T11:37:24.084645Z TRACE scylla::network::connection: Sending 1 requests; 9 bytes
2025-02-13T11:37:24.084755Z TRACE scylla::network::connection: Sending 1 requests; 9 bytes
2025-02-13T11:37:24.084916Z TRACE scylla::network::connection: Keepalive request successful on connection to node 127.0.2.3
2025-02-13T11:37:24.085050Z TRACE scylla::network::connection: Keepalive request successful on connection to node 127.0.2.2
2025-02-13T11:37:24.087329Z  WARN scylla::network::connection: Timed out while waiting for response to keepalive request on connection to node 127.0.2.1
2025-02-13T11:37:24.087527Z DEBUG scylla::network::connection_pool: [127.0.2.1:9042] Got error for connection 0x7ec9840a71e0: BrokenConnection(BrokenConnectionError(KeepaliveTimeout(127.0.2.1)))
2025-02-13T11:37:24.087574Z TRACE scylla::network::connection_pool: [127.0.2.1:9042] Connection 0x7ec9840a71e0 removed from shard 0 pool, now there is 0 for the shard, total 0
2025-02-13T11:37:24.087617Z TRACE scylla::network::connection_pool: pool_state=[(0, [])]
2025-02-13T11:37:24.087648Z DEBUG scylla::network::connection_pool: [127.0.2.1:9042] Scheduling next refill in 50 ms
thread 'schema_agreement::test_schema_agreement_with_paused_node' panicked at scylla/tests/ccm_integration/schema_agreement.rs:152:14:
called `Result::unwrap()` on an `Err` value: SchemaAgreementError(RequestError(BrokenConnectionError(BrokenConnectionError(KeepaliveTimeout(127.0.2.1)))))

test draft: https://github.com/soyacz/scylla-rust-driver/blob/hackathon-schema-agreement/scylla/tests/ccm_integration/schema_agreement.rs

@Lorak-mmk Lorak-mmk added the bug Something isn't working label Feb 13, 2025
@Lorak-mmk
Copy link
Collaborator

What happens is:

  • Node is paused
  • Paused node does not respond to queries, but the TCP connection is not closed. Driver will notice that the node is down after some time, thanks to the keepaliver
  • Before the driver notices that the connections no longer work, we issue schema changing query
  • Query succeeds, so the driver tries to await schema agreement (because automatic schema agreement waiting on DDL is enabled by default)
  • Driver uses "iter_working_connections" and send a schema version fetch on each of them
  • Paused node doesn't respond, so driver keeps waiting on it.
  • Keepaliver notices the nodes is down and shuts down the connection
  • schema version fetch request returns broken connection error
  • Error propagates, and so driver returns an error for the schema changing query

The same thing is possible for PREPARE path - it also uses "iter_working_connection".
I think the fix is to ignore broken connection errors - we want to issue the query on working connection, so if the connection is no longer working we should ignore it.
We actually already do this for USE KEYSPACE path: see fn use_keyspace_result. We should use similar logic for PREPARE and schema awaiting.

@Lorak-mmk
Copy link
Collaborator

Another minor thing regarding paused node: if we send some query to it, we will wait for response until client-side timeout passes. That means we won't use retries at all.
I wonder if this is the correct behavior (which means the user should use speculative execution to prevent that), or maybe we should have a client-side timeout for a single attempt?
Or maybe the main client-side timeout should be for a single attempt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants