WAL records that are no-ops cause timelines to remain active #5962

problame · 2023-11-28T18:31:10Z

@ars , @koivunej and I just discovered that there are cases where timelines remain active in safekeepers because of a never-closing gap between safekeeper commit_lsn and pageserver remote_consistent_lsn.

This became apparent when analyzing reasons for spawning of initial logical size calculation tasks in Pageservers during storage deployment, using this query: ca 75% of the initial logical size calculations are initiated by WalReceiverConnectionHandler, which doesn't make sense because we know that much less than 75% of tenants are truly active¹.

Other graphs we looked at during analysis (all looking at exact same time frame)

live connection count for walreceiver conns between PS and SK (gauge): Link
amount of walreceiver connection switches (better than above gauge): Link
one of the safekeepers in that region: gauge of active timelines: Link
one of the safekeeper's logs, logging timeline state transitions: Link

Repro

You can repro it locally quite easily:

neon_local init
neon_local start
neon_local tenant create --set-default
neon_local endpoint create
neon_local endpoint start main
neon_local endpoint stop main
neon_local stop
neon_local start
# safeekeeper is now at a commit_lsn > pageserver's remote_consistent_lsn
# pageserver will never flush a layer

We did some digging with trace-level logging:

RUST_LOG=pageserver::walrecord=trace,pageserver::tenant::timeline::walreceiver=trace,pageserver::tenant::storage_layer::inmemory_layer=trace,pageserver::walingest=trace,debug ./target/debug/neon_local pageserver restart

PS log

2023-11-28T18:16:42.929912Z TRACE wal_connection_manager{tenant_id=8bfbde11eed5bf1e46ab031ea0a3f5c6 timeline_id=a87acb3253cfa652007a7e0404c1e726}:connection{node_id=1}: received XLogData between 0/149F0D8 and 0/149F1B8
2023-11-28T18:16:42.930139Z TRACE wal_connection_manager{tenant_id=8bfbde11eed5bf1e46ab031ea0a3f5c6 timeline_id=a87acb3253cfa652007a7e0404c1e726}:connection{node_id=1}: decode_wal_record xl_rmid = 0 xl_info = 96
2023-11-28T18:16:42.930390Z TRACE wal_connection_manager{tenant_id=8bfbde11eed5bf1e46ab031ea0a3f5c6 timeline_id=a87acb3253cfa652007a7e0404c1e726}:connection{node_id=1}: decode_wal_record xl_rmid = 8 xl_info = 16
2023-11-28T18:16:42.930465Z TRACE wal_connection_manager{tenant_id=8bfbde11eed5bf1e46ab031ea0a3f5c6 timeline_id=a87acb3253cfa652007a7e0404c1e726}:connection{node_id=1}: decode_wal_record xl_rmid = 0 xl_info = 96
2023-11-28T18:16:42.930496Z TRACE wal_connection_manager{tenant_id=8bfbde11eed5bf1e46ab031ea0a3f5c6 timeline_id=a87acb3253cfa652007a7e0404c1e726}:connection{node_id=1}: decode_wal_record xl_rmid = 8 xl_info = 16
2023-11-28T18:16:42.930523Z  INFO wal_connection_manager{tenant_id=8bfbde11eed5bf1e46ab031ea0a3f5c6 timeline_id=a87acb3253cfa652007a7e0404c1e726}:connection{node_id=1}: caught up at LSN 0/149F1B8

We would have expected the trace-level logging from

neon/pageserver/src/tenant/storage_layer/inmemory_layer.rs

Line 260 in dc72567

trace!("put_value key {} at {}/{}", key, self.timeline_id, lsn);

but didn't.

Analysis / Interpretation Of The Logs

These WAL records are no-ops to Pageservers, i.e., we don't put() any key-value-pair into the Timeline in response to these records, i.e., we don't create an open layer, i.e., check_checkpoint_distance() doesn't have anything to flush.

Fixing This

TBD

Related tasks

Understand The Scope Of The Issue

Give feedback

walingest: prepare for exhaustive match and handling of no-ops #6002
pageserver metrics in pageserver to quantify the condition in prod
Options

Better Observability In Safekeepers

Give feedback

have counters for state transitions of sk timeline states instead of gauge safekeeper_timeline_active
Options

initial logical size calculation is initiated either from walreceiver connection handler, or by consumption metrics. See also concurrency-limit initial logical size calculation #5955 . ↩

The text was updated successfully, but these errors were encountered:

refs #5962

These will help us answer questions such as: - when & at what do calculations get started after PS restart? - how often is the api to get current incrementally-computed logical size called, and does it return Exact vs Approximate? I'd also be interested in a histogram of how much wall clock time size calculations take, but, I don't know good bucket sizes, and, logging it would introduce yet another per-timeline log message during startup; don't think that's worth it just yet. Context - https://neondb.slack.com/archives/C033RQ5SPDH/p1701197668789769 - #5962 - #5963 - #5955 - neondatabase/cloud#7408

…ent (#5974) refs #5962 --------- Co-authored-by: Joonas Koivunen <[email protected]>

These will help us answer questions such as: - when & at what do calculations get started after PS restart? - how often is the api to get current incrementally-computed logical size called, and does it return Exact vs Approximate? I'd also be interested in a histogram of how much wall clock time size calculations take, but, I don't know good bucket sizes, and, logging it would introduce yet another per-timeline log message during startup; don't think that's worth it just yet. Context - https://neondb.slack.com/archives/C033RQ5SPDH/p1701197668789769 - #5962 - #5963 - #5955 - neondatabase/cloud#7408

hlinnaka · 2023-11-30T16:13:58Z

I guess we need to create an empty delta layer to cover those records..

These will help us answer questions such as: - when & at what do calculations get started after PS restart? - how often is the api to get current incrementally-computed logical size called, and does it return Exact vs Approximate? I'd also be interested in a histogram of how much wall clock time size calculations take, but, I don't know good bucket sizes, and, logging it would introduce yet another per-timeline log message during startup; don't think that's worth it just yet. Context - https://neondb.slack.com/archives/C033RQ5SPDH/p1701197668789769 - #5962 - #5963 - #5955 - neondatabase/cloud#7408

knizhnik · 2023-12-03T18:54:33Z

If we already support filtering of WAL records on Safekeepers (for sharding), then why not to filter this no-ops WAL records and do not send them to PS at all?
We definitely need them for replication, but replication is done by safekeepers...

problame · 2023-12-04T14:04:51Z

IDK about @hlinnaka 's idea with the empty delta layer as impacts compaction

I kinda like Konstantin's idea of filtering out no-op records in SK.

I'm reading between the lines that Konstantin's idea implies that SKs will keep track of last non-noop LSN and the criterion for "PS has caught up" is "is last noop LSN == remote_consistent_lsn".

I think if we go with filtering in SKs, we should also have a corresponding change in PSes that makes noise or outright rejects no-op records, so, we'd be alerted if we introduce any regressions. PR #6002 is a step in that direction.

problame · 2023-12-04T14:09:35Z

@hlinnaka and I decided to pause this effort until #5993 is fixed and shipped. That will give us an idea of how much this issue contributes to the 75% number mentioned in the issue description.

…6000) Problem ------- Before this PR, there was no concurrency limit on initial logical size computations. While logical size computations are lazy in theory, in practice (production), they happen in a short timeframe after restart. This means that on a PS with 20k tenants, we'd have up to 20k concurrent initial logical size calculation requests. This is self-inflicted needless overload. This hasn't been a problem so far because the `.await` points on the logical size calculation path never return `Pending`, hence we have a natural concurrency limit of the number of executor threads. But, as soon as we return `Pending` somewhere in the logical size calculation path, other concurrent tasks get scheduled by tokio. If these other tasks are also logical size calculations, they eventually pound on the same bottleneck. For example, in #5479, we want to switch the VirtualFile descriptor cache to a `tokio::sync::RwLock`, which makes us return `Pending`, and without measures like this patch, after PS restart, VirtualFile descriptor cache thrashes heavily for 2 hours until all the logical size calculations have been computed and the degree of concurrency / concurrent VirtualFile operations is down to regular levels. See the *Experiment* section below for details.  Background ---------- Before this PR, initial logical size calculation was spawned lazily on first call to `Timeline::get_current_logical_size()`. In practice (prod), the lazy calculation is triggered by `WalReceiverConnectionHandler` if the timeline is active according to storage broker, or by the first iteration of consumption metrics worker after restart (`MetricsCollection`). The spawns by walreceiver are high-priority because logical size is needed by Safekeepers (via walreceiver `PageserverFeedback`) to enforce the project logical size limit. The spawns by metrics collection are not on the user-critical path and hence low-priority. [^consumption_metrics_slo] [^consumption_metrics_slo]: We can't delay metrics collection indefintely because there are TBD internal SLOs tied to metrics collection happening in a timeline manner (neondatabase/cloud#7408). But let's ignore that in this issue. The ratio of walreceiver-initiated spawns vs consumption-metrics-initiated spawns can be reconstructed from logs (`spawning logical size computation from context of task kind {:?}"`). PR #5995 and #6018 adds metrics for this. First investigation of the ratio lead to the discovery that walreceiver spawns 75% of init logical size computations. That's because of two bugs: - In Safekeepers: #5993 - In interaction between Pageservers and Safekeepers: #5962 The safekeeper bug is likely primarily responsible but we don't have the data yet. The metrics will hopefully provide some insights. When assessing production-readiness of this PR, please assume that neither of these bugs are fixed yet. Changes In This PR ------------------ With this PR, initial logical size calculation is reworked as follows: First, all initial logical size calculation task_mgr tasks are started early, as part of timeline activation, and run a retry loop with long back-off until success. This removes the lazy computation; it was needless complexity because in practice, we compute all logical sizes anyways, because consumption metrics collects it. Second, within the initial logical size calculation task, each attempt queues behind the background loop concurrency limiter semaphore. This fixes the performance issue that we pointed out in the "Problem" section earlier. Third, there is a twist to queuing behind the background loop concurrency limiter semaphore. Logical size is needed by Safekeepers (via walreceiver `PageserverFeedback`) to enforce the project logical size limit. However, we currently do open walreceiver connections even before we have an exact logical size. That's bad, and I'll build on top of this PR to fix that (#5963). But, for the purposes of this PR, we don't want to introduce a regression, i.e., we don't want to provide an exact value later than before this PR. The solution is to introduce a priority-boosting mechanism (`GetLogicalSizePriority`), allowing callers of `Timeline::get_current_logical_size` to specify how urgently they need an exact value. The effect of specifying high urgency is that the initial logical size calculation task for the timeline will skip the concurrency limiting semaphore. This should yield effectively the same behavior as we had before this PR with lazy spawning. Last, the priority-boosting mechanism obsoletes the `init_order`'s grace period for initial logical size calculations. It's a separate commit to reduce the churn during review. We can drop that commit if people think it's too much churn, and commit it later once we know this PR here worked as intended. Experiment With #5479 --------------------- I validated this PR combined with #5479 to assess whether we're making forward progress towards asyncification. The setup is an `i3en.3xlarge` instance with 20k tenants, each with one timeline that has 9 layers. All tenants are inactive, i.e., not known to SKs nor storage broker. This means all initial logical size calculations are spawned by consumption metrics `MetricsCollection` task kind. The consumption metrics worker starts requesting logical sizes at low priority immediately after restart. This is achieved by deleting the consumption metrics cache file on disk before starting PS.[^consumption_metrics_cache_file] [^consumption_metrics_cache_file] Consumption metrics worker persists its interval across restarts to achieve persistent reporting intervals across PS restarts; delete the state file on disk to get predictable (and I believe worst-case in terms of concurrency during PS restart) behavior. Before this patch, all of these timelines would all do their initial logical size calculation in parallel, leading to extreme thrashing in page cache and virtual file cache. With this patch, the virtual file cache thrashing is reduced significantly (from 80k `open`-system-calls/second to ~500 `open`-system-calls/second during loading). ### Critique The obvious critique with above experiment is that there's no skipping of the semaphore, i.e., the priority-boosting aspect of this PR is not exercised. If even just 1% of our 20k tenants in the setup were active in SK/storage_broker, then 200 logical size calculations would skip the limiting semaphore immediately after restart and run concurrently. Further critique: given the two bugs wrt timeline inactive vs active state that were mentioned in the Background section, we could have 75% of our 20k tenants being (falsely) active on restart. So... (next section) This Doesn't Make Us Ready For Async VirtualFile ------------------------------------------------ This PR is a step towards asynchronous `VirtualFile`, aka, #5479 or even #4744. But it doesn't yet enable us to ship #5479. The reason is that this PR doesn't limit the amount of high-priority logical size computations. If there are many high-priority logical size calculations requested, we'll fall over like we did if #5479 is applied without this PR. And currently, at very least due to the bugs mentioned in the Background section, we run thousands of high-priority logical size calculations on PS startup in prod. So, at a minimum, we need to fix these bugs. Then we can ship #5479 and #4744, and things will likely be fine under normal operation. But in high-traffic situations, overload problems will still be more likely to happen, e.g., VirtualFile cache descriptor thrashing. The solution candidates for that are orthogonal to this PR though: * global concurrency limiting * per-tenant rate limiting => #5899 * load shedding * scaling bottleneck resources (fd cache size (neondatabase/cloud#8351), page cache size(neondatabase/cloud#8351), spread load across more PSes, etc) Conclusion ---------- Even with the remarks from in the previous section, we should merge this PR because: 1. it's an improvement over the status quo (esp. if the aforementioned bugs wrt timeline active / inactive are fixed) 2. it prepares the way for #6010 3. it gets us close to shipping #5479 and #4744

jcsp · 2023-12-15T10:41:29Z

Triage notes:

Christian's latest update is current

problame · 2024-01-03T18:21:56Z

#5993 has been rolled out and its impact been reviewed.

I think we don't have a direct metric to measure the impact of this bug.

Instead of using indirect metric like comparing pageserver initiated walreceiver connections with safekeeper active timelines during restart, let's have a direct metric.

Relatively low effort proposal: push #6002 over the finish line (it's a good change anyway), then add a counter metric for IngestRecordOutcome::Noop.

Watch that counter during restart.

In order to get number of affected timelines, have a counter that is only incremented once per timeline struct.

problame · 2024-01-04T15:06:04Z

It turns out we have the pageserver_wal_ingest_records_received metric now.
While it isn't a counter for just ::Noop but all ingestions, we can still use it to see whether there's a peak in ingestions during PS startup.

Dashboard pinned to this week's deploy in us-west-2

=> Eyeballing it for us-west-2, there are no abnormal spikes during PS restart.

My take-away from that is that this issue isn't wide-spread.

With regard to this issue's relevance for #5479 , remember the main worry was that this bug would trigger a lot of initial logical size calculations.

Looking at the increase of metric pageserver_initial_logical_size_timelines_where_walreceiver_got_approximate_size after restarts provides a conservative approximation for how bad the remaining impact is.
Why is it a conservative approximation?

PS startup triggers initial size calculations eagerly in the background.
So, every initial size calculation that PS startup does can't be triggered by truly SK-active timelines or timelines that are SK-active because of this issue's bug.

Above dashboard includes a row that shows the numbers. There's less than 300 occurrences total per instance instance in us-west-2, and these 300 are spread over multiple minutes.
So, in summary: quite a low overall rate of these calculations, compared to the total amount of initial logical size calculations we do during startup (=> yet another row in that dashboard).

Should be safe to ship #5479 without fixing this bug.

problame · 2024-04-12T12:05:36Z

Some notes from the April storage offsite:

@arssher sees 2x more active timelines than there are connections from compute (=> these are either due to S3 uploads or due to this issue, needs more analysis)
it is probably sufficient to advance remote_consistent_lsn even for no-op records
- basically needs InMemoryLayer::put_noop() that advances the layer's end_lsn , then when that InMemoryLayer gets frozen & flushed, the end_lsn will eventually end up as IndexPart::disk_consistent_lsn, and that's remote_consistent_lsn moving forward
@jcsp tunes in: sharding solved this, but it's behind a check that only applies to sharded tenants
- for the case where there IS an InMemoryLayer open, it's properly solved
- for the case where there is NO InMemoryLayer, it's solved using a periodic tick
  - this only updates the local metadata but doesn't schedule an upload
  - there is another background thing that schedules an index upload at some interval

jcsp · 2024-04-12T12:11:41Z

Changes that upload remote index proactively in some cases -- this was for sharding, might not covery every case where a timeline ingest goes idle

neon/pageserver/src/tenant/timeline.rs

Lines 1202 to 1226 in 221414d

    
           let last_record_lsn = self.get_last_record_lsn(); 
        
           let disk_consistent_lsn = self.get_disk_consistent_lsn(); 
        
           if last_record_lsn > disk_consistent_lsn { 
        
               // We have no open layer, but disk_consistent_lsn is behind the last record: this indicates 
        
               // we are a sharded tenant and have skipped some WAL 
        
               let last_freeze_ts = *self.last_freeze_ts.read().unwrap(); 
        
               if last_freeze_ts.elapsed() >= self.get_checkpoint_timeout() { 
        
                   // This should be somewhat rare, so we log it at INFO level. 
        
                   // 
        
                   // We checked for checkpoint timeout so that a shard without any 
        
                   // data ingested (yet) doesn't write a remote index as soon as it 
        
                   // sees its LSN advance: we only do this if we've been layer-less 
        
                   // for some time. 
        
                   tracing::info!( 
        
                       "Advancing disk_consistent_lsn past WAL ingest gap {} -> {}", 
        
                       disk_consistent_lsn, 
        
                       last_record_lsn 
        
                   ); 
        
                   // The flush loop will update remote consistent LSN as well as disk consistent LSN. 
        
                   self.flush_frozen_layers_and_wait(last_record_lsn) 
        
                       .await 
        
                       .ok(); 
        
               } 
        
           }

jcsp · 2024-04-12T12:14:02Z

Currently advancing the disk consistent lsn is behind a condition for whether we're sharded, which can be safely removed (the condition was just to de-risk the change):
ac7fc61

problame · 2024-09-02T12:23:19Z

Attaching this issue to

WAL filtering in safekeeper for sharding #6345

as that Epic is when we'll be touching ingest code & will have to worry about "active" status of timelines anyways.

problame added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver c/storage Component: storage labels Nov 28, 2023

problame added a commit that referenced this issue Nov 29, 2023

walingest: refactor if-cascade on decoded.xl_rmid into match statement

986147d

refs #5962

problame mentioned this issue Nov 29, 2023

walingest: refactor if-cascade on decoded.xl_rmid into match statement #5974

Merged

problame self-assigned this Nov 29, 2023

arssher mentioned this issue Nov 30, 2023

Safekeeper timeline is deactivated without persisting state #5993

Closed

problame mentioned this issue Nov 30, 2023

initial logical size calculation: add a bunch of metrics #5995

Merged

problame added a commit that referenced this issue Nov 30, 2023

walingest: refactor if-cascade on decoded.xl_rmid into match statem…

3bb1030

…ent (#5974) refs #5962 --------- Co-authored-by: Joonas Koivunen <[email protected]>

problame mentioned this issue Nov 30, 2023

concurrency-limit initial logical size calculation [v2] #6000

Merged

problame changed the title ~~SK timelines remain active if WAL only contains no-op records~~ WAL records that are no-ops cause timelines to remain active Nov 30, 2023

problame mentioned this issue Nov 30, 2023

walingest: prepare for exhaustive match and handling of no-ops #6002

Closed

This was referenced Dec 7, 2023

revert "revert recent VirtualFile asyncification changes #5291" #5479

Closed

review impact of SK #5993 fix #6062

Closed

jcsp added the triaged bugs that were already triaged label Dec 15, 2023

problame mentioned this issue Jan 5, 2024

walingest: log a one-time warning on unknown record #6276

Closed

hlinnaka mentioned this issue Jan 20, 2024

Don't pass InvalidTransactionId to update_next_xid. #6410

Merged

problame mentioned this issue Sep 2, 2024

WAL filtering in safekeeper for sharding #6345

Closed

problame removed their assignment Sep 2, 2024

VladLazar mentioned this issue Oct 11, 2024

Epic: sharded pageserver ingest #9329

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WAL records that are no-ops cause timelines to remain active #5962

WAL records that are no-ops cause timelines to remain active #5962

problame commented Nov 28, 2023 •

edited by koivunej

Loading

Understand The Scope Of The Issue

Better Observability In Safekeepers

hlinnaka commented Nov 30, 2023

knizhnik commented Dec 3, 2023

problame commented Dec 4, 2023

problame commented Dec 4, 2023

jcsp commented Dec 15, 2023

problame commented Jan 3, 2024

problame commented Jan 4, 2024

problame commented Apr 12, 2024 •

edited

Loading

jcsp commented Apr 12, 2024

jcsp commented Apr 12, 2024

problame commented Sep 2, 2024

WAL records that are no-ops cause timelines to remain active #5962

WAL records that are no-ops cause timelines to remain active #5962

Comments

problame commented Nov 28, 2023 • edited by koivunej Loading

Repro

Analysis / Interpretation Of The Logs

Fixing This

Related tasks

Understand The Scope Of The Issue

Better Observability In Safekeepers

Footnotes

hlinnaka commented Nov 30, 2023

knizhnik commented Dec 3, 2023

problame commented Dec 4, 2023

problame commented Dec 4, 2023

jcsp commented Dec 15, 2023

problame commented Jan 3, 2024

problame commented Jan 4, 2024

problame commented Apr 12, 2024 • edited Loading

jcsp commented Apr 12, 2024

jcsp commented Apr 12, 2024

problame commented Sep 2, 2024

problame commented Nov 28, 2023 •

edited by koivunej

Loading

problame commented Apr 12, 2024 •

edited

Loading