You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The storage bucket(s) backing a large and very busy cluster had a configuration change such that fragment uploads were refused for an extended period (hours), and the disks of many brokers filled to 100%.
As intended, brokers paused accepting new appends until more disk was available, and also as intended, once uploads to the backing bucket resumed, all but one of the brokers reclaimed disk space and were able to continue accepting appends.
The specific mechanism by which this works is that, once a broker sees the remote bucket contains a fragment covering the span of a local fragment, the local fragment (and it's *os.File) is dropped for the GC to finalize. It's not closed, because it may still be accessed via a concurrent read.
One broker, for unknown reasons, was unable to reclaim dangling *os.File's, and never escaped the 100% disk full condition. Before forcibly killing it, I was able to verify 1) GC was still running regularly, 2) goroutine traces showed that refreshes of the fragment index from the bucket -- the mechanism by which *os.File references are dropped -- were proceeding normally, and 3) there weren't other wedged goroutines which could explain a very large number of dangling *os.File references. Other than that, I'm currently scratching my head.
The text was updated successfully, but these errors were encountered:
The storage bucket(s) backing a large and very busy cluster had a configuration change such that fragment uploads were refused for an extended period (hours), and the disks of many brokers filled to 100%.
As intended, brokers paused accepting new appends until more disk was available, and also as intended, once uploads to the backing bucket resumed, all but one of the brokers reclaimed disk space and were able to continue accepting appends.
The specific mechanism by which this works is that, once a broker sees the remote bucket contains a fragment covering the span of a local fragment, the local fragment (and it's
*os.File
) is dropped for the GC to finalize. It's not closed, because it may still be accessed via a concurrent read.One broker, for unknown reasons, was unable to reclaim dangling
*os.File
's, and never escaped the 100% disk full condition. Before forcibly killing it, I was able to verify 1) GC was still running regularly, 2) goroutine traces showed that refreshes of the fragment index from the bucket -- the mechanism by which*os.File
references are dropped -- were proceeding normally, and 3) there weren't other wedged goroutines which could explain a very large number of dangling*os.File
references. Other than that, I'm currently scratching my head.The text was updated successfully, but these errors were encountered: