-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server.go: use worker goroutines for fewer stack allocations #3204
Conversation
bf9688f
to
532ec54
Compare
f1bbea6
to
7fc9be1
Compare
9cda117
to
7d808bc
Compare
edced23
to
cb330a0
Compare
Per offline discussions:
|
This comment has been minimized.
This comment has been minimized.
Done.
Turns out to be insignificant. Also, since we're now allowing the user to set the number of stream workers, enforcing it to be an exponent of two is probably not a good idea. Switched to modulo.
I just realised that there's a slight issue with using stream ID -- it's biased towards lower numbers. If a server receives one unary request from several clients (a common workload), they'll all go to the same channel because the first stream ID is always 0. Switched to a round-robin method that's actually fairer and 1% faster (~compared to master, round robin is a 5.33% improvement while stream ID is a ~4.54% improvement).
See previous comment. Pushed V2. |
Currently (go1.13.4), the default stack size for newly spawned goroutines is 2048 bytes. This is insufficient when processing gRPC requests as the we often require more than 4 KiB stacks. This causes the Go runtime to call runtime.morestack at least twice per RPC, which causes performance to suffer needlessly as stack reallocations require all sorts of internal work such as changing pointers to point to new addresses. Since this stack growth is guaranteed to happen at least twice per RPC, reusing goroutines gives us two wins: 1. The stack is already grown to 8 KiB after the first RPC, so subsequent RPCs do not call runtime.morestack. 2. We eliminate the need to spawn a new goroutine for each request (even though they're relatively inexpensive). Performance improves across the board. The improvement is especially visible in small, unary requests as the overhead of stack reallocation is higher, percentage-wise. QPS is up anywhere between 3% and 5% depending on the number of concurrent RPC requests in flight. Latency is down ~3%. There is even a 1% decrease in memory footprint in some cases, though that is an unintended, but happy coincidence. unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_8-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false Title Before After Percentage TotalOps 2613512 2701705 3.37% SendOps 0 0 NaN% RecvOps 0 0 NaN% Bytes/op 8657.00 8654.17 -0.03% Allocs/op 173.37 173.28 0.00% ReqT/op 348468.27 360227.33 3.37% RespT/op 348468.27 360227.33 3.37% 50th-Lat 174.601µs 167.378µs -4.14% 90th-Lat 233.132µs 229.087µs -1.74% 99th-Lat 438.98µs 441.857µs 0.66% Avg-Lat 183.263µs 177.26µs -3.28%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After these changes, can you re-run some benchmarks and provide the results here? Thanks!
// serverWorkerResetThreshold defines how often the stack must be reset. Every | ||
// N requests, by spawning a new goroutine in its place, a worker can reset its | ||
// stack so that large stacks don't live in memory forever. 2^16 should allow | ||
// each goroutine stack to live for at least a few seconds in a typical | ||
// workload (assuming a QPS of a few thousand requests/sec). | ||
const serverWorkerResetThreshold = 1 << 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brainstorming: should this be time-based instead of request-based? Or some combination of both?
If a server goes idle, we may want to restart the threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly yes, but I expect the overhead introduced by that to be non-negligible. Also heavily depends on how Go's runtime manages shrinking stacks during GC pauses, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could do it with a Timer
checked in the same select
as the workload without being too expensive. But this is fine for now - we can work on this more in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a late comment just to point out that the runtime shrinks the stacks of goroutines during GC (or at the next safepoint), so this may be not needed at all (and potentially is counter-productive): https://github.com/golang/go/blob/9d812cfa5cbb1f573d61c452c864072270526753/src/runtime/mgcmark.go#L781-L783
Dropping all this part would have the benefit of making it much easier to implement adaptive workers (discussed below)
(Still interested in seeing some new results after the latest changes, otherwise LGTM.) |
s.handleStream(st, stream, s.traceInfo(st, stream)) | ||
wg.Done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should have been
defer wg.Done()
s.handleStream(st, stream, s.traceInfo(st, stream))
as it is on line 809 and as it was before the change
if s.opts.numServerWorkers > 0 { | ||
data := &serverWorkerData{st: st, wg: &wg, stream: stream} | ||
select { | ||
case s.serverWorkerChannels[atomic.AddUint32(&roundRobinCounter, 1)%s.opts.numServerWorkers] <- data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is potentially pretty suboptimal in case some workers are busy on requests that take significant time
select { | ||
case s.serverWorkerChannels[atomic.AddUint32(&roundRobinCounter, 1)%s.opts.numServerWorkers] <- data: | ||
default: | ||
// If all stream workers are busy, fallback to the default code path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a worker in this case, not a one-off goroutine.
Extra workers created over numServerWorkers should linger for a short period of time waiting for new work once they're done, and then shut down if no new work arrives. This would in turn enable to remove the numServerWorkers knob, as new workers will be created as needed.
Currently (go1.13.4), the default stack size for newly spawned
goroutines is 2048 bytes. This is insufficient when processing gRPC
requests as the we often require more than 4 KiB stacks. This causes the
Go runtime to call runtime.morestack at least twice per RPC, which
causes performance to suffer needlessly as stack reallocations require
all sorts of internal work such as changing pointers to point to new
addresses.
Since this stack growth is guaranteed to happen at least twice per RPC,
reusing goroutines gives us two wins:
subsequent RPCs do not call runtime.morestack.
(even though they're relatively inexpensive).
Performance improves across the board. The improvement is especially
visible in small, unary requests as the overhead of stack reallocation
is higher, percentage-wise. QPS is up anywhere between 3% and 5%
depending on the number of concurrent RPC requests in flight. Latency is
down ~3%. There is even a 1% decrease in memory footprint in some cases,
though that is an unintended, but happy coincidence.