-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve BalancedPool balancing algorithm #1065
Comments
HI @mcollina I'd like to work on this one. Can you assign me to it? |
go for it! |
@mcollina what is the reason behind introducing the balancedPool? Is it to support distributing the requests to multiple upstreams? |
Yes exactly. There can be a few situations where deploying a separate load balancer in front of a pool of services is not an option. |
Do you already have an idea what the algorithm should do? I took a brief look into the paper; the algorithm that is used to decide the weights assumes that we are aware of the different request classes. This isn't the case in our situation, right? I'll also dig more into the paper in the following days I'm also taking a look at this: https://github.com/elastic/elastic-transport-js/blob/main/src/pool/WeightedConnectionPool.ts It looks like they start each connection with the max weight, and then decrease the weight if the connection errored or TO. They also have an option to decrease the weight if the http response code is 502, 503, or 504 |
That seems a really good start! |
@delvedor can surely help reviewing. |
Hi guys, I finally have a vacation so hopefully I'll be able to wrap this thing 🙏 I was stuck at testing and currently working on rebasing after the new factory thingy changes. I need some help. So I've been writing tests and I noticed a behaviour in the algorithm where one server would be chosen for so many times. I'll try to explain it via an example: Let us assume that we have 2 upstreams A, and B, with initial weights of 50 for both. Also let us assume that our dynamic weights logic would have a penalty of 19 points if the upstream request fails. if B fails the weight of B would be 31 and the weight of A would stay 50. the current weight would be around 50 and the algorithm would decrease that weight every N iterations by the GCD which is "1" in our case GCD(50,31). So then the current weight is decreased every N iterations (2 in our case) by the GCD. This would result in the algorithm choosing server A for more than 78 times one solution here might be to choose a penalty where P % max_weight == 0. something like 10? and then we'll get a GCD of 10 so B would be choose after 2 iterations ((50-10)/10)*2 It would be very helpful if someone else can validate my findings and check if this is an acceptable behaviour. I'm also up for a call if anyone would like to discuss this further :) |
I think 78 times is not that bad from my understanding. If we are under low load, always picking the same server is not a problem. If we are under high load, 78 times does not look to be a long wait, less than a second if we are sending around 100 req/sec. How does the math look like for like 3 or 5 servers? |
I've written a test runner part of the Here is a successful test with 3 servers with starting weight of 100 and error penalty of 7 and server A failing on the first request
here is how it looks with 3 servers running without any failing requests
The load is distributed equally across the 3 servers |
I'm not sure wether we should have the errorPenalty and startingWeightPerServer as a configured value by the user or should we have a static one defined by us |
It's a good idea to make it configurable with a good default |
@mcollina do we have an estimate about how many upstreams would probably be added? like a max of 100? I'm asking because currently the algorithm suffers from a case where the complexity could be up to O(maxWeight) if the GCD between all the weights is 1. We could optimise that to be O(N) by either calculating the next lower weight from currentWeight or keep it the logic the same and make sure we don't end up with weights that would end up with a GCD of 1 I've also noticed that the elastic-transport-js library has a bug due to this assumption https://github.com/elastic/elastic-transport-js/blob/main/src/pool/WeightedConnectionPool.ts#L54
This is wrong especially when the GCD is 1. This is why they use a |
I would not expect more than 100 peers. |
I'm seeing a weird behaviour where I have 3 upstreams (A,B,C) and only 1 of them goes down C. After C is down upstreams A, B will also emit a disconnect event |
Looks like a bug to me. |
my bad it seems related to the test server keepalive timeout configuration :( |
HI @mcollina, can we close this issue now? |
yes |
Currently
BalancedPool
use a simple round robin algorithm if the target upstream is not available.A more nuanced algorithm could easily produce better performance.
This is a good paper describing a better algorithm: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.863.3985&rep=rep1&type=pdf.
The text was updated successfully, but these errors were encountered: