-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyActiveStorage and Reductionist: parallel performance, general aspects #89
Comments
num_threads
in request_data
num_threads
in request_data
Hi @valeriupredoi, nice to see you're trying to drive the performance up. The short answer is that the exception is correct. You can't pass The long answer is: Reductionist uses the Tokio asynchronous runtime rather than threads. Each individual request is processed sequentially, but many requests may be processed concurrently. So it's important that the client is able to send many requests to the server concurrently in order to see good performance. I added parallel request support to PyActiveStorage some time ago, but it would be good to confirm that it's working. In general with async runtimes a problem can be that there is too much concurrency, meaning that the server may accept too many requests and exhaust resources (e.g. memory) trying to fulfill them. I did some work on trying to limit the concurrency in a controlled way, but it is not enabled by default. I'm happy to help you look into the performance if you think it's not working well. |
Well, the one thing i would say is that we can't be failing back locally. I don't know what caused these results, but it wasn't that! |
Hi @markgoddard - great explanation, very many thanks! Since I am here, let me pick your brains, when you have a bit of time to spare: consider two cases - one extremo wrt HDF5 chunking, and another fairly standard, let me explain (also have a look at the detailed test results we put together at https://github.com/valeriupredoi/testing_PyActiveStorage ) Extreme HDF5 chunking: 3400 tiny chunks (of about 70kb each) We (PyActiveStorage/Active with Reductionist) use 150 threads to select, process, and eventually send all these 3400 tiny chunks to Reductionist; here is the timeseries of how long it takes Reductionist to process each of these chunks (note of course time is measured via Active, ie time for and the same timeseries, but with Active running just one thread:
So there is factor 5-10 variability for Reductionist's time to process a tiny 70 odd kb chunk, depending how much load it receives (and this is just from one client!) - in fairness, the 3400 chunks file with such tiny chunks is silly, but it serves as an excercise of testing the limits of our pipeline. (I'll post the stuff about the other file, the 64 chunks one in the next comment so we don't overextend comments) 🍺 |
Decent chunking: 64 chunks (of about 4MB each) Here things are relatively better in terms of how long it takes Reductionist to analyze one chunk ie there is no such distinct separation between populations of times, but yet still, running Active with 150 threads is very bad for Reductionst; here is the Reductionist time to process one chunk vs chunk size plot, when running with 150 threads: and the same sort of plot but with 1 thread ie serial run: Here too one sees about order 6 variability for average Reductionist times per chunk, depending how much load we send it at one given time. Note that in the case of a serial Active run, it takes about the same 0.2-0.3s for Reductionist to process either a chunk of 70kB or a chunk of 4MB - which is very cool - but we need to think of ways to optimize its parallel processing since, as it is now, I don't think it's ready yet for a serious thrashing by a number of clients, each running silly small chunked data (OK not 70kB but 1MB default HDF5 chunks haha). What do you think? 🍺 |
just to be clear: we deployed Reductionist stock (the version from a month ago) - no tweaks or customization, so we (I) don't mess up anything I don't know how to fix 😁 |
Hi @valeriupredoi, I've read through the milestone report and the data you've shared here. It's certainly interesting to see it used and the performance analysed. The bimodal chunk processing time distribution is very interesting, and definitely digging into further. I think overall though it's definitely worth considering what we're trying to optimise for here. My assumption was something like this:
Consider a simplified setup where the server processes requests sequentially. If the client concurrently sends 100 requests, the server will process them one by one, returning responses when ready. In this case the request latency seen on the server is proportional to the position of the request in the queue, giving a roughly linear plot. This is actually what we see in this graph. Adding Reductionist's async processing into the setup makes it harder to reason about, especially without the resource management enabled. However we might simplify it as adding some parallelism into the queue processing. This would result in another linear plot with a different gradient. If we consider the 3400 chunk plot, one key difference here is that the number of chunks exceeds the number of threads. So there is a new behaviour here that we did not see before - Python threads finishing their chunk, then picking up another. I wonder whether this accounts for the some of the bimodal nature? |
Hi @markgoddard - hope you had a good Easter break! Sorry for the delay in responding, we all took a few days off around the time J. H. Christ decided to get on an American Airlines flight and take off on his annual trip 😁 Very good points! We are now almost ready to test the following setup: @bnlawrence has created two more virtual machines (accessible via the main VM that we've been testing with, with the current remote Reductionist; each of these new VMs have a backend IP address, but the front end/world-facing IP is the same as the older VM), so we are planning on deploying Reductionist on those two new ones, via a slightly adapted Ansible deployment. That way, we are hoping the chunks sent by the client will be distributed among the now three Reductionists. I am not Ansible-literate myself (Bryan is though), and wanted to ask you, how can I modify the inventory configuration so the deployment happens on two or more machines at the same time, please? Ie this file https://github.com/stackhpc/reductionist-rs/blob/main/deployment/inventory Cheers muchos 🍻 |
num_threads
in request_data
To be specific, the configuration is that we have three nodes one of which has a public IP address. So, active is the public facing node (with two IP addresses), with active2 and active3 only having one each. They can all see the S3 server, but we want the public IP address to be the haproxy address. |
Hi @valeriupredoi & @bnlawrence. Please see the example in the deployment docs. In your case it might look like this:
|
Thanks. But I think the issue is we were not sure what was going on with the terms that look like:
what is the ansible_connection doing there? |
The example inventory is for the simple case where you are running ansible on the same host that you want to deploy reductionist on. ansible_connection=local is used to force use of a local shell rather than SSH for simplicity. If you're using multiple hosts then you shouldn't use ansible_connection=local for any remote hosts. If the hostnames are not resolvable you can set ansible_host= for them in the inventory. You'd need to specify each host on its own line in order to do that, rather than using the |
Hi @markgoddard many thanks for the pointers 🍺 I am hitting an issue I can not shake off neither with a new [vpredoi@activeh ~]$ ansible-playbook -i reductionist-rs/deployment/inventory reductionist-rs/deployment/site.yml -e 'ansible_python_interpreter=/usr/bin/python3' -vvv
ansible-playbook [core 2.15.10]
config file = None
configured module search path = ['/home/vpredoi/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /home/vpredoi/.local/lib/python3.9/site-packages/ansible
ansible collection location = /home/vpredoi/.ansible/collections:/usr/share/ansible/collections
executable location = /home/vpredoi/.local/bin/ansible-playbook
python version = 3.9.16 (main, May 29 2023, 00:00:00) [GCC 11.3.1 20221121 (Red Hat 11.3.1-4)] (/usr/bin/python3)
jinja version = 3.1.3
libyaml = True
No config file found; using defaults
host_list declined parsing /home/vpredoi/reductionist-rs/deployment/inventory as it did not pass its verify_file() method
script declined parsing /home/vpredoi/reductionist-rs/deployment/inventory as it did not pass its verify_file() method
auto declined parsing /home/vpredoi/reductionist-rs/deployment/inventory as it did not pass its verify_file() method
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details
Parsed /home/vpredoi/reductionist-rs/deployment/inventory inventory source with ini plugin
ERROR! couldn't resolve module/action 'community.docker.docker_container'. This often indicates a misspelling, missing collection, or incorrect module path.
The error appears to be in '/home/vpredoi/reductionist-rs/deployment/site.yml': line 72, column 7, but may
be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
- name: Ensure step-ca container is running
^ here
[vpredoi@activeh ~]$ which python3
/usr/bin/python3 Any clues what's the hap here? Also, to be 100% sure, if I deploy on multiple machines, do I use this
or should I write the IP addresses instead of activeh, active2, active3? Our new head machine is now called |
hang on - I think I know what's the hap here - proves out I don't have |
Hi @valeriupredoi. Looks like you missed this step when setting up the new host as an Ansible control host:
Make sure you've run all the steps from https://stackhpc.github.io/reductionist-rs/deployment/#ansible-control-host-setup too. |
The inventory looks fine, although to be clear I might specify activeh rather than localhost for these groups:
Regarding hostnames vs IPs, it doesn't matter, as long as Ansible can reach the host. If you have DNS, great. Otherwise you could add entries to
|
many thanks @markgoddard - I had to do some pip magic, I replaced
Note that I had trimmed the [step-ca] section to only one machine, otherwise I kept getting
Any ideas - also @bnlawrence |
after yet another try it connected only to
My
and my
(masked off the last three digits for security purposes) |
Nevermind me! I managed to get the connections done working 🥳 - bit of jittery pokery with the ssh configuration file and that was that: I have activeh, active2 and active3 all seen by ansibles now, and the thing deploys, though I got this now:
What le grand crappe? 😁 |
some update: I managed to step over that error by making a couple changes to
and, of course, by creating a couple root CA certificates on active2 and active3, but now reductionist is not pinging correctly:
I did open the 8080 ports so am not sure what to do next - will need some help from @bnlawrence methinks 🍺 |
I would say in general that the Ansible playbook is designed to be used with all hosts in the inventory. There's a certain amount of referring to facts of other hosts that are assumed to exist. I think you should trim down the haproxy group to just activeh. HAProxy is not deployed with any support for failover (e.g. using keepalived), and the certificate is generated for the IP of only one of the HAProxy servers. I'll add a check for the group size similar to step-ca. Regarding the step CLI bootstrap issues, I think it's a side effect of having configured the hosts separately. I'll update the condition to always grab the CA fingerprint so that it's available if any hosts require it. I'm a bit concerned about your addition of run_once to the bootstrap CA task - it should run against all hosts. |
Some improvements: #90 |
Hi @markgoddard - legend, I'll grab that and run with it, cheers very much 🍺
That may explain the final error I reported, which smelled like an SSL issue, but lemme stash the changes I made and run with #90 |
Hi @markgoddard how's things, mate? Got a very quick question, and thought I'd open an issue since that may benefit of a discussion if the answer is not a simple "no". Can we run Reductionist on multiple threads, of which max threads be specified in the
request_data
dict? Funny thing is I actualy addednum_threads: 100
in therequest_data
, and our chunks are being analyzed at an average speed 2.5x faster than in normal situations, but I get this message right at the very end of the run, before getting the result:Cheers very much in advance 🍺
The text was updated successfully, but these errors were encountered: