-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SISGraph for_all_nodes more efficient single-threaded #87
SISGraph for_all_nodes more efficient single-threaded #87
Conversation
We can definitely not remove the multithreaded implementation, as on the i6 cluster Sisyphus can become nearly unusable otherwise. I am not sure what the reasons are that this is causing such large speedups, but it is a fact that they exist. |
|
Sometimes I can get the same speed with the single-threaded implementation. The problem is that the timing really depends on the current FS usage, so testing this is rather difficult. But I will definitely stick to the multi-threaded approach. |
I doubt that the multi threaded code can be faster. Can you share an example? |
Why shouldn't it be possible for the multithreaded code to be fast? Using multiple threads allows us to place multiple calls to the filesystem at the same time. It's possible that each call now takes a little longer since we are putting more stress on the filesystem, but in my experience it still loads faster overall. @albertz Are you testing you setup on a local hard drive or a network drive? |
So I guess it's clear that if it is CPU bound, the single threaded code must be faster, right? So about the FS bound case: this depends on whether the FS can be faster with multiple threads, which depends on the FS. NFS specifically should not be faster because it anyway gets serialized through the network. |
Actually, there is one way to make it faster when the FS is slow: Have two threads, one which goes through the graph and collects all jobs and puts them into another queue, a separate one which does FS calls. The current API of |
I agree with your statement about the CPU bound case, but I disagree with the statement about NFS.
|
If it has to go through the network, and there is only one ethernet card, everything has to be serialized at one point. The ethernet card on the physical level is just one single stream of bits. On the OS level, caching could potentially speed some things up (but not really for Python), but otherwise I don't see how multiple parallel NFS calls could be faster. |
But it is not about NFS itself (which is just the protocol), but the total latency of the machine hosting the FS. So sure the network can not parallelize, but you do not have to wait for the full latency time to make the next request. For this PR this all does not matter anyway. @albertz your change brings the single-threaded implementation in the same range (no matter if slightly better or worse) as the multi-threaded one, so we should accept it. The rest is user configuration, and we can find out over time what to use in which scenario. |
I think we are talking past each other, I'm not talking about serializing any request into bytes, but handling each request serial (would linear be a better term to use here?). So if the file system gets 4 requests, [R1, R2, R3, R4] and the program is waiting for 4 answers [A1, A2, A3, A4]. A serial response would be if the filesystem only starts working on A2 after it finished A1, and A3 will only be worked on after A2 is finished. A parallel response would be that the filesystem can work on A1, A2, A3, and A4 at the same time. I'm very sure this is the case with most filesystems relevant for our use case including the standard NFS implementation on Linux. So let's assume each call to the filesystem takes 1 second no matter how many calls are done in parallel: In practice, the filesystem will slow down with multiple requests, but it should still be faster than doing everything after another. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While loops should be faster than recursive I doubt that the speed up comes from that. The main difference should be that the old version didn't check if it visited a node already. In that case I would prefer to keep the code between the multithreaded and the single threaded version as similar as possible.
I'll rework that code block a little.
# make sure all inputs are updated | ||
visited_set = set() | ||
visited_list = [] | ||
queue = list(reversed(nodes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you reverse the nodes first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I pop the latest entry (queue.pop(-1)
) and I wanted to keep a similar order as before (although it might not matter; the order of job._sis_inputs
is anyway random).
No, I think this is not correct. We should separate bandwidth and latency. For bandwidth, consider you have some limit in what you can transfer per second. E.g. assume you can transfer 100MB/sec. Let's say each A1-A4 reads a 100MB file. No matter how you do it, it will take 4 secs. For latency, it's a bit trickier. Let's assume some low bandwidth op which takes a single TCP packet for the request and also for the answer, e.g. some If you do it in parallel, the syscalls might be in random order or maybe truly in parallel, but then when the NFS client translates them into TCP packets for the corresponding requests and they get send over ethernet, they will be in some well defined order again. Let's assume that is [A1,A2,A3,A4], and they arrive like that on the NFS server. The NFS server now handles all those requests + many other unrelated requests. It is limited in how many requests in can handle per sec. For simplicity and a very extreme case, let's assume it can handle 1 request per second. So again, no matter how we do it, it will take approx 4 secs to handle the requests, no matter if we do it in parallel or sequentially.
I don't think so. But we should benchmark it. Btw, did you see my other post? I think there is some potential speedup when the FS is very slow, when you use 2 threads, one for the CPU work (going through the graph) and another for the FS work. Then you can keep both the CPU and the FS busy at the same time. Even the current multithreading implementation does not really do that. In any case, this PR here is only about improving the single threaded case. I think we can also improve the multi-threaded case further but maybe we should do that as a separate step, and move this discussion elsewhere, to some issue or so? |
Checking the visited nodes makes the huge difference, going from hours of runtime, basically exponential in the size of the graph, down to a few seconds, and linear runtime. However, such loop-queue-based implementation vs a recursive implementation can also make a big difference, maybe 50-100% in my experience, esp for deep graphs (where a path could maybe visit 100s of nodes). At the same time, I also think a queue-based implementation is much more straightforward. So I would definitely keep that. Or only if you can show in benchmarks that there is no measurable difference, then we could keep a recursive implementation. |
I also wouldn't expect any speed up for large files, but most FS access by Sisyphus at this point is just asking the file system if a file exists.
I'm not saying requesting it in parallel has to be faster, it might as well be slower in some cases, but the whole discussion started with:
so I gave you an example where it is faster. So can we agree that it can be faster?
My statement was probably a little bit too broad, it will heavily depend on many factors, but there are many combinations where accessing the filesystem in parallel should be faster.
I'm considering to refactor the multithreaded code as well and I'm still trying out a few things. The goal is to have the single and multithreaded case to be as similar as possible. |
True, but this can also be added with a few lines for the recursive case.
Interesting, in my experience loops are faster in C++, but I never saw a big difference in Python. I'll try to come up with a small benchmark later. From my side is the main argument for putting it in a loop is that we can't run into a maximum recursion depth limit, but I'm still testing something. |
No, even in this example, I think the single threaded code can be faster. This is what I tried to explain with what I described. And my explanation even did not consider the big overhead you get due to the multithreading, which is maybe another factor 2 slower. But anyway, there are many rough assumptions and estimations in either argumentation. So it would be good maybe to measure things. I assume you can simulate also a slow FS in a systematic way, maybe via some FUSE (although it only makes sense if this reflects the real behavior of NFS).
Note that the current state before this PR was already very different. In fact, this PR already improves on that, on a conceptual level. Also, note that my original use case was a big graph, and it was very much CPU bound (many |
The function call overhead in Python is even much larger than in C++, and a stack frame in Python is also much bigger than in C++. So in Python, this should make a much bigger difference than in C++. |
I created a small benchmark:
Running on my laptop loops are around 10%-20% faster for a depth of 1000 jobs. Increasing the depth to 10000 loops outperform the recursion by a lot, 3 to 4 times as fast and I had to increase the recursion limit. So yeah, we should use loops and also update the multithreaded case... |
I don't disagree that the single threaded code can be faster, I just disagree with your statement that multi threaded code can not be faster. |
Yes, this was wrong. But I corrected that in my follow up comments. Specifically for a slow FS, the variant using 2 threads can be faster, one doing the CPU work, the other doing FS work. But I think more than one thread for the CPU work, or more than one thread for the FS work will not improve. Unfortunately, decoupling the CPU and FS work is not really possible with our current API. It's only possible when you ignore the output of the function Btw, another optimization for when we just want to iterate through all jobs: We can cache the list of jobs. This would save us all the CPU work. Even this loop-queue-based implementation takes some CPU work and when there are many |
We should meet for lunch at some point and continue this discussion 😄
Maybe we could just replace the bottom_up flag, with a top_down_and_break flag. We can't break break for the bottom_up approach anyway (since we already added the children before have the result if we should break).
Yeah, there is absolutely room for improvement. As you mentioned, cacheing the whole job list can be tricky so I would attempt this last. Updating all nodes and compute the current state only once at the beginning of the manager loop sounds like the most promising starting point to me. Currently some results are cached with a timeout, but triggering this centralized should be cleaner, faster, and maybe even use less memory. |
To follow up the single thread vs multi thread discussion for everyone else (I already talked with @albertz offline), here is a short script to test the difference between accessing many files using a single thread and multi thread file approach: import sys
import os
import glob
import time
from multiprocessing.pool import ThreadPool
pool = ThreadPool(20)
def check_file(filename):
try:
os.stat(filename)
glob.glob(filename+'/*')
except FileNotFoundError:
pass
def single_thread():
for i in sys.argv[2:]:
check_file(i)
def multi_thread():
pool.map(check_file, sys.argv[2:])
if sys.argv[1] == 'single':
start = time.time()
single_thread()
print('Single:', time.time()-start)
elif sys.argv[1] == 'multi':
start = time.time()
multi_thread()
print('Multi', time.time()-start) I ran each variant ten times (alternating between them) on a fairly fast network file system and gave it a large number of directories as input: This was of course just one benchmark and I expect the results to vary drastically depending on the filesystem and the number of given files, but it shows that having a multi thread implementation can be useful. |
This gives me some huge speedup, of a big GMM pipeline, from hours of runtime (I let it run for about 30 mins and then stopped, so I don't actually know) down to 20 secs runtime.
This is with
GRAPH_WORKER = 1
.The main problem is that this did not keep track of visited nodes for this case.
But also, when comparing this implementation now to the case of
GRAPH_WORKER > 1
(e.g.GRAPH_WORKER = 16
), it's more than twice as fast. And this is expected: I don't see any reason why a multi-threaded implementation can make it any faster:So, I would even suggest to go further and completely remove the multithreaded implementation.