Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SISGraph for_all_nodes more efficient single-threaded #87

Merged
merged 1 commit into from
Jun 21, 2022

Conversation

albertz
Copy link
Member

@albertz albertz commented Jun 20, 2022

This gives me some huge speedup, of a big GMM pipeline, from hours of runtime (I let it run for about 30 mins and then stopped, so I don't actually know) down to 20 secs runtime.

This is with GRAPH_WORKER = 1.

The main problem is that this did not keep track of visited nodes for this case.

But also, when comparing this implementation now to the case of GRAPH_WORKER > 1 (e.g. GRAPH_WORKER = 16), it's more than twice as fast. And this is expected: I don't see any reason why a multi-threaded implementation can make it any faster:

  • Any computation in Python is actually single threaded due to the GIL.
  • There is a lot of overhead involved with the multithreading logic.
  • If the FS is the bottleneck, there is also no point in having it multithreaded.

So, I would even suggest to go further and completely remove the multithreaded implementation.

@albertz albertz requested review from critias and JackTemaki June 20, 2022 12:12
@JackTemaki
Copy link
Contributor

JackTemaki commented Jun 20, 2022

We can definitely not remove the multithreaded implementation, as on the i6 cluster Sisyphus can become nearly unusable otherwise. I am not sure what the reasons are that this is causing such large speedups, but it is a fact that they exist.

@JackTemaki
Copy link
Contributor

JackTemaki commented Jun 20, 2022

And are you sure this PR is not altering the multi-threaded case as well? I get some slowdowns when running multi-threaded with this PR. Nevermind, this was just a single fluctuation.

@JackTemaki
Copy link
Contributor

Sometimes I can get the same speed with the single-threaded implementation. The problem is that the timing really depends on the current FS usage, so testing this is rather difficult. But I will definitely stick to the multi-threaded approach.

@albertz
Copy link
Member Author

albertz commented Jun 20, 2022

I doubt that the multi threaded code can be faster. Can you share an example?

@critias
Copy link
Contributor

critias commented Jun 20, 2022

Why shouldn't it be possible for the multithreaded code to be fast?
In my experience I the bottleneck was usually the filesystem and not the cpu (depending on your recipes this might be different for you). If we are using only one thread python will wait for each function which accesses the filesystem, e.g. os.path.exists(path), until it returns. The GIL will be released during this wait period, but since we are using only one thread this doesn't change anything.

Using multiple threads allows us to place multiple calls to the filesystem at the same time. It's possible that each call now takes a little longer since we are putting more stress on the filesystem, but in my experience it still loads faster overall.

@albertz Are you testing you setup on a local hard drive or a network drive?

@albertz
Copy link
Member Author

albertz commented Jun 20, 2022

So I guess it's clear that if it is CPU bound, the single threaded code must be faster, right?

So about the FS bound case: this depends on whether the FS can be faster with multiple threads, which depends on the FS. NFS specifically should not be faster because it anyway gets serialized through the network.

@albertz
Copy link
Member Author

albertz commented Jun 20, 2022

Actually, there is one way to make it faster when the FS is slow: Have two threads, one which goes through the graph and collects all jobs and puts them into another queue, a separate one which does FS calls. The current API of for_all_nodes does not really allow for that though. Also, I doubt that this would really be faster than the single threaded code. And the current multi threaded code should be slower in any case.

@critias
Copy link
Contributor

critias commented Jun 20, 2022

So I guess it's clear that if it is CPU bound, the single threaded code must be faster, right?

So about the FS bound case: this depends on whether the FS can be faster with multiple threads, which depends on the FS. NFS specifically should not be faster because it anyway gets serialized through the network.

I agree with your statement about the CPU bound case, but I disagree with the statement about NFS.

  1. When I tested this implementation on a fairly slow NFS filesystem I could see a speed up.
  2. Especially since everything has to be send via the network I expect a speed up. The network provides a fairly high latency (compared to your local drive). I don't know the exact implementation of NFS, but so far I have no reason to doubt that multiple requests can be handled in parallel. Processing all request to the filesystem serial and not in parallel would be unworkable slow in many cases.

@albertz
Copy link
Member Author

albertz commented Jun 20, 2022

I have no reason to doubt that multiple requests can be handled in parallel.

If it has to go through the network, and there is only one ethernet card, everything has to be serialized at one point. The ethernet card on the physical level is just one single stream of bits.

On the OS level, caching could potentially speed some things up (but not really for Python), but otherwise I don't see how multiple parallel NFS calls could be faster.

@JackTemaki
Copy link
Contributor

If it has to go through the network, and there is only one ethernet card, everything has to be serialized at one point. The ethernet card on the physical level is just one single stream of bits.

On the OS level, caching could potentially speed some things up (but not really for Python), but otherwise I don't see how multiple parallel NFS calls could be faster.

But it is not about NFS itself (which is just the protocol), but the total latency of the machine hosting the FS. So sure the network can not parallelize, but you do not have to wait for the full latency time to make the next request.

For this PR this all does not matter anyway. @albertz your change brings the single-threaded implementation in the same range (no matter if slightly better or worse) as the multi-threaded one, so we should accept it. The rest is user configuration, and we can find out over time what to use in which scenario.

@critias
Copy link
Contributor

critias commented Jun 20, 2022

If it has to go through the network, and there is only one ethernet card, everything has to be serialized at one point. The ethernet card on the physical level is just one single stream of bits.

On the OS level, caching could potentially speed some things up (but not really for Python), but otherwise I don't see how multiple parallel NFS calls could be faster.

I think we are talking past each other, I'm not talking about serializing any request into bytes, but handling each request serial (would linear be a better term to use here?). So if the file system gets 4 requests, [R1, R2, R3, R4] and the program is waiting for 4 answers [A1, A2, A3, A4]. A serial response would be if the filesystem only starts working on A2 after it finished A1, and A3 will only be worked on after A2 is finished. A parallel response would be that the filesystem can work on A1, A2, A3, and A4 at the same time. I'm very sure this is the case with most filesystems relevant for our use case including the standard NFS implementation on Linux.

So let's assume each call to the filesystem takes 1 second no matter how many calls are done in parallel:
The R1 -> A1 represents sending request 1 from the program to the filesystem, in case of NFS to the server, and the answer 1 back from the server, and back to the program.
Serial it would be R1->A1 (1 second) + R2->A2 (1 second) + R3->A3 (1 second) + R4->A4 (1 second) gives a total of 4 seconds.
Parallel it would be R1, R2, R3, R4 -> A1, A2, A3, A4 (1 second) so the whole operation would take 1 second.

In practice, the filesystem will slow down with multiple requests, but it should still be faster than doing everything after another.

Copy link
Contributor

@critias critias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While loops should be faster than recursive I doubt that the speed up comes from that. The main difference should be that the old version didn't check if it visited a node already. In that case I would prefer to keep the code between the multithreaded and the single threaded version as similar as possible.
I'll rework that code block a little.

# make sure all inputs are updated
visited_set = set()
visited_list = []
queue = list(reversed(nodes))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you reverse the nodes first?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I pop the latest entry (queue.pop(-1)) and I wanted to keep a similar order as before (although it might not matter; the order of job._sis_inputs is anyway random).

@albertz
Copy link
Member Author

albertz commented Jun 20, 2022

If it has to go through the network, and there is only one ethernet card, everything has to be serialized at one point. The ethernet card on the physical level is just one single stream of bits.
On the OS level, caching could potentially speed some things up (but not really for Python), but otherwise I don't see how multiple parallel NFS calls could be faster.

I think we are talking past each other, I'm not talking about serializing any request into bytes, but handling each request serial (would linear be a better term to use here?). So if the file system gets 4 requests, [R1, R2, R3, R4] and the program is waiting for 4 answers [A1, A2, A3, A4]. A serial response would be if the filesystem only starts working on A2 after it finished A1, and A3 will only be worked on after A2 is finished. A parallel response would be that the filesystem can work on A1, A2, A3, and A4 at the same time. I'm very sure this is the case with most filesystems relevant for our use case including the standard NFS implementation on Linux.

So let's assume each call to the filesystem takes 1 second no matter how many calls are done in parallel: The R1 -> A1 represents sending request 1 from the program to the filesystem, in case of NFS to the server, and the answer 1 back from the server, and back to the program. Serial it would be R1->A1 (1 second) + R2->A2 (1 second) + R3->A3 (1 second) + R4->A4 (1 second) gives a total of 4 seconds. Parallel it would be R1, R2, R3, R4 -> A1, A2, A3, A4 (1 second) so the whole operation would take 1 second.

No, I think this is not correct.

We should separate bandwidth and latency.

For bandwidth, consider you have some limit in what you can transfer per second. E.g. assume you can transfer 100MB/sec. Let's say each A1-A4 reads a 100MB file. No matter how you do it, it will take 4 secs.

For latency, it's a bit trickier. Let's assume some low bandwidth op which takes a single TCP packet for the request and also for the answer, e.g. some stat. We should also specify where the latency comes from because depending on that, you might be right or I might be right. I think the latency is mostly due to the NFS server being busy by lots of other requests.

If you do it in parallel, the syscalls might be in random order or maybe truly in parallel, but then when the NFS client translates them into TCP packets for the corresponding requests and they get send over ethernet, they will be in some well defined order again. Let's assume that is [A1,A2,A3,A4], and they arrive like that on the NFS server. The NFS server now handles all those requests + many other unrelated requests. It is limited in how many requests in can handle per sec. For simplicity and a very extreme case, let's assume it can handle 1 request per second. So again, no matter how we do it, it will take approx 4 secs to handle the requests, no matter if we do it in parallel or sequentially.

In practice, the filesystem will slow down with multiple requests, but it should still be faster than doing everything after another.

I don't think so. But we should benchmark it. Btw, did you see my other post? I think there is some potential speedup when the FS is very slow, when you use 2 threads, one for the CPU work (going through the graph) and another for the FS work. Then you can keep both the CPU and the FS busy at the same time. Even the current multithreading implementation does not really do that.


In any case, this PR here is only about improving the single threaded case. I think we can also improve the multi-threaded case further but maybe we should do that as a separate step, and move this discussion elsewhere, to some issue or so?

@albertz
Copy link
Member Author

albertz commented Jun 20, 2022

While loops should be faster than recursive I doubt that the speed up comes from that. The main difference should be that the old version didn't check if it visited a node already. In that case I would prefer to keep the code between the multithreaded and the single threaded version as similar as possible. I'll rework that code block a little.

Checking the visited nodes makes the huge difference, going from hours of runtime, basically exponential in the size of the graph, down to a few seconds, and linear runtime.

However, such loop-queue-based implementation vs a recursive implementation can also make a big difference, maybe 50-100% in my experience, esp for deep graphs (where a path could maybe visit 100s of nodes). At the same time, I also think a queue-based implementation is much more straightforward. So I would definitely keep that. Or only if you can show in benchmarks that there is no measurable difference, then we could keep a recursive implementation.

@critias
Copy link
Contributor

critias commented Jun 21, 2022

For bandwidth, consider you have some limit in what you can transfer per second. E.g. assume you can transfer 100MB/sec. Let's say each A1-A4 reads a 100MB file. No matter how you do it, it will take 4 secs.

I also wouldn't expect any speed up for large files, but most FS access by Sisyphus at this point is just asking the file system if a file exists.

For simplicity and a very extreme case, let's assume it can handle 1 request per second. So again, no matter how we do it, it will take approx 4 secs to handle the requests, no matter if we do it in parallel or sequentially.

I'm not saying requesting it in parallel has to be faster, it might as well be slower in some cases, but the whole discussion started with:

I doubt that the multi threaded code can be faster. Can you share an example?

so I gave you an example where it is faster. So can we agree that it can be faster?

In practice, the filesystem will slow down with multiple requests, but it should still be faster than doing everything after another.

I don't think so. But we should benchmark it.

My statement was probably a little bit too broad, it will heavily depend on many factors, but there are many combinations where accessing the filesystem in parallel should be faster.

Btw, did you see my other post? I think there is some potential speedup when the FS is very slow, when you use 2 threads, one for the CPU work (going through the graph) and another for the FS work. Then you can keep both the CPU and the FS busy at the same time. Even the current multithreading implementation does not really do that.
In any case, this PR here is only about improving the single threaded case. I think we can also improve the multi-threaded case further but maybe we should do that as a separate step, and move this discussion elsewhere, to some issue or so?

I'm considering to refactor the multithreaded code as well and I'm still trying out a few things. The goal is to have the single and multithreaded case to be as similar as possible.

@critias
Copy link
Contributor

critias commented Jun 21, 2022

Checking the visited nodes makes the huge difference, going from hours of runtime, basically exponential in the size of the graph, down to a few seconds, and linear runtime.

True, but this can also be added with a few lines for the recursive case.

However, such loop-queue-based implementation vs a recursive implementation can also make a big difference, maybe 50-100% in my experience, esp for deep graphs (where a path could maybe visit 100s of nodes).

Interesting, in my experience loops are faster in C++, but I never saw a big difference in Python. I'll try to come up with a small benchmark later. From my side is the main argument for putting it in a loop is that we can't run into a maximum recursion depth limit, but I'm still testing something.

@albertz
Copy link
Member Author

albertz commented Jun 21, 2022

I'm not saying requesting it in parallel has to be faster, it might as well be slower in some cases, but the whole discussion started with:

I doubt that the multi threaded code can be faster. Can you share an example?

so I gave you an example where it is faster. So can we agree that it can be faster?

No, even in this example, I think the single threaded code can be faster. This is what I tried to explain with what I described. And my explanation even did not consider the big overhead you get due to the multithreading, which is maybe another factor 2 slower.

But anyway, there are many rough assumptions and estimations in either argumentation. So it would be good maybe to measure things. I assume you can simulate also a slow FS in a systematic way, maybe via some FUSE (although it only makes sense if this reflects the real behavior of NFS).

Btw, did you see my other post? I think there is some potential speedup when the FS is very slow, when you use 2 threads, one for the CPU work (going through the graph) and another for the FS work. Then you can keep both the CPU and the FS busy at the same time. Even the current multithreading implementation does not really do that.
In any case, this PR here is only about improving the single threaded case. I think we can also improve the multi-threaded case further but maybe we should do that as a separate step, and move this discussion elsewhere, to some issue or so?

I'm considering to refactor the multithreaded code as well and I'm still trying out a few things. The goal is to have the single and multithreaded case to be as similar as possible.

Note that the current state before this PR was already very different. In fact, this PR already improves on that, on a conceptual level.

Also, note that my original use case was a big graph, and it was very much CPU bound (many for_all_nodes calls don't do any FS action at all, they just set some attribute for all jobs or so). This was very slow before, extremely exponential slow for the single threaded case, and still very slow for the multi threaded case. This PR here is about improving this case by a lot. Although I also still see some more potential optimization for even this CPU-bound case. But I wanted to do that in follow up PRs, so I'm waiting now that this here gets merged.

@albertz
Copy link
Member Author

albertz commented Jun 21, 2022

However, such loop-queue-based implementation vs a recursive implementation can also make a big difference, maybe 50-100% in my experience, esp for deep graphs (where a path could maybe visit 100s of nodes).

Interesting, in my experience loops are faster in C++, but I never saw a big difference in Python. I'll try to come up with a small benchmark later. From my side is the main argument for putting it in a loop is that we can't run into a maximum recursion depth limit, but I'm still testing something.

The function call overhead in Python is even much larger than in C++, and a stack frame in Python is also much bigger than in C++. So in Python, this should make a much bigger difference than in C++.

@critias
Copy link
Contributor

critias commented Jun 21, 2022

However, such loop-queue-based implementation vs a recursive implementation can also make a big difference, maybe 50-100% in my experience, esp for deep graphs (where a path could maybe visit 100s of nodes).

Interesting, in my experience loops are faster in C++, but I never saw a big difference in Python. I'll try to come up with a small benchmark later. From my side is the main argument for putting it in a loop is that we can't run into a maximum recursion depth limit, but I'm still testing something.

I created a small benchmark:

import sys

class Job:
    def __init__(self, value, next_node=None):
        self.value = value
        self.next = next_node

jobs = [Job(0)]

sys.setrecursionlimit(2000)

for i in range(1, 1000):
    jobs.append(Job(i, jobs[-1]))

def loop(start_job=jobs[-1]):
    queue = [start_job]
    while queue:
        job = queue.pop()
        if job.next:
            queue.append(job.next)
        else:
            assert job.value == 0

def rec(job=jobs[-1]):
    if job.next:
        rec(job.next)
    else:
        assert job.value == 0


timeit_count = 1000
print(timeit.timeit(loop, number=timeit_count))
print(timeit.timeit(rec, number=timeit_count))

Running on my laptop loops are around 10%-20% faster for a depth of 1000 jobs. Increasing the depth to 10000 loops outperform the recursion by a lot, 3 to 4 times as fast and I had to increase the recursion limit. So yeah, we should use loops and also update the multithreaded case...

@critias critias merged commit 36db648 into master Jun 21, 2022
@critias
Copy link
Contributor

critias commented Jun 21, 2022

I doubt that the multi threaded code can be faster. Can you share an example?

so I gave you an example where it is faster. So can we agree that it can be faster?

No, even in this example, I think the single threaded code can be faster. This is what I tried to explain with what I described. And my explanation even did not consider the big overhead you get due to the multithreading, which is maybe another factor 2 slower.

I don't disagree that the single threaded code can be faster, I just disagree with your statement that multi threaded code can not be faster.

@critias critias deleted the albert-for-all-nodes-single-thread-more-efficient branch June 21, 2022 07:42
@albertz
Copy link
Member Author

albertz commented Jun 21, 2022

I just disagree with your statement that multi threaded code can not be faster.

Yes, this was wrong. But I corrected that in my follow up comments. Specifically for a slow FS, the variant using 2 threads can be faster, one doing the CPU work, the other doing FS work.

But I think more than one thread for the CPU work, or more than one thread for the FS work will not improve.

Unfortunately, decoupling the CPU and FS work is not really possible with our current API. It's only possible when you ignore the output of the function f. Which is anyway often the case, that this can be ignored and we really want to visit all nodes. Maybe we should add another flag for that?

Btw, another optimization for when we just want to iterate through all jobs: We can cache the list of jobs. This would save us all the CPU work. Even this loop-queue-based implementation takes some CPU work and when there are many for_all_nodes calls, this can be some overhead. E.g. the set_job_targets took quite some time for me although I only had around 500 jobs in total, and maybe 100 targets, so it should have been extremely fast. (For set_job_targets, the caching has to be slightly more clever though, as the starting node can be different.) Also, I am not sure if we maybe don't want the cache, and want that every single for_all_nodes call runs job._sis_runnable() for all jobs to potentially update inputs and add new jobs dynamically? But potentially allowing this for every single for_all_nodes call sounds also very inefficient and unnecessary to me, because in practice, I don't think that you always want or need to allow for that. Maybe we anyway should decouple this? Like some separate function update_all_jobs which only does this.

@critias
Copy link
Contributor

critias commented Jun 21, 2022

I just disagree with your statement that multi threaded code can not be faster.

Yes, this was wrong. But I corrected that in my follow up comments. Specifically for a slow FS, the variant using 2 threads can be faster, one doing the CPU work, the other doing FS work.

But I think more than one thread for the CPU work, or more than one thread for the FS work will not improve.

We should meet for lunch at some point and continue this discussion 😄

Unfortunately, decoupling the CPU and FS work is not really possible with our current API. It's only possible when you ignore the output of the function f. Which is anyway often the case, that this can be ignored and we really want to visit all nodes. Maybe we should add another flag for that?

Maybe we could just replace the bottom_up flag, with a top_down_and_break flag. We can't break break for the bottom_up approach anyway (since we already added the children before have the result if we should break).

Btw, another optimization for when we just want to iterate through all jobs: We can cache the list of jobs. This would save us all the CPU work. Even this loop-queue-based implementation takes some CPU work and when there are many for_all_nodes calls, this can be some overhead. E.g. the set_job_targets took quite some time for me although I only had around 500 jobs in total, and maybe 100 targets, so it should have been extremely fast. (For set_job_targets, the caching has to be slightly more clever though, as the starting node can be different.) Also, I am not sure if we maybe don't want the cache, and want that every single for_all_nodes call runs job._sis_runnable() for all jobs to potentially update inputs and add new jobs dynamically? But potentially allowing this for every single for_all_nodes call sounds also very inefficient and unnecessary to me, because in practice, I don't think that you always want or need to allow for that. Maybe we anyway should decouple this? Like some separate function update_all_jobs which only does this.

Yeah, there is absolutely room for improvement. As you mentioned, cacheing the whole job list can be tricky so I would attempt this last. Updating all nodes and compute the current state only once at the beginning of the manager loop sounds like the most promising starting point to me. Currently some results are cached with a timeout, but triggering this centralized should be cleaner, faster, and maybe even use less memory.

@albertz albertz mentioned this pull request Jun 21, 2022
@critias
Copy link
Contributor

critias commented Jun 27, 2022

To follow up the single thread vs multi thread discussion for everyone else (I already talked with @albertz offline), here is a short script to test the difference between accessing many files using a single thread and multi thread file approach:

import sys
import os
import glob
import time
from multiprocessing.pool import ThreadPool

pool = ThreadPool(20)


def check_file(filename):
    try:
        os.stat(filename)
        glob.glob(filename+'/*')
    except FileNotFoundError:
        pass


def single_thread():
    for i in sys.argv[2:]:
        check_file(i)


def multi_thread():
    pool.map(check_file, sys.argv[2:])


if sys.argv[1] == 'single':
    start = time.time()
    single_thread()
    print('Single:', time.time()-start)
elif sys.argv[1] == 'multi':
    start = time.time()
    multi_thread()
    print('Multi', time.time()-start)

I ran each variant ten times (alternating between them) on a fairly fast network file system and gave it a large number of directories as input:
My single thread results were between 2.3 and 10.8 seconds with an average of 5.5 seconds
My multi thread results were between 1.7 and 2.1 seconds with an average of 1.9 seconds

This was of course just one benchmark and I expect the results to vary drastically depending on the filesystem and the number of given files, but it shows that having a multi thread implementation can be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants