-
Notifications
You must be signed in to change notification settings - Fork 14
[L0] Asynchronous data fetching #711
base: main
Are you sure you want to change the base?
Conversation
83c70f5
to
c75a886
Compare
c80c3c8
to
3bc2d3e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First portion of comments.
omniscidb/L0Mgr/L0Mgr.h
Outdated
@@ -56,6 +58,35 @@ class L0Kernel; | |||
class L0CommandList; | |||
class L0CommandQueue; | |||
|
|||
class L0DataFetcher { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a usage comment? A short version for the description in the PR should do.
omniscidb/L0Mgr/L0Mgr.cpp
Outdated
L0_SAFE_CALL(zeCommandListAppendMemoryCopy( | ||
current_cl_bytes.first, dst, src, num_bytes, nullptr, 0, nullptr)); | ||
current_cl_bytes.second += num_bytes; | ||
if (current_cl_bytes.second >= 128 * 1024 * 1024) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a parameter, could you move it to a named constant in the class definition?
omniscidb/L0Mgr/L0Mgr.cpp
Outdated
ZE_COMMAND_QUEUE_PRIORITY_NORMAL}; | ||
L0_SAFE_CALL(zeCommandQueueCreate( | ||
driver.ctx(), device_, &command_queue_fetch_desc, &queue_handle_)); | ||
current_cl_bytes = {{}, 0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please stick to the code style for class members' names. There are multiple cases, I'm not marking them all.
omniscidb/L0Mgr/L0Mgr.h
Outdated
@@ -68,6 +99,7 @@ class L0Device { | |||
std::shared_ptr<L0CommandQueue> command_queue_; | |||
|
|||
public: | |||
L0DataFetcher data_fetcher; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the data_fetcher_
be an implementation detail? Based on the class API I don't expect us ever to use it without a device
object anyway.
omniscidb/L0Mgr/L0Mgr.cpp
Outdated
L0_SAFE_CALL(zeCommandListReset(recycled.back())); | ||
} | ||
for (auto& dead_handle : graveyard) { | ||
L0_SAFE_CALL(zeCommandListDestroy(recycled.back())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be zeCommandListDestroy(dead_handle)
?
omniscidb/L0Mgr/L0Mgr.cpp
Outdated
} else { | ||
L0_SAFE_CALL( | ||
zeCommandListCreate(driver_.ctx(), device_, &cl_desc, ¤t_cl_bytes.first)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic could be wrapped into smth like getRecycledOrNew...()
for readability.
omniscidb/L0Mgr/L0Mgr.h
Outdated
@@ -56,6 +58,35 @@ class L0Kernel; | |||
class L0CommandList; | |||
class L0CommandQueue; | |||
|
|||
class L0DataFetcher { | |||
#ifdef HAVE_L0 | |||
static constexpr uint16_t GRAVEYARD_LIMIT{500}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd avoid any potential problems by making it size_t
instead of saving a couple bytes.
There seems to be a build error on win btw: https://github.com/intel-ai/hdk/actions/runs/6852649791/job/18969702088?pr=711#step:10:1064 |
3bc2d3e
to
c00af13
Compare
This PR introduces an asynchronous (batched) data fetching for L0 GPUs. Its purpose is to reduce end-to-end execution time of a workload.
Why?
We have recursive materializations (from Disk to CPU to GPU), once CPU has its buffer materialized we begin a data transfer and wait for its completion, but why should we? We can proceed to materialization of the next buffer and only care about all of the data being on GPU exactly before the kernel execution. This way we can overlap
memcpy
for CPU buffers with GPU data transfer and we don't lose on anything. Additionally, we also hide the latencies caused by the buffer manager which are constant and their impact grows linearly to the fragment count.How?
We will batch transfers into a command list until the it reaches 128MB worth of data, then we actually execute a transfer. Right after sending all of the kernel parameters (that is, right before kernel execution) we wait until the data transfers are finished (barrier). Once the transfers are done, we want to keep and recycle the command lists to avoid the overhead of creating new ones or deleting them. Which is again a constant overhead that grows linearly to the fragment count.
Why this design?
L0 has something called an "immediate command list" that should seemingly do what we want, however, the documentation says that it "may be synchronous". Indeed, on PVC it is displaying synchronous behavior, but asynchronous on Arc GPUs. The proposed solution is asynchronous for both Arc and PVC. The 128MB granularity is arbitrary. This design showed good scalability to fragment count and overall less overhead (measured with
ze_tracer
) compared to the current solution in an isolated L0 benchmark.Multithreaded fetching
Since we may have many fragments (e.g., many cores or we are in a heterogeneous mode), we will have more chunks to fetch, so why not perform CPU materializations in parallel and asynchronously send chunks to GPU? Of course, we wont achieve a perfect scaling due to non-data-transfer-related synchronization points (e.g., in buffer manager), but the effect is still visible. This solution uses
tbb::task_arena limitedArena(16)
, no noticeable benefit beyond this number was observed.What about fetching data from GPU?
There is no much benefit in reorganizing data transfers from GPU in an asynchronous fashion since we do not expect to do as "much" in between transfers on the CPU side as we do while loading data to GPU. Maybe someone will correct me.
Measurements
Taxi benchmark, 100 million rows. PVC + 128 cores CPU.
Fully on GPU, 256 fragments. Read values as speedup multiplier.
50% on GPU, 50% on CPU, 256 fragments.
Even for default fragment size for GPU-only mode (30 mil.) we can see a speedup:
Fully on GPU, 4 fragments:
Of course, the benefit vanishes the less we have to do on CPU between data transfers. E.g., for zero-copy columns the best-case speedup was 1.2x. Btw, is there something we could move to after
fetchChunks()
, but beforeprepareKernelParams()
?What about CUDA devices?
It is possible, the upper bound is 2x faster data transfer (i.e., pinned vs non-pinned). One needs to inform CUDA about the fact that malloc'ed CPU buffers (e.g., on slab level) are pinned, we can use
cuMemHostRegister()
.But
cuMemHostRegister()
vs <2 ms without). So overall we get the same time as in synchronous mode. That is, instead of waiting while CUDA uses intermediate page locked buffers for transfers to GPU from CPU pageable buffers (SYNC case), we will wait until it finishes updating its page tables (ASYNC case).cuMemHostRegister()
on column chunk level is too expensive.cuMemHostUnregister()
are also linear to data size and in fact have proven to be even slower thancuMemHostRegister()
.