Skip to content
This repository has been archived by the owner on May 9, 2024. It is now read-only.

[L0] Asynchronous data fetching #711

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

akroviakov
Copy link
Contributor

@akroviakov akroviakov commented Oct 27, 2023

This PR introduces an asynchronous (batched) data fetching for L0 GPUs. Its purpose is to reduce end-to-end execution time of a workload.


Why?

We have recursive materializations (from Disk to CPU to GPU), once CPU has its buffer materialized we begin a data transfer and wait for its completion, but why should we? We can proceed to materialization of the next buffer and only care about all of the data being on GPU exactly before the kernel execution. This way we can overlap memcpy for CPU buffers with GPU data transfer and we don't lose on anything. Additionally, we also hide the latencies caused by the buffer manager which are constant and their impact grows linearly to the fragment count.


How?

We will batch transfers into a command list until the it reaches 128MB worth of data, then we actually execute a transfer. Right after sending all of the kernel parameters (that is, right before kernel execution) we wait until the data transfers are finished (barrier). Once the transfers are done, we want to keep and recycle the command lists to avoid the overhead of creating new ones or deleting them. Which is again a constant overhead that grows linearly to the fragment count.

Why this design?

L0 has something called an "immediate command list" that should seemingly do what we want, however, the documentation says that it "may be synchronous". Indeed, on PVC it is displaying synchronous behavior, but asynchronous on Arc GPUs. The proposed solution is asynchronous for both Arc and PVC. The 128MB granularity is arbitrary. This design showed good scalability to fragment count and overall less overhead (measured with ze_tracer) compared to the current solution in an isolated L0 benchmark.


Multithreaded fetching

Since we may have many fragments (e.g., many cores or we are in a heterogeneous mode), we will have more chunks to fetch, so why not perform CPU materializations in parallel and asynchronously send chunks to GPU? Of course, we wont achieve a perfect scaling due to non-data-transfer-related synchronization points (e.g., in buffer manager), but the effect is still visible. This solution uses tbb::task_arena limitedArena(16), no noticeable benefit beyond this number was observed.


What about fetching data from GPU?

There is no much benefit in reorganizing data transfers from GPU in an asynchronous fashion since we do not expect to do as "much" in between transfers on the CPU side as we do while loading data to GPU. Maybe someone will correct me.


Measurements

Taxi benchmark, 100 million rows. PVC + 128 cores CPU.


Fully on GPU, 256 fragments. Read values as speedup multiplier.

Setup Q1 fetching Q2 fetching Q3 fetching Q4 fetching End-to-End
1 thread 2 1 1.56 1.1 1.1
limitedArena(8) 2 3.3 4.68 3.8 1.32
limitedArena(16) 1.25 5 6.15 5.9 1.42
limitedArena(24) 1.25 4.23 4.38 3.65 1.4

50% on GPU, 50% on CPU, 256 fragments.

Setup End-to-End
1 thread no changes
limitedArena(8) 1.31
limitedArena(16) 1.34
limitedArena(24) 1.38

Even for default fragment size for GPU-only mode (30 mil.) we can see a speedup:
Fully on GPU, 4 fragments:

Setup Q1 fetching Q2 fetching Q3 fetching Q4 fetching End-to-End
1 thread 1 1.35 1.25 1.1 1.11
limitedArena(8) 1.23 2.47 2.28 2.66 1.26

Of course, the benefit vanishes the less we have to do on CPU between data transfers. E.g., for zero-copy columns the best-case speedup was 1.2x. Btw, is there something we could move to after fetchChunks(), but before prepareKernelParams()?


What about CUDA devices?

It is possible, the upper bound is 2x faster data transfer (i.e., pinned vs non-pinned). One needs to inform CUDA about the fact that malloc'ed CPU buffers (e.g., on slab level) are pinned, we can use cuMemHostRegister().
But

  • The time CUDA needs to update its page tables almost matches the data transfer time (registering one CPU slab costs ~300ms with cuMemHostRegister() vs <2 ms without). So overall we get the same time as in synchronous mode. That is, instead of waiting while CUDA uses intermediate page locked buffers for transfers to GPU from CPU pageable buffers (SYNC case), we will wait until it finishes updating its page tables (ASYNC case).
  • Both SYNC and ASYNC cases are linear to data size and the ASYNC one only makes sense if we get to the point of evictions (to leverage subsequent accelerated data transfer from "pinned" slabs), but then we are likely to suffer more from the evictions themselves anyways.
  • Additionally, not all CPU slabs may need to be registered (more complex logic required), calling cuMemHostRegister() on column chunk level is too expensive.
  • Apart from that, if we crash, the mapping may persist and we will have problems with unregistering those memory regions to register the new ones in the next run.
  • Moreover, calls to cuMemHostUnregister() are also linear to data size and in fact have proven to be even slower than cuMemHostRegister().

@akroviakov akroviakov force-pushed the akroviak/gpu_async_fetch branch from 83c70f5 to c75a886 Compare October 27, 2023 14:08
@akroviakov akroviakov force-pushed the akroviak/gpu_async_fetch branch from c80c3c8 to 3bc2d3e Compare November 13, 2023 16:12
Copy link
Contributor

@kurapov-peter kurapov-peter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First portion of comments.

@@ -56,6 +58,35 @@ class L0Kernel;
class L0CommandList;
class L0CommandQueue;

class L0DataFetcher {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a usage comment? A short version for the description in the PR should do.

L0_SAFE_CALL(zeCommandListAppendMemoryCopy(
current_cl_bytes.first, dst, src, num_bytes, nullptr, 0, nullptr));
current_cl_bytes.second += num_bytes;
if (current_cl_bytes.second >= 128 * 1024 * 1024) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a parameter, could you move it to a named constant in the class definition?

ZE_COMMAND_QUEUE_PRIORITY_NORMAL};
L0_SAFE_CALL(zeCommandQueueCreate(
driver.ctx(), device_, &command_queue_fetch_desc, &queue_handle_));
current_cl_bytes = {{}, 0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please stick to the code style for class members' names. There are multiple cases, I'm not marking them all.

@@ -68,6 +99,7 @@ class L0Device {
std::shared_ptr<L0CommandQueue> command_queue_;

public:
L0DataFetcher data_fetcher;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the data_fetcher_ be an implementation detail? Based on the class API I don't expect us ever to use it without a device object anyway.

L0_SAFE_CALL(zeCommandListReset(recycled.back()));
}
for (auto& dead_handle : graveyard) {
L0_SAFE_CALL(zeCommandListDestroy(recycled.back()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be zeCommandListDestroy(dead_handle)?

} else {
L0_SAFE_CALL(
zeCommandListCreate(driver_.ctx(), device_, &cl_desc, &current_cl_bytes.first));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic could be wrapped into smth like getRecycledOrNew...() for readability.

@@ -56,6 +58,35 @@ class L0Kernel;
class L0CommandList;
class L0CommandQueue;

class L0DataFetcher {
#ifdef HAVE_L0
static constexpr uint16_t GRAVEYARD_LIMIT{500};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid any potential problems by making it size_t instead of saving a couple bytes.

@kurapov-peter
Copy link
Contributor

@akroviakov akroviakov force-pushed the akroviak/gpu_async_fetch branch from 3bc2d3e to c00af13 Compare November 24, 2023 10:39
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants