[L0] Asynchronous data fetching #711

akroviakov · 2023-10-27T11:21:07Z

This PR introduces an asynchronous (batched) data fetching for L0 GPUs. Its purpose is to reduce end-to-end execution time of a workload.

Why?

We have recursive materializations (from Disk to CPU to GPU), once CPU has its buffer materialized we begin a data transfer and wait for its completion, but why should we? We can proceed to materialization of the next buffer and only care about all of the data being on GPU exactly before the kernel execution. This way we can overlap memcpy for CPU buffers with GPU data transfer and we don't lose on anything. Additionally, we also hide the latencies caused by the buffer manager which are constant and their impact grows linearly to the fragment count.

How?

We will batch transfers into a command list until the it reaches 128MB worth of data, then we actually execute a transfer. Right after sending all of the kernel parameters (that is, right before kernel execution) we wait until the data transfers are finished (barrier). Once the transfers are done, we want to keep and recycle the command lists to avoid the overhead of creating new ones or deleting them. Which is again a constant overhead that grows linearly to the fragment count.

Why this design?

L0 has something called an "immediate command list" that should seemingly do what we want, however, the documentation says that it "may be synchronous". Indeed, on PVC it is displaying synchronous behavior, but asynchronous on Arc GPUs. The proposed solution is asynchronous for both Arc and PVC. The 128MB granularity is arbitrary. This design showed good scalability to fragment count and overall less overhead (measured with ze_tracer) compared to the current solution in an isolated L0 benchmark.

Multithreaded fetching

Since we may have many fragments (e.g., many cores or we are in a heterogeneous mode), we will have more chunks to fetch, so why not perform CPU materializations in parallel and asynchronously send chunks to GPU? Of course, we wont achieve a perfect scaling due to non-data-transfer-related synchronization points (e.g., in buffer manager), but the effect is still visible. This solution uses tbb::task_arena limitedArena(16), no noticeable benefit beyond this number was observed.

What about fetching data from GPU?

There is no much benefit in reorganizing data transfers from GPU in an asynchronous fashion since we do not expect to do as "much" in between transfers on the CPU side as we do while loading data to GPU. Maybe someone will correct me.

Measurements

Taxi benchmark, 100 million rows. PVC + 128 cores CPU.

Fully on GPU, 256 fragments. Read values as speedup multiplier.

Setup	Q1 fetching	Q2 fetching	Q3 fetching	Q4 fetching	End-to-End
1 thread	2	1	1.56	1.1	1.1
limitedArena(8)	2	3.3	4.68	3.8	1.32
limitedArena(16)	1.25	5	6.15	5.9	1.42
limitedArena(24)	1.25	4.23	4.38	3.65	1.4

50% on GPU, 50% on CPU, 256 fragments.

Setup	End-to-End
1 thread	no changes
limitedArena(8)	1.31
limitedArena(16)	1.34
limitedArena(24)	1.38

Even for default fragment size for GPU-only mode (30 mil.) we can see a speedup:
Fully on GPU, 4 fragments:

Setup	Q1 fetching	Q2 fetching	Q3 fetching	Q4 fetching	End-to-End
1 thread	1	1.35	1.25	1.1	1.11
limitedArena(8)	1.23	2.47	2.28	2.66	1.26

Of course, the benefit vanishes the less we have to do on CPU between data transfers. E.g., for zero-copy columns the best-case speedup was 1.2x. Btw, is there something we could move to after fetchChunks(), but before prepareKernelParams()?

What about CUDA devices?

It is possible, the upper bound is 2x faster data transfer (i.e., pinned vs non-pinned). One needs to inform CUDA about the fact that malloc'ed CPU buffers (e.g., on slab level) are pinned, we can use cuMemHostRegister().
But

The time CUDA needs to update its page tables almost matches the data transfer time (registering one CPU slab costs ~300ms with cuMemHostRegister() vs <2 ms without). So overall we get the same time as in synchronous mode. That is, instead of waiting while CUDA uses intermediate page locked buffers for transfers to GPU from CPU pageable buffers (SYNC case), we will wait until it finishes updating its page tables (ASYNC case).
Both SYNC and ASYNC cases are linear to data size and the ASYNC one only makes sense if we get to the point of evictions (to leverage subsequent accelerated data transfer from "pinned" slabs), but then we are likely to suffer more from the evictions themselves anyways.
Additionally, not all CPU slabs may need to be registered (more complex logic required), calling cuMemHostRegister() on column chunk level is too expensive.
Apart from that, if we crash, the mapping may persist and we will have problems with unregistering those memory regions to register the new ones in the next run.
Moreover, calls to cuMemHostUnregister() are also linear to data size and in fact have proven to be even slower than cuMemHostRegister().

kurapov-peter

First portion of comments.

kurapov-peter · 2023-11-23T14:17:00Z

omniscidb/L0Mgr/L0Mgr.h

@@ -56,6 +58,35 @@ class L0Kernel;
 class L0CommandList;
 class L0CommandQueue;

+class L0DataFetcher {


Could you please add a usage comment? A short version for the description in the PR should do.

kurapov-peter · 2023-11-23T14:53:42Z

omniscidb/L0Mgr/L0Mgr.cpp

+  L0_SAFE_CALL(zeCommandListAppendMemoryCopy(
+      current_cl_bytes.first, dst, src, num_bytes, nullptr, 0, nullptr));
+  current_cl_bytes.second += num_bytes;
+  if (current_cl_bytes.second >= 128 * 1024 * 1024) {


Since this is a parameter, could you move it to a named constant in the class definition?

kurapov-peter · 2023-11-23T14:57:19Z

omniscidb/L0Mgr/L0Mgr.cpp

+      ZE_COMMAND_QUEUE_PRIORITY_NORMAL};
+  L0_SAFE_CALL(zeCommandQueueCreate(
+      driver.ctx(), device_, &command_queue_fetch_desc, &queue_handle_));
+  current_cl_bytes = {{}, 0};


Please stick to the code style for class members' names. There are multiple cases, I'm not marking them all.

kurapov-peter · 2023-11-23T15:02:00Z

omniscidb/L0Mgr/L0Mgr.h

@@ -68,6 +99,7 @@ class L0Device {
  std::shared_ptr<L0CommandQueue> command_queue_;

 public:
+  L0DataFetcher data_fetcher;


Can the data_fetcher_ be an implementation detail? Based on the class API I don't expect us ever to use it without a device object anyway.

kurapov-peter · 2023-11-23T15:08:19Z

omniscidb/L0Mgr/L0Mgr.cpp

+    L0_SAFE_CALL(zeCommandListReset(recycled.back()));
+  }
+  for (auto& dead_handle : graveyard) {
+    L0_SAFE_CALL(zeCommandListDestroy(recycled.back()));


Should it be zeCommandListDestroy(dead_handle)?

kurapov-peter · 2023-11-23T15:12:43Z

omniscidb/L0Mgr/L0Mgr.cpp

+    } else {
+      L0_SAFE_CALL(
+          zeCommandListCreate(driver_.ctx(), device_, &cl_desc, &current_cl_bytes.first));
+    }


This logic could be wrapped into smth like getRecycledOrNew...() for readability.

kurapov-peter · 2023-11-23T15:18:25Z

omniscidb/L0Mgr/L0Mgr.h

@@ -56,6 +58,35 @@ class L0Kernel;
 class L0CommandList;
 class L0CommandQueue;

+class L0DataFetcher {
+#ifdef HAVE_L0
+  static constexpr uint16_t GRAVEYARD_LIMIT{500};


I'd avoid any potential problems by making it size_t instead of saving a couple bytes.

kurapov-peter · 2023-11-23T15:26:13Z

There seems to be a build error on win btw: https://github.com/intel-ai/hdk/actions/runs/6852649791/job/18969702088?pr=711#step:10:1064

akroviakov requested review from kurapov-peter and ienkovich October 27, 2023 11:21

akroviakov force-pushed the akroviak/gpu_async_fetch branch from 83c70f5 to c75a886 Compare October 27, 2023 14:08

akroviakov force-pushed the akroviak/gpu_async_fetch branch from c80c3c8 to 3bc2d3e Compare November 13, 2023 16:12

kurapov-peter suggested changes Nov 23, 2023

View reviewed changes

akroviakov and others added 2 commits November 24, 2023 02:43

MT GPU fetch, l0 async

d4dff2c

Move l0 fetcher completely into device class

c00af13

akroviakov force-pushed the akroviak/gpu_async_fetch branch from 3bc2d3e to c00af13 Compare November 24, 2023 10:39

change tbb parallel for version

f6e9626

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[L0] Asynchronous data fetching #711

[L0] Asynchronous data fetching #711

akroviakov commented Oct 27, 2023 •

edited

Loading

kurapov-peter left a comment

kurapov-peter Nov 23, 2023

kurapov-peter Nov 23, 2023

kurapov-peter Nov 23, 2023

kurapov-peter Nov 23, 2023

kurapov-peter Nov 23, 2023

kurapov-peter Nov 23, 2023

kurapov-peter Nov 23, 2023

kurapov-peter commented Nov 23, 2023

[L0] Asynchronous data fetching #711

Are you sure you want to change the base?

[L0] Asynchronous data fetching #711

Conversation

akroviakov commented Oct 27, 2023 • edited Loading

Why?

How?

Why this design?

Multithreaded fetching

What about fetching data from GPU?

Measurements

What about CUDA devices?

kurapov-peter left a comment

Choose a reason for hiding this comment

kurapov-peter Nov 23, 2023

Choose a reason for hiding this comment

kurapov-peter Nov 23, 2023

Choose a reason for hiding this comment

kurapov-peter Nov 23, 2023

Choose a reason for hiding this comment

kurapov-peter Nov 23, 2023

Choose a reason for hiding this comment

kurapov-peter Nov 23, 2023

Choose a reason for hiding this comment

kurapov-peter Nov 23, 2023

Choose a reason for hiding this comment

kurapov-peter Nov 23, 2023

Choose a reason for hiding this comment

kurapov-peter commented Nov 23, 2023

akroviakov commented Oct 27, 2023 •

edited

Loading