You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The closer that we can get in dpctl seems to be max_compute_units but that does not seem to do the job, online documentation seems to suggest that the number of threads per compute unit can depend on the hardware and is not necessarily related to the sub group size.
While it seems that an exact value is impossible to have for reasons related to hardware architecture, at least an upper bound sound achievable ?
I'm looking for such upper bound to have a closer estimate to the quantity of global memory cache that would be required by some kernels that rely on caching to ensure maximum cache hit rate during execution.
The text was updated successfully, but these errors were encountered:
A reasonable bound can be (threads per compute unit) * (maximal work-groups size). The latter is accessible in dpctl via dpctl.SyclDevice.max_work_group_size. You can bound the former from above by value 8 (see architectural summary table in the reference Xe architecture table).
Wouldn't it be (threads per compute unit) * (number of compute units), where the former is indeed 8 as the optimization guide shows, but the latter is number of compute units is dpctl.SyclDevice.max_compute_units ? With the compute units meaning XVE for Xe architecture ? It seems to fit better the thread count number given in the summary array.
Could there be plans for either exposing max_threads_per_compute_units or max_thread_counts in future SYCL specs and/or in dpctl ?
the latter is number of compute units is dpctl.SyclDevice.max_compute_units
I sorry, you are definitely right. I was all wrapped up in the notion that kernels are launched by work-groups, but it does mean that several concurrent work-groups cannot be launched in different cores.
The closer that we can get in
dpctl
seems to bemax_compute_units
but that does not seem to do the job, online documentation seems to suggest that the number of threads per compute unit can depend on the hardware and is not necessarily related to the sub group size.While it seems that an exact value is impossible to have for reasons related to hardware architecture, at least an upper bound sound achievable ?
I'm looking for such upper bound to have a closer estimate to the quantity of global memory cache that would be required by some kernels that rely on caching to ensure maximum cache hit rate during execution.
The text was updated successfully, but these errors were encountered: