Skip to content

Commit

Permalink
wip, adding docs
Browse files Browse the repository at this point in the history
  • Loading branch information
charleskawczynski committed Mar 9, 2025
1 parent 52dee7a commit 9ba1195
Show file tree
Hide file tree
Showing 2 changed files with 46 additions and 1 deletion.
5 changes: 4 additions & 1 deletion docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,10 @@ withenv("GKSwstype" => "nul") do
"Remapping" => "remapping.md",
"MatrixFields" => "matrix_fields.md",
"API" => "api.md",
"Developer docs" => ["Performance tips" => "performance_tips.md"],
"Developer docs" => [
"Performance tips" => "performance_tips.md"
"Shared memory design" => "shmem_design.md"
],
"Tutorials" => [
joinpath("tutorials", tutorial * ".md") for
tutorial in TUTORIALS
Expand Down
42 changes: 42 additions & 0 deletions docs/src/shmem_design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Shared memory design

ClimaCore stencil operators support staggered (or collocated) finite difference
operations. For example, the `DivergenceF2C` operator takes an argument that
lives on the cell faces and the resulting divergence calculation lives on the
cell centers.

## Motivation

A naive and simplified implementation of this operation looks like `div[i] = (f
[i+1] - f[i]) / dz[i]`. Such a calculation on the gpu (or cpu) requires `f[i]`
be read from global memory to compute the result of `div[i]` and `div[i-1]`. Not
to mention, if `f` is a `Broadcasted` object (`Broadcasted` objects behave like
arrays, and support `f[i]` behavior), then `f[i]` may require several reads and
or computations.

Reading data from global memory is often the main bottleneck for
bandwidth-limited cuda kernels. As such, we use shared memory (or, "shmem" for
short) to reduce the number of global memory reads (and compute) in our kernels.

## High-level design

The high-level view of the design is:

- The `bc::StencilBroadcasted` type has a `work` field, which is used to store
shared memory for the `bc.op` operator. The element type of the `work`
(or parts of `work` if there are multiple parts) is the type returned by the
`bc.op`'s `Operator.return_eltype`.
- Recursively reconstruct the broadcasted object, allocating shared memory for
each `StencilBroadcasted` along the way that supports shared memory
(different operators require different arguments, and therefore different
types and amounts of shared memory).
- Recursively fill the shared memory for all `StencilBroadcasted`. This is done
by reading the argument data from `getidx`
- The destination field is filled with the result of `getidx` (as it is without
shmem), except that we overload `getidx` (for supported `StencilBroadcasted`
types) to retrieve the result of `getidx` via `fd_operator_evaluate`, which
retrieves the result from the shmem, instead of global memory.




0 comments on commit 9ba1195

Please sign in to comment.