To begin developing a custom backend with RAPIDS-Triton, we strongly recommend that you take advantage of the rapids-triton-template repo, which provides a basic template for your backend code. If this is your first time developing a backend with RAPIDS-Triton, the easiest way to get started is to follow the Linear Example. This provides a detailed walkthrough of every step in the process of creating a backend with example code. The rest of these usage docs will provide general information on specific features you are likely to take advantage of when building your backend.
To provide logging messages in your backend, RAPIDS-Triton provides log_info
,
log_warn
, log_error
, and log_debug
. During default Triton execution, all
logging messages up to (but not including) debug level will be visible. These
functions can be invoked in two ways and can optionally include file and line
information. To add a logging message to your code, use one of the following
invocations:
#include <rapids_triton/triton/logging.hpp>
void logging_example() {
rapids::log_info() << "This is a log message.";
rapids::log_info("This is an equivalent invocation.");
rapids::log_info(__FILE__, __LINE__) << "This one has file and line info.";
rapids::log_info(__FILE__, __LINE__, "And so does this one.");
}
If you encounter an error condition at any point in your backend which cannot
be otherwise handled, you should throw a TritonException
. In most cases, this
error will be gracefully handled and passed to the Triton server in a way that
will not interfere with execution of other backends, models, or requests.
TritonException
objects are constructed with an error type and a message
indicating what went wrong, as shown below:
#include <rapids_triton/exceptions.hpp>
void error_example() {
throw rapids::TritonException(rapids::Error::Internal, "Something bad happened!");
}
Available error types are:
Internal
: The most common error type. Used when an unexpected condition which is not the result of bad user input (e.g. CUDA error).NotFound
: An error type returned when a named resource (e.g. named CUDA IPC memory block) cannot be found.InvalidArg
: An error type returned when the user has provided invalid input in a configuration file or request.Unavailable
: An error returned when a resource exists but is currently unavailable.Unsupported
: An error which indicates that a requested functionality is not implemented by this backend (e.g. GPU execution for a CPU-only backend).AlreadyExists
: An error which indicates that a resource which is being created has already been created.Unknown
: The type of the error cannot be established. This type should be avoided wherever possible.
The cuda_check
function is provided to help facilitate error handling of
direct invocations of the CUDA API. If such an invocation fails, cuda_check
will throw an appropriate TritonException
:
#include <rapids_triton/exceptions.hpp>
void cuda_check_example() {
rapids::cuda_check(cudaSetDevice(0));
}
If a TritonException
is thrown while a backend is being loaded, Triton's
server logs will indicate the failure and include the error message. If a
TritonException
is thrown while a model is being loaded, Triton's server logs
will display the error message in the loading logs for that model. If a
TritonException
is thrown during handling of a request, the client will
receive an indication that the request failed along with the error message, but
the model can continue to process other requests.
Most Triton backends include support for builds intended to support only CPU execution. While this is not required, RAPIDS-Triton includes a compile-time constant which can be useful for facilitating this:
#include <rapids_triton/build_control.hpp>
void do_a_gpu_thing() {
if constexpr (rapids::IS_GPU_BUILD) {
rapids::log_info("Executing on GPU...");
} else {
rapids::log_error("Can't do that! This is a CPU-only build.");
}
}
You can also make use of the preprocessor identifier TRITON_ENABLE_GPU
for
conditional inclusion of headers:
#ifdef TRITON_ENABLE_GPU
#include <gpu_stuff.h>
#endif
Sometimes, having a CUDA symbol available in a CPU-only build can avoid layers
of indirection which would otherwise be required to allow for compilation of
both GPU and CPU versions of particular code. RAPIDS-Triton includes a header
which has some placeholders for CUDA symbols used internally by the library,
and which may be useful for backends which implement CPU-only builds as well.
Note that all placeholder symbols are namespaced within
triton::backend::rapids
. Note that not all symbols from the CUDA runtime API
are included, but additional symbols will be added over time. All placeholder
symbols will be implemented in a way that is consistent with similar
placeholders in the main Triton codebase. A typical usage is shown below:
#ifdef TRITON_ENABLE_GPU
#include <cuda_runtime_api.h>
#else
#include <rapids_triton/cpu_only/cuda_runtime_replacement.hpp>
#endif
// E.g. cudaStream_t is now defined regardless of whether or not this is a
// CPU-only build.
Within a backend, it is often useful to process data in a way that is agnostic to whether the underlying memory is on the host or on device and whether that memory is owned by the backend or provided by Triton. For instance, a backend may receive input data from Triton on the host and conditionally transfer it to the GPU before processing. In this case, owned memory must be allocated on the GPU to store the data, but after that point, the backend will treat the data exactly the same as if Triton had provided it on device in the first place.
In order to handle such situations, RAPIDS-Triton provides the Buffer
object.
When the Buffer
is non-owning, it provides a lightweight wrapper to the
underlying memory. When it is owning, Buffer
will handle any necessary
deallocation (on host or device). These objects can also be extremely useful
for passing data back and forth between host and device. The following examples
show ways in which Buffer
objects can be constructed and used:
#include <utility>
#include <vector>
#include <rapids_triton/memory/types.hpp> // rapids::HostMemory and rapids::DeviceMemory
#include <rapids_triton/memory/buffer.hpp> // rapids::Buffer
void buffer_examples() {
auto data = std::vector<int>{0, 1, 2, 3, 4};
// This buffer is a lightweight wrapper around the data stored in the `data`
// vector. Because this constructor takes an `int*` pointer as its first
// argument, it is assumed that the lifecycle of the underlying memory is
// separately managed.
auto non_owning_host_buffer = rapids::Buffer<int>(data.data(), rapids::HostMemory);
// This buffer owns its own memory on the host, with space for 5 ints. When
// it goes out of scope, the memory will be appropriately deallocated.
auto owning_host_buffer = rapids::Buffer<int>(5, rapids::HostMemory);
// This buffer is constructed as a copy of `non_owning_host_buffer`. Because
// its requested memory type is `DeviceMemory`, the data will be copied to a
// new (owned) GPU allocation. Device and stream can also be specified in the
// constructor.
auto owning_device_buffer = rapids::Buffer<int>(non_owning_host_buffer, rapids::DeviceMemory);
// Once again, because this constructor takes an `int*` pointer, it will
// simply be a lightweight wrapper around the memory that is actually managed
// by `owning_device_buffer`. Here we have omitted the memory type argument,
// since it defaults to `DeviceMemory`. This constructor can also accept
// device and stream arguments, and care should be taken to ensure that the
// right device is specified when the buffer does not allocate its own
// memory.
auto non_owning_device_buffer = rapids::Buffer<int>(owning_host_buffer.data());
auto base_buffer1 = rapids::Buffer<int>(data.data(), rapids::HostMemory);
// Because this buffer is on the host, just like the (moved-from) buffer it
// is being constructed from, it remains non-owning
auto non_owning_moved_buffer = rapids::Buffer<int>(std::move(base_buffer1), rapids::HostMemory);
auto base_buffer2 = rapids::Buffer<int>(data.data(), rapids::HostMemory);
// Because this buffer is on the device, unlike the (moved-from) buffer it is
// being constructed from, memory must be allocated on-device, and the new
// buffer becomes owning.
auto owning_moved_buffer = rapids::Buffer<int>(std::move(base_buffer2), rapids::DeviceMemory);
}
data()
: Return a raw pointer to the buffer's datasize()
: Return the number of elements contained by the buffermem_type()
: Return the type of memory (HostMemory
orDeviceMemory
) contained by the bufferdevice()
: Return the id of the device on which this buffer resides (always 0 for host buffers)stream()
: Return the CUDA stream associated with this buffer.stream_synchronize()
: Perform a stream synchronization on this buffer's stream.set_stream(cudaStream_t new_stream)
: Synchronize on the current stream and then switch buffer to the new stream.
Tensor
objects are wrappers around Buffers
with some additional metadata
and functionality. All Tensor
objects have a shape which can be retrieved as
a std::vector
using the shape()
method. A reference to the underlying
buffer can also be retrieved with the buffer()
method.
OutputTensor
objects are used to store data which will eventually be returned
as part of Triton's response to a client request. Their finalize
methods are
used to actually marshal their underlying data into a response.
In general, OutputTensor
objects should not be constructed directly but
should instead be retrieved using the get_output
method of a Model
(described later).
Moving data around between host and device or simply between buffers of the
same type can be one of the more error-prone tasks outside of actual model
execution in a backend. To help make this process easier, RAPIDS-Triton
provides a number of overrides of the rapids::copy
function, which provides a
safe way to mode data between buffers or tensors. Assuming the size attribute
of the buffer or tensor has not been corrupted, rapids::copy
should never
result in segfaults or invalid memory access on device.
Additional overrides of rapids::copy
exist, but we will describe the most
common uses of it here. Note that you need not worry about where the underlying
data is located (on host or device) when invoking rapids::copy
. The function
will take care of detecting and handling this. Tensor
overrides are in
rapids_triton/tensor/tensor.hpp
and Buffer
overrides are in
rapids_triton/memory/buffer.hpp
.
If you wish to simply copy the entire contents of one buffer into another or
one tensor into another, rapids::copy
can be invoked as follows:
rapids::copy(destination_buffer, source_buffer);
rapids::copy(destination_tensor, source_tensor);
If the destination is too small to contain the data from the source, a
TritonException
will be thrown.
To distribute data from one tensor to many, the following override is available:
rapids::copy(iterator_to_first_destination, iterator_to_last_destination, source);
Note that destination tensors can be of different sizes. If the destination
buffers cannot contain all data from the source, a TritonException
will be
thrown. Destination tensors can also be a mixture of device and host tensors if
desired.
To move data from part of one buffer to part of another, you can use another override as in the following example:
rapids::copy(destination_buffer, source_buffer, 10, 3, 6);
The extra arguments here provide the offset from the beginning of the
destination buffer to which data should be copied, the index of the beginning
element to be copied from the source, and the index one past the final element to
be copied from the source. Thus, this invocation will copy the third, fourth,
and fifth elements of the source buffer to the tenth, eleventh, and twelfth
elements of the destination. If the destination buffer only had room for (e.g.)
eleven elements, a TritonException
would be thrown.
For a thorough introduction to developing a RAPIDS-Triton Model
for your
backend, see the Linear Example
repo. Here, we will
just briefly summarize some of the useful methods of Model
objects.
get_input
: Used to retrieve an input tensor of a particular name from Tritonget_output
: Used to retrieve an output tensor of a particular name from Tritonget_config_param
: Used to retrieve a named parameter from the configuration file for this modelget_device_id
: The device on which this model is deployed (0 for host deployments)get_deployment_type
: One ofGPUDeployment
orCPUDeployment
depending on whether this model is configured to be deployed on device or host
predict
: The method which performs actual inference on input data and stores it to the output locationload
: A method which can be overridden to load resources that will be used for the lifetime of the modelunload
: A method used to unload any resources loaded inload
if necessarypreferred_mem_type
,preferred_mem_type_in
, andpreferred_mem_type_out
: The location (device or host) where input and output data should be stored. The latter two methods can be overridden if input and output data should be stored differently. Otherwise,preferred_mem_type
will be used for both.get_stream
: A method which can be overridden to provide different streams for handling successive batches. Otherwise, the default stream associated with this model will be used.
Multiple instances of a RAPIDS-Triton model may need to share some data between
them (or may choose to do so for efficiency). SharedState
objects facilitate
this. For a thorough introduction to developing a RAPIDS-Triton SharedState
for your backend, see the Linear Example
repo. Just like the
Model
objects which share a particular SharedState
object, configuration
parameters can be retrieved using SharedState
's get_config_param
method.
Otherwise, most additional functionality is defined by the backend
implementation, including load
and unload
methods for any necessary
loading/unloading of resources that will be used for the lifetime of the shared
state.
Note that just one shared state is constructed by the server regardless of how many instances of a given model are created.
For most device memory allocations, it is strongly recommended that you simply
construct a Buffer
of the correct size and type. However, if you absolutely
cannot use a Buffer
in a particular context, you are encouraged to allocate
and deallocate device memory using RMM. Any
memory managed in this way will make use of Triton's CUDA memory pool, which
will be faster than performing individual allocations. It is strongly
recommended that you not change the RMM device resource in your backend, since
doing so will cause allocations to no longer make use of Triton's memory pool.