Skip to content

Integrating SSE with NVIDIA Triton Inference Server using a Python backend and Zephyr model. There is very less documentation how to use Nvidia Triton in Streaming use-cases ( hard to find in their docs ), hence this should be helpful for people who want to deploy streaming with Triton

Notifications You must be signed in to change notification settings

nickaggarwal/nvidia-triton-llm-streaming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Deploy Zephr Using Streaming(SSE) with Nvidia Triton for Python Backend

In the world of machine learning and AI, real-time data processing and inference can significantly enhance the responsiveness and efficiency of applications. Server-Sent Events (SSE) offer a straightforward method to push updates to a client over HTTP, making it an ideal choice for real-time data streaming. Combined with NVIDIA Triton, an open-source inference server that supports multiple machine learning frameworks, developers can deploy and manage scalable AI applications efficiently. This blog will guide you through integrating SSE with NVIDIA Triton using a Python backend.

What is SSE?

Server-Sent Events (SSE) is a standard describing how servers can initiate data transmission towards browser clients once an initial client connection has been established. It’s particularly useful for creating a one-way communication channel from the server to the client, such as for real-time notifications, live updates, and streaming data.

What is NVIDIA Triton?

NVIDIA Triton Inference Server provides a robust, scalable serving system for deploying machine learning models from any framework (TensorFlow, PyTorch, ONNX Runtime, etc.) over GPUs and CPUs in production environments. Triton simplifies deploying multiple models from different frameworks, optimizing GPU utilization and integrating with Kubernetes.

How to use the Model?

  • Create a folder model_repo
  • Clone the repo inside the model_repo
  • Run the docker command
docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ~/model_repo:/models nvcr.io/nvidia/tritonserver:23.11-py3 tritonserver --model-repository=/models --model-control-mode=explicit
  • Install the dependencies inside docker with docker exec
pip install "autoawq==0.1.8"
pip install "torch==2.1.2"

About

Integrating SSE with NVIDIA Triton Inference Server using a Python backend and Zephyr model. There is very less documentation how to use Nvidia Triton in Streaming use-cases ( hard to find in their docs ), hence this should be helpful for people who want to deploy streaming with Triton

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published