[FEATURE] Enhance the AI connector framework to support 1)Async Prediction and 2) Prediction with Streaming Response #2484

Zhangxunmt · 2024-05-30T18:24:42Z

Is your feature request related to a problem?

Two enhancements are proposed in this feature to improve the Ml-Commons Connector framework.

Currently in the connector framework, we only have one way to predict remote models in realtime mode through API calls. This realtime invocation cannot handle the batch inference as proposed in [FEATURE] Support batch inference #1840. One important pre-requisite of batch inference is offline Endpoint invocation asynchronously. In additional to Bedrock that has the async model prediction, SageMaker also provides the API to invoke endpoints in async mode (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpointAsync.html). Different than the Async client http connections, this Async Prediction usually require the request payloads and the responses stored in storage service like S3.

The benefits of this async model invocation are multi-folds:
Non-blocking Execution: Ml-Commons can continue to perform other tasks after the async prediction call is finished without waiting for response.
Improved Responsiveness: Especially useful in web applications or any application where responsiveness is critical and needs written records.
Concurrency: This Async endpoint invocation APIs are typically implemented through internal queuing technique (https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html), so it doesn't have the common realtime throttling problems as discussed in [BUG] Neural search: 4xx error ingesting data with Sagemaker external model #2249.

As another improvement, we should also add the model invoking mode of streaming responses. Now SageMaker and OpenAI all support Model Predictions with streaming responses in https://platform.openai.com/docs/api-reference/streaming and https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpointWithResponseStream.html. We should integrate them in the connector as new Action Types.

The benefits of model invocation with streaming responses can be summarized:
Reduced Latency: Begin processing data as soon as it starts arriving. This will greatly reduce the latency that we saw in the query assistant, etc.
Improve efficiency and reduce memory usage: Handle large volumes of data without loading it all into memory at once. It saves the memory usage and improves the service stability.
Improved User Experience: Provide real-time feedback or updates based on the incoming data.

What solution would you like?
Integrate the Async Invoke Model APIs from SageMaker and Bedrock.
Integrate the Streaming payload response APIs from SageMaker, OpenAI, and make it general to support others.

What alternatives have you considered?
The implementation should be done in a general way that is easily extended to new model sever platforms.

Do you have any additional context?

austintlee · 2024-06-04T18:27:38Z

For streaming - opensearch-project/OpenSearch#13772

zane-neo · 2024-08-06T06:28:52Z

@Zhangxunmt Any update on this issue?

Zhangxunmt added enhancement New feature or request untriaged labels May 30, 2024

Zhangxunmt self-assigned this May 30, 2024

Zhangxunmt added feature Priority-Medium and removed untriaged labels May 30, 2024

ylwu-amzn added this to ml-commons projects Jun 4, 2024

ylwu-amzn moved this to On-deck in ml-commons projects Jun 4, 2024

This was referenced Jun 18, 2024

[Enhancement] Support more Action types in ML Extensibility Framework #1230

Closed

add offline batch predict job ActionType in connector #2572

Closed

ylwu-amzn mentioned this issue Oct 1, 2024

Something wrong when parsing the JSON data returned by Ollama model #2996

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Enhance the AI connector framework to support 1)Async Prediction and 2) Prediction with Streaming Response #2484

[FEATURE] Enhance the AI connector framework to support 1)Async Prediction and 2) Prediction with Streaming Response #2484

Zhangxunmt commented May 30, 2024 •

edited

Loading

austintlee commented Jun 4, 2024

zane-neo commented Aug 6, 2024

[FEATURE] Enhance the AI connector framework to support 1)Async Prediction and 2) Prediction with Streaming Response #2484

[FEATURE] Enhance the AI connector framework to support 1)Async Prediction and 2) Prediction with Streaming Response #2484

Comments

Zhangxunmt commented May 30, 2024 • edited Loading

austintlee commented Jun 4, 2024

zane-neo commented Aug 6, 2024

Zhangxunmt commented May 30, 2024 •

edited

Loading