Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Enhance the AI connector framework to support 1)Async Prediction and 2) Prediction with Streaming Response #2484

Open
Zhangxunmt opened this issue May 30, 2024 · 2 comments
Assignees
Labels

Comments

@Zhangxunmt
Copy link
Collaborator

Zhangxunmt commented May 30, 2024

Is your feature request related to a problem?

Two enhancements are proposed in this feature to improve the Ml-Commons Connector framework.

  1. Currently in the connector framework, we only have one way to predict remote models in realtime mode through API calls. This realtime invocation cannot handle the batch inference as proposed in [FEATURE] Support batch inference  #1840. One important pre-requisite of batch inference is offline Endpoint invocation asynchronously. In additional to Bedrock that has the async model prediction, SageMaker also provides the API to invoke endpoints in async mode (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpointAsync.html). Different than the Async client http connections, this Async Prediction usually require the request payloads and the responses stored in storage service like S3.
  1. As another improvement, we should also add the model invoking mode of streaming responses. Now SageMaker and OpenAI all support Model Predictions with streaming responses in https://platform.openai.com/docs/api-reference/streaming and https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpointWithResponseStream.html. We should integrate them in the connector as new Action Types.
  • The benefits of model invocation with streaming responses can be summarized:

  • Reduced Latency: Begin processing data as soon as it starts arriving. This will greatly reduce the latency that we saw in the query assistant, etc.

  • Improve efficiency and reduce memory usage: Handle large volumes of data without loading it all into memory at once. It saves the memory usage and improves the service stability.

  • Improved User Experience: Provide real-time feedback or updates based on the incoming data.

What solution would you like?
Integrate the Async Invoke Model APIs from SageMaker and Bedrock.
Integrate the Streaming payload response APIs from SageMaker, OpenAI, and make it general to support others.

What alternatives have you considered?
The implementation should be done in a general way that is easily extended to new model sever platforms.

Do you have any additional context?

@austintlee
Copy link
Collaborator

For streaming - opensearch-project/OpenSearch#13772

@zane-neo
Copy link
Collaborator

zane-neo commented Aug 6, 2024

@Zhangxunmt Any update on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: On-deck
Development

No branches or pull requests

3 participants