You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Two enhancements are proposed in this feature to improve the Ml-Commons Connector framework.
Currently in the connector framework, we only have one way to predict remote models in realtime mode through API calls. This realtime invocation cannot handle the batch inference as proposed in [FEATURE] Support batch inference #1840. One important pre-requisite of batch inference is offline Endpoint invocation asynchronously. In additional to Bedrock that has the async model prediction, SageMaker also provides the API to invoke endpoints in async mode (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpointAsync.html). Different than the Async client http connections, this Async Prediction usually require the request payloads and the responses stored in storage service like S3.
The benefits of this async model invocation are multi-folds:
Non-blocking Execution: Ml-Commons can continue to perform other tasks after the async prediction call is finished without waiting for response.
Improved Responsiveness: Especially useful in web applications or any application where responsiveness is critical and needs written records.
The benefits of model invocation with streaming responses can be summarized:
Reduced Latency: Begin processing data as soon as it starts arriving. This will greatly reduce the latency that we saw in the query assistant, etc.
Improve efficiency and reduce memory usage: Handle large volumes of data without loading it all into memory at once. It saves the memory usage and improves the service stability.
Improved User Experience: Provide real-time feedback or updates based on the incoming data.
What solution would you like?
Integrate the Async Invoke Model APIs from SageMaker and Bedrock.
Integrate the Streaming payload response APIs from SageMaker, OpenAI, and make it general to support others.
What alternatives have you considered?
The implementation should be done in a general way that is easily extended to new model sever platforms.
Do you have any additional context?
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem?
Two enhancements are proposed in this feature to improve the Ml-Commons Connector framework.
The benefits of this async model invocation are multi-folds:
Non-blocking Execution: Ml-Commons can continue to perform other tasks after the async prediction call is finished without waiting for response.
Improved Responsiveness: Especially useful in web applications or any application where responsiveness is critical and needs written records.
Concurrency: This Async endpoint invocation APIs are typically implemented through internal queuing technique (https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html), so it doesn't have the common realtime throttling problems as discussed in [BUG] Neural search: 4xx error ingesting data with Sagemaker external model #2249.
The benefits of model invocation with streaming responses can be summarized:
Reduced Latency: Begin processing data as soon as it starts arriving. This will greatly reduce the latency that we saw in the query assistant, etc.
Improve efficiency and reduce memory usage: Handle large volumes of data without loading it all into memory at once. It saves the memory usage and improves the service stability.
Improved User Experience: Provide real-time feedback or updates based on the incoming data.
What solution would you like?
Integrate the Async Invoke Model APIs from SageMaker and Bedrock.
Integrate the Streaming payload response APIs from SageMaker, OpenAI, and make it general to support others.
What alternatives have you considered?
The implementation should be done in a general way that is easily extended to new model sever platforms.
Do you have any additional context?
The text was updated successfully, but these errors were encountered: