[RFC] Asynchronous Offline Batch Inference and Ingestion to OpenSearch #2891

Zhangxunmt · 2024-09-04T22:36:59Z

Problem Statement

Nowadays remote model servers like AWS SageMaker, BedRock, or OpenAI, Cohere, etc all support batch predict APIs, which allow users to send large amount of synchronous requests in a file like S3 and produce the results asynchronously into a file as the output. Different platform use different terminology, e.g. SageMaker uses "batch transform" and bedrock uses "batch inference", but they are all the same in the sense of accepting the requests in the batch mode and process them asynchronously. While some uses require to send synchronous requests, but there are many cases where requests do not need an immediate response or certain rate limits prevent you from executing a large number of queries quickly. For example, in this case, the customer experiences throttling issues by the remote model and the data ingestion cannot finish.

In this RFC, "batch inference" is used as the terminology to represent the "batch" operations in all remote servers. The benefits of utilizing batch inference can be summaries but not limited to 1) Better cost efficiency: 50% cost discount compared to synchronous APIs (number may vary for different servers) and 2) Higher rate limits compared to the synchronous APIs.

For OpenSearch users, the most common use case of batch inference is to ingest embedding data into the K-NN index for vector search. However, the entire Ingest API in OpenSearch was designed for stream data processing. So all of the data processing capabilities supported by our ingestion processors aren’t optimized for batch ingestion and they do not accept files as inputs for ingestion. The typical data ingestion for ML use cases are presented in the following diagram.

This RFC focus on the these two improvements:

Enhance our AI connector framework in order to support these batch APIs
Provide a new ingest API to ingest data into the OpenSearch from files like S3, openAI files, etc which are the output of these batch APIs.

Proposed Solution

Speed up a batch transform job through SageMaker

Phase 1 : Add new “batch_predict” action type in AI Connector framework (released in OpenSearch 2.16)
Add a new action type “batch_predict” in the connector blueprint. This action type is to run batch prediction using the connector in ML-Commons. An example of the connector for sageMaker is given below. Different model frameworks have different configuration of the connector, and we will define multiple ConnectorExecutors to easily extend batch prediction for other new AI models.

# Example API requestPOST /_plugins/_ml/connectors/_create
POST /_plugins/_ml/connectors/_create
{
  "name": "DJL Sagemaker Connector: all-MiniLM-L6-v2",
  "version": "1",
  "description": "The connector to sagemaker embedding model all-MiniLM-L6-v2",
  "protocol": "aws_sigv4",
  "credential": {
    "access_key": "<your access key>",
    "secret_key": "<your secret key>",
    "session_token": "<your session token>"
  },
  "parameters": {
    "region": "us-east-1",
    "service_name": "sagemaker",
    "DataProcessing": {
        "InputFilter": "$.content",
        "JoinSource": "Input",
        "OutputFilter": "$"
    },
    "ModelName": "DJL-Text-Embedding-Model-imageforjsonlines",
    "TransformInput": { 
      "ContentType": "application/json",
      "DataSource": { 
         "S3DataSource": { 
            "S3DataType": "S3Prefix",
            "S3Uri": "s3://offlinebatch/sagemaker_djl_batch_input.json"
         }
      },
      "SplitType": "Line"
    },
    "TransformJobName": "SM-offline-batch-transform-07-12-13-30",
    "TransformOutput": { 
      "AssembleWith": "Line",
      "Accept": "application/json",
      "S3OutputPath": "s3://offlinebatch/output"
   },
   "TransformResources": { 
      "InstanceCount": 1,
      "InstanceType": "ml.c5.xlarge"
   },
   "BatchStrategy": "SingleRecord"
  },
  "actions": [
    {
      "action_type": "predict",
      "method": "POST",
      "headers": {
        "content-type": "application/json"
      },
      "url": "https://runtime.sagemaker.us-east-1.amazonaws.com/endpoints/OpenSearch-sagemaker-060124023703/invocations",
      "request_body": "${parameters.input}",
      "pre_process_function": "connector.pre_process.default.embedding",
      "post_process_function": "connector.post_process.default.embedding"
    },
    {
        "action_type": "batch_predict",
        "method": "POST",
        "headers": {
            "content-type": "application/json"
        },
        "url": "https://api.sagemaker.us-east-1.amazonaws.com/CreateTransformJob",
        "request_body": "{ \"BatchStrategy\": \"${parameters.BatchStrategy}\", \"ModelName\": \"${parameters.ModelName}\", \"DataProcessing\" : ${parameters.DataProcessing}, \"TransformInput\": ${parameters.TransformInput}, \"TransformJobName\" : \"${parameters.TransformJobName}\", \"TransformOutput\" : ${parameters.TransformOutput}, \"TransformResources\" : ${parameters.TransformResources}}"
    }
  ]
}

After the connector new action type is added, add a new API for batch inference jobs. To be consistent with the current “Predict” API in ML-Commons, the new API is named batch-prediction-job. This API maps to the “create_transform_job” API in SageMaker and “model-invocation-job” API in Bedrock. Example request of the new batch inference API.

POST /_plugins/_ml/models/dBK3t5ABrxVhHgFYhg7Q/_batch_predict
{
  "parameters": {
    "TransformJobName": "SM-offline-batch-transform-07-15-11-30"
  }
}
# Response
{
   "job_arn": "arn:aws:sagemaker:us-east-1:802041417063:transform-job/SM-offline-batch-transform"
}
{
    'task_id': 'xxxxxxxx',
    "state": "[InProgress]
}

Phase 2 : Add offline batch inference task management and new batch ingestion API.
Add the task management for the batch transform jobs which maps to the “list-prediction-jobs”, “describe-prediction-job” and “ cancel-prediction-job” etc APIs from remote model servers.

Add a new API for offline batch ingestion jobs. This API reads the batch inference results in the phase 1 and other input files, and bulk ingest the vectors into KNN index. This API reads the vector data from the S3 file, and organize the bulk ingestion request based on the field mapping. We will call the bulk ingestion to the KNN index asynchronously so a task id will be returned.

POST /_plugins/_ml/_batch_ingestion
{
  "index_name": "my-nlp-index",
  "field_map": {
    "chapter": "$.content[0]",
    "title": "$.content[1]",
    "chapter_embedding": "$.SageMakerOutput[0]"
     "title_embedding": "$.SageMakerOutput[1]",
    "ingest_fields": ["$.id"]
  },
  "credential": {
    "region": "us-east-1",
    "access_key": "<your access key>",
    "secret_key": "<your secret key>",
    "session_token": "<your session token>"
  },
  "data_source": {
    "type": "s3",
    "source": ["s3://offlinebatch/output/sagemaker_djl_batch_input.json.out"]
  }
}

# expected response
{
  "task_id": "cbsPlpEBMHcagzGbOQOx",
  "task_type": "BATCH_INGEST",
  "status": "CREATED"
}

Phase 3: Integrate the offline batch inference and ingestion engine in the data prepper
Data Prepper is a server-side data collector capable of filtering, enriching, transforming, normalizing, and aggregating data for downstream analytics and visualization. Data Prepper lets users build custom pipelines to improve the operational view of applications. We can integrate the AI connector in the data prepper for customer to setup custom pipelines to run offline batch ingestion.

A sample data Prepper config is presented below. Integration with data prepper can be implemented after phase 1 and 2. That usually means users have the option to use data Prepper to setup offline batch ingestion without directly calling any Plugins.

sample-pipeline:
  workers: 4 #Number of workers
  delay: 100 # in milliseconds, how often the workers should run
  source:
    file:
        path: <path/to/input-file> -> input file S3
  buffer:
    bounded_blocking:
      buffer_size: 1024 # max number of events the buffer will accept
      batch_size: 256 # max number of events the buffer will drain for each read
  processor:
    - string_converter:
       upper_case: true
    - text_embedding: 
       model_id: <xxxxx> -> the model in ml-commons for batch inference
  sink:
    - file:
       path: <path/to/output-file> -> ourput file S3

In Scope

Offline Batch Inference is released in OpenSearch 2.16
Offline Batch Ingestion is scheduled to be released in OpenSearch 2.17
Offline Batch Inference and Ingestion Task/Job management is scheduled to be released in OpenSearch 2.17
The Integration of Offline Batch Inference with OSI is in the planning

Out of Scope

Integrate the offline batch engine into the OpenSearch Ingest processor and pipeline - Not in planning
Integrate the offline batch engine with other remote ingestion engine like Apache Airflow, Apache Spark (Flint). These external engines can be used for offline batch. This will be discussed in a separate RFC if prioritized.

Real Case Example using a public ML Model Service

Given an input file for batch inference using the OpenAI embedding model "text-embedding-ada-002":

{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": [ "What is the meaning of life?", "The food was delicious and the waiter..."]}}
{"custom_id": "request-2", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": [ "What is the meaning of work?", "The travel was fantastic and the view..."]}}
{"custom_id": "request-3", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": [ "What is the meaning of friend?", "The old friend was far away and the time..."]}}
... ...

Use the connector below to kick off the batch inference through OpenAI batch API

POST /_plugins/_ml/connectors/_create
{
  "name": "OpenAI Embedding model",
  "description": "OpenAI embedding model for offline batch",
  "version": "1",
  "protocol": "http",
  "parameters": {
    "model": "text-embedding-ada-002",
    "input_file_id": "<your input file id in OpenAI>",
    "endpoint": "/v1/embeddings"
  },
  "credential": {
    "openAI_key": "<your key>"
  },
  "actions": [
    {
         ...
    },
    {
      "action_type": "batch_predict",
      "method": "POST",
      "url": "https://api.openai.com/v1/batches",
      "headers": {"Authorization": "Bearer ${credential.openAI_key}"},
      "request_body": "{ \"input_file_id\": \"${parameters.input_file_id}\", \"endpoint\": \"${parameters.endpoint}\", \"completion_window\": \"24h\" }"
    },
    {
      "action_type": "batch_predict_status",
      "method": "GET",
      "url": "https://api.openai.com/v1/batches/${parameters.id}",
      "headers": {
        "Authorization": "Bearer ${credential.openAI_key}"
      }
    }
  ]
}

# Invoke the model with "batch_predict" action type
POST /_plugins/_ml/models/<your model associated with the connector>/_batch_predict
{
  "parameters": {
    "input_file_id": "<your file id>"
  }
}

# Response
{
 "task_id": "KYZSv5EBqL2d0mFvs80C",
 "status": "CREATED"
}

Once you receive the task_id, you can check the task through the Task API to check the remote job status in OpenAI.

// Get batch job status using get task API
GET /_plugins/_ml/tasks/KYZSv5EBqL2d0mFvs80C

// Response
{
 "model_id": "JYZRv5EBqL2d0mFvKs1E",
 "task_type": "BATCH_PREDICTION",
 "function_name": "REMOTE",
 "state": "COMPLETED",
 "input_type": "REMOTE",
 "worker_node": [
 "Ee5OCIq0RAy05hqQsNI1rg"
 ],
 "create_time": 1725491751455,
 "last_update_time": 1725491751455,
 "is_async": false,
 "remote_job": {
 "cancelled_at": null,
 "metadata": null,
 "request_counts": {
 "total": 3.0,
 "completed": 3.0,
 "failed": 0.0
 },
 "input_file_id": "file-5gXEtbKjHnYrKrdtv69IeRN2",
 "output_file_id": "file-Wux0Pk80dhkxi98Z5iKNjB4n",
 "error_file_id": null,
 "created_at": 1.725491753E9,
 "in_progress_at": 1.725491753E9,
 "expired_at": null,
 "finalizing_at": 1.725491757E9,
 "completed_at": 1.725491758E9,
 "endpoint": "/v1/embeddings",
 "expires_at": 1.725578153E9,
 "cancelling_at": null,
 "completion_window": "24h",
 "id": "batch_yz4YzfPfajDgcWFk4CqfxYox",
 "failed_at": null,
 "errors": null,
 "object": "batch",
 "status": "completed"
 }
}

After the job status is showing "completed", you can double check your output in the file id "file-Wux0Pk80dhkxi98Z5iKNjB4n" in this example. It's a file in the jsonl format that each line in the file represents a single inference result for a request in the input file.
The output file content is:

{"id": "batch_req_ITKQn29igorXCAGp6wzYs5IS", "custom_id": "request-1", "response": {"status_code": 200, "request_id": "10845755592510080d13054c3776aef4", "body": {"object": "list", "data": [{"object": "embedding", "index": 0, "embedding": [0.0044326545, ... ...]}, {"object": "embedding", "index": 1, "embedding": [0.002297497, ... ... ]}], "model": "text-embedding-ada-002", "usage": {"prompt_tokens": 15, "total_tokens": 15}}}, "error": null}
...
...

To ingest your embedding data and your other fields into OpenSearch, you can use the batch ingestion API given below. The field map is the actual fields you want to ingest in your KNN cluster. In the field map, the key is the field name, and the value is the JsonPath to find your data in the sources files. For example, source[1].$.body.input[1] means you want to use the JsonPath $.body.input[1] to fetch the element body.input[1] in the second file in the "source" array of the batch request.

POST /_plugins/_ml/_batch_ingestion
{
  "index_name": "my-nlp-index-openai",
  "field_map": {
    "question": "source[1].$.body.input[0]",
    "answer": "source[1].$.body.input[1]",
    "question_embedding":"source[0].$.response.body.data[0].embedding",
    "answer_embedding":"source[0].$.response.body.data[1].embedding",
    "_id": ["source[0].$.custom_id", "source[1].$.custom_id"],
  },
  "ingest_fields": ["source[2].$.custom_field1", "source[2].$.custom_field2", "source[2].$.custom_field3" ], 
  "credential": {
    "openAI_key": "<your openAI key>"
  },
  "data_source": {
    "type": "openAI",
    "source": ["<your batch inference output file id>", "<your batch inference input file id>", "<your file id that contains other fields data>"]
  }
}

# expected response
{
  "task_id": "b5Yym5EBdxfqV0JWKrT6",
  "task_type": "BATCH_INGEST",
  "status": "CREATED"
}

# Once you obtain the task_id, use it to check the job status
GET /_plugins/_ml/tasks/cbsPlpEBMHcagzGbOQOx

# expected response, "state": "COMPLETED" means this task is successful
{
  "task_type": "BATCH_INGEST",
  "state": "COMPLETED",
  "create_time": 1724885641978,
  "last_update_time": 1724885642658,
  "is_async": true
}

Request for Comments

This Offline batch engine is implemented in the machine learning space. It's designed to incorporate with the remote Model Batch APIs to facilitate the batch inference of the LLM models and ingest ML model output data into the OpenSearch KNN index for vector search. However, the batch ingestion API can also be used for general ingestion purpose when the inputs are in a file system. So please not limit your thoughts only in the machine learning when you review this feature.

Please leave your suggestions and concerns in this RFC and your valuable inputs are appreciated.

The text was updated successfully, but these errors were encountered:

dylan-tong-aws · 2024-09-11T22:43:49Z

@Zhangxunmt, I think it's important for us to continue supporting Ingest API compatibility. It's been in use since 2016, and we have major feature sets like neural search that are dependent on it.

Today, users can use the Ingest API to define an ingest pipeline.

PUT /_ingest/pipeline/my_pipeline
{
"description": "pipeline with inference processing",
"processors": [
{
{
"ml_inference": {
"model_id": "<model_id>",
"function_name": "<function_name>",
"full_response_path": "<full_response_path>",
"model_config":{
"<model_config_field>": "<config_value>"
},
"model_input": "<model_input>",
"input_map": [
{
"<model_input_field>": "<document_field>"
}
],
"output_map": [
{
"<new_document_field>": "<model_output_field>"
}
],
"override": ""
}
}]}

They can run a bulk streaming ingestion job like my-index/_bulk?pipeline=my-pipeline .

The user should be able to execute this pipeline in batch mode using a command like my-index/_batch?pipeline=my-pipeline source="s3://..." temp_stage="s3://..." (optionally override default)

Batch processing support for each ingestion processor can be incrementally added.

heemin32 · 2025-01-24T00:22:04Z

As @dylan-tong-aws mentioned, should the api support the final ingestion of embedding as well? When embedding is generated, OpenSearch should read it and ingest it into an index rather than asking client to do it.

Zhangxunmt · 2025-01-24T18:36:30Z

As @dylan-tong-aws mentioned, should the api support the final ingestion of embedding as well? When embedding is generated, OpenSearch should read it and ingest it into an index rather than asking client to do it.

The current API supports that final ingestion, but it has to be triggered by CX. The integration with Data Prepper will make these async steps automated including the final ingestion.

heemin32 · 2025-01-24T19:16:41Z

It would be nice to have a feature which ingest final document without user to trigger ingest API again. Also, there might be many users who does not use Data Prepper.

Zhangxunmt · 2025-01-24T20:02:18Z

It would be nice to have a feature which ingest final document without user to trigger ingest API again. Also, there might be many users who does not use Data Prepper.

Agree. Do you think using flow framework/fractal is a good approach to automate all these? That was the initial plan actually.

heemin32 · 2025-01-24T22:30:22Z

Agree. Do you think using flow framework/fractal is a good approach to automate all these? That was the initial plan actually.

With polling mechanism to be introduced, #3218, I think when it checked that task is completed from inference model, it can also read the output file and ingest the documents as well?

Zhangxunmt · 2025-01-24T23:35:57Z

That's right theoretically, but the ingestion needs inputs like field map, ingest index name, and likely data transformation needed before ingestion. Data Prepper can do that using their existing pipelines but we don't natively support all those. It's possible that we can do the most basic ingestion through the polling job, but that wouldn't be a complete solution.

heemin32 · 2025-01-25T22:18:50Z

I noticed there are two separate APIs. Why not combine them? If users provide all the parameters in /_plugins/_ml/models/dBK3t5ABrxVhHgFYhg7Q/_batch_predict that they currently provide in /_plugins/_ml/_batch_ingestion, it seems like the /_plugins/_ml/_batch_ingestion API might not be necessary.

Additionally, one feature that could be useful is allowing users to specify a source index/field instead of a file path. This way, the system could collect the data, create the file, perform batch inference, and ingest the results into another index. What are your thoughts?

Zhangxunmt · 2025-01-26T00:40:33Z

The initial plan was to use the Fractal Flow Framework to stitch these two APIs together. With the current polling jobs in ML Commons, it might be worth exploring the option to merge them. However, there are two key challenges:

Introducing a separate ingestion interface in ML Commons is controversial as it could confuse users from a product perspective. This is a key feedback from PM and leaders as there are dedicated ingestion APIs or tools offered by OpenSearch, e.g. ingest pipeline in search cluster and OSI.
The current setup lacks the flexibility to integrate with diverse data sources and formats.

For the reindexing use case, I think it is a valid use case. Since the Reindex API already exists, it would be a strong proposal to integrate it with ML Commons' batch_predict functionality.

heemin32 · 2025-01-26T04:14:33Z

The initial plan was to use the Fractal Flow Framework to stitch these two APIs together.
Users could potentially utilize the fractal flow framework, but it introduces additional steps that could be avoided by offering a unified API.

The reason for requesting a combined API is that I can't envision a scenario where a user would perform offline batch processing without ultimately ingesting the data into OpenSearch. In other words, the offline batch API seems to lack standalone value without the _batch_ingestion API. Wouldn't it make more sense to have the batch API handle ingestion as well?

Introducing a separate ingestion interface in ML Commons is controversial as it could confuse users from a product perspective. This is a key feedback from PM and leaders as there are dedicated ingestion APIs or tools offered by OpenSearch, e.g. ingest pipeline in search cluster and OSI.

Isn't POST /_plugins/_ml/_batch_ingestion a newly introduced API in ML Common for ingestion? Additionally, the geospatial plugin also has an API for ingesting GeoJSON data: opensearch-project/geospatial#47

The current setup lacks the flexibility to integrate with diverse data sources and formats.

I believe this lack of flexibility exists whether we combine them or not, right?

Apologies for the late feedback and comments, especially since most of the feature has already been implemented. I just wanted to share some thoughts on potential improvements for this feature in the future.

krisfreedain · 2025-01-27T17:17:17Z

Catch All Triage - 1, 2, 3

Zhangxunmt added the RFC Request For Comments from the OpenSearch Community label Sep 4, 2024

Zhangxunmt self-assigned this Sep 4, 2024

github-actions bot added the untriaged label Sep 4, 2024

Zhangxunmt removed the untriaged label Sep 4, 2024

Zhangxunmt added this to OpenSearch Roadmap Sep 4, 2024

github-project-automation bot moved this to New in OpenSearch Roadmap Sep 4, 2024

Zhangxunmt moved this from New to In Progress in OpenSearch Roadmap Sep 4, 2024

ylwu-amzn added this to ml-commons projects Sep 10, 2024

Zhangxunmt mentioned this issue Sep 10, 2024

[DOC] Add support for Offline Batch Inference and Ingestion opensearch-project/documentation-website#8210

Closed

3 tasks

mingshl moved this to In Progress in ml-commons projects Sep 24, 2024

Zhangxunmt mentioned this issue Sep 9, 2024

[FEATURE] Offline Batch Inference and Batch Ingestion #2840

Closed

Zhangxunmt moved this from In Progress to Done in ml-commons projects Oct 4, 2024

Zhangxunmt closed this as completed Jan 7, 2025

heemin32 mentioned this issue Jan 24, 2025

[FEATURE] Document ingestion with offline batch inference #3428

Open

Zhangxunmt reopened this Jan 24, 2025

github-actions bot added the untriaged label Jan 24, 2025

krisfreedain removed the untriaged label Jan 27, 2025

Zhangxunmt mentioned this issue Feb 13, 2025

Support reading S3 object meta data only opensearch-project/data-prepper#5433

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Asynchronous Offline Batch Inference and Ingestion to OpenSearch #2891

[RFC] Asynchronous Offline Batch Inference and Ingestion to OpenSearch #2891

Zhangxunmt commented Sep 4, 2024 •

edited

Loading

dylan-tong-aws commented Sep 11, 2024 •

edited

Loading

heemin32 commented Jan 24, 2025 •

edited

Loading

Zhangxunmt commented Jan 24, 2025

heemin32 commented Jan 24, 2025

Zhangxunmt commented Jan 24, 2025

heemin32 commented Jan 24, 2025

Zhangxunmt commented Jan 24, 2025

heemin32 commented Jan 25, 2025

Zhangxunmt commented Jan 26, 2025

heemin32 commented Jan 26, 2025

krisfreedain commented Jan 27, 2025

[RFC] Asynchronous Offline Batch Inference and Ingestion to OpenSearch #2891

[RFC] Asynchronous Offline Batch Inference and Ingestion to OpenSearch #2891

Comments

Zhangxunmt commented Sep 4, 2024 • edited Loading

Problem Statement

Proposed Solution

Speed up a batch transform job through SageMaker

In Scope

Out of Scope

Real Case Example using a public ML Model Service

Request for Comments

dylan-tong-aws commented Sep 11, 2024 • edited Loading

heemin32 commented Jan 24, 2025 • edited Loading

Zhangxunmt commented Jan 24, 2025

heemin32 commented Jan 24, 2025

Zhangxunmt commented Jan 24, 2025

heemin32 commented Jan 24, 2025

Zhangxunmt commented Jan 24, 2025

heemin32 commented Jan 25, 2025

Zhangxunmt commented Jan 26, 2025

heemin32 commented Jan 26, 2025

krisfreedain commented Jan 27, 2025

Zhangxunmt commented Sep 4, 2024 •

edited

Loading

dylan-tong-aws commented Sep 11, 2024 •

edited

Loading

heemin32 commented Jan 24, 2025 •

edited

Loading