add ml processor for offline batch inference #5507

Zhangxunmt · 2025-03-06T20:19:23Z

Description

Adding a new ml-processor to interact with ml-commons plugin in OpenSearch for ML related applications.

An example:

ml-batch-job-pipeline:
  source:
    s3:
      codec:
        ndjson:
      compression: none
      aws:
        region: "us-east-1"
      default_bucket_owner: <your aws account>
      scan:
        metadata_only: true
        buckets:
          - bucket:
              name: "offlinebatch"
              filter:
                include_prefix:
                  - bedrock-multisource/my_batch
                exclude_suffix:
                  - .out

  buffer:
    bounded_blocking:
      buffer_size: 2048 # max number of records the buffer accepts
      batch_size: 512 # max number of records the buffer drains after each read

  processor:
    - ml:
        host: "<your host url>"
        aws_sigv4: true
        action_type: "batch_predict"
        service_name: "bedrock"
        model_id: "<your model id>"
        output_path: "s3://offlinebatch/bedrock-multisource/output-multisource/"
        aws:
          region: "us-east-1"
        ml_when: /bucket == "offlinebatch"

  sink:
    - stdout:

batch-ingest-pipeline:
  source:
    s3:
      codec:
        ndjson:
      compression: none
      aws:
        region: "us-east-1"
      default_bucket_owner: <your aws account>
      scan:
        scheduling:
          interval: PT6M
          count: 10
        buckets:
          - bucket:
              name: "offlinebatch"
              filter:
                include_prefix:
                  - bedrock-multisource/output-multisource/
                exclude_suffix:
                  - manifest.json.out

  processor:
    - copy_values:
        entries:
          - to_key: chapter
            from_key: /modelInput/inputText
          - to_key: chapter_embedding
            from_key: /modelOutput/embedding
    - delete_entries:
        with_keys: [modelInput, modelOutput, recordId, s3]

  sink:
    - opensearch:
        hosts: ["<your host url>"]
        aws_sigv4: true
        index: "my-nlp-index-bedrock"
        username: "username"
        password: "<your password>"

Issues Resolved

#5470
#5433

Check List

New functionality includes testing.
New functionality has a documentation issue. Please link to it in this PR.
- New functionality has javadoc added
Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Xun Zhang <[email protected]>

Zhangxunmt requested review from sb2k16, chenqi0805, engechas, san81, srikanthjg, graytaylor0, dinujoh, kkondaka, KarstenSchnitter, dlvenable and oeyh as code owners March 6, 2025 20:19

Zhangxunmt force-pushed the main branch from 83fd18c to 26b5f58 Compare March 6, 2025 23:05

add ml processor for offline batch inference

86e4f34

Signed-off-by: Xun Zhang <[email protected]>

Zhangxunmt force-pushed the main branch from 26b5f58 to 86e4f34 Compare March 6, 2025 23:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add ml processor for offline batch inference #5507

add ml processor for offline batch inference #5507

Zhangxunmt commented Mar 6, 2025 •

edited

Loading

add ml processor for offline batch inference #5507

Are you sure you want to change the base?

add ml processor for offline batch inference #5507

Conversation

Zhangxunmt commented Mar 6, 2025 • edited Loading

Description

Issues Resolved

Check List

Zhangxunmt commented Mar 6, 2025 •

edited

Loading