Bart model converted ONNX inference #14222

ZiyueWangUoB · 2021-11-01T03:34:41Z

Hi, I followed the instructions to convert BART-LARGE-CNN model to ONNX here (https://github.com/huggingface/transformers/blob/master/docs/source/serialization.rst) using transformers.onnx script. The model was exported fine and I can run inference.

However, the results of the inference, from the 'last_hideen_state' are in logits (I think)? How can I parse this output for summarization purposes?

Here are screenshots of what I've done.

This is the resulting output from those two states:

github-actions · 2021-12-01T15:02:41Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

lewtun · 2021-12-08T16:08:05Z

Hey @ZiyueWangUoB by default, the transformers.onnx package exports models using the --features=default flag. This corresponds to exporting an AutoModel topology, but since you're interested in summarization, you'll want to use the seq2seq-lm features that export an AutoModelForSeq2SeqLM topology.

This topology is not currently support for BART, but will be once #14358 is merged.

This will allow you to run:

python -m transformers.onnx --model=facebook/bart-large-cnn --features=seq2seq-lm onnx/

which will produce and ONNX model whose outputs are logits instead of last_hidden_state and encoder_last_hidden_state. You will still have to implement your own algorithm for text generation (e.g. beam search), so you might be interested in checking out this example which does that.

FYI you can find the model's output names from the ONNX config, e.g.

from transformers import AutoConfig, AutoModelForSeq2SeqLM
from transformers.models.bart import BartOnnxConfig

model_ckpt = "facebook/bart-large-cnn"
config = AutoConfig.from_pretrained(model_ckpt)
onnx_config = BartOnnxConfig(config, task="default")
onnx_config.outputs
# OrderedDict([('last_hidden_state', {0: 'batch', 1: 'sequence'}),
#              ('encoder_last_hidden_state', {0: 'batch', 1: 'sequence'})])

sorenmc · 2021-12-14T15:38:45Z

If i wish to use a distilbart model could i use the linked example directly for beam search? Also the linked issued #14358 has been merged, and i tried using the --features=seq2seq-lm flag but I got the following error message:

ValueError: bart doesn't support feature seq2seq-lm. Supported values are: ['default']

lewtun · 2021-12-15T15:50:44Z

Hey @sorenmc AFAIK the linked example should work with DistilBART, but please open a new issue if it doesn't.

Regarding #14358 we had to revert it to handle some issues in the tests. The new PR to track is #14700

sorenmc · 2021-12-21T09:20:59Z

For anyone in my position: I still have not tried this, but will give an update here when i have!

github-actions · 2022-01-14T15:03:11Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

mohanvamsibu-kore · 2022-02-08T04:41:36Z

Hey @lewtun

Hey @ZiyueWangUoB by default, the transformers.onnx package exports models using the --features=default flag. This corresponds to exporting an AutoModel topology, but since you're interested in summarization, you'll want to use the seq2seq-lm features that export an AutoModelForSeq2SeqLM topology.

This topology is not currently support for BART, but will be once #14358 is merged.

This will allow you to run:
python -m transformers.onnx --model=facebook/bart-large-cnn --features=seq2seq-lm onnx/
which will produce and ONNX model whose outputs are logits instead of last_hidden_state and encoder_last_hidden_state. You will still have to implement your own algorithm for text generation (e.g. beam search), so you might be interested in checking out this example which does that.

FYI you can find the model's output names from the ONNX config, e.g.
from transformers import AutoConfig, AutoModelForSeq2SeqLM
from transformers.models.bart import BartOnnxConfig

model_ckpt = "facebook/bart-large-cnn"
config = AutoConfig.from_pretrained(model_ckpt)
onnx_config = BartOnnxConfig(config, task="default")
onnx_config.outputs
# OrderedDict([('last_hidden_state', {0: 'batch', 1: 'sequence'}),
#              ('encoder_last_hidden_state', {0: 'batch', 1: 'sequence'})])

Hello, @lewtun I am trying the same scenario, The example guide URL for beam_search is returning 404. (https://github.com/huggingface/transformers/tree/master/examples/onnx/pytorch/summarization) Can you please post the latest URL

mohanvamsibu-kore · 2022-02-08T04:42:49Z

For anyone in my position: I still have not tried this, but will give an update here when i have!

Hey @sorenmc, If you have tried this approach.. can you please attach a code snippet here. It will be mighty helpful

TonyMas · 2022-02-09T22:11:45Z

Hi, @mohanvamsibu-kore summarization example was moved here: https://github.com/huggingface/transformers/tree/master/examples/research_projects/onnx/summarization

mohanvamsibu-kore · 2022-02-10T05:54:29Z

Hi, @TonyMas Thank You. I have implemented summarization from the model "lidiya/bart-large-xsum-samsum", the ONNX model was extremely fast but, I see that the beam_search is very slow which is taking a major chunk of the time (~ 9 secs ) in CPU. I tried with greedy search as well, which is taking ~3-4 secs. so,

Is there a way to optimize beam_search?
Can I run greedy_search on GPU? If Yes, Please let me know the steps

mohanvamsibu-kore · 2022-02-14T17:58:16Z

@TonyMas Can you please help me with the above concerns? I have also tried with the same example provided under https://github.com/huggingface/transformers/tree/master/examples/research_projects/onnx/summarization. It took ~10 secs on GPU. for the input of ~1000 characters. Please let me know if I can reduce the time

jbesomi · 2022-02-24T12:05:36Z

Hey @mohanvamsibu-kore, I am also interested in exporting lidiya/bart-large-xsum-samsum on ONNX. I would love to see your code and see how we can speed it up. Can you share the code?

sorenmc · 2022-03-03T14:40:35Z

For anyone in my position: I still have not tried this, but will give an update here when i have!

Hey @sorenmc, If you have tried this approach.. can you please attach a code snippet here. It will be mighty helpful

Sorry have been on vacation, and i have sadly not had the time

jspablo · 2022-03-09T13:08:36Z

I have been testing the Bart + Beam Search to ONNX example but it seems that the attention_mask layer is fixed to the sample input used when exporting the model. Setting it up like the inputs_ids in the dynamic_axes fix the issue.
The point is that testing the model with some texts returns pretty much the same tokens from the input text. Do you have the same experience? We really need this feature from optimum, any updates on this?

lewtun · 2022-03-09T13:16:53Z

Hey @jspablo we're currently discussing internally on the best approach for supporting text generation and other inference tasks within optimum. We don't have a timeline on this yet, but I'll report back once we have a clearer picture on this.

cc @philschmid @mfuntowicz

zeke-john · 2022-08-14T18:31:33Z

any update?

philschmid · 2022-08-15T07:09:33Z

Yes see: https://huggingface.co/docs/optimum/main/en/onnxruntime/modeling_ort#optimum.onnxruntime.ORTModelForSeq2SeqLM

ZiyueWangUoB · 2022-10-14T08:39:05Z

I have been testing the Bart + Beam Search to ONNX example but it seems that the attention_mask layer is fixed to the sample input used when exporting the model. Setting it up like the inputs_ids in the dynamic_axes fix the issue. The point is that testing the model with some texts returns pretty much the same tokens from the input text. Do you have the same experience? We really need this feature from optimum, any updates on this?

Found a fix for this yet?

fxmarty · 2022-12-26T09:39:54Z

Yes see: https://huggingface.co/docs/optimum/main/en/onnxruntime/modeling_ort#optimum.onnxruntime.ORTModelForSeq2SeqLM

Updated link: https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort#optimum.onnxruntime.ORTModelForSeq2SeqLM

This allows to basically do the inference with ONNX Runtime, while still using the generate() from PyTorch:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-xsum")

# instead of: `model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-xsum")`
# the argument `from_transformers=True` handles the ONNX export on the fly.
model = ORTModelForSeq2SeqLM.from_pretrained("facebook/bart-large-xsum", from_transformers=True, use_cache=True)

to_summarize = "The Bart model was proposed in BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019."

inputs = tokenizer(to_summarize, return_tensors="pt")

gen_tokens = model.generate(**inputs)
outputs = tokenizer.batch_decode(gen_tokens)
print(outputs)
# prints: ['</s>A new model for training artificial intelligence systems has been proposed by a group of researchers at the University of Oxford.</s>']

Alternatively, you can export the model offline and load it later:

optimum-cli export onnx --model facebook/bart-large-xsum --task seq2seq-lm-with-past --for-ort bart_onnx/

nurgel · 2023-02-16T11:44:56Z

Yes see: https://huggingface.co/docs/optimum/main/en/onnxruntime/modeling_ort#optimum.onnxruntime.ORTModelForSeq2SeqLM

Updated link: https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort#optimum.onnxruntime.ORTModelForSeq2SeqLM

This allows to basically do the inference with ONNX Runtime, while still using the generate() from PyTorch:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-xsum")

# instead of: `model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-xsum")`
# the argument `from_transformers=True` handles the ONNX export on the fly.
model = ORTModelForSeq2SeqLM.from_pretrained("facebook/bart-large-xsum", from_transformers=True, use_cache=True)

to_summarize = "The Bart model was proposed in BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019."

inputs = tokenizer(to_summarize, return_tensors="pt")

gen_tokens = model.generate(**inputs)
outputs = tokenizer.batch_decode(gen_tokens)
print(outputs)
# prints: ['</s>A new model for training artificial intelligence systems has been proposed by a group of researchers at the University of Oxford.</s>']

Alternatively, you can export the model offline and load it later:

optimum-cli export onnx --model facebook/bart-large-xsum --task seq2seq-lm-with-past --for-ort bart_onnx/

thought the main selling point of using ONNX is speed. but the inference using ORTModelForSeq2SeqLM:
model.generate(**inputs)
is 2x slower than inference using a pipeline:
pipeline("summarization", model="facebook/bart-large-xsum")
can you please elaborate on why this is the case? is there some magic happening inside pipeline()?

fxmarty · 2023-02-16T14:44:59Z

Could you give me your transformers and optimum versions? There is a critical bug if you use transformers==4.26 and optimum==1.6.3, it has been fixed in the 1.6.4 release.

If you would like to open an issue in Optimum repo with a reproducible script, I can have a look from there!

github-actions bot closed this as completed Jan 23, 2022

dtiarks mentioned this issue May 3, 2023

Add UDOP #22940

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bart model converted ONNX inference #14222

Bart model converted ONNX inference #14222

ZiyueWangUoB commented Nov 1, 2021 •

edited

Loading

github-actions bot commented Dec 1, 2021

lewtun commented Dec 8, 2021

sorenmc commented Dec 14, 2021 •

edited

Loading

lewtun commented Dec 15, 2021

sorenmc commented Dec 21, 2021

github-actions bot commented Jan 14, 2022

mohanvamsibu-kore commented Feb 8, 2022

mohanvamsibu-kore commented Feb 8, 2022

TonyMas commented Feb 9, 2022

mohanvamsibu-kore commented Feb 10, 2022

mohanvamsibu-kore commented Feb 14, 2022

jbesomi commented Feb 24, 2022

sorenmc commented Mar 3, 2022

jspablo commented Mar 9, 2022

lewtun commented Mar 9, 2022

zeke-john commented Aug 14, 2022

philschmid commented Aug 15, 2022

ZiyueWangUoB commented Oct 14, 2022

fxmarty commented Dec 26, 2022

nurgel commented Feb 16, 2023 •

edited

Loading

fxmarty commented Feb 16, 2023

Bart model converted ONNX inference #14222

Bart model converted ONNX inference #14222

Comments

ZiyueWangUoB commented Nov 1, 2021 • edited Loading

github-actions bot commented Dec 1, 2021

lewtun commented Dec 8, 2021

sorenmc commented Dec 14, 2021 • edited Loading

lewtun commented Dec 15, 2021

sorenmc commented Dec 21, 2021

github-actions bot commented Jan 14, 2022

mohanvamsibu-kore commented Feb 8, 2022

mohanvamsibu-kore commented Feb 8, 2022

TonyMas commented Feb 9, 2022

mohanvamsibu-kore commented Feb 10, 2022

mohanvamsibu-kore commented Feb 14, 2022

jbesomi commented Feb 24, 2022

sorenmc commented Mar 3, 2022

jspablo commented Mar 9, 2022

lewtun commented Mar 9, 2022

zeke-john commented Aug 14, 2022

philschmid commented Aug 15, 2022

ZiyueWangUoB commented Oct 14, 2022

fxmarty commented Dec 26, 2022

nurgel commented Feb 16, 2023 • edited Loading

fxmarty commented Feb 16, 2023

ZiyueWangUoB commented Nov 1, 2021 •

edited

Loading

sorenmc commented Dec 14, 2021 •

edited

Loading

nurgel commented Feb 16, 2023 •

edited

Loading