Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bart model converted ONNX inference #14222

Closed
ZiyueWangUoB opened this issue Nov 1, 2021 · 21 comments
Closed

Bart model converted ONNX inference #14222

ZiyueWangUoB opened this issue Nov 1, 2021 · 21 comments

Comments

@ZiyueWangUoB
Copy link

ZiyueWangUoB commented Nov 1, 2021

Hi, I followed the instructions to convert BART-LARGE-CNN model to ONNX here (https://github.com/huggingface/transformers/blob/master/docs/source/serialization.rst) using transformers.onnx script. The model was exported fine and I can run inference.

However, the results of the inference, from the 'last_hideen_state' are in logits (I think)? How can I parse this output for summarization purposes?

Here are screenshots of what I've done.

MicrosoftTeams-image (5)

This is the resulting output from those two states:
MicrosoftTeams-image (6)

@github-actions
Copy link

github-actions bot commented Dec 1, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@lewtun
Copy link
Member

lewtun commented Dec 8, 2021

Hey @ZiyueWangUoB by default, the transformers.onnx package exports models using the --features=default flag. This corresponds to exporting an AutoModel topology, but since you're interested in summarization, you'll want to use the seq2seq-lm features that export an AutoModelForSeq2SeqLM topology.

This topology is not currently support for BART, but will be once #14358 is merged.

This will allow you to run:

python -m transformers.onnx --model=facebook/bart-large-cnn --features=seq2seq-lm onnx/

which will produce and ONNX model whose outputs are logits instead of last_hidden_state and encoder_last_hidden_state. You will still have to implement your own algorithm for text generation (e.g. beam search), so you might be interested in checking out this example which does that.

FYI you can find the model's output names from the ONNX config, e.g.

from transformers import AutoConfig, AutoModelForSeq2SeqLM
from transformers.models.bart import BartOnnxConfig

model_ckpt = "facebook/bart-large-cnn"
config = AutoConfig.from_pretrained(model_ckpt)
onnx_config = BartOnnxConfig(config, task="default")
onnx_config.outputs
# OrderedDict([('last_hidden_state', {0: 'batch', 1: 'sequence'}),
#              ('encoder_last_hidden_state', {0: 'batch', 1: 'sequence'})])

@sorenmc
Copy link

sorenmc commented Dec 14, 2021

If i wish to use a distilbart model could i use the linked example directly for beam search? Also the linked issued #14358 has been merged, and i tried using the --features=seq2seq-lm flag but I got the following error message:

ValueError: bart doesn't support feature seq2seq-lm. Supported values are: ['default']

@lewtun
Copy link
Member

lewtun commented Dec 15, 2021

Hey @sorenmc AFAIK the linked example should work with DistilBART, but please open a new issue if it doesn't.

Regarding #14358 we had to revert it to handle some issues in the tests. The new PR to track is #14700

@sorenmc
Copy link

sorenmc commented Dec 21, 2021

For anyone in my position: I still have not tried this, but will give an update here when i have!

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@mohanvamsibu-kore
Copy link

Hey @lewtun

Hey @ZiyueWangUoB by default, the transformers.onnx package exports models using the --features=default flag. This corresponds to exporting an AutoModel topology, but since you're interested in summarization, you'll want to use the seq2seq-lm features that export an AutoModelForSeq2SeqLM topology.

This topology is not currently support for BART, but will be once #14358 is merged.

This will allow you to run:

python -m transformers.onnx --model=facebook/bart-large-cnn --features=seq2seq-lm onnx/

which will produce and ONNX model whose outputs are logits instead of last_hidden_state and encoder_last_hidden_state. You will still have to implement your own algorithm for text generation (e.g. beam search), so you might be interested in checking out this example which does that.

FYI you can find the model's output names from the ONNX config, e.g.

from transformers import AutoConfig, AutoModelForSeq2SeqLM
from transformers.models.bart import BartOnnxConfig

model_ckpt = "facebook/bart-large-cnn"
config = AutoConfig.from_pretrained(model_ckpt)
onnx_config = BartOnnxConfig(config, task="default")
onnx_config.outputs
# OrderedDict([('last_hidden_state', {0: 'batch', 1: 'sequence'}),
#              ('encoder_last_hidden_state', {0: 'batch', 1: 'sequence'})])

Hello, @lewtun I am trying the same scenario, The example guide URL for beam_search is returning 404. (https://github.com/huggingface/transformers/tree/master/examples/onnx/pytorch/summarization) Can you please post the latest URL

@mohanvamsibu-kore
Copy link

For anyone in my position: I still have not tried this, but will give an update here when i have!

Hey @sorenmc, If you have tried this approach.. can you please attach a code snippet here. It will be mighty helpful

@TonyMas
Copy link

TonyMas commented Feb 9, 2022

@mohanvamsibu-kore
Copy link

Hi, @TonyMas Thank You. I have implemented summarization from the model "lidiya/bart-large-xsum-samsum", the ONNX model was extremely fast but, I see that the beam_search is very slow which is taking a major chunk of the time (~ 9 secs ) in CPU. I tried with greedy search as well, which is taking ~3-4 secs. so,

  1. Is there a way to optimize beam_search?
  2. Can I run greedy_search on GPU? If Yes, Please let me know the steps

@mohanvamsibu-kore
Copy link

@TonyMas Can you please help me with the above concerns? I have also tried with the same example provided under https://github.com/huggingface/transformers/tree/master/examples/research_projects/onnx/summarization. It took ~10 secs on GPU. for the input of ~1000 characters. Please let me know if I can reduce the time

@jbesomi
Copy link

jbesomi commented Feb 24, 2022

Hey @mohanvamsibu-kore, I am also interested in exporting lidiya/bart-large-xsum-samsum on ONNX. I would love to see your code and see how we can speed it up. Can you share the code?

@sorenmc
Copy link

sorenmc commented Mar 3, 2022

For anyone in my position: I still have not tried this, but will give an update here when i have!

Hey @sorenmc, If you have tried this approach.. can you please attach a code snippet here. It will be mighty helpful

Sorry have been on vacation, and i have sadly not had the time

@jspablo
Copy link

jspablo commented Mar 9, 2022

I have been testing the Bart + Beam Search to ONNX example but it seems that the attention_mask layer is fixed to the sample input used when exporting the model. Setting it up like the inputs_ids in the dynamic_axes fix the issue.
The point is that testing the model with some texts returns pretty much the same tokens from the input text. Do you have the same experience? We really need this feature from optimum, any updates on this?

@lewtun
Copy link
Member

lewtun commented Mar 9, 2022

Hey @jspablo we're currently discussing internally on the best approach for supporting text generation and other inference tasks within optimum. We don't have a timeline on this yet, but I'll report back once we have a clearer picture on this.

cc @philschmid @mfuntowicz

@zeke-john
Copy link

any update?

@philschmid
Copy link
Contributor

Yes see: https://huggingface.co/docs/optimum/main/en/onnxruntime/modeling_ort#optimum.onnxruntime.ORTModelForSeq2SeqLM

@ZiyueWangUoB
Copy link
Author

I have been testing the Bart + Beam Search to ONNX example but it seems that the attention_mask layer is fixed to the sample input used when exporting the model. Setting it up like the inputs_ids in the dynamic_axes fix the issue. The point is that testing the model with some texts returns pretty much the same tokens from the input text. Do you have the same experience? We really need this feature from optimum, any updates on this?

Found a fix for this yet?

@fxmarty
Copy link
Contributor

fxmarty commented Dec 26, 2022

Yes see: https://huggingface.co/docs/optimum/main/en/onnxruntime/modeling_ort#optimum.onnxruntime.ORTModelForSeq2SeqLM

Updated link: https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort#optimum.onnxruntime.ORTModelForSeq2SeqLM

This allows to basically do the inference with ONNX Runtime, while still using the generate() from PyTorch:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-xsum")

# instead of: `model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-xsum")`
# the argument `from_transformers=True` handles the ONNX export on the fly.
model = ORTModelForSeq2SeqLM.from_pretrained("facebook/bart-large-xsum", from_transformers=True, use_cache=True)

to_summarize = "The Bart model was proposed in BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019."

inputs = tokenizer(to_summarize, return_tensors="pt")

gen_tokens = model.generate(**inputs)
outputs = tokenizer.batch_decode(gen_tokens)
print(outputs)
# prints: ['</s>A new model for training artificial intelligence systems has been proposed by a group of researchers at the University of Oxford.</s>']

Alternatively, you can export the model offline and load it later:

optimum-cli export onnx --model facebook/bart-large-xsum --task seq2seq-lm-with-past --for-ort bart_onnx/

@nurgel
Copy link

nurgel commented Feb 16, 2023

Yes see: https://huggingface.co/docs/optimum/main/en/onnxruntime/modeling_ort#optimum.onnxruntime.ORTModelForSeq2SeqLM

Updated link: https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort#optimum.onnxruntime.ORTModelForSeq2SeqLM

This allows to basically do the inference with ONNX Runtime, while still using the generate() from PyTorch:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-xsum")

# instead of: `model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-xsum")`
# the argument `from_transformers=True` handles the ONNX export on the fly.
model = ORTModelForSeq2SeqLM.from_pretrained("facebook/bart-large-xsum", from_transformers=True, use_cache=True)

to_summarize = "The Bart model was proposed in BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019."

inputs = tokenizer(to_summarize, return_tensors="pt")

gen_tokens = model.generate(**inputs)
outputs = tokenizer.batch_decode(gen_tokens)
print(outputs)
# prints: ['</s>A new model for training artificial intelligence systems has been proposed by a group of researchers at the University of Oxford.</s>']

Alternatively, you can export the model offline and load it later:

optimum-cli export onnx --model facebook/bart-large-xsum --task seq2seq-lm-with-past --for-ort bart_onnx/

thought the main selling point of using ONNX is speed. but the inference using ORTModelForSeq2SeqLM:
model.generate(**inputs)
is 2x slower than inference using a pipeline:
pipeline("summarization", model="facebook/bart-large-xsum")
can you please elaborate on why this is the case? is there some magic happening inside pipeline()?

@fxmarty
Copy link
Contributor

fxmarty commented Feb 16, 2023

Could you give me your transformers and optimum versions? There is a critical bug if you use transformers==4.26 and optimum==1.6.3, it has been fixed in the 1.6.4 release.

If you would like to open an issue in Optimum repo with a reproducible script, I can have a look from there!

@dtiarks dtiarks mentioned this issue May 3, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests