Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent predictions (confidence values) with multiple runs #15

Open
kgarg8 opened this issue Feb 27, 2023 · 4 comments
Open

Inconsistent predictions (confidence values) with multiple runs #15

kgarg8 opened this issue Feb 27, 2023 · 4 comments
Assignees

Comments

@kgarg8
Copy link

kgarg8 commented Feb 27, 2023

  • ferret version: 0.4.1
  • Python version: 3.10.9
  • Operating System: Ubuntu 20.04.5 LTS

Description

Describe what you were trying to get done.
I am loading ferret's explainer with my pretrained model (for classification task on nlp dataset).

Tell us what happened, what went wrong, and what you expected to happen.
I am loading ferret's explainer with my pretrained model (for classification task on nlp dataset) but the problems are:
(1) every run of the explainer is giving me different confidence labels
(2) [Could be consequence of 1st problem] the explainer's prediction is often inconsistent with the pretrained model's prediction

What I Did

import torch
from transformers import AutoModelForSequenceClassification, BertweetTokenizer
from ferret import Benchmark

device = torch.device("cuda:2") if torch.cuda.is_available() else torch.device("cpu")
model = AutoModelForSequenceClassification.from_pretrained("vinai/bertweet-base", num_labels=3, ignore_mismatched_sizes=True).to(device)
model.load_state_dict(torch.load(model_load_path))
model.eval()
tokenizer = BertweetTokenizer.from_pretrained("vinai/bertweet-base", normalization=True, is_fast=True)

bench = Benchmark(model, tokenizer)
tweet = "#god is utterly powerless without human intervention . . . </s> atheism"
bench.score(tweet)

Output (illustrates problem-1 of different confidence values for different runs)

{'LABEL_0': 0.3069733679294586,
 'LABEL_1': 0.35715219378471375,
 'LABEL_2': 0.33587440848350525}
# Prediction: **LABEL_1**

{'LABEL_0': 0.3356691002845764,
 'LABEL_1': 0.3353104293346405,
 'LABEL_2': 0.3290204405784607}
# Prediction: **LABEL_0**
model.eval()
sample = tokenizer.encode_plus(tweet)
sample['labels'] = [0]
with torch.no_grad():
    input_ids = torch.tensor(sample['input_ids']).to(device)
    attention_mask = torch.tensor(sample['attention_mask']).to(device)
    labels = torch.tensor(sample['labels']).to(device)
    outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
    preds = outputs.logits
    rounded_preds = F.softmax(preds)
    _, indices = torch.max(rounded_preds, 1)

# Output: tensor([[-0.0779, -0.0418,  0.1261]], device='cuda:2')
# Prediction: **LABEL_2** (different from explainer's prediction - illustrates problem #2)
@kgarg8
Copy link
Author

kgarg8 commented Feb 27, 2023

More insights into the problem:

(1) The problem is not with my pretrained model but whenever the ferret loads any pretrained model, it leads to completely different explanations and prediction values.

(2) Just reloading ferret everytime gives consistent results. So, it seems like some randomness is being introduced during the time the ferret loads the model.

@g8a9
Copy link
Owner

g8a9 commented Mar 6, 2023

Hi, thank you for reaching out.

ferret uses the model.config.label2id and model.config.id2label to associate the logit positional index to the label, which might be non-coherent with your class labels. We should notify the user of this immediately after the benchmark class is instantiated.

About the randomicity of results, I've been trying ferret with two models and tasks (sentiment analysis and language detection), and I have consistent results (exact prediction for every inference). You can see it here: https://colab.research.google.com/drive/14rSS8RZx45vZIrKdds4rSR2hRhHV3-1z?usp=sharing

Is there any model checkpoint that you can share so that I can try it with yours?

@g8a9 g8a9 self-assigned this Mar 6, 2023
@g8a9
Copy link
Owner

g8a9 commented Mar 15, 2023

Hi @kgarg8, any news here? :)

@kgarg8
Copy link
Author

kgarg8 commented Mar 15, 2023

Yes, I do have updates.

I think the problem was I used let's say X steps to evaluate my pretrained model which included loading my pretrained model, seeding it, loading dataloader, etc. but using ferret, I did not use all these steps since it is a two-liner evaluation to evaluate a given sample.

The way I resolved it was - do the same 'X' steps first and then try to work on ferret model. This made the ferret prediction consistent with my original model evaluation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants