Inconsistent predictions (confidence values) with multiple runs #15

kgarg8 · 2023-02-27T16:44:02Z

ferret version: 0.4.1
Python version: 3.10.9
Operating System: Ubuntu 20.04.5 LTS

Description

Describe what you were trying to get done.
I am loading ferret's explainer with my pretrained model (for classification task on nlp dataset).

Tell us what happened, what went wrong, and what you expected to happen.
I am loading ferret's explainer with my pretrained model (for classification task on nlp dataset) but the problems are:
(1) every run of the explainer is giving me different confidence labels
(2) [Could be consequence of 1st problem] the explainer's prediction is often inconsistent with the pretrained model's prediction

What I Did

import torch
from transformers import AutoModelForSequenceClassification, BertweetTokenizer
from ferret import Benchmark

device = torch.device("cuda:2") if torch.cuda.is_available() else torch.device("cpu")
model = AutoModelForSequenceClassification.from_pretrained("vinai/bertweet-base", num_labels=3, ignore_mismatched_sizes=True).to(device)
model.load_state_dict(torch.load(model_load_path))
model.eval()
tokenizer = BertweetTokenizer.from_pretrained("vinai/bertweet-base", normalization=True, is_fast=True)

bench = Benchmark(model, tokenizer)
tweet = "#god is utterly powerless without human intervention . . . </s> atheism"
bench.score(tweet)

Output (illustrates problem-1 of different confidence values for different runs)

{'LABEL_0': 0.3069733679294586,
 'LABEL_1': 0.35715219378471375,
 'LABEL_2': 0.33587440848350525}
# Prediction: **LABEL_1**

{'LABEL_0': 0.3356691002845764,
 'LABEL_1': 0.3353104293346405,
 'LABEL_2': 0.3290204405784607}
# Prediction: **LABEL_0**

model.eval()
sample = tokenizer.encode_plus(tweet)
sample['labels'] = [0]
with torch.no_grad():
    input_ids = torch.tensor(sample['input_ids']).to(device)
    attention_mask = torch.tensor(sample['attention_mask']).to(device)
    labels = torch.tensor(sample['labels']).to(device)
    outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
    preds = outputs.logits
    rounded_preds = F.softmax(preds)
    _, indices = torch.max(rounded_preds, 1)

# Output: tensor([[-0.0779, -0.0418,  0.1261]], device='cuda:2')
# Prediction: **LABEL_2** (different from explainer's prediction - illustrates problem #2)

The text was updated successfully, but these errors were encountered:

kgarg8 · 2023-02-27T17:14:06Z

More insights into the problem:

(1) The problem is not with my pretrained model but whenever the ferret loads any pretrained model, it leads to completely different explanations and prediction values.

(2) Just reloading ferret everytime gives consistent results. So, it seems like some randomness is being introduced during the time the ferret loads the model.

g8a9 · 2023-03-06T10:33:58Z

Hi, thank you for reaching out.

ferret uses the model.config.label2id and model.config.id2label to associate the logit positional index to the label, which might be non-coherent with your class labels. We should notify the user of this immediately after the benchmark class is instantiated.

About the randomicity of results, I've been trying ferret with two models and tasks (sentiment analysis and language detection), and I have consistent results (exact prediction for every inference). You can see it here: https://colab.research.google.com/drive/14rSS8RZx45vZIrKdds4rSR2hRhHV3-1z?usp=sharing

Is there any model checkpoint that you can share so that I can try it with yours?

g8a9 · 2023-03-15T14:34:31Z

Hi @kgarg8, any news here? :)

kgarg8 · 2023-03-15T14:55:19Z

Yes, I do have updates.

I think the problem was I used let's say X steps to evaluate my pretrained model which included loading my pretrained model, seeding it, loading dataloader, etc. but using ferret, I did not use all these steps since it is a two-liner evaluation to evaluate a given sample.

The way I resolved it was - do the same 'X' steps first and then try to work on ferret model. This made the ferret prediction consistent with my original model evaluation.

g8a9 self-assigned this Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent predictions (confidence values) with multiple runs #15

Inconsistent predictions (confidence values) with multiple runs #15

kgarg8 commented Feb 27, 2023 •

edited

Loading

kgarg8 commented Feb 27, 2023 •

edited

Loading

g8a9 commented Mar 6, 2023

g8a9 commented Mar 15, 2023

kgarg8 commented Mar 15, 2023

Inconsistent predictions (confidence values) with multiple runs #15

Inconsistent predictions (confidence values) with multiple runs #15

Comments

kgarg8 commented Feb 27, 2023 • edited Loading

Description

What I Did

Output (illustrates problem-1 of different confidence values for different runs)

kgarg8 commented Feb 27, 2023 • edited Loading

g8a9 commented Mar 6, 2023

g8a9 commented Mar 15, 2023

kgarg8 commented Mar 15, 2023

kgarg8 commented Feb 27, 2023 •

edited

Loading

kgarg8 commented Feb 27, 2023 •

edited

Loading