Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behavior when different seeds are initialized at evaluations time #3

Open
marcoromanelli-github opened this issue Jul 18, 2023 · 5 comments

Comments

@marcoromanelli-github
Copy link

Thank you for your work and code!

After running the command

python train.py --dataset cifar10 --target_label 0 --gpu 0

we have tried to evaluate the performance of your detector with

python Beatrix.py --dataset cifar10 --gpu 0

limiting ourselves to only checking the effect of poisoning label 0.

In particular, we have changed this code to the following

if __name__ == "__main__":
    for seed_ in range (10):
        print('-'*50+'seed:', seed_)
        seed = seed_
        torch.manual_seed(seed)
        torch.cuda.manual_seed(seed)
        np.random.seed(seed)
        random.seed(seed)
        torch.backends.cudnn.deterministic = True

        opt = config.get_argument().parse_args()
        os.environ["CUDA_VISIBLE_DEVICES"] = opt.gpu
        for k in range(1):  # range(10):
            main(k)

to study the effect of different seeds on the performance.

From the attached log file, we have noticed that for some seeds, namely [3, 5, 7, 9] the value of the anomaly index for the target class 0 is not the highest.
Moreover, for some seeds, namely [0, 2, 3, 5, 7, 9], the anomaly index for class 0 appears to be below the threshold $e^2$ reported in the paper, resulting in missed detections.

These phenomena seem to appear more often than we expected.
Could you help interpreting this, and suggest what to change in case we are doing something wrong?

@wanlunsec
Copy link
Owner

Hi Marco,

Thank you for reaching out and for your interest in our work.

In the detection evaluation, the randomness exists in the "shuffle" method:

(clean_feature,bd_feature,ori_label,bd_label) = shuffle(clean_feature,bd_feature,ori_label,bd_label)

You may be able to get more stable detection results under different random seeds, if you increase the available clean data in detection:

self.clean_data_perclass = 30

@marcoromanelli-github
Copy link
Author

Thanks for your answer. Indeed, we arrived to the same conclusion by replacing 30 with 300.

However, at this point, my questions are:

  1. How can we obtain the results you published in tables 5 and 6?
  2. Were multiple runs with different seeds performed to produce these results?

@wanlunsec
Copy link
Owner

Hi Marco,

Thanks for your questions.

  1. Following previous works, we trained multiple of backdoored models with different infected labels to conduct comparison experiments in Table 5 and 6.
  2. Just like the other baseline methods, we conducted multiple experiments with different backdoored models instead of different seeds.

@marcoromanelli-github
Copy link
Author

Thanks for your answer.
However we still have doubts on how to reproduce these results.

We trained multiple of backdoored models with different infected labels to conduct comparison experiments in Table 5 and 6

We understand this point and we did the same too. However we couldn’t obtain the same results.

  1. What seed(s) did you use in your experiments?
  2. In light of your previous answer, were these values certainly obtained with only 30 clean samples?

@wanlunsec
Copy link
Owner

In the experiments, we used 30 clean samples per class for backdoor detection. And as shown in the implementation, we did not set "random seed".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants