Attacking Bert: Weight Poisoning Attacks on Pre-Trained Models #8

joeddav · 2020-06-26T15:46:52Z

joeddav
Jun 26, 2020

Hi everyone, this week I wrote up a quick discussion on a great paper from Kurita et al.'s on how pre-trained models can be "poisoned" to exhibit nefarious behavior that persist even after fine-tuning on downstream tasks. Below are a few general discussion questions I'd love to get your input on, but feel free to also bring up anything that's interesting to you!

Paper: Weight Poisoning Attacks on Pre-trained Models
Authors: Keita Kurita, Paul Michel, Graham Neubig
Presenter: Joe Davison
Presentation: Colab notebook/post

Discussion Questions

The authors give a brute-force method for identifying trigger words by simply evaluating the LFR (label flip rate) for every word in a corpus. Words with very high LFRs can then be inspect to see if they make sense, or if they might be engineered triggers. Is this a practical thing that people should do before deploying models they didn't train themselves? Is there another way that words with anamolous effects on a model could be identified? How else could poisoned weights be identified?
Is it safe for companies with features like spam and toxicity detection to use pre-trained models from the community in deployed applications?
When does it make sense for an attacker to try to disseminate a poisoned model and when is it smarter to attack an existing model by creating adversarial examples?
Do you buy the author's explanation of why the method doesn't do as well on spam classification? If not, why do you think it is?
The authors say that ignoring second-order information in "preliminary experiments" did not degrade performance (end of section 3.1). For the people who are better at math than me, do you buy this? Should they have tried to do some Hessian approximation to more extensively test whether first order information is sufficient?

stefan-it · 2020-06-26T15:58:00Z

stefan-it
Jun 26, 2020

GitHub Discussions are available 🎉

1 reply

joeddav Jun 26, 2020
Author

In beta for the moment 😅

srulikbd · 2020-06-28T07:53:17Z

srulikbd
Jun 28, 2020

Very interesting paper!
Thanks for the awesome colab.

Is there a possibility to attack the pretrained language model itself?

It might be a stronger attack.

1 reply

joeddav Jun 29, 2020
Author

How do you mean? My understanding is that they do start with a pretrained model and simply "fine-tune" it with the RIPPLe objective.

srulikbd · 2020-06-29T05:54:02Z

srulikbd
Jun 29, 2020

just for clarification: what percentage of the corpus the authors poisoned in that paper?
I think they said 50% percent, but if so, it's easy to recognize.

3 replies

joeddav Jun 29, 2020
Author

Yeah, they do 50%. What do you mean by easy to recognize? The dataset they poison is never "released", but rather used to poison the pre-trained model weights which are released. Are you saying it would be easy to recognize just from the model weights that half of the data it was trained on were poisoned?

srulikbd Jun 29, 2020

Ok, that make sense.
I was a bit confused about it.

joeddav Jun 29, 2020
Author

Np! 😃

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attacking Bert: Weight Poisoning Attacks on Pre-Trained Models #8

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Attacking Bert: Weight Poisoning Attacks on Pre-Trained Models #8

joeddav Jun 26, 2020

Replies: 3 comments · 5 replies

stefan-it Jun 26, 2020

joeddav Jun 26, 2020 Author

srulikbd Jun 28, 2020

joeddav Jun 29, 2020 Author

srulikbd Jun 29, 2020

joeddav Jun 29, 2020 Author

srulikbd Jun 29, 2020

joeddav Jun 29, 2020 Author

joeddav
Jun 26, 2020

Replies: 3 comments 5 replies

stefan-it
Jun 26, 2020

joeddav Jun 26, 2020
Author

srulikbd
Jun 28, 2020

joeddav Jun 29, 2020
Author

srulikbd
Jun 29, 2020

joeddav Jun 29, 2020
Author

joeddav Jun 29, 2020
Author