Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to reproduce Counterfactual Robustness result with ChatGPT #6

Open
baichuan-assistant opened this issue Jan 11, 2024 · 5 comments

Comments

@baichuan-assistant
Copy link

Here's what I did:

Step1:
python evalue.py --dataset zh --noise_rate 0.0 --modelname chatgpt

Step2:
python fact_evalue.py --dataset zh --modelname chatgpt

I got file prediction_zh_chatgpt_temp0.7_noise0.0_passage5_correct0.0_result with content:

{
    "all_rate": 0.9473684210526315,
    "noise_rate": 0.0,
    "tt": 270,
    "nums": 285
}

And file prediction_zh_chatgpt_temp0.7_noise0.0_passage5_correct0.0_chatgptresult.json with content:

{
    "reject_rate": 0.0,
    "all_rate": 0.9385245901639344,
    "correct_rate": 0,
    "tt": 229,
    "rejecttt": 0,
    "correct_tt": 0,
    "nums": 244,
    "noise_rate": 0.0
}

I failed to see how this matches the results in the paper:
image

Any ideas?

@chen700564
Copy link
Owner

Hello, you can use --datasset zh_fact, since zh_fact.json file is used to evaluate the counterfactural robustness. I will update the readme, thanks for your issue.

@baichuan-assistant
Copy link
Author

Hello, you can use --datasset zh_fact, since zh_fact.json file is used to evaluate the counterfactural robustness. I will update the readme, thanks for your issue.

Cool. I reran both lines with --dataset zh_fact. I can now get:

{
    "reject_rate": 0.011235955056179775,
    "all_rate": 0.1797752808988764,
    "correct_rate": 1.0,
    "tt": 16,
    "rejecttt": 1,
    "correct_tt": 1,
    "nums": 89,
    "noise_rate": 0.0
}

So I guess all_rate matches Acc[doc], 17 vs 17.9, and how reject_rate matches ED, both 1%. Correct?

Also where do I get ED* and correct_rate?

Thanks.

@chen700564
Copy link
Owner

Hello, rejecttt: 1 is ED* (ED*=1/100=1%) . The ED is obtained by evalue.py, the output 'fact_check_rate' is the ED.

@baichuan-assistant
Copy link
Author

Hello, rejecttt: 1 is ED* (ED*=1/100=1%) . The ED is obtained by evalue.py, the output 'fact_check_rate' is the ED.

I see. With evaluate on ChatGPT I got:

{
    "all_rate": 0.17391304347826086,
    "noise_rate": 0.0,
    "tt": 16,
    "nums": 92,
    "fact_check_rate": 0.0,
    "correct_rate": 0,
    "fact_tt": 0,
    "correct_tt": 0
}

Seems like GPT-3.5 produces unstable result?

@chen700564
Copy link
Owner

Yes, you can set by lower temperature like --temp 0.2 to make the result is more stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants