Unable to reproduce Counterfactual Robustness result with ChatGPT #6

baichuan-assistant · 2024-01-11T10:38:24Z

Here's what I did:

Step1:
python evalue.py --dataset zh --noise_rate 0.0 --modelname chatgpt

Step2:
python fact_evalue.py --dataset zh --modelname chatgpt

I got file prediction_zh_chatgpt_temp0.7_noise0.0_passage5_correct0.0_result with content:

{
    "all_rate": 0.9473684210526315,
    "noise_rate": 0.0,
    "tt": 270,
    "nums": 285
}

And file prediction_zh_chatgpt_temp0.7_noise0.0_passage5_correct0.0_chatgptresult.json with content:

{
    "reject_rate": 0.0,
    "all_rate": 0.9385245901639344,
    "correct_rate": 0,
    "tt": 229,
    "rejecttt": 0,
    "correct_tt": 0,
    "nums": 244,
    "noise_rate": 0.0
}

I failed to see how this matches the results in the paper:

Any ideas?

The text was updated successfully, but these errors were encountered:

chen700564 · 2024-01-11T11:10:22Z

Hello, you can use --datasset zh_fact, since zh_fact.json file is used to evaluate the counterfactural robustness. I will update the readme, thanks for your issue.

baichuan-assistant · 2024-01-11T11:26:45Z

Hello, you can use --datasset zh_fact, since zh_fact.json file is used to evaluate the counterfactural robustness. I will update the readme, thanks for your issue.

Cool. I reran both lines with --dataset zh_fact. I can now get:

{
    "reject_rate": 0.011235955056179775,
    "all_rate": 0.1797752808988764,
    "correct_rate": 1.0,
    "tt": 16,
    "rejecttt": 1,
    "correct_tt": 1,
    "nums": 89,
    "noise_rate": 0.0
}

So I guess all_rate matches Acc[doc], 17 vs 17.9, and how reject_rate matches ED, both 1%. Correct?

Also where do I get ED* and correct_rate?

Thanks.

chen700564 · 2024-01-11T11:36:34Z

Hello, rejecttt: 1 is ED* (ED*=1/100=1%) . The ED is obtained by evalue.py, the output 'fact_check_rate' is the ED.

baichuan-assistant · 2024-01-11T11:42:05Z

Hello, rejecttt: 1 is ED* (ED*=1/100=1%) . The ED is obtained by evalue.py, the output 'fact_check_rate' is the ED.

I see. With evaluate on ChatGPT I got:

{
    "all_rate": 0.17391304347826086,
    "noise_rate": 0.0,
    "tt": 16,
    "nums": 92,
    "fact_check_rate": 0.0,
    "correct_rate": 0,
    "fact_tt": 0,
    "correct_tt": 0
}

Seems like GPT-3.5 produces unstable result?

chen700564 · 2024-01-11T11:46:18Z

Yes, you can set by lower temperature like --temp 0.2 to make the result is more stable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to reproduce Counterfactual Robustness result with ChatGPT #6

Unable to reproduce Counterfactual Robustness result with ChatGPT #6

baichuan-assistant commented Jan 11, 2024

chen700564 commented Jan 11, 2024

baichuan-assistant commented Jan 11, 2024

chen700564 commented Jan 11, 2024

baichuan-assistant commented Jan 11, 2024

chen700564 commented Jan 11, 2024

Unable to reproduce Counterfactual Robustness result with ChatGPT #6

Unable to reproduce Counterfactual Robustness result with ChatGPT #6

Comments

baichuan-assistant commented Jan 11, 2024

chen700564 commented Jan 11, 2024

baichuan-assistant commented Jan 11, 2024

chen700564 commented Jan 11, 2024

baichuan-assistant commented Jan 11, 2024

chen700564 commented Jan 11, 2024