-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
I have evaluated LLaMA (7B, 13B and 30B) in most of the tasks available in this library and the results are bad for some tasks. I will give some examples with the 7B model. I haven't checked all the results yet, I put them here so that we can fix the problems that we find. I can share more results and configs if you need more information.
This is the script that I have used for evaluation.
for model_name in "${model_names[@]}"; do
for group_name in "${tasks_selected[@]}"; do
srun python3 lm-evaluation-harness/main.py \
--model hf-causal-experimental \
--model_args pretrained=$model_name,use_accelerate=True \
--tasks ${tasks[${group_name}]} \
--device cuda \
--output_path results/llama-${model_name:48}_${group_name}_${num_fewshot}-shot.json \
--batch_size auto \
--no_cache \
--num_fewshot ${num_fewshot}
done
done
Common Sense Reasoning
Results are similar to the paper, generally a bit lower. This is expected because of the differences in prompts. Some exceptions include ARC and openbookqa where the result is much lower.
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| piqa | 0 | acc | 0.7818 | ± | 0.0096 |
| acc_norm | 0.7742 | ± | 0.0098 | ||
| wsc273 | 0 | acc | 0.8095 | ± | 0.0238 |
| arc_easy | 0 | acc | 0.6738 | ± | 0.0096 |
| acc_norm | 0.5248 | ± | 0.0102 | ||
| hellaswag | 0 | acc | 0.5639 | ± | 0.0049 |
| acc_norm | 0.7298 | ± | 0.0044 | ||
| winogrande | 0 | acc | 0.6693 | ± | 0.0132 |
| prost | 0 | acc | 0.2569 | ± | 0.0032 |
| acc_norm | 0.2803 | ± | 0.0033 | ||
| swag | 0 | acc | 0.5547 | ± | 0.0035 |
| acc_norm | 0.6687 | ± | 0.0033 | ||
| boolq | 1 | acc | 0.7306 | ± | 0.0078 |
| arc_challenge | 0 | acc | 0.3823 | ± | 0.0142 |
| acc_norm | 0.4138 | ± | 0.0144 | ||
| mc_taco | 0 | em | 0.1126 | ||
| f1 | 0.4827 | ||||
| copa | 0 | acc | 0.8400 | ± | 0.0368 |
| openbookqa | 0 | acc | 0.2820 | ± | 0.0201 |
| acc_norm | 0.4240 | ± | 0.0221 |
Mathematical Reasoning
Very low accuracies are obtained, 0 is same cases. GSM8K and MATH results are much lower than in the paper.
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| mathqa | 0 | acc | 0.2677 | ± | 0.0081 |
| acc_norm | 0.2787 | ± | 0.0082 | ||
| math_asdiv | 0 | acc | 0.0000 | ± | 0.0000 |
| gsm8k | 0 | acc | 0.0000 | ± | 0.0000 |
| math_num_theory | 1 | acc | 0.0074 | ± | 0.0037 |
| math_precalc | 1 | acc | 0.0037 | ± | 0.0026 |
| drop | 1 | em | 0.0427 | ± | 0.0021 |
| f1 | 0.1216 | ± | 0.0025 | ||
| math_geometry | 1 | acc | 0.0084 | ± | 0.0042 |
| math_counting_and_prob | 1 | acc | 0.0169 | ± | 0.0059 |
| math_intermediate_algebra | 1 | acc | 0.0066 | ± | 0.0027 |
| math_prealgebra | 1 | acc | 0.0126 | ± | 0.0038 |
| math_algebra | 1 | acc | 0.0168 | ± | 0.0037 |
Reading Comprehension
RACE results are much lower than on the paper.
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| coqa | 1 | f1 | 0.7521 | ± | 0.0153 |
| em | 0.6267 | ± | 0.0188 | ||
| drop | 1 | em | 0.0359 | ± | 0.0019 |
| f1 | 0.1135 | ± | 0.0023 | ||
| race | 1 | acc | 0.3990 | ± | 0.0152 |
Question Answering
0 accuracy for TriviaQA and webqs
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| webqs | 0 | acc | 0.0000 | ± | 0.0000 |
| truthfulqa_mc | 1 | mc1 | 0.2105 | ± | 0.0143 |
| mc2 | 0.3414 | ± | 0.0131 | ||
| headqa_en | 0 | acc | 0.3242 | ± | 0.0089 |
| acc_norm | 0.3592 | ± | 0.0092 | ||
| triviaqa | 1 | acc | 0.0000 | ± | 0.0000 |
| headqa_es | 0 | acc | 0.2826 | ± | 0.0086 |
| acc_norm | 0.3242 | ± | 0.0089 | ||
| logiqa | 0 | acc | 0.2181 | ± | 0.0162 |
| acc_norm | 0.3026 | ± | 0.0180 | ||
| squad2 | 1 | exact | 9.4163 | ||
| f1 | 19.4490 | ||||
| HasAns_exact | 18.4885 | ||||
| HasAns_f1 | 38.5827 | ||||
| NoAns_exact | 0.3701 | ||||
| NoAns_f1 | 0.3701 | ||||
| best_exact | 50.0716 | ||||
| best_f1 | 50.0801 |
LAMBADA
LAMBADA does not work properly, 0 accuracy is obtained.
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| lambada_openai_mt_it | 0 | ppl | 3653680.5734 | ± | 197082.9861 |
| acc | 0.0000 | ± | 0.0000 | ||
| lambada_standard | 0 | ppl | 2460346.8573 | ± | 81216.5655 |
| acc | 0.0000 | ± | 0.0000 | ||
| lambada_openai_mt_es | 0 | ppl | 3818890.4545 | ± | 197999.0532 |
| acc | 0.0000 | ± | 0.0000 | ||
| lambada_openai | 0 | ppl | 2817465.0925 | ± | 138319.0882 |
| acc | 0.0000 | ± | 0.0000 | ||
| lambada_openai_mt_fr | 0 | ppl | 2111186.1155 | ± | 111724.4284 |
| acc | 0.0000 | ± | 0.0000 | ||
| lambada_openai_mt_de | 0 | ppl | 1805613.6771 | ± | 97892.7891 |
| acc | 0.0000 | ± | 0.0000 | ||
| lambada_standard_cloze | 0 | ppl | 6710057.2411 | ± | 169833.9100 |
| acc | 0.0000 | ± | 0.0000 | ||
| lambada_openai_mt_en | 0 | ppl | 2817465.0925 | ± | 138319.0882 |
| acc | 0.0000 | ± | 0.0000 | ||
| lambada_openai_cloze | 0 | ppl | 255777.7112 | ± | 11345.7710 |
| acc | 0.0004 | ± | 0.0003 |
Arithmetic
Another task that returns 0 accuracy.
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| arithmetic_3ds | 0 | acc | 0 | ± | 0 |
| arithmetic_1dc | 0 | acc | 0 | ± | 0 |
| arithmetic_2da | 0 | acc | 0 | ± | 0 |
| arithmetic_4ds | 0 | acc | 0 | ± | 0 |
| arithmetic_3da | 0 | acc | 0 | ± | 0 |
| arithmetic_2ds | 0 | acc | 0 | ± | 0 |
| arithmetic_4da | 0 | acc | 0 | ± | 0 |
| arithmetic_5ds | 0 | acc | 0 | ± | 0 |
| arithmetic_2dm | 0 | acc | 0 | ± | 0 |
| arithmetic_5da | 0 | acc | 0 | ± | 0 |
BLIMP
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| blimp_npi_present_2 | 0 | acc | 0.530 | ± | 0.0158 |
| blimp_anaphor_gender_agreement | 0 | acc | 0.448 | ± | 0.0157 |
| blimp_causative | 0 | acc | 0.508 | ± | 0.0158 |
| blimp_existential_there_quantifiers_1 | 0 | acc | 0.683 | ± | 0.0147 |
| blimp_existential_there_quantifiers_2 | 0 | acc | 0.674 | ± | 0.0148 |
| blimp_existential_there_subject_raising | 0 | acc | 0.696 | ± | 0.0146 |
| blimp_principle_A_reconstruction | 0 | acc | 0.673 | ± | 0.0148 |
| blimp_principle_A_domain_3 | 0 | acc | 0.501 | ± | 0.0158 |
| blimp_sentential_subject_island | 0 | acc | 0.606 | ± | 0.0155 |
| blimp_superlative_quantifiers_2 | 0 | acc | 0.561 | ± | 0.0157 |
| blimp_complex_NP_island | 0 | acc | 0.416 | ± | 0.0156 |
| blimp_wh_island | 0 | acc | 0.275 | ± | 0.0141 |
| blimp_wh_vs_that_no_gap_long_distance | 0 | acc | 0.812 | ± | 0.0124 |
| blimp_principle_A_c_command | 0 | acc | 0.390 | ± | 0.0154 |
| blimp_sentential_negation_npi_scope | 0 | acc | 0.588 | ± | 0.0156 |
| blimp_principle_A_case_2 | 0 | acc | 0.554 | ± | 0.0157 |
| blimp_determiner_noun_agreement_2 | 0 | acc | 0.598 | ± | 0.0155 |
| blimp_left_branch_island_echo_question | 0 | acc | 0.835 | ± | 0.0117 |
| blimp_wh_vs_that_with_gap_long_distance | 0 | acc | 0.227 | ± | 0.0133 |
| blimp_determiner_noun_agreement_with_adjective_1 | 0 | acc | 0.577 | ± | 0.0156 |
| blimp_ellipsis_n_bar_1 | 0 | acc | 0.668 | ± | 0.0149 |
| blimp_wh_questions_subject_gap | 0 | acc | 0.720 | ± | 0.0142 |
| blimp_wh_questions_subject_gap_long_distance | 0 | acc | 0.746 | ± | 0.0138 |
| blimp_only_npi_scope | 0 | acc | 0.266 | ± | 0.0140 |
| blimp_coordinate_structure_constraint_complex_left_branch | 0 | acc | 0.682 | ± | 0.0147 |
| blimp_adjunct_island | 0 | acc | 0.539 | ± | 0.0158 |
| blimp_determiner_noun_agreement_irregular_1 | 0 | acc | 0.572 | ± | 0.0157 |
| blimp_expletive_it_object_raising | 0 | acc | 0.659 | ± | 0.0150 |
| blimp_npi_present_1 | 0 | acc | 0.534 | ± | 0.0158 |
| blimp_superlative_quantifiers_1 | 0 | acc | 0.612 | ± | 0.0154 |
| blimp_determiner_noun_agreement_with_adj_2 | 0 | acc | 0.540 | ± | 0.0158 |
| blimp_principle_A_domain_2 | 0 | acc | 0.646 | ± | 0.0151 |
| blimp_irregular_past_participle_adjectives | 0 | acc | 0.429 | ± | 0.0157 |
| blimp_regular_plural_subject_verb_agreement_1 | 0 | acc | 0.645 | ± | 0.0151 |
| blimp_transitive | 0 | acc | 0.698 | ± | 0.0145 |
| blimp_existential_there_object_raising | 0 | acc | 0.788 | ± | 0.0129 |
| blimp_distractor_agreement_relational_noun | 0 | acc | 0.441 | ± | 0.0157 |
| blimp_animate_subject_passive | 0 | acc | 0.626 | ± | 0.0153 |
| blimp_sentential_negation_npi_licensor_present | 0 | acc | 0.940 | ± | 0.0075 |
| blimp_only_npi_licensor_present | 0 | acc | 0.814 | ± | 0.0123 |
| blimp_irregular_plural_subject_verb_agreement_2 | 0 | acc | 0.700 | ± | 0.0145 |
| blimp_matrix_question_npi_licensor_present | 0 | acc | 0.117 | ± | 0.0102 |
| blimp_passive_2 | 0 | acc | 0.703 | ± | 0.0145 |
| blimp_tough_vs_raising_2 | 0 | acc | 0.768 | ± | 0.0134 |
| blimp_determiner_noun_agreement_with_adj_irregular_1 | 0 | acc | 0.563 | ± | 0.0157 |
| blimp_drop_argument | 0 | acc | 0.701 | ± | 0.0145 |
| blimp_wh_vs_that_no_gap | 0 | acc | 0.848 | ± | 0.0114 |
| blimp_wh_vs_that_with_gap | 0 | acc | 0.239 | ± | 0.0135 |
| blimp_left_branch_island_simple_question | 0 | acc | 0.740 | ± | 0.0139 |
| blimp_wh_questions_object_gap | 0 | acc | 0.670 | ± | 0.0149 |
| blimp_determiner_noun_agreement_1 | 0 | acc | 0.636 | ± | 0.0152 |
| blimp_determiner_noun_agreement_with_adj_irregular_2 | 0 | acc | 0.591 | ± | 0.0156 |
| blimp_tough_vs_raising_1 | 0 | acc | 0.298 | ± | 0.0145 |
| blimp_inchoative | 0 | acc | 0.420 | ± | 0.0156 |
| blimp_principle_A_case_1 | 0 | acc | 0.985 | ± | 0.0038 |
| blimp_animate_subject_trans | 0 | acc | 0.761 | ± | 0.0135 |
| blimp_intransitive | 0 | acc | 0.592 | ± | 0.0155 |
| blimp_anaphor_number_agreement | 0 | acc | 0.659 | ± | 0.0150 |
| blimp_distractor_agreement_relative_clause | 0 | acc | 0.314 | ± | 0.0147 |
| blimp_regular_plural_subject_verb_agreement_2 | 0 | acc | 0.705 | ± | 0.0144 |
| blimp_ellipsis_n_bar_2 | 0 | acc | 0.794 | ± | 0.0128 |
| blimp_irregular_plural_subject_verb_agreement_1 | 0 | acc | 0.653 | ± | 0.0151 |
| blimp_principle_A_domain_1 | 0 | acc | 0.962 | ± | 0.0060 |
| blimp_determiner_noun_agreement_irregular_2 | 0 | acc | 0.602 | ± | 0.0155 |
| blimp_coordinate_structure_constraint_object_extraction | 0 | acc | 0.629 | ± | 0.0153 |
| blimp_passive_1 | 0 | acc | 0.702 | ± | 0.0145 |
| blimp_irregular_past_participle_verbs | 0 | acc | 0.725 | ± | 0.0141 |
Human alignment
ETHICS, Toxigen and CrowsPairs
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| ethics_virtue | 0 | acc | 0.2098 | ± | 0.0058 |
| em | 0.0000 | ||||
| crows_pairs_french_race_color | 0 | likelihood_difference | 12.0489 | ± | 0.7332 |
| pct_stereotype | 0.4326 | ± | 0.0231 | ||
| ethics_utilitarianism_original | 0 | acc | 0.9586 | ± | 0.0029 |
| crows_pairs_english_nationality | 0 | likelihood_difference | 6.7626 | ± | 0.5869 |
| pct_stereotype | 0.5370 | ± | 0.0340 | ||
| crows_pairs_english_socioeconomic | 0 | likelihood_difference | 6.4016 | ± | 0.5420 |
| pct_stereotype | 0.5684 | ± | 0.0360 | ||
| crows_pairs_french_socioeconomic | 0 | likelihood_difference | 9.8084 | ± | 1.0151 |
| pct_stereotype | 0.5204 | ± | 0.0358 | ||
| crows_pairs_english_religion | 0 | likelihood_difference | 7.2196 | ± | 0.7592 |
| pct_stereotype | 0.6667 | ± | 0.0449 | ||
| ethics_justice | 0 | acc | 0.4996 | ± | 0.0096 |
| em | 0.0015 | ||||
| crows_pairs_english_autre | 0 | likelihood_difference | 11.0114 | ± | 5.8908 |
| pct_stereotype | 0.4545 | ± | 0.1575 | ||
| toxigen | 0 | acc | 0.4309 | ± | 0.0162 |
| acc_norm | 0.4319 | ± | 0.0162 | ||
| crows_pairs_french_autre | 0 | likelihood_difference | 7.5120 | ± | 2.0958 |
| pct_stereotype | 0.6154 | ± | 0.1404 | ||
| ethics_cm | 0 | acc | 0.5691 | ± | 0.0079 |
| crows_pairs_english_gender | 0 | likelihood_difference | 7.9174 | ± | 0.5502 |
| pct_stereotype | 0.5312 | ± | 0.0279 | ||
| crows_pairs_english_race_color | 0 | likelihood_difference | 6.2465 | ± | 0.3239 |
| pct_stereotype | 0.4665 | ± | 0.0222 | ||
| crows_pairs_english_age | 0 | likelihood_difference | 5.9423 | ± | 0.7903 |
| pct_stereotype | 0.5165 | ± | 0.0527 | ||
| ethics_utilitarianism | 0 | acc | 0.4981 | ± | 0.0072 |
| crows_pairs_english_sexual_orientation | 0 | likelihood_difference | 8.3048 | ± | 0.8428 |
| pct_stereotype | 0.6237 | ± | 0.0505 | ||
| ethics_deontology | 0 | acc | 0.5058 | ± | 0.0083 |
| em | 0.0022 | ||||
| crows_pairs_french_religion | 0 | likelihood_difference | 9.5853 | ± | 0.8750 |
| pct_stereotype | 0.4348 | ± | 0.0464 | ||
| crows_pairs_french_gender | 0 | likelihood_difference | 11.7990 | ± | 0.8714 |
| pct_stereotype | 0.5202 | ± | 0.0279 | ||
| crows_pairs_french_nationality | 0 | likelihood_difference | 10.4165 | ± | 0.9066 |
| pct_stereotype | 0.4071 | ± | 0.0309 | ||
| crows_pairs_english_physical_appearance | 0 | likelihood_difference | 4.5126 | ± | 0.6932 |
| pct_stereotype | 0.5000 | ± | 0.0593 | ||
| crows_pairs_french_age | 0 | likelihood_difference | 11.9396 | ± | 1.5377 |
| pct_stereotype | 0.3556 | ± | 0.0507 | ||
| crows_pairs_english_disability | 0 | likelihood_difference | 9.6697 | ± | 1.1386 |
| pct_stereotype | 0.6615 | ± | 0.0591 | ||
| crows_pairs_french_sexual_orientation | 0 | likelihood_difference | 7.6058 | ± | 0.7939 |
| pct_stereotype | 0.6703 | ± | 0.0496 | ||
| crows_pairs_french_physical_appearance | 0 | likelihood_difference | 7.0451 | ± | 0.9484 |
| pct_stereotype | 0.5556 | ± | 0.0590 | ||
| crows_pairs_french_disability | 0 | likelihood_difference | 10.1477 | ± | 1.3907 |
| pct_stereotype | 0.4242 | ± | 0.0613 |
MMLU
MMLU results seem to be ok.
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| hendrycksTest-high_school_geography | 0 | acc | 0.4293 | ± | 0.0353 |
| acc_norm | 0.3636 | ± | 0.0343 | ||
| hendrycksTest-philosophy | 0 | acc | 0.4019 | ± | 0.0278 |
| acc_norm | 0.3537 | ± | 0.0272 | ||
| hendrycksTest-world_religions | 0 | acc | 0.6257 | ± | 0.0371 |
| acc_norm | 0.5146 | ± | 0.0383 | ||
| hendrycksTest-college_biology | 0 | acc | 0.3194 | ± | 0.0390 |
| acc_norm | 0.2917 | ± | 0.0380 | ||
| hendrycksTest-electrical_engineering | 0 | acc | 0.3586 | ± | 0.0400 |
| acc_norm | 0.3241 | ± | 0.0390 | ||
| hendrycksTest-global_facts | 0 | acc | 0.3200 | ± | 0.0469 |
| acc_norm | 0.2900 | ± | 0.0456 | ||
| hendrycksTest-high_school_government_and_politics | 0 | acc | 0.4819 | ± | 0.0361 |
| acc_norm | 0.3731 | ± | 0.0349 | ||
| hendrycksTest-moral_scenarios | 0 | acc | 0.2760 | ± | 0.0150 |
| acc_norm | 0.2726 | ± | 0.0149 | ||
| hendrycksTest-econometrics | 0 | acc | 0.2895 | ± | 0.0427 |
| acc_norm | 0.2632 | ± | 0.0414 | ||
| hendrycksTest-international_law | 0 | acc | 0.3884 | ± | 0.0445 |
| acc_norm | 0.5785 | ± | 0.0451 | ||
| hendrycksTest-us_foreign_policy | 0 | acc | 0.5600 | ± | 0.0499 |
| acc_norm | 0.4500 | ± | 0.0500 | ||
| hendrycksTest-high_school_macroeconomics | 0 | acc | 0.3179 | ± | 0.0236 |
| acc_norm | 0.3026 | ± | 0.0233 | ||
| hendrycksTest-virology | 0 | acc | 0.3976 | ± | 0.0381 |
| acc_norm | 0.2892 | ± | 0.0353 | ||
| hendrycksTest-high_school_mathematics | 0 | acc | 0.2259 | ± | 0.0255 |
| acc_norm | 0.3074 | ± | 0.0281 | ||
| hendrycksTest-clinical_knowledge | 0 | acc | 0.3887 | ± | 0.0300 |
| acc_norm | 0.3811 | ± | 0.0299 | ||
| hendrycksTest-professional_psychology | 0 | acc | 0.3840 | ± | 0.0197 |
| acc_norm | 0.2990 | ± | 0.0185 | ||
| hendrycksTest-formal_logic | 0 | acc | 0.3095 | ± | 0.0413 |
| acc_norm | 0.3492 | ± | 0.0426 | ||
| hendrycksTest-management | 0 | acc | 0.4854 | ± | 0.0495 |
| acc_norm | 0.3689 | ± | 0.0478 | ||
| hendrycksTest-human_sexuality | 0 | acc | 0.5115 | ± | 0.0438 |
| acc_norm | 0.3664 | ± | 0.0423 | ||
| hendrycksTest-high_school_world_history | 0 | acc | 0.3924 | ± | 0.0318 |
| acc_norm | 0.3376 | ± | 0.0308 | ||
| hendrycksTest-medical_genetics | 0 | acc | 0.4400 | ± | 0.0499 |
| acc_norm | 0.4000 | ± | 0.0492 | ||
| hendrycksTest-computer_security | 0 | acc | 0.3700 | ± | 0.0485 |
| acc_norm | 0.4400 | ± | 0.0499 | ||
| hendrycksTest-miscellaneous | 0 | acc | 0.5837 | ± | 0.0176 |
| acc_norm | 0.3895 | ± | 0.0174 | ||
| hendrycksTest-public_relations | 0 | acc | 0.3909 | ± | 0.0467 |
| acc_norm | 0.2273 | ± | 0.0401 | ||
| hendrycksTest-college_physics | 0 | acc | 0.2353 | ± | 0.0422 |
| acc_norm | 0.3235 | ± | 0.0466 | ||
| hendrycksTest-professional_accounting | 0 | acc | 0.3014 | ± | 0.0274 |
| acc_norm | 0.2943 | ± | 0.0272 | ||
| hendrycksTest-logical_fallacies | 0 | acc | 0.3804 | ± | 0.0381 |
| acc_norm | 0.3497 | ± | 0.0375 | ||
| hendrycksTest-business_ethics | 0 | acc | 0.5300 | ± | 0.0502 |
| acc_norm | 0.4600 | ± | 0.0501 | ||
| hendrycksTest-high_school_chemistry | 0 | acc | 0.2512 | ± | 0.0305 |
| acc_norm | 0.2956 | ± | 0.0321 | ||
| hendrycksTest-astronomy | 0 | acc | 0.4539 | ± | 0.0405 |
| acc_norm | 0.4605 | ± | 0.0406 | ||
| hendrycksTest-high_school_us_history | 0 | acc | 0.4265 | ± | 0.0347 |
| acc_norm | 0.3137 | ± | 0.0326 | ||
| hendrycksTest-college_chemistry | 0 | acc | 0.3300 | ± | 0.0473 |
| acc_norm | 0.3000 | ± | 0.0461 | ||
| hendrycksTest-abstract_algebra | 0 | acc | 0.2300 | ± | 0.0423 |
| acc_norm | 0.2600 | ± | 0.0441 | ||
| hendrycksTest-moral_disputes | 0 | acc | 0.3642 | ± | 0.0259 |
| acc_norm | 0.3324 | ± | 0.0254 | ||
| hendrycksTest-college_computer_science | 0 | acc | 0.3300 | ± | 0.0473 |
| acc_norm | 0.2800 | ± | 0.0451 | ||
| hendrycksTest-professional_law | 0 | acc | 0.2966 | ± | 0.0117 |
| acc_norm | 0.2855 | ± | 0.0115 | ||
| hendrycksTest-college_mathematics | 0 | acc | 0.3200 | ± | 0.0469 |
| acc_norm | 0.3200 | ± | 0.0469 | ||
| hendrycksTest-high_school_microeconomics | 0 | acc | 0.3866 | ± | 0.0316 |
| acc_norm | 0.3655 | ± | 0.0313 | ||
| hendrycksTest-high_school_european_history | 0 | acc | 0.4061 | ± | 0.0383 |
| acc_norm | 0.3697 | ± | 0.0377 | ||
| hendrycksTest-high_school_biology | 0 | acc | 0.3581 | ± | 0.0273 |
| acc_norm | 0.3581 | ± | 0.0273 | ||
| hendrycksTest-security_studies | 0 | acc | 0.4082 | ± | 0.0315 |
| acc_norm | 0.3102 | ± | 0.0296 | ||
| hendrycksTest-high_school_psychology | 0 | acc | 0.4661 | ± | 0.0214 |
| acc_norm | 0.3083 | ± | 0.0198 | ||
| hendrycksTest-conceptual_physics | 0 | acc | 0.3277 | ± | 0.0307 |
| acc_norm | 0.2170 | ± | 0.0269 | ||
| hendrycksTest-human_aging | 0 | acc | 0.3722 | ± | 0.0324 |
| acc_norm | 0.2511 | ± | 0.0291 | ||
| hendrycksTest-prehistory | 0 | acc | 0.4012 | ± | 0.0273 |
| acc_norm | 0.2778 | ± | 0.0249 | ||
| hendrycksTest-sociology | 0 | acc | 0.4776 | ± | 0.0353 |
| acc_norm | 0.4279 | ± | 0.0350 | ||
| hendrycksTest-marketing | 0 | acc | 0.6111 | ± | 0.0319 |
| acc_norm | 0.5043 | ± | 0.0328 | ||
| hendrycksTest-high_school_computer_science | 0 | acc | 0.4100 | ± | 0.0494 |
| acc_norm | 0.3400 | ± | 0.0476 | ||
| hendrycksTest-machine_learning | 0 | acc | 0.3036 | ± | 0.0436 |
| acc_norm | 0.2679 | ± | 0.0420 | ||
| hendrycksTest-elementary_mathematics | 0 | acc | 0.3201 | ± | 0.0240 |
| acc_norm | 0.2910 | ± | 0.0234 | ||
| hendrycksTest-nutrition | 0 | acc | 0.3954 | ± | 0.0280 |
| acc_norm | 0.4379 | ± | 0.0284 | ||
| hendrycksTest-anatomy | 0 | acc | 0.3852 | ± | 0.0420 |
| acc_norm | 0.2815 | ± | 0.0389 | ||
| hendrycksTest-jurisprudence | 0 | acc | 0.4352 | ± | 0.0479 |
| acc_norm | 0.5000 | ± | 0.0483 | ||
| hendrycksTest-college_medicine | 0 | acc | 0.3757 | ± | 0.0369 |
| acc_norm | 0.3064 | ± | 0.0351 | ||
| hendrycksTest-high_school_statistics | 0 | acc | 0.3426 | ± | 0.0324 |
| acc_norm | 0.3426 | ± | 0.0324 | ||
| hendrycksTest-high_school_physics | 0 | acc | 0.2053 | ± | 0.0330 |
| acc_norm | 0.2715 | ± | 0.0363 | ||
| hendrycksTest-professional_medicine | 0 | acc | 0.3382 | ± | 0.0287 |
| acc_norm | 0.2794 | ± | 0.0273 |