Bad results for LLaMA

I have evaluated LLaMA (7B, 13B and 30B) in most of the tasks available in this library and the results are bad for some tasks. I will give some examples with the 7B model. I haven't checked all the results yet, I put them here so that we can fix the problems that we find. I can share more results and configs if you need more information.

This is the script that I have used for evaluation.
```
for model_name in "${model_names[@]}"; do
    for group_name in "${tasks_selected[@]}"; do
        srun python3 lm-evaluation-harness/main.py \
            --model hf-causal-experimental \
            --model_args pretrained=$model_name,use_accelerate=True \
            --tasks ${tasks[${group_name}]} \
            --device cuda \
            --output_path results/llama-${model_name:48}_${group_name}_${num_fewshot}-shot.json \
            --batch_size auto \
            --no_cache \
            --num_fewshot ${num_fewshot}
    done
done
```

# Common Sense Reasoning 
Results are similar to the paper, generally a bit lower. This is expected because of the differences in prompts. Some exceptions include ARC and openbookqa where the result is much lower.

|    Task     |Version| Metric |Value |   |Stderr|
|-------------|------:|--------|-----:|---|-----:|
|piqa         |      0|acc     |0.7818|±  |0.0096|
|             |       |acc_norm|0.7742|±  |0.0098|
|wsc273       |      0|acc     |0.8095|±  |0.0238|
|arc_easy     |      0|acc     |0.6738|±  |0.0096|
|             |       |acc_norm|0.5248|±  |0.0102|
|hellaswag    |      0|acc     |0.5639|±  |0.0049|
|             |       |acc_norm|0.7298|±  |0.0044|
|winogrande   |      0|acc     |0.6693|±  |0.0132|
|prost        |      0|acc     |0.2569|±  |0.0032|
|             |       |acc_norm|0.2803|±  |0.0033|
|swag         |      0|acc     |0.5547|±  |0.0035|
|             |       |acc_norm|0.6687|±  |0.0033|
|boolq        |      1|acc     |0.7306|±  |0.0078|
|arc_challenge|      0|acc     |0.3823|±  |0.0142|
|             |       |acc_norm|0.4138|±  |0.0144|
|mc_taco      |      0|em      |0.1126|   |      |
|             |       |f1      |0.4827|   |      |
|copa         |      0|acc     |0.8400|±  |0.0368|
|openbookqa   |      0|acc     |0.2820|±  |0.0201|
|             |       |acc_norm|0.4240|±  |0.0221|

# Mathematical Reasoning
Very low accuracies are obtained, 0 is same cases. GSM8K and MATH results are much lower than in the paper.
|          Task           |Version| Metric |Value |   |Stderr|
|-------------------------|------:|--------|-----:|---|-----:|
|mathqa                   |      0|acc     |0.2677|±  |0.0081|
|                         |       |acc_norm|0.2787|±  |0.0082|
|math_asdiv               |      0|acc     |0.0000|±  |0.0000|
|gsm8k                    |      0|acc     |0.0000|±  |0.0000|
|math_num_theory          |      1|acc     |0.0074|±  |0.0037|
|math_precalc             |      1|acc     |0.0037|±  |0.0026|
|drop                     |      1|em      |0.0427|±  |0.0021|
|                         |       |f1      |0.1216|±  |0.0025|
|math_geometry            |      1|acc     |0.0084|±  |0.0042|
|math_counting_and_prob   |      1|acc     |0.0169|±  |0.0059|
|math_intermediate_algebra|      1|acc     |0.0066|±  |0.0027|
|math_prealgebra          |      1|acc     |0.0126|±  |0.0038|
|math_algebra             |      1|acc     |0.0168|±  |0.0037|

# Reading Comprehension
RACE results are much lower than on the paper.
|Task|Version|Metric|Value |   |Stderr|
|----|------:|------|-----:|---|-----:|
|coqa|      1|f1    |0.7521|±  |0.0153|
|    |       |em    |0.6267|±  |0.0188|
|drop|      1|em    |0.0359|±  |0.0019|
|    |       |f1    |0.1135|±  |0.0023|
|race|      1|acc   |0.3990|±  |0.0152|

# Question Answering
0 accuracy for TriviaQA and webqs
|    Task     |Version|   Metric   | Value |   |Stderr|
|-------------|------:|------------|------:|---|-----:|
|webqs        |      0|acc         | 0.0000|±  |0.0000|
|truthfulqa_mc|      1|mc1         | 0.2105|±  |0.0143|
|             |       |mc2         | 0.3414|±  |0.0131|
|headqa_en    |      0|acc         | 0.3242|±  |0.0089|
|             |       |acc_norm    | 0.3592|±  |0.0092|
|triviaqa     |      1|acc         | 0.0000|±  |0.0000|
|headqa_es    |      0|acc         | 0.2826|±  |0.0086|
|             |       |acc_norm    | 0.3242|±  |0.0089|
|logiqa       |      0|acc         | 0.2181|±  |0.0162|
|             |       |acc_norm    | 0.3026|±  |0.0180|
|squad2       |      1|exact       | 9.4163|   |      |
|             |       |f1          |19.4490|   |      |
|             |       |HasAns_exact|18.4885|   |      |
|             |       |HasAns_f1   |38.5827|   |      |
|             |       |NoAns_exact | 0.3701|   |      |
|             |       |NoAns_f1    | 0.3701|   |      |
|             |       |best_exact  |50.0716|   |      |
|             |       |best_f1     |50.0801|   |      |

# LAMBADA
LAMBADA does not work properly, 0 accuracy is obtained.

|         Task         |Version|Metric|   Value    |   |  Stderr   |
|----------------------|------:|------|-----------:|---|----------:|
|lambada_openai_mt_it  |      0|ppl   |3653680.5734|±  |197082.9861|
|                      |       |acc   |      0.0000|±  |     0.0000|
|lambada_standard      |      0|ppl   |2460346.8573|±  | 81216.5655|
|                      |       |acc   |      0.0000|±  |     0.0000|
|lambada_openai_mt_es  |      0|ppl   |3818890.4545|±  |197999.0532|
|                      |       |acc   |      0.0000|±  |     0.0000|
|lambada_openai        |      0|ppl   |2817465.0925|±  |138319.0882|
|                      |       |acc   |      0.0000|±  |     0.0000|
|lambada_openai_mt_fr  |      0|ppl   |2111186.1155|±  |111724.4284|
|                      |       |acc   |      0.0000|±  |     0.0000|
|lambada_openai_mt_de  |      0|ppl   |1805613.6771|±  | 97892.7891|
|                      |       |acc   |      0.0000|±  |     0.0000|
|lambada_standard_cloze|      0|ppl   |6710057.2411|±  |169833.9100|
|                      |       |acc   |      0.0000|±  |     0.0000|
|lambada_openai_mt_en  |      0|ppl   |2817465.0925|±  |138319.0882|
|                      |       |acc   |      0.0000|±  |     0.0000|
|lambada_openai_cloze  |      0|ppl   | 255777.7112|±  | 11345.7710|
|                      |       |acc   |      0.0004|±  |     0.0003|

# Arithmetic
Another task that returns 0 accuracy.
|     Task     |Version|Metric|Value|   |Stderr|
|--------------|------:|------|----:|---|-----:|
|arithmetic_3ds|      0|acc   |    0|±  |     0|
|arithmetic_1dc|      0|acc   |    0|±  |     0|
|arithmetic_2da|      0|acc   |    0|±  |     0|
|arithmetic_4ds|      0|acc   |    0|±  |     0|
|arithmetic_3da|      0|acc   |    0|±  |     0|
|arithmetic_2ds|      0|acc   |    0|±  |     0|
|arithmetic_4da|      0|acc   |    0|±  |     0|
|arithmetic_5ds|      0|acc   |    0|±  |     0|
|arithmetic_2dm|      0|acc   |    0|±  |     0|
|arithmetic_5da|      0|acc   |    0|±  |     0|

# BLIMP
|                          Task                           |Version|Metric|Value|   |Stderr|
|---------------------------------------------------------|------:|------|----:|---|-----:|
|blimp_npi_present_2                                      |      0|acc   |0.530|±  |0.0158|
|blimp_anaphor_gender_agreement                           |      0|acc   |0.448|±  |0.0157|
|blimp_causative                                          |      0|acc   |0.508|±  |0.0158|
|blimp_existential_there_quantifiers_1                    |      0|acc   |0.683|±  |0.0147|
|blimp_existential_there_quantifiers_2                    |      0|acc   |0.674|±  |0.0148|
|blimp_existential_there_subject_raising                  |      0|acc   |0.696|±  |0.0146|
|blimp_principle_A_reconstruction                         |      0|acc   |0.673|±  |0.0148|
|blimp_principle_A_domain_3                               |      0|acc   |0.501|±  |0.0158|
|blimp_sentential_subject_island                          |      0|acc   |0.606|±  |0.0155|
|blimp_superlative_quantifiers_2                          |      0|acc   |0.561|±  |0.0157|
|blimp_complex_NP_island                                  |      0|acc   |0.416|±  |0.0156|
|blimp_wh_island                                          |      0|acc   |0.275|±  |0.0141|
|blimp_wh_vs_that_no_gap_long_distance                    |      0|acc   |0.812|±  |0.0124|
|blimp_principle_A_c_command                              |      0|acc   |0.390|±  |0.0154|
|blimp_sentential_negation_npi_scope                      |      0|acc   |0.588|±  |0.0156|
|blimp_principle_A_case_2                                 |      0|acc   |0.554|±  |0.0157|
|blimp_determiner_noun_agreement_2                        |      0|acc   |0.598|±  |0.0155|
|blimp_left_branch_island_echo_question                   |      0|acc   |0.835|±  |0.0117|
|blimp_wh_vs_that_with_gap_long_distance                  |      0|acc   |0.227|±  |0.0133|
|blimp_determiner_noun_agreement_with_adjective_1         |      0|acc   |0.577|±  |0.0156|
|blimp_ellipsis_n_bar_1                                   |      0|acc   |0.668|±  |0.0149|
|blimp_wh_questions_subject_gap                           |      0|acc   |0.720|±  |0.0142|
|blimp_wh_questions_subject_gap_long_distance             |      0|acc   |0.746|±  |0.0138|
|blimp_only_npi_scope                                     |      0|acc   |0.266|±  |0.0140|
|blimp_coordinate_structure_constraint_complex_left_branch|      0|acc   |0.682|±  |0.0147|
|blimp_adjunct_island                                     |      0|acc   |0.539|±  |0.0158|
|blimp_determiner_noun_agreement_irregular_1              |      0|acc   |0.572|±  |0.0157|
|blimp_expletive_it_object_raising                        |      0|acc   |0.659|±  |0.0150|
|blimp_npi_present_1                                      |      0|acc   |0.534|±  |0.0158|
|blimp_superlative_quantifiers_1                          |      0|acc   |0.612|±  |0.0154|
|blimp_determiner_noun_agreement_with_adj_2               |      0|acc   |0.540|±  |0.0158|
|blimp_principle_A_domain_2                               |      0|acc   |0.646|±  |0.0151|
|blimp_irregular_past_participle_adjectives               |      0|acc   |0.429|±  |0.0157|
|blimp_regular_plural_subject_verb_agreement_1            |      0|acc   |0.645|±  |0.0151|
|blimp_transitive                                         |      0|acc   |0.698|±  |0.0145|
|blimp_existential_there_object_raising                   |      0|acc   |0.788|±  |0.0129|
|blimp_distractor_agreement_relational_noun               |      0|acc   |0.441|±  |0.0157|
|blimp_animate_subject_passive                            |      0|acc   |0.626|±  |0.0153|
|blimp_sentential_negation_npi_licensor_present           |      0|acc   |0.940|±  |0.0075|
|blimp_only_npi_licensor_present                          |      0|acc   |0.814|±  |0.0123|
|blimp_irregular_plural_subject_verb_agreement_2          |      0|acc   |0.700|±  |0.0145|
|blimp_matrix_question_npi_licensor_present               |      0|acc   |0.117|±  |0.0102|
|blimp_passive_2                                          |      0|acc   |0.703|±  |0.0145|
|blimp_tough_vs_raising_2                                 |      0|acc   |0.768|±  |0.0134|
|blimp_determiner_noun_agreement_with_adj_irregular_1     |      0|acc   |0.563|±  |0.0157|
|blimp_drop_argument                                      |      0|acc   |0.701|±  |0.0145|
|blimp_wh_vs_that_no_gap                                  |      0|acc   |0.848|±  |0.0114|
|blimp_wh_vs_that_with_gap                                |      0|acc   |0.239|±  |0.0135|
|blimp_left_branch_island_simple_question                 |      0|acc   |0.740|±  |0.0139|
|blimp_wh_questions_object_gap                            |      0|acc   |0.670|±  |0.0149|
|blimp_determiner_noun_agreement_1                        |      0|acc   |0.636|±  |0.0152|
|blimp_determiner_noun_agreement_with_adj_irregular_2     |      0|acc   |0.591|±  |0.0156|
|blimp_tough_vs_raising_1                                 |      0|acc   |0.298|±  |0.0145|
|blimp_inchoative                                         |      0|acc   |0.420|±  |0.0156|
|blimp_principle_A_case_1                                 |      0|acc   |0.985|±  |0.0038|
|blimp_animate_subject_trans                              |      0|acc   |0.761|±  |0.0135|
|blimp_intransitive                                       |      0|acc   |0.592|±  |0.0155|
|blimp_anaphor_number_agreement                           |      0|acc   |0.659|±  |0.0150|
|blimp_distractor_agreement_relative_clause               |      0|acc   |0.314|±  |0.0147|
|blimp_regular_plural_subject_verb_agreement_2            |      0|acc   |0.705|±  |0.0144|
|blimp_ellipsis_n_bar_2                                   |      0|acc   |0.794|±  |0.0128|
|blimp_irregular_plural_subject_verb_agreement_1          |      0|acc   |0.653|±  |0.0151|
|blimp_principle_A_domain_1                               |      0|acc   |0.962|±  |0.0060|
|blimp_determiner_noun_agreement_irregular_2              |      0|acc   |0.602|±  |0.0155|
|blimp_coordinate_structure_constraint_object_extraction  |      0|acc   |0.629|±  |0.0153|
|blimp_passive_1                                          |      0|acc   |0.702|±  |0.0145|
|blimp_irregular_past_participle_verbs                    |      0|acc   |0.725|±  |0.0141|

# Human alignment
ETHICS, Toxigen and CrowsPairs
|                 Task                  |Version|       Metric        | Value |   |Stderr|
|---------------------------------------|------:|---------------------|------:|---|-----:|
|ethics_virtue                          |      0|acc                  | 0.2098|±  |0.0058|
|                                       |       |em                   | 0.0000|   |      |
|crows_pairs_french_race_color          |      0|likelihood_difference|12.0489|±  |0.7332|
|                                       |       |pct_stereotype       | 0.4326|±  |0.0231|
|ethics_utilitarianism_original         |      0|acc                  | 0.9586|±  |0.0029|
|crows_pairs_english_nationality        |      0|likelihood_difference| 6.7626|±  |0.5869|
|                                       |       |pct_stereotype       | 0.5370|±  |0.0340|
|crows_pairs_english_socioeconomic      |      0|likelihood_difference| 6.4016|±  |0.5420|
|                                       |       |pct_stereotype       | 0.5684|±  |0.0360|
|crows_pairs_french_socioeconomic       |      0|likelihood_difference| 9.8084|±  |1.0151|
|                                       |       |pct_stereotype       | 0.5204|±  |0.0358|
|crows_pairs_english_religion           |      0|likelihood_difference| 7.2196|±  |0.7592|
|                                       |       |pct_stereotype       | 0.6667|±  |0.0449|
|ethics_justice                         |      0|acc                  | 0.4996|±  |0.0096|
|                                       |       |em                   | 0.0015|   |      |
|crows_pairs_english_autre              |      0|likelihood_difference|11.0114|±  |5.8908|
|                                       |       |pct_stereotype       | 0.4545|±  |0.1575|
|toxigen                                |      0|acc                  | 0.4309|±  |0.0162|
|                                       |       |acc_norm             | 0.4319|±  |0.0162|
|crows_pairs_french_autre               |      0|likelihood_difference| 7.5120|±  |2.0958|
|                                       |       |pct_stereotype       | 0.6154|±  |0.1404|
|ethics_cm                              |      0|acc                  | 0.5691|±  |0.0079|
|crows_pairs_english_gender             |      0|likelihood_difference| 7.9174|±  |0.5502|
|                                       |       |pct_stereotype       | 0.5312|±  |0.0279|
|crows_pairs_english_race_color         |      0|likelihood_difference| 6.2465|±  |0.3239|
|                                       |       |pct_stereotype       | 0.4665|±  |0.0222|
|crows_pairs_english_age                |      0|likelihood_difference| 5.9423|±  |0.7903|
|                                       |       |pct_stereotype       | 0.5165|±  |0.0527|
|ethics_utilitarianism                  |      0|acc                  | 0.4981|±  |0.0072|
|crows_pairs_english_sexual_orientation |      0|likelihood_difference| 8.3048|±  |0.8428|
|                                       |       |pct_stereotype       | 0.6237|±  |0.0505|
|ethics_deontology                      |      0|acc                  | 0.5058|±  |0.0083|
|                                       |       |em                   | 0.0022|   |      |
|crows_pairs_french_religion            |      0|likelihood_difference| 9.5853|±  |0.8750|
|                                       |       |pct_stereotype       | 0.4348|±  |0.0464|
|crows_pairs_french_gender              |      0|likelihood_difference|11.7990|±  |0.8714|
|                                       |       |pct_stereotype       | 0.5202|±  |0.0279|
|crows_pairs_french_nationality         |      0|likelihood_difference|10.4165|±  |0.9066|
|                                       |       |pct_stereotype       | 0.4071|±  |0.0309|
|crows_pairs_english_physical_appearance|      0|likelihood_difference| 4.5126|±  |0.6932|
|                                       |       |pct_stereotype       | 0.5000|±  |0.0593|
|crows_pairs_french_age                 |      0|likelihood_difference|11.9396|±  |1.5377|
|                                       |       |pct_stereotype       | 0.3556|±  |0.0507|
|crows_pairs_english_disability         |      0|likelihood_difference| 9.6697|±  |1.1386|
|                                       |       |pct_stereotype       | 0.6615|±  |0.0591|
|crows_pairs_french_sexual_orientation  |      0|likelihood_difference| 7.6058|±  |0.7939|
|                                       |       |pct_stereotype       | 0.6703|±  |0.0496|
|crows_pairs_french_physical_appearance |      0|likelihood_difference| 7.0451|±  |0.9484|
|                                       |       |pct_stereotype       | 0.5556|±  |0.0590|
|crows_pairs_french_disability          |      0|likelihood_difference|10.1477|±  |1.3907|
|                                       |       |pct_stereotype       | 0.4242|±  |0.0613|

# MMLU
MMLU results seem to be ok.
|                      Task                       |Version| Metric |Value |   |Stderr|
|-------------------------------------------------|------:|--------|-----:|---|-----:|
|hendrycksTest-high_school_geography              |      0|acc     |0.4293|±  |0.0353|
|                                                 |       |acc_norm|0.3636|±  |0.0343|
|hendrycksTest-philosophy                         |      0|acc     |0.4019|±  |0.0278|
|                                                 |       |acc_norm|0.3537|±  |0.0272|
|hendrycksTest-world_religions                    |      0|acc     |0.6257|±  |0.0371|
|                                                 |       |acc_norm|0.5146|±  |0.0383|
|hendrycksTest-college_biology                    |      0|acc     |0.3194|±  |0.0390|
|                                                 |       |acc_norm|0.2917|±  |0.0380|
|hendrycksTest-electrical_engineering             |      0|acc     |0.3586|±  |0.0400|
|                                                 |       |acc_norm|0.3241|±  |0.0390|
|hendrycksTest-global_facts                       |      0|acc     |0.3200|±  |0.0469|
|                                                 |       |acc_norm|0.2900|±  |0.0456|
|hendrycksTest-high_school_government_and_politics|      0|acc     |0.4819|±  |0.0361|
|                                                 |       |acc_norm|0.3731|±  |0.0349|
|hendrycksTest-moral_scenarios                    |      0|acc     |0.2760|±  |0.0150|
|                                                 |       |acc_norm|0.2726|±  |0.0149|
|hendrycksTest-econometrics                       |      0|acc     |0.2895|±  |0.0427|
|                                                 |       |acc_norm|0.2632|±  |0.0414|
|hendrycksTest-international_law                  |      0|acc     |0.3884|±  |0.0445|
|                                                 |       |acc_norm|0.5785|±  |0.0451|
|hendrycksTest-us_foreign_policy                  |      0|acc     |0.5600|±  |0.0499|
|                                                 |       |acc_norm|0.4500|±  |0.0500|
|hendrycksTest-high_school_macroeconomics         |      0|acc     |0.3179|±  |0.0236|
|                                                 |       |acc_norm|0.3026|±  |0.0233|
|hendrycksTest-virology                           |      0|acc     |0.3976|±  |0.0381|
|                                                 |       |acc_norm|0.2892|±  |0.0353|
|hendrycksTest-high_school_mathematics            |      0|acc     |0.2259|±  |0.0255|
|                                                 |       |acc_norm|0.3074|±  |0.0281|
|hendrycksTest-clinical_knowledge                 |      0|acc     |0.3887|±  |0.0300|
|                                                 |       |acc_norm|0.3811|±  |0.0299|
|hendrycksTest-professional_psychology            |      0|acc     |0.3840|±  |0.0197|
|                                                 |       |acc_norm|0.2990|±  |0.0185|
|hendrycksTest-formal_logic                       |      0|acc     |0.3095|±  |0.0413|
|                                                 |       |acc_norm|0.3492|±  |0.0426|
|hendrycksTest-management                         |      0|acc     |0.4854|±  |0.0495|
|                                                 |       |acc_norm|0.3689|±  |0.0478|
|hendrycksTest-human_sexuality                    |      0|acc     |0.5115|±  |0.0438|
|                                                 |       |acc_norm|0.3664|±  |0.0423|
|hendrycksTest-high_school_world_history          |      0|acc     |0.3924|±  |0.0318|
|                                                 |       |acc_norm|0.3376|±  |0.0308|
|hendrycksTest-medical_genetics                   |      0|acc     |0.4400|±  |0.0499|
|                                                 |       |acc_norm|0.4000|±  |0.0492|
|hendrycksTest-computer_security                  |      0|acc     |0.3700|±  |0.0485|
|                                                 |       |acc_norm|0.4400|±  |0.0499|
|hendrycksTest-miscellaneous                      |      0|acc     |0.5837|±  |0.0176|
|                                                 |       |acc_norm|0.3895|±  |0.0174|
|hendrycksTest-public_relations                   |      0|acc     |0.3909|±  |0.0467|
|                                                 |       |acc_norm|0.2273|±  |0.0401|
|hendrycksTest-college_physics                    |      0|acc     |0.2353|±  |0.0422|
|                                                 |       |acc_norm|0.3235|±  |0.0466|
|hendrycksTest-professional_accounting            |      0|acc     |0.3014|±  |0.0274|
|                                                 |       |acc_norm|0.2943|±  |0.0272|
|hendrycksTest-logical_fallacies                  |      0|acc     |0.3804|±  |0.0381|
|                                                 |       |acc_norm|0.3497|±  |0.0375|
|hendrycksTest-business_ethics                    |      0|acc     |0.5300|±  |0.0502|
|                                                 |       |acc_norm|0.4600|±  |0.0501|
|hendrycksTest-high_school_chemistry              |      0|acc     |0.2512|±  |0.0305|
|                                                 |       |acc_norm|0.2956|±  |0.0321|
|hendrycksTest-astronomy                          |      0|acc     |0.4539|±  |0.0405|
|                                                 |       |acc_norm|0.4605|±  |0.0406|
|hendrycksTest-high_school_us_history             |      0|acc     |0.4265|±  |0.0347|
|                                                 |       |acc_norm|0.3137|±  |0.0326|
|hendrycksTest-college_chemistry                  |      0|acc     |0.3300|±  |0.0473|
|                                                 |       |acc_norm|0.3000|±  |0.0461|
|hendrycksTest-abstract_algebra                   |      0|acc     |0.2300|±  |0.0423|
|                                                 |       |acc_norm|0.2600|±  |0.0441|
|hendrycksTest-moral_disputes                     |      0|acc     |0.3642|±  |0.0259|
|                                                 |       |acc_norm|0.3324|±  |0.0254|
|hendrycksTest-college_computer_science           |      0|acc     |0.3300|±  |0.0473|
|                                                 |       |acc_norm|0.2800|±  |0.0451|
|hendrycksTest-professional_law                   |      0|acc     |0.2966|±  |0.0117|
|                                                 |       |acc_norm|0.2855|±  |0.0115|
|hendrycksTest-college_mathematics                |      0|acc     |0.3200|±  |0.0469|
|                                                 |       |acc_norm|0.3200|±  |0.0469|
|hendrycksTest-high_school_microeconomics         |      0|acc     |0.3866|±  |0.0316|
|                                                 |       |acc_norm|0.3655|±  |0.0313|
|hendrycksTest-high_school_european_history       |      0|acc     |0.4061|±  |0.0383|
|                                                 |       |acc_norm|0.3697|±  |0.0377|
|hendrycksTest-high_school_biology                |      0|acc     |0.3581|±  |0.0273|
|                                                 |       |acc_norm|0.3581|±  |0.0273|
|hendrycksTest-security_studies                   |      0|acc     |0.4082|±  |0.0315|
|                                                 |       |acc_norm|0.3102|±  |0.0296|
|hendrycksTest-high_school_psychology             |      0|acc     |0.4661|±  |0.0214|
|                                                 |       |acc_norm|0.3083|±  |0.0198|
|hendrycksTest-conceptual_physics                 |      0|acc     |0.3277|±  |0.0307|
|                                                 |       |acc_norm|0.2170|±  |0.0269|
|hendrycksTest-human_aging                        |      0|acc     |0.3722|±  |0.0324|
|                                                 |       |acc_norm|0.2511|±  |0.0291|
|hendrycksTest-prehistory                         |      0|acc     |0.4012|±  |0.0273|
|                                                 |       |acc_norm|0.2778|±  |0.0249|
|hendrycksTest-sociology                          |      0|acc     |0.4776|±  |0.0353|
|                                                 |       |acc_norm|0.4279|±  |0.0350|
|hendrycksTest-marketing                          |      0|acc     |0.6111|±  |0.0319|
|                                                 |       |acc_norm|0.5043|±  |0.0328|
|hendrycksTest-high_school_computer_science       |      0|acc     |0.4100|±  |0.0494|
|                                                 |       |acc_norm|0.3400|±  |0.0476|
|hendrycksTest-machine_learning                   |      0|acc     |0.3036|±  |0.0436|
|                                                 |       |acc_norm|0.2679|±  |0.0420|
|hendrycksTest-elementary_mathematics             |      0|acc     |0.3201|±  |0.0240|
|                                                 |       |acc_norm|0.2910|±  |0.0234|
|hendrycksTest-nutrition                          |      0|acc     |0.3954|±  |0.0280|
|                                                 |       |acc_norm|0.4379|±  |0.0284|
|hendrycksTest-anatomy                            |      0|acc     |0.3852|±  |0.0420|
|                                                 |       |acc_norm|0.2815|±  |0.0389|
|hendrycksTest-jurisprudence                      |      0|acc     |0.4352|±  |0.0479|
|                                                 |       |acc_norm|0.5000|±  |0.0483|
|hendrycksTest-college_medicine                   |      0|acc     |0.3757|±  |0.0369|
|                                                 |       |acc_norm|0.3064|±  |0.0351|
|hendrycksTest-high_school_statistics             |      0|acc     |0.3426|±  |0.0324|
|                                                 |       |acc_norm|0.3426|±  |0.0324|
|hendrycksTest-high_school_physics                |      0|acc     |0.2053|±  |0.0330|
|                                                 |       |acc_norm|0.2715|±  |0.0363|
|hendrycksTest-professional_medicine              |      0|acc     |0.3382|±  |0.0287|
|                                                 |       |acc_norm|0.2794|±  |0.0273|


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bad results for LLaMA #443

Common Sense Reasoning

Mathematical Reasoning

Reading Comprehension

Question Answering

LAMBADA

Arithmetic

BLIMP

Human alignment

MMLU

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Task	Version	Metric	Value		Stderr
piqa	0	acc	0.7818	±	0.0096
		acc_norm	0.7742	±	0.0098
wsc273	0	acc	0.8095	±	0.0238
arc_easy	0	acc	0.6738	±	0.0096
		acc_norm	0.5248	±	0.0102
hellaswag	0	acc	0.5639	±	0.0049
		acc_norm	0.7298	±	0.0044
winogrande	0	acc	0.6693	±	0.0132
prost	0	acc	0.2569	±	0.0032
		acc_norm	0.2803	±	0.0033
swag	0	acc	0.5547	±	0.0035
		acc_norm	0.6687	±	0.0033
boolq	1	acc	0.7306	±	0.0078
arc_challenge	0	acc	0.3823	±	0.0142
		acc_norm	0.4138	±	0.0144
mc_taco	0	em	0.1126
		f1	0.4827
copa	0	acc	0.8400	±	0.0368
openbookqa	0	acc	0.2820	±	0.0201
		acc_norm	0.4240	±	0.0221

Task	Version	Metric	Value		Stderr
mathqa	0	acc	0.2677	±	0.0081
		acc_norm	0.2787	±	0.0082
math_asdiv	0	acc	0.0000	±	0.0000
gsm8k	0	acc	0.0000	±	0.0000
math_num_theory	1	acc	0.0074	±	0.0037
math_precalc	1	acc	0.0037	±	0.0026
drop	1	em	0.0427	±	0.0021
		f1	0.1216	±	0.0025
math_geometry	1	acc	0.0084	±	0.0042
math_counting_and_prob	1	acc	0.0169	±	0.0059
math_intermediate_algebra	1	acc	0.0066	±	0.0027
math_prealgebra	1	acc	0.0126	±	0.0038
math_algebra	1	acc	0.0168	±	0.0037

Task	Version	Metric	Value		Stderr
coqa	1	f1	0.7521	±	0.0153
		em	0.6267	±	0.0188
drop	1	em	0.0359	±	0.0019
		f1	0.1135	±	0.0023
race	1	acc	0.3990	±	0.0152

Task	Version	Metric	Value		Stderr
webqs	0	acc	0.0000	±	0.0000
truthfulqa_mc	1	mc1	0.2105	±	0.0143
		mc2	0.3414	±	0.0131
headqa_en	0	acc	0.3242	±	0.0089
		acc_norm	0.3592	±	0.0092
triviaqa	1	acc	0.0000	±	0.0000
headqa_es	0	acc	0.2826	±	0.0086
		acc_norm	0.3242	±	0.0089
logiqa	0	acc	0.2181	±	0.0162
		acc_norm	0.3026	±	0.0180
squad2	1	exact	9.4163
		f1	19.4490
		HasAns_exact	18.4885
		HasAns_f1	38.5827
		NoAns_exact	0.3701
		NoAns_f1	0.3701
		best_exact	50.0716
		best_f1	50.0801

Task	Version	Metric	Value		Stderr
lambada_openai_mt_it	0	ppl	3653680.5734	±	197082.9861
		acc	0.0000	±	0.0000
lambada_standard	0	ppl	2460346.8573	±	81216.5655
		acc	0.0000	±	0.0000
lambada_openai_mt_es	0	ppl	3818890.4545	±	197999.0532
		acc	0.0000	±	0.0000
lambada_openai	0	ppl	2817465.0925	±	138319.0882
		acc	0.0000	±	0.0000
lambada_openai_mt_fr	0	ppl	2111186.1155	±	111724.4284
		acc	0.0000	±	0.0000
lambada_openai_mt_de	0	ppl	1805613.6771	±	97892.7891
		acc	0.0000	±	0.0000
lambada_standard_cloze	0	ppl	6710057.2411	±	169833.9100
		acc	0.0000	±	0.0000
lambada_openai_mt_en	0	ppl	2817465.0925	±	138319.0882
		acc	0.0000	±	0.0000
lambada_openai_cloze	0	ppl	255777.7112	±	11345.7710
		acc	0.0004	±	0.0003

Task	Metric
arithmetic_3ds	acc	±
arithmetic_1dc	acc	±
arithmetic_2da	acc	±
arithmetic_4ds	acc	±
arithmetic_3da	acc	±
arithmetic_2ds	acc	±
arithmetic_4da	acc	±
arithmetic_5ds	acc	±
arithmetic_2dm	acc	±
arithmetic_5da	acc	±

Task	Metric	Value		Stderr
blimp_npi_present_2	acc	0.530	±	0.0158
blimp_anaphor_gender_agreement	acc	0.448	±	0.0157
blimp_causative	acc	0.508	±	0.0158
blimp_existential_there_quantifiers_1	acc	0.683	±	0.0147
blimp_existential_there_quantifiers_2	acc	0.674	±	0.0148
blimp_existential_there_subject_raising	acc	0.696	±	0.0146
blimp_principle_A_reconstruction	acc	0.673	±	0.0148
blimp_principle_A_domain_3	acc	0.501	±	0.0158
blimp_sentential_subject_island	acc	0.606	±	0.0155
blimp_superlative_quantifiers_2	acc	0.561	±	0.0157
blimp_complex_NP_island	acc	0.416	±	0.0156
blimp_wh_island	acc	0.275	±	0.0141
blimp_wh_vs_that_no_gap_long_distance	acc	0.812	±	0.0124
blimp_principle_A_c_command	acc	0.390	±	0.0154
blimp_sentential_negation_npi_scope	acc	0.588	±	0.0156
blimp_principle_A_case_2	acc	0.554	±	0.0157
blimp_determiner_noun_agreement_2	acc	0.598	±	0.0155
blimp_left_branch_island_echo_question	acc	0.835	±	0.0117
blimp_wh_vs_that_with_gap_long_distance	acc	0.227	±	0.0133
blimp_determiner_noun_agreement_with_adjective_1	acc	0.577	±	0.0156
blimp_ellipsis_n_bar_1	acc	0.668	±	0.0149
blimp_wh_questions_subject_gap	acc	0.720	±	0.0142
blimp_wh_questions_subject_gap_long_distance	acc	0.746	±	0.0138
blimp_only_npi_scope	acc	0.266	±	0.0140
blimp_coordinate_structure_constraint_complex_left_branch	acc	0.682	±	0.0147
blimp_adjunct_island	acc	0.539	±	0.0158
blimp_determiner_noun_agreement_irregular_1	acc	0.572	±	0.0157
blimp_expletive_it_object_raising	acc	0.659	±	0.0150
blimp_npi_present_1	acc	0.534	±	0.0158
blimp_superlative_quantifiers_1	acc	0.612	±	0.0154
blimp_determiner_noun_agreement_with_adj_2	acc	0.540	±	0.0158
blimp_principle_A_domain_2	acc	0.646	±	0.0151
blimp_irregular_past_participle_adjectives	acc	0.429	±	0.0157
blimp_regular_plural_subject_verb_agreement_1	acc	0.645	±	0.0151
blimp_transitive	acc	0.698	±	0.0145
blimp_existential_there_object_raising	acc	0.788	±	0.0129
blimp_distractor_agreement_relational_noun	acc	0.441	±	0.0157
blimp_animate_subject_passive	acc	0.626	±	0.0153
blimp_sentential_negation_npi_licensor_present	acc	0.940	±	0.0075
blimp_only_npi_licensor_present	acc	0.814	±	0.0123
blimp_irregular_plural_subject_verb_agreement_2	acc	0.700	±	0.0145
blimp_matrix_question_npi_licensor_present	acc	0.117	±	0.0102
blimp_passive_2	acc	0.703	±	0.0145
blimp_tough_vs_raising_2	acc	0.768	±	0.0134
blimp_determiner_noun_agreement_with_adj_irregular_1	acc	0.563	±	0.0157
blimp_drop_argument	acc	0.701	±	0.0145
blimp_wh_vs_that_no_gap	acc	0.848	±	0.0114
blimp_wh_vs_that_with_gap	acc	0.239	±	0.0135
blimp_left_branch_island_simple_question	acc	0.740	±	0.0139
blimp_wh_questions_object_gap	acc	0.670	±	0.0149
blimp_determiner_noun_agreement_1	acc	0.636	±	0.0152
blimp_determiner_noun_agreement_with_adj_irregular_2	acc	0.591	±	0.0156
blimp_tough_vs_raising_1	acc	0.298	±	0.0145
blimp_inchoative	acc	0.420	±	0.0156
blimp_principle_A_case_1	acc	0.985	±	0.0038
blimp_animate_subject_trans	acc	0.761	±	0.0135
blimp_intransitive	acc	0.592	±	0.0155
blimp_anaphor_number_agreement	acc	0.659	±	0.0150
blimp_distractor_agreement_relative_clause	acc	0.314	±	0.0147
blimp_regular_plural_subject_verb_agreement_2	acc	0.705	±	0.0144
blimp_ellipsis_n_bar_2	acc	0.794	±	0.0128
blimp_irregular_plural_subject_verb_agreement_1	acc	0.653	±	0.0151
blimp_principle_A_domain_1	acc	0.962	±	0.0060
blimp_determiner_noun_agreement_irregular_2	acc	0.602	±	0.0155
blimp_coordinate_structure_constraint_object_extraction	acc	0.629	±	0.0153
blimp_passive_1	acc	0.702	±	0.0145
blimp_irregular_past_participle_verbs	acc	0.725	±	0.0141

Task	Version	Metric	Value		Stderr
ethics_virtue	0	acc	0.2098	±	0.0058
		em	0.0000
crows_pairs_french_race_color	0	likelihood_difference	12.0489	±	0.7332
		pct_stereotype	0.4326	±	0.0231
ethics_utilitarianism_original	0	acc	0.9586	±	0.0029
crows_pairs_english_nationality	0	likelihood_difference	6.7626	±	0.5869
		pct_stereotype	0.5370	±	0.0340
crows_pairs_english_socioeconomic	0	likelihood_difference	6.4016	±	0.5420
		pct_stereotype	0.5684	±	0.0360
crows_pairs_french_socioeconomic	0	likelihood_difference	9.8084	±	1.0151
		pct_stereotype	0.5204	±	0.0358
crows_pairs_english_religion	0	likelihood_difference	7.2196	±	0.7592
		pct_stereotype	0.6667	±	0.0449
ethics_justice	0	acc	0.4996	±	0.0096
		em	0.0015
crows_pairs_english_autre	0	likelihood_difference	11.0114	±	5.8908
		pct_stereotype	0.4545	±	0.1575
toxigen	0	acc	0.4309	±	0.0162
		acc_norm	0.4319	±	0.0162
crows_pairs_french_autre	0	likelihood_difference	7.5120	±	2.0958
		pct_stereotype	0.6154	±	0.1404
ethics_cm	0	acc	0.5691	±	0.0079
crows_pairs_english_gender	0	likelihood_difference	7.9174	±	0.5502
		pct_stereotype	0.5312	±	0.0279
crows_pairs_english_race_color	0	likelihood_difference	6.2465	±	0.3239
		pct_stereotype	0.4665	±	0.0222
crows_pairs_english_age	0	likelihood_difference	5.9423	±	0.7903
		pct_stereotype	0.5165	±	0.0527
ethics_utilitarianism	0	acc	0.4981	±	0.0072
crows_pairs_english_sexual_orientation	0	likelihood_difference	8.3048	±	0.8428
		pct_stereotype	0.6237	±	0.0505
ethics_deontology	0	acc	0.5058	±	0.0083
		em	0.0022
crows_pairs_french_religion	0	likelihood_difference	9.5853	±	0.8750
		pct_stereotype	0.4348	±	0.0464
crows_pairs_french_gender	0	likelihood_difference	11.7990	±	0.8714
		pct_stereotype	0.5202	±	0.0279
crows_pairs_french_nationality	0	likelihood_difference	10.4165	±	0.9066
		pct_stereotype	0.4071	±	0.0309
crows_pairs_english_physical_appearance	0	likelihood_difference	4.5126	±	0.6932
		pct_stereotype	0.5000	±	0.0593
crows_pairs_french_age	0	likelihood_difference	11.9396	±	1.5377
		pct_stereotype	0.3556	±	0.0507
crows_pairs_english_disability	0	likelihood_difference	9.6697	±	1.1386
		pct_stereotype	0.6615	±	0.0591
crows_pairs_french_sexual_orientation	0	likelihood_difference	7.6058	±	0.7939
		pct_stereotype	0.6703	±	0.0496
crows_pairs_french_physical_appearance	0	likelihood_difference	7.0451	±	0.9484
		pct_stereotype	0.5556	±	0.0590
crows_pairs_french_disability	0	likelihood_difference	10.1477	±	1.3907
		pct_stereotype	0.4242	±	0.0613

Task	Version	Metric	Value		Stderr
hendrycksTest-high_school_geography	0	acc	0.4293	±	0.0353
		acc_norm	0.3636	±	0.0343
hendrycksTest-philosophy	0	acc	0.4019	±	0.0278
		acc_norm	0.3537	±	0.0272
hendrycksTest-world_religions	0	acc	0.6257	±	0.0371
		acc_norm	0.5146	±	0.0383
hendrycksTest-college_biology	0	acc	0.3194	±	0.0390
		acc_norm	0.2917	±	0.0380
hendrycksTest-electrical_engineering	0	acc	0.3586	±	0.0400
		acc_norm	0.3241	±	0.0390
hendrycksTest-global_facts	0	acc	0.3200	±	0.0469
		acc_norm	0.2900	±	0.0456
hendrycksTest-high_school_government_and_politics	0	acc	0.4819	±	0.0361
		acc_norm	0.3731	±	0.0349
hendrycksTest-moral_scenarios	0	acc	0.2760	±	0.0150
		acc_norm	0.2726	±	0.0149
hendrycksTest-econometrics	0	acc	0.2895	±	0.0427
		acc_norm	0.2632	±	0.0414
hendrycksTest-international_law	0	acc	0.3884	±	0.0445
		acc_norm	0.5785	±	0.0451
hendrycksTest-us_foreign_policy	0	acc	0.5600	±	0.0499
		acc_norm	0.4500	±	0.0500
hendrycksTest-high_school_macroeconomics	0	acc	0.3179	±	0.0236
		acc_norm	0.3026	±	0.0233
hendrycksTest-virology	0	acc	0.3976	±	0.0381
		acc_norm	0.2892	±	0.0353
hendrycksTest-high_school_mathematics	0	acc	0.2259	±	0.0255
		acc_norm	0.3074	±	0.0281
hendrycksTest-clinical_knowledge	0	acc	0.3887	±	0.0300
		acc_norm	0.3811	±	0.0299
hendrycksTest-professional_psychology	0	acc	0.3840	±	0.0197
		acc_norm	0.2990	±	0.0185
hendrycksTest-formal_logic	0	acc	0.3095	±	0.0413
		acc_norm	0.3492	±	0.0426
hendrycksTest-management	0	acc	0.4854	±	0.0495
		acc_norm	0.3689	±	0.0478
hendrycksTest-human_sexuality	0	acc	0.5115	±	0.0438
		acc_norm	0.3664	±	0.0423
hendrycksTest-high_school_world_history	0	acc	0.3924	±	0.0318
		acc_norm	0.3376	±	0.0308
hendrycksTest-medical_genetics	0	acc	0.4400	±	0.0499
		acc_norm	0.4000	±	0.0492
hendrycksTest-computer_security	0	acc	0.3700	±	0.0485
		acc_norm	0.4400	±	0.0499
hendrycksTest-miscellaneous	0	acc	0.5837	±	0.0176
		acc_norm	0.3895	±	0.0174
hendrycksTest-public_relations	0	acc	0.3909	±	0.0467
		acc_norm	0.2273	±	0.0401
hendrycksTest-college_physics	0	acc	0.2353	±	0.0422
		acc_norm	0.3235	±	0.0466
hendrycksTest-professional_accounting	0	acc	0.3014	±	0.0274
		acc_norm	0.2943	±	0.0272
hendrycksTest-logical_fallacies	0	acc	0.3804	±	0.0381
		acc_norm	0.3497	±	0.0375
hendrycksTest-business_ethics	0	acc	0.5300	±	0.0502
		acc_norm	0.4600	±	0.0501
hendrycksTest-high_school_chemistry	0	acc	0.2512	±	0.0305
		acc_norm	0.2956	±	0.0321
hendrycksTest-astronomy	0	acc	0.4539	±	0.0405
		acc_norm	0.4605	±	0.0406
hendrycksTest-high_school_us_history	0	acc	0.4265	±	0.0347
		acc_norm	0.3137	±	0.0326
hendrycksTest-college_chemistry	0	acc	0.3300	±	0.0473
		acc_norm	0.3000	±	0.0461
hendrycksTest-abstract_algebra	0	acc	0.2300	±	0.0423
		acc_norm	0.2600	±	0.0441
hendrycksTest-moral_disputes	0	acc	0.3642	±	0.0259
		acc_norm	0.3324	±	0.0254
hendrycksTest-college_computer_science	0	acc	0.3300	±	0.0473
		acc_norm	0.2800	±	0.0451
hendrycksTest-professional_law	0	acc	0.2966	±	0.0117
		acc_norm	0.2855	±	0.0115
hendrycksTest-college_mathematics	0	acc	0.3200	±	0.0469
		acc_norm	0.3200	±	0.0469
hendrycksTest-high_school_microeconomics	0	acc	0.3866	±	0.0316
		acc_norm	0.3655	±	0.0313
hendrycksTest-high_school_european_history	0	acc	0.4061	±	0.0383
		acc_norm	0.3697	±	0.0377
hendrycksTest-high_school_biology	0	acc	0.3581	±	0.0273
		acc_norm	0.3581	±	0.0273
hendrycksTest-security_studies	0	acc	0.4082	±	0.0315
		acc_norm	0.3102	±	0.0296
hendrycksTest-high_school_psychology	0	acc	0.4661	±	0.0214
		acc_norm	0.3083	±	0.0198
hendrycksTest-conceptual_physics	0	acc	0.3277	±	0.0307
		acc_norm	0.2170	±	0.0269
hendrycksTest-human_aging	0	acc	0.3722	±	0.0324
		acc_norm	0.2511	±	0.0291
hendrycksTest-prehistory	0	acc	0.4012	±	0.0273
		acc_norm	0.2778	±	0.0249
hendrycksTest-sociology	0	acc	0.4776	±	0.0353
		acc_norm	0.4279	±	0.0350
hendrycksTest-marketing	0	acc	0.6111	±	0.0319
		acc_norm	0.5043	±	0.0328
hendrycksTest-high_school_computer_science	0	acc	0.4100	±	0.0494
		acc_norm	0.3400	±	0.0476
hendrycksTest-machine_learning	0	acc	0.3036	±	0.0436
		acc_norm	0.2679	±	0.0420
hendrycksTest-elementary_mathematics	0	acc	0.3201	±	0.0240
		acc_norm	0.2910	±	0.0234
hendrycksTest-nutrition	0	acc	0.3954	±	0.0280
		acc_norm	0.4379	±	0.0284
hendrycksTest-anatomy	0	acc	0.3852	±	0.0420
		acc_norm	0.2815	±	0.0389
hendrycksTest-jurisprudence	0	acc	0.4352	±	0.0479
		acc_norm	0.5000	±	0.0483
hendrycksTest-college_medicine	0	acc	0.3757	±	0.0369
		acc_norm	0.3064	±	0.0351
hendrycksTest-high_school_statistics	0	acc	0.3426	±	0.0324
		acc_norm	0.3426	±	0.0324
hendrycksTest-high_school_physics	0	acc	0.2053	±	0.0330
		acc_norm	0.2715	±	0.0363
hendrycksTest-professional_medicine	0	acc	0.3382	±	0.0287
		acc_norm	0.2794	±	0.0273

Bad results for LLaMA #443

Description

Common Sense Reasoning

Mathematical Reasoning

Reading Comprehension

Question Answering

LAMBADA

Arithmetic

BLIMP

Human alignment

MMLU

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions