- [2024.05.08] We supported the evaluation of 4 MoE models: Mixtral-8x22B-v0.1, Mixtral-8x22B-Instruct-v0.1, Qwen1.5-MoE-A2.7B, Qwen1.5-MoE-A2.7B-Chat. Try them out now!
- [2024.04.30] We supported evaluating a model's compression efficiency by calculating its Bits per Character (BPC) metric on an external corpora (official paper). Check out the llm-compression evaluation config now! 🔥🔥🔥
- [2024.04.29] We report the performance of several famous LLMs on the common benchmarks, welcome to documentation for more information! 🔥🔥🔥.
- [2024.04.26] We deprecated the multi-madality evaluating function from OpenCompass, related implement has moved to VLMEvalKit, welcome to use! 🔥🔥🔥.
- [2024.04.26] We supported the evaluation of ArenaHard welcome to try!🔥🔥🔥.
- [2024.04.22] We supported the evaluation of LLaMA3 和 LLaMA3-Instruct, welcome to try! 🔥🔥🔥
- [2024.02.29] We supported the MT-Bench, AlpacalEval and AlignBench, more information can be found here
- [2024.01.30] We release OpenCompass 2.0. Click CompassKit, CompassHub, and CompassRank for more information !
- [2024.01.17] We supported the evaluation of InternLM2 and InternLM2-Chat, InternLM2 showed extremely strong performance in these tests, welcome to try!
- [2024.01.17] We supported the needle in a haystack test with multiple needles, more information can be found here.
- [2023.12.28] We have enabled seamless evaluation of all models developed using LLaMA2-Accessory, a powerful toolkit for comprehensive LLM development.
- [2023.12.22] We have released T-Eval, a step-by-step evaluation benchmark to gauge your LLMs on tool utilization. Welcome to our Leaderboard for more details!
- [2023.12.10] We have released VLMEvalKit, a toolkit for evaluating vision-language models (VLMs), currently support 20+ VLMs and 7 multi-modal benchmarks (including MMBench series).
- [2023.12.10] We have supported Mistral AI's MoE LLM: Mixtral-8x7B-32K. Welcome to MixtralKit for more details about inference and evaluation.
- [2023.11.22] We have supported many API-based models, include Baidu, ByteDance, Huawei, 360. Welcome to Models section for more details.
- [2023.11.20] Thanks helloyongyang for supporting the evaluation with LightLLM as backent. Welcome to Evaluation With LightLLM for more details.
- [2023.11.13] We are delighted to announce the release of OpenCompass v0.1.8. This version enables local loading of evaluation benchmarks, thereby eliminating the need for an internet connection. Please note that with this update, you must re-download all evaluation datasets to ensure accurate and up-to-date results.
- [2023.11.06] We have supported several API-based models, include ChatGLM Pro@Zhipu, ABAB-Chat@MiniMax and Xunfei. Welcome to Models section for more details.
- [2023.10.24] We release a new benchmark for evaluating LLMs’ capabilities of having multi-turn dialogues. Welcome to BotChat for more details.
- [2023.09.26] We update the leaderboard with Qwen, one of the best-performing open-source models currently available, welcome to our homepage for more details.
- [2023.09.20] We update the leaderboard with InternLM-20B, welcome to our homepage for more details.
- [2023.09.19] We update the leaderboard with WeMix-LLaMA2-70B/Phi-1.5-1.3B, welcome to our homepage for more details.
- [2023.09.18] We have released long context evaluation guidance.
- [2023.09.08] We update the leaderboard with Baichuan-2/Tigerbot-2/Vicuna-v1.5, welcome to our homepage for more details.
- [2023.09.06] Baichuan2 team adpots OpenCompass to evaluate their models systematically. We deeply appreciate the community's dedication to transparency and reproducibility in LLM evaluation.
- [2023.09.02] We have supported the evaluation of Qwen-VL in OpenCompass.
- [2023.08.25] TigerBot team adpots OpenCompass to evaluate their models systematically. We deeply appreciate the community's dedication to transparency and reproducibility in LLM evaluation.
- [2023.08.21] Lagent has been released, which is a lightweight framework for building LLM-based agents. We are working with Lagent team to support the evaluation of general tool-use capability, stay tuned!
- [2023.08.18] We have supported evaluation for multi-modality learning, include MMBench, SEED-Bench, COCO-Caption, Flickr-30K, OCR-VQA, ScienceQA and so on. Leaderboard is on the road. Feel free to try multi-modality evaluation with OpenCompass !
- [2023.08.18] Dataset card is now online. Welcome new evaluation benchmark OpenCompass !
- [2023.08.11] Model comparison is now online. We hope this feature offers deeper insights!
- [2023.08.11] We have supported LEval.
- [2023.08.10] OpenCompass is compatible with LMDeploy. Now you can follow this instruction to evaluate the accelerated models provide by the Turbomind.
- [2023.08.10] We have supported Qwen-7B and XVERSE-13B ! Go to our leaderboard for more results! More models are welcome to join OpenCompass.
- [2023.08.09] Several new datasets(CMMLU, TydiQA, SQuAD2.0, DROP) are updated on our leaderboard! More datasets are welcomed to join OpenCompass.
- [2023.08.07] We have added a script for users to evaluate the inference results of MMBench-dev.
- [2023.08.05] We have supported GPT-4! Go to our leaderboard for more results! More models are welcome to join OpenCompass.
- [2023.07.27] We have supported CMMLU! More datasets are welcome to join OpenCompass.