The second generation of the InternLM model, InternLM2, includes models at two scales: 7B and 20B. For the convenience of users and researchers, we have open-sourced four versions of each scale of the model, which are:
- internlm2-base-20b: Foundation models with high quality and high adaptation flexibility, which serve as a good starting point for downstream deep adaptations.
- internlm2-20b (recommended): Further pretrain with general domain data and domain-enhanced corpus, obtaining state-of-the-art performance in evaluation with good language capability. InternLM2 models are recommended for consideration in most applications.
- internlm2-chat-20b-sft: Intermediate version of InternLM2-Chat that only undergoes supervised fine-tuning (SFT), based on the InternLM2-Base model. We release them to benefit research on alignment.
- internlm2-chat-20b (recommended): Further aligned on top of InternLM2-Chat-SFT through online RLHF. InternLM2-Chat exhibits better instruction following, chat experience, and function calling, which is recommended for downstream applications.
The base model of InternLM2 has the following technical features:
- Effective support for ultra-long contexts of up to 200,000 characters: The model nearly perfectly achieves "finding a needle in a haystack" in long inputs of 200,000 characters. It also leads among open-source models in performance on long-text tasks such as LongBench and L-Eval.
- Comprehensive performance enhancement: Compared to the previous generation model, it shows significant improvements in various capabilities, including reasoning, mathematics, and coding.
Model | Transformers(HF) | ModelScope(HF) | OpenXLab(HF) | Release Date |
---|---|---|---|---|
InternLM2-Base-20B | 🤗internlm/internlm2-base-20b | internlm2-base-20b | 2024-01-17 | |
InternLM2-20B | 🤗internlm/internlm2-20b | internlm2-20b | 2024-01-17 | |
InternLM2-Chat-20B-SFT | 🤗internlm/internlm2-chat-20b-sft | internlm2-chat-20b-sft | 2024-01-17 | |
InternLM2-Chat-20B | 🤗internlm/internlm2-chat-20b | internlm2-chat-20b | 2024-01-17 |
We have evaluated InternLM2 on several important benchmarks using the open-source evaluation tool OpenCompass. Some of the evaluation results are shown in the table below. You are welcome to visit the OpenCompass Leaderboard for more evaluation results.
Dataset\Models | InternLM2-7B | InternLM2-Chat-7B | InternLM2-20B | InternLM2-Chat-20B | ChatGPT | GPT-4 |
---|---|---|---|---|---|---|
MMLU | 65.8 | 63.7 | 67.7 | 66.5 | 69.1 | 83.0 |
AGIEval | 49.9 | 47.2 | 53.0 | 50.3 | 39.9 | 55.1 |
BBH | 65.0 | 61.2 | 72.1 | 68.3 | 70.1 | 86.7 |
GSM8K | 70.8 | 70.7 | 76.1 | 79.6 | 78.2 | 91.4 |
MATH | 20.2 | 23.0 | 25.5 | 31.9 | 28.0 | 45.8 |
HumanEval | 43.3 | 59.8 | 48.8 | 67.1 | 73.2 | 74.4 |
MBPP(Sanitized) | 51.8 | 51.4 | 63.0 | 65.8 | 78.9 | 79.0 |
- The evaluation results were obtained from OpenCompass , and evaluation configuration can be found in the configuration files provided by OpenCompass.
- The evaluation data may have numerical differences due to the version iteration of OpenCompass, so please refer to the latest evaluation results of OpenCompass.