This is the Pytorch implementation of E2LLM in the EMNLP'25 paper: E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning.
Abstract
-
We propose E2LLM, a novel long-context modeling framework built on pre-trained text encoders and decoder-only LLMs to effectively address the ``impossible triangle'' challenge.
-
We introduce two training objectives: soft prompt reconstruction and long-context instruction fine-tuning, enabling the LLM to understand the soft prompt while reasoning about accurate outputs.
-
Comprehensive experiments conducted on diverse benchmarks demonstrate the efficiency and practicality of E2LLM and reveal its superiority over 8 SOTA baselines and competency on LongBench v2.
- Ubuntu OS
- python==3.10
- torch==2.0.1
- cuda==11.7
- accelerate==0.23.0
- transformers==4.36.0
- deepspeed==0.9.3
- flash-attn==2.3.6
- peft==0.7.0
- scikit-learn==1.3.0
Dependencies can be installed by:
pip install -r requirements.txt
The overall directory structure is as follows:
${CODE_ROOT}
|-- configs
|-- eval_config.json
|-- lora_modules.json
|-- model2maxlen.json
|-- train_config.json
|-- dataset
|-- __init__.py
|-- dataset.py
|-- evaluate
|-- __init__.py
|-- em_quality.py
|-- f1_qa.py
|-- niah_metric.py
|-- rouge_sum.py
|-- local
|-- ds_config_zero2.yaml
|-- model
|-- __init__.py
|-- encoder_model_bert.py
|-- pma.py
|-- pro_model.py
|-- pefts
|-- __init__.py
|-- e2llm_args.py
|-- e2llm_trainer.py
|-- preprocess
|-- preshuffle_data_and_chunk.py
|-- prompts
|-- utils
|-- __init__.py
|-- common_utils.py
|-- eval.py
|-- eval.sh
|-- train_accelerate.py
|-- train_local_machine.sh
|-- train_multi_node.sh
The five datasets (QMSum, GovReport, Quality, NarrativeQA and TriviaQA) used in this paper can be downloaded from the following links:
Before training, first convert the data into a JSONL file in the format {'context': 'xxx', 'prompt': 'xxx', 'answer': 'xxx'}. Then run
python preprocess/preshuffle_data_and_chunk.py
and set the chunk_size parameter during execution.
During training, first set the desired parameters in configs/train_config.json, then run the appropriate script according to your environment:
-
If you are training on a local machine:
sh train_local_machine.sh
-
If you are training on a cluster / multi-node setup:
sh train_multi_node.sh
For inference, run
sh eval.sh
and adjust its parameters so that they match the ones used during training.
If you find our repository helpful, please cite us as follows:
@misc{liao2025e2llmencoderelongatedlarge,
title={E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning},
author={Zihan Liao and Jun Wang and Hang Yu and Lingxiao Wei and Jianguo Li and Jun Wang and Wei Zhang},
year={2025},
eprint={2409.06679},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.06679},
}