E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

This is the Pytorch implementation of E2LLM in the EMNLP'25 paper: E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning.

Overview

Abstract

We propose E2LLM, a novel long-context modeling framework built on pre-trained text encoders and decoder-only LLMs to effectively address the ``impossible triangle'' challenge.
We introduce two training objectives: soft prompt reconstruction and long-context instruction fine-tuning, enabling the LLM to understand the soft prompt while reasoning about accurate outputs.
Comprehensive experiments conducted on diverse benchmarks demonstrate the efficiency and practicality of E2LLM and reveal its superiority over 8 SOTA baselines and competency on LongBench v2.

Requirements

Ubuntu OS
python==3.10
torch==2.0.1
cuda==11.7
accelerate==0.23.0
transformers==4.36.0
deepspeed==0.9.3
flash-attn==2.3.6
peft==0.7.0
scikit-learn==1.3.0

Dependencies can be installed by:

pip install -r requirements.txt

The overall directory structure is as follows:

${CODE_ROOT}
    |-- configs
    	|-- eval_config.json
        |-- lora_modules.json
        |-- model2maxlen.json
        |-- train_config.json
    |-- dataset
        |-- __init__.py
        |-- dataset.py
    |-- evaluate
		|-- __init__.py
    	|-- em_quality.py
    	|-- f1_qa.py
    	|-- niah_metric.py
    	|-- rouge_sum.py
    |-- local
    	|-- ds_config_zero2.yaml
    |-- model
    	|-- __init__.py
    	|-- encoder_model_bert.py
    	|-- pma.py
    	|-- pro_model.py
    |-- pefts
    	|-- __init__.py
    	|-- e2llm_args.py
    	|-- e2llm_trainer.py
    |-- preprocess
    	|-- preshuffle_data_and_chunk.py
    |-- prompts
    |-- utils
    	|-- __init__.py
    	|-- common_utils.py
	|-- eval.py
	|-- eval.sh
	|-- train_accelerate.py
    |-- train_local_machine.sh
    |-- train_multi_node.sh

Data preparation

The five datasets (QMSum, GovReport, Quality, NarrativeQA and TriviaQA) used in this paper can be downloaded from the following links:

Before training, first convert the data into a JSONL file in the format {'context': 'xxx', 'prompt': 'xxx', 'answer': 'xxx'}. Then run

python preprocess/preshuffle_data_and_chunk.py

and set the chunk_size parameter during execution.

Train

During training, first set the desired parameters in configs/train_config.json, then run the appropriate script according to your environment:

If you are training on a local machine:
```
sh train_local_machine.sh
```
If you are training on a cluster / multi-node setup:
```
sh train_multi_node.sh
```

Evaluate

For inference, run

sh eval.sh

and adjust its parameters so that they match the ones used during training.

Citation

If you find our repository helpful, please cite us as follows:

@misc{liao2025e2llmencoderelongatedlarge,
      title={E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning}, 
      author={Zihan Liao and Jun Wang and Hang Yu and Lingxiao Wei and Jianguo Li and Jun Wang and Wei Zhang},
      year={2025},
      eprint={2409.06679},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.06679}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

Overview

Requirements

Data preparation

Train

Evaluate

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
dataset		dataset
evaluate		evaluate
img		img
local		local
model		model
pefts		pefts
preprocess		preprocess
prompts		prompts
utils		utils
.gitignore		.gitignore
LEGAL.md		LEGAL.md
LICENSE.md		LICENSE.md
README.md		README.md
eval.py		eval.py
eval.sh		eval.sh
requirements.txt		requirements.txt
train_accelerate.py		train_accelerate.py
train_local_machine.sh		train_local_machine.sh
train_multi_node.sh		train_multi_node.sh

License

codefuse-ai/E2LLM

Folders and files

Latest commit

History

Repository files navigation

E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

Overview

Requirements

Data preparation

Train

Evaluate

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages