Skip to content

codefuse-ai/E2LLM

Repository files navigation

E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

This is the Pytorch implementation of E2LLM in the EMNLP'25 paper: E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning.

Overview

The network architecture of E2LLM.

Abstract
  • We propose E2LLM, a novel long-context modeling framework built on pre-trained text encoders and decoder-only LLMs to effectively address the ``impossible triangle'' challenge.

  • We introduce two training objectives: soft prompt reconstruction and long-context instruction fine-tuning, enabling the LLM to understand the soft prompt while reasoning about accurate outputs.

  • Comprehensive experiments conducted on diverse benchmarks demonstrate the efficiency and practicality of E2LLM and reveal its superiority over 8 SOTA baselines and competency on LongBench v2.

Requirements

  • Ubuntu OS
  • python==3.10
  • torch==2.0.1
  • cuda==11.7
  • accelerate==0.23.0
  • transformers==4.36.0
  • deepspeed==0.9.3
  • flash-attn==2.3.6
  • peft==0.7.0
  • scikit-learn==1.3.0

Dependencies can be installed by:

pip install -r requirements.txt

The overall directory structure is as follows:

${CODE_ROOT}
    |-- configs
    	|-- eval_config.json
        |-- lora_modules.json
        |-- model2maxlen.json
        |-- train_config.json
    |-- dataset
        |-- __init__.py
        |-- dataset.py
    |-- evaluate
		|-- __init__.py
    	|-- em_quality.py
    	|-- f1_qa.py
    	|-- niah_metric.py
    	|-- rouge_sum.py
    |-- local
    	|-- ds_config_zero2.yaml
    |-- model
    	|-- __init__.py
    	|-- encoder_model_bert.py
    	|-- pma.py
    	|-- pro_model.py
    |-- pefts
    	|-- __init__.py
    	|-- e2llm_args.py
    	|-- e2llm_trainer.py
    |-- preprocess
    	|-- preshuffle_data_and_chunk.py
    |-- prompts
    |-- utils
    	|-- __init__.py
    	|-- common_utils.py
	|-- eval.py
	|-- eval.sh
	|-- train_accelerate.py
    |-- train_local_machine.sh
    |-- train_multi_node.sh

Data preparation

The five datasets (QMSum, GovReport, Quality, NarrativeQA and TriviaQA) used in this paper can be downloaded from the following links:

Before training, first convert the data into a JSONL file in the format {'context': 'xxx', 'prompt': 'xxx', 'answer': 'xxx'}. Then run

python preprocess/preshuffle_data_and_chunk.py

and set the chunk_size parameter during execution.

Train

During training, first set the desired parameters in configs/train_config.json, then run the appropriate script according to your environment:

  • If you are training on a local machine:

    sh train_local_machine.sh
    
  • If you are training on a cluster / multi-node setup:

    sh train_multi_node.sh
    

Evaluate

For inference, run

sh eval.sh

and adjust its parameters so that they match the ones used during training.

Citation

If you find our repository helpful, please cite us as follows:

@misc{liao2025e2llmencoderelongatedlarge,
      title={E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning}, 
      author={Zihan Liao and Jun Wang and Hang Yu and Lingxiao Wei and Jianguo Li and Jun Wang and Wei Zhang},
      year={2025},
      eprint={2409.06679},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.06679}, 
}

About

Pytorch Code for E2LLM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published