Skip to content

Conversation

@rbiswasfc
Copy link
Contributor

This PR adds four tasks from the RULER benchmark. Specifically, these are the tasks:

  • QA2 (hotpotqa after adding distracting information)
  • Multi-hop Tracing: Variable Tracking (VT)
  • Aggregation: Common Words (CWE)
  • Multi-keys Needle-in-a-haystack (NIAH)

Currently, each task is having 4k context length - which can be adjusted as needed.

@rbiswasfc rbiswasfc requested review from fladhak and griff4692 June 27, 2024 11:39
@rbiswasfc
Copy link
Contributor Author

Running with default setting with Qwen2-1.5B-Instruct model (e.g. python eval.py --tasks rulercwe --checkpoint_path checkpoints/Qwen/Qwen2-1.5B-Instruct/model.pth ) I got these scores:

  • rulerqa: StringMatch_score: 38.8
  • rulerniah: StringMatch_score: 89.00
  • rulervt: StringMatch_score: 77.96
  • rulercwe: StringMatch_score: 31.08

@griff4692 griff4692 merged commit 6a02bef into main Jun 27, 2024
@griff4692 griff4692 deleted the rb/ruler branch June 27, 2024 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants