Skip to content

Latest commit

 

History

History
1832 lines (1425 loc) · 56.4 KB

Dataset_documentation.md

File metadata and controls

1832 lines (1425 loc) · 56.4 KB

BRIGHT

BRIGHT is the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents. The queries are collected from diverse domains (StackExchange, LeetCode, and math competitions), all sourced from realistic human data. Experiments show that existing retrieval models perform poorly on BRIGHT, where the highest score is only 21 measured by nDCG@10. BRIGHT provides a good testbed for future retrieval research in more realistic and challenging settings.

Data Card Author(s)

  • Hongjin Su, HKU: Owner
  • Howard Yen, Princeton: Owner
  • Mengzhou Xia, Princeton: Owner
  • Weijia Shi, UW: Contributor
  • Niklas Muennighoff: Contributor
  • Han-yu Wang, HKU: Contributor
  • Haisu Liu, HKU: Contributor
  • Quan Shi, Princeton: Contributor
  • Zachary S. Siegel, Princeton: Contributor
  • Michael Tang, Princeton: Contributor
  • Ruoxi Sun, Google: Contributor
  • Jinsung Yoon, Google: Contributor
  • Sercan Ö. Arik, Google: Contributor
  • Danqi Chen, Princeton: Contributor
  • Tao Yu, HKU: Contributor

Authorship

Publishers

Publishing Organization(s)

The University of Hong Kong

Industry Type(s)

  • Academic - Tech

Contact Detail(s)

  • Publishing POC: N/A
  • Affiliation: N/A
  • Contact: N/A
  • Mailing List: N/A
  • Website: N/A

Dataset Owners

Team(s)

Hongjin Su, Howard Yen and Mengzhou Xia

Contact Detail(s)

  • Dataset Owner(s): Hongjin Su, Howard Yen and Mengzhou Xia
  • Affiliation: The University of Hong Kong and Princeton University
  • Contact: [email protected], {hyen,mengzhou}@cs.princeton.edu
  • Group Email: N/A
  • Website: N/A

Author(s)

  • Hongjin Su, PhD student, The University of Hong Kong
  • Howard Yen, Masters student, Princeton University
  • Mengzhou Xia, PhD student, Princeton University

Funding Sources

Institution(s)

  • Princeton University
  • Google cloud AI research

Funding or Grant Summary(ies)

  • Non-Sensitive Data about people
  • Public data accessible to everyone

Dataset Snapshot

Category Data
Size of Dataset 607 MB
Number of Instances 1322
Number of Fields 6
Domains 11

Above: We collect 1322 diverse queries from realistics human data. Each example is annotated with the gold documents and the reasning traces to fine them.

Content Description

The datasets are collected from StackExchange, TheoremQA, LeetCode and Math competition.

Descriptive Statistics

Dataset # Q # D # D+ Q.L. D.L.
Biology 103 57,364 3.6 83.6 115.2
Earth Science 118 122,388 7.7 132.4 113.3
Economics 103 50,221 8.0 120.2 181.5
Psychology 101 52,841 7.3 118.2 149.6
Robotics 101 62,198 5.5 120.6 818.9
Stack Overflow 117 101,100 7.0 704.5 478.3
Sustainable Living 108 60,732 5.6 108.0 148.5
LeetCode 142 413,932 1.8 483.1 497.5
Pony 112 7,894 22.5 98.3 102.6
AoPS 111 188,177 4.7 89.0 250.5
TheoremQA 206 188,177 3.2 117.1 250.5

Data statistics of BRIGHT For each dataset, we show the number of queries (# Q) and documents (# D), the average number of positive documents (# D+) per example, the average length of queries (Q.L.) and documents (D.L., measured by the GPT-2 tokenizer)

Sensitivity of Data

Sensitivity Type(s)

  • None

Field(s) with Sensitive Data

  • None

Intentional Collected Sensitive Data

  • None

Unintentionally Collected Sensitive Data

  • None

Security and Privacy Handling

We select academia-oriented domains and remove all user information in StackExchange data.

Risk Type(s)

  • No Known Risks

Supplemental Link(s)

  • None

Risk(s) and Mitigation(s)

  • N/A

Dataset Version and Maintenance

Maintenance Status

Actively Maintained - No new versions will be made available, but this dataset will be actively maintained, including but not limited to updates to the data.

Version Details

Current Version: 1.0

Last Updated: 06/2024

Release Date: 06/2024

Maintenance Plan

We will mainly use Github issues and huggingface communities to address any issue the users encounter in using the BRIGHT data.

Versioning: If new versions are released, it will become 1.1 or 2.0 depending on the update.

Updates: There may be updates in the future.

Errors: We will address the error users encounter

Feedback: Either by email, Github issue, huggingface community, we welcome all fedback to make this benchmark better

Next Planned Update(s)

Version affected: N/A

Next data update: N/A

Next version: N/A

Next version update: N/A

Expected Change(s)

Updates to Data: N/A

Updates to Dataset: N/A

Additional Notes: Add here

Example of Data Points

Primary Data Modality

  • Text Data

Sampling of Data Points

Typical Data Point

Summarize here. Include any criteria for typicality of data point.

{
  "query": "Claim in article about why insects are attracted to light\nIn this article they are addressing the reason insects are attracted to light when they say\nHeat radiation as an attractive component is refuted by the effect of LED lighting, which supplies negligible infrared radiation yet still entraps vast numbers of insects.\nI don't see why attraction to LEDs shows they're not seeking heat. Could they for example be evolutionarily programmed to associate light with heat? So that even though they don't encounter heat near/on the LEDs they still \"expect\" to?",
  "reasoning": "The question probes why insects are drawn to low-heat LED lights, challenging the idea that their attraction to light is heat-based. The document helps distinguish between heat attraction and evolved behaviors, shedding light on why insects might be attracted to LEDs despite their minimal heat.",
  "id": "0",
  "excluded_ids": [
    "N/A"
  ],
  "gold_ids_long": [
    "insects_attracted_to_light/Proximate_and_ultimate_causation.txt",
    "insects_attracted_to_light/Phototaxis.txt"
  ],
  "gold_ids": [
    "insects_attracted_to_light/Phototaxis_3.txt",
    "insects_attracted_to_light/Proximate_and_ultimate_causation_0.txt",
    "insects_attracted_to_light/Phototaxis_4.txt",
    "insects_attracted_to_light/Proximate_and_ultimate_causation_1.txt",
    "insects_attracted_to_light/Phototaxis_0.txt"
  ]
}

Motivations & Intentions

Motivations

Purpose(s)

  • Research

Domain(s) of Application

retrieval

Motivating Factor(s)

  • Existing retrieval benchmarks can be solved by lexical or semantic match
  • Many realistic scenarios cannot be solved by such simple match
  • To bridge this gap, we introduce BRIGHT to evaluate retrieval models in realistic settings where intensive reasoning is required

Intended Use

Dataset Use(s)

  • Evaluate retrieval systems in realistic scenarios

Suitable Use Case(s)

Suitable Use Case: Evaluate retrieval models

Unsuitable Use Case(s)

Unsuitable Use Case: Train retrieval models

Research and Problem Space(s)

We investigate new directions of retrieval, where the relevance between queries and documents go beyond lexical and semantic similarities.

Citation Guidelines

Guidelines & Steps: Include citation when using BRIGHT

BiBTeX:

@inproceedings{BRIGHT,
  title={BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval},
  author={Su, Hongjin and Yen, Howard and Xia, Mengzhou and Shi, Weijia and Muennighoff, Niklas and Wang, Han-yu and Liu, Haisu and Shi, Quan and Siegel, Zachary S and Tang, Michael and Sun, Ruoxi and Yoon, Jinsung and Arik, Sercan O and Chen, Danqi and Yu, Tao},
  year={2024},
}

Access, Rentention, & Wipeout

Access

Access Type

  • External - Open Access

Documentation Link(s)

Prerequisite(s)

N/A

Policy Link(s)

Code to download data:

from datasets import load_dataset
data = load_dataset('xlangai/BRIGHT', 'examples')['biology']

Access Control List(s)

N/A

Retention

Free retention

Wipeout and Deletion

We are not currently considering wiping out or deleting the data

Provenance

Collection

Method(s) Used

  • Data are collected by authors

Methodology Detail(s)

Collection Type

Source: StackExchange, TheoremQA, LeetCode and Math competition.

Platform: N/A

Is this source considered sensitive or high-risk? No

Dates of Collection: 2024.03~2024.05

Primary modality of collection data:

Usage Note: Select one for this collection type.

  • Text Data

Update Frequency for collected data:

Usage Note: Select one for this collection type.

  • Static

Additional Links for this collection:

N/A

Source Description(s)

  • Source: StackExchange is a popular question-answering platform where users ask questions and receive answers from the community. One example is
How good is it to reuse water from plant pots?

I'm living in an apartment, and after I water my plants the water goes to plates below the pots. The pots are in a metallic structure above the plates, so I can take the plates to reuse the water (throwing it at the plants again).

This reuse seems beneficial, because I think I can get rid of mosquitoes that would reproduce in the stagnated water. And also some nutrients of the soil (as well as earthworms) can return to the vase.

Is there some negative points in doing that?

EDIT: I think I must add that I'm at 3 degrees of latitude, in a hot and humid tropical rainforest, where the precipitation used to be around 1700 mm. So I use lots of water everyday, more than once a day sometimes, so the reused water is a small fraction of the water used.

waterreuseplants
Share
Improve this question
Follow
edited Mar 17, 2016 at 15:27
asked Sep 3, 2015 at 18:39
Rodrigo's user avatar
Rodrigo
16311 silver badge66 bronze badges
i think you mean "pots" if they have dirt in them. "vases" hold water and cur flowers. – 
Kate Gregory
 Mar 17, 2016 at 14:53
Yes, @KateGregory, you're absolutely right. That's because in Portuguese we call them "vasos" :) – 
Rodrigo
 Mar 17, 2016 at 15:25
Add a comment
2 Answers
Sorted by:

Highest score (default)
7

In my experience plants suffer in the long term from accumulation of salts in the soil, so fresh water would be better than reusing the water. Even better would be to get hold of fresh rain water (tricky in an apartment though, unless perhaps you have a balcony that gets rained on) for watering them, as that won't contain the salts that tap water does.

More detail here.

Share
Improve this answer
Follow
  • Source: LeetCode is a popular coding platform for programmers to practice. One example is:
5. Longest Palindromic Substring
Medium
Topics
Companies
Hint
Given a string s, return the longest 
palindromic
 
substring
 in s.

 

Example 1:

Input: s = "babad"
Output: "bab"
Explanation: "aba" is also a valid answer.
Example 2:

Input: s = "cbbd"
Output: "bb"
 

Constraints:

1 <= s.length <= 1000
s consist of only digits and English letters.
  • Source: AoPS contains math competition questions. One example is:
Problem 1
What is the ones digit of 222,22 -222,222, -2,222,-222222 \
A. 0
B. 2
C. 4
D. 6
E. 8      
        

Solution 1
We can rewrite the expression as\[222,222-(22,222+2,222+222+22+2).\]
We note that the units digit of the addition is $0$ because all the units digits of the five numbers are $2$ and $5*2=10$, which has a units digit of $0$.
Now, we have something with a units digit of $0$ subtracted from $222,222$. The units digit of this expression is obviously $2$, and we get $\boxed{B}$ as our answer.

Collection Cadence

Static: Data was collected once from single or multiple sources.

Data Integration

Source

Included Fields

Data fields that were collected and are included in the dataset.

Field Name Description
Post The content of post where users ask questions

Additional Notes: Add here

Excluded Fields

Data fields that were collected but are excluded from the dataset.

Field Name Description
Answer Community answers
Votes The votes for the post of answers

Data Processing

All the data collection and processing are done manually or with the help of python scripts.

Collection Criteria

Data Selection

  • StackExchange: We select posts that have links in answers receiving user accept or more than 5 votes
  • Math and Code: We select questions that require a theorems of syntax documentation.

Data Inclusion

  • We include data from diverse domains including psychology, robotics, etc.

Data Exclusion

  • We exclude examples that do not require reasoning in retrieval or do not use theorems.

Relationship to Source

Use & Utility(ies)

  • StackExchange: We use the post and linked web pages in answers
  • Math & Code: We use the questions and tags in websites.

Benefit and Value(s)

  • Using this method, we collect retrieval instances that require intensive reasoning to retrieve documents

Limitation(s) and Trade-Off(s)

  • The judgement of relevance can be subjective, leading to non-perfect human performance.

Version and Maintenance

First Version

Note(s) and Caveat(s)

None

Cadence

  • Daily

Last and Next Update(s)

  • We have not updated the datasets since release.

Changes on Update(s)

N/A

Human and Other Sensitive Attributes

Sensitive Human Attribute(s)

None

Intentionality

Intentionally Collected Attributes

We only use human-labeled links or tags to find examples or documents, but not directly include human labels.

Unintentionally Collected Attributes

None

Rationale

We follow links or tags to find relevant documents or examples

Source(s)

None

Methodology Detail(s)

We follow links or tags to find relevant documents or examples

Distribution(s)

N/A

Known Correlations

[query, gold_ids, gold_ids_long]

Description: The documents corresponding to gold_ids or gold_ids_long are relevant to queries.

Impact on dataset use: It helps evalute retrieval models in realistic setting.

Risk(s) and Mitigation(s)

Human Attribute

None

Extended Use

Use with Other Data

Safety Level

  • Safe to use with other data

Known Safe Dataset(s) or Data Type(s)

The data in BRIGHT benchmark focus on academia-oriented domains, and they should be safe.

Best Practices

Evaluate retrieval systems on BRIGHT.

Known Unsafe Dataset(s) or Data Type(s)

None

Limitation(s) and Recommendation(s)

The judgement of relevance between queries and documents can be subjective, so marginal difference between model evaluation could be ignored, while significant difference gives good signals of model capabilities.

Forking & Sampling

Safety Level

  • Safe to form and/or sample

Acceptable Sampling Method(s)

  • Cluster Sampling
  • Haphazard Sampling
  • Multi-stage sampling
  • Random Sampling
  • Retrospective Sampling
  • Systematic Sampling
  • Weighted Sampling
  • Unsampled

Best Practice(s)

Although sampling is possible, we recommend not to do it because the size of BRIGHT is not very large.

Risk(s) and Mitigation(s)

N/A

Limitation(s) and Recommendation(s)

N/A

Use in ML or AI Systems

Dataset Use(s)

  • Evaluation

Notable Feature(s)

The intensive reasoning required to retrieve documents.

Usage Guideline(s)

Usage Guidelines: Follow the tutorial to evaluate retrieval systems.

Approval Steps: Steps are here.

Reviewer: We authors review the dataset for publication.

Distribution(s)

The BRIGHT benchmark is for the purpose of evaluation, i.e., all data are in test set.

Known Correlation(s)

query, gold_ids, gold_ids_long

Description: The documents corresponding to gold_ids or gold_ids_long are relevant to queries.

Impact on dataset use: It can help evaluate retrieval systems in more realistic scenarios.

Risks from correlation: The judgement of correlation is by real users, and can be subjective.

Split Statistics

Dataset # Q # D # D+ Q.L. D.L.
Biology 103 57,364 3.6 83.6 115.2
Earth Science 118 122,388 7.7 132.4 113.3
Economics 103 50,221 8.0 120.2 181.5
Psychology 101 52,841 7.3 118.2 149.6
Robotics 101 62,198 5.5 120.6 818.9
Stack Overflow 117 101,100 7.0 704.5 478.3
Sustainable Living 108 60,732 5.6 108.0 148.5
LeetCode 142 413,932 1.8 483.1 497.5
Pony 112 7,894 22.5 98.3 102.6
AoPS 111 188,177 4.7 89.0 250.5
TheoremQA 206 188,177 3.2 117.1 250.5

Data statistics of BRIGHT For each dataset, we show the number of queries (# Q) and documents (# D), the average number of positive documents (# D+) per example, the average length of queries (Q.L.) and documents (D.L., measured by the GPT-2 tokenizer)

Transformations

Synopsis

Transformation(s) Applied

  • Data Aggregation

Field(s) Transformed

Transformation Type

Field Name Source & Target
gold_ids links: gold_ids
gold_ids_long links: gold_ids_long

Library(ies) and Method(s) Used

Transformation Type

Method: We follow the links or tags to find relevant documents.

Platforms, tools, or libraries: We do not leverage other platforms or tools in transformation

Transformation Results: We collect 1322 examples that can be used for evaluating retrievers.

Breakdown of Transformations

We find documents for all instances following the procedure above

Residual & Other Risk(s)

The risk is that the relevance judgement is subjective.

Human Oversight Measure(s)

We require human annotators to write down the judgement for relevance and reasoning steps.

Additional Considerations

None

Cleaning Mismatched Value(s)

We select high-quality data instance from websites, so there is no further cleaning.

Method(s) Used

We follow links or tags in the websites.

Comparative Summary

We do not use incorrect or mismatched values.

Residual & Other Risk(s)

M/A

Human Oversight Measure(s)

The data and notes written down by annotators are reviewed

Additional Considerations

None

Anomalies

We select data from websites, so no anomaly or outlier is excluded.

Method(s) Used

N/A

Platforms, tools, or libraries N/A

Comparative Summary

N/A

Residual & Other Risk(s)

N/A

Human Oversight Measure(s)

The data and notes written by annotators are reviewed.

Additional Considerations

N/A

Dimensionality Reduction

N/A

Method(s) Used

N/A

Platforms, tools, or libraries N/A

Comparative Summary

N/A

Residual & Other Risks

N/A

Human Oversight Measure(s)

N/A

Additional Considerations

N/A

Joining Input Sources

We use StackExchange, LeetCode, TheoremQA and math competitions

Method(s) Used

They are independent splits, so no join is performed

Comparative Summary

N/A

Residual & Other Risk(s)

N/A

Human Oversight Measure(s)

N/A

Additional Considerations

N/A

Redaction or Anonymization

N/A

Method(s) Used

N/A

Comparative Summary

N/A

Residual & Other Risk(s)

N/A

Human Oversight Measure(s)

N/A

Additional Considerations

N/A

Others (Please Specify)

N/A

Method(s) Used

N/A

Comparative Summary

N/A

Residual & Other Risk(s)

N/A

Human Oversight Measure(s)

N/A

Additional Considerations

N/A

Annotations & Labeling

Annotation Workforce Type

  • Human Annotations (Expert)
  • Human Annotations (Non-Expert)

Annotation Characteristic(s)

Annotation Type Number
Total number of annotations 1322

Annotation Description(s)

Description: Description of annotations (labels, ratings) produced. Include how this was created or authored.

We follow links/tags to find relevant documents

Link: N/A

Platforms, tools, or libraries: N/A

Annotation Distribution(s)

Dataset number
Biology 103
Earth Science 118
Economics 103
Psychology 101
Robotics 101
Stack Overflow 117
Sustainable Living 108
LeetCode 142
Pony 112
AoPS 111
TheoremQA 206

Distribution of data splits in each domain

Annotation Task(s)

(Task Type)

Task description & instructions: In this section, we describe the instructions for annotators to collect data in BRIGHT.

StackExchange

  1. Browse posts from the newest to the oldest.

  2. Discard posts without an answer accepted by the user or obtains more than 5 votes

  3. Discard answers of posts without URL links.

  4. For each link in the answer, write down the answers to: (1). why are the document and the post relevant; (2). what is the reasoning required to understand the relevance between the post and the document. If there answers are not possible, discard the link.

  5. Use LLMs (e.g., ChatGPT, Claude, etc.) to generate post key words, or use the post title to search for web pages with large keyword or semantic overlap in Google. Search for at most 5 negative web pages per query.

  6. Split every web page into small passages either by two newline symbols, "#" in markdonw files or fixed-length tokens

TheoremQA

In TheoremQA, the main task for the annotator is to check if the GPT-4 rewritten questions are valid. The specific instructions are as follows:

  1. Read the rewritten question and determine if it is solvable.
  2. If it is solvable, read the original question and solution, and determine if the rewritten question is consistent with the original question. That is, the same reasoning steps and the final answer should hold.
  3. If it is also consistent, mark the question as valid, and make any minor edits to the problem statement (e.g., to improve grammar or fluency) as you see fit.
  4. If it is not solvable or not consistent, read the original question and solution, and correct the rewritten question if possible. If not, then discard the problem.

AoPS In AoPS, annotators are tasked to find questions from the AoPS Wiki and record the problems:

  1. Browse through the AoPS Wiki and find topic/category pages (example 1, example 2).
  2. Look through each page and find pages specific theorems or techniques that can be used to solve problems. The page should link to at least two competition problems (example 1, example 2).
  3. Record the links of both the theorem/technique as well as the problem pages. The annotators are assigned a category to look for theorems in to avoid overlaps, and the categories are {algebra, geometry, calculus, probability, number theory, other}. After all links are collected, we use a web scraper to collect the problem statement and solutions, and we manually check the quality of the scraped data.

LeetCode In LeetCode, annotators determine whether a question is grounded in real-world concepts. We give a similar instruction to the annotator as to GPT-4:

  1. Read the problem statement carefully.
  2. Categorize the question into one of three categories: • 0: The question is not grounded in any real-world concepts. The description only uses coding-specific terms, such as "linked list", "binary search", "palindrome", "sorting", etc.. • 1: The question is not grounded in any real-world concepts or real-world concepts that are commonly used in the context of coding, such as needle in a haystack, strings/words, or a spiral matrix.

• 2: The question is grounded in real-world concepts that are not commonly used in the context of coding, such as building height, planting trees, or games. It may still uses some code-specific terms to specify the data structure involved.

Methods used: Basically we follow links/tags to find documents

Inter-rater adjudication policy: Reviewers annotate where the pairing of queries and documents are not convincing.

Golden questions: N/A

Human Annotators

Annotator Description(s)

(Annotation Type)

Task type: Annotate StackExchange data

Number of unique annotators: 3

Expertise of annotators: Both experts and non-experts

Description of annotators: PhD students in computer science, biology, environment, etc.

Language distribution of annotators: They all speak fluent English

Geographic distribution of annotators: They come from Asia

Summary of annotation instructions: Follow links to find documents with filtering

Summary of gold questions: N/A

Annotation platforms: Google sheets

Additional Notes: N/A

Annotator Task(s)

(Task Type)

Task description: Annotate math and code data

Task instructions: Follow tags to find similar problems/questions

Methods used: Follow tags annotated by websites

Inter-rater adjudication policy: The data is reviewed

Golden questions: N/A

Additional notes: N/A

Language(s)

(Annotation Type)

  • 100% English

Location(s)

(Annotation Type)

  • Asia [50 %]
  • US [50 %]

Gender(s)

(Annotation Type)

  • Male [80 %]
  • Female [20 %]

Validation Types

Method(s)

  • Code/cross-reference Validation

Breakdown(s)

(Validation Type)

Number of Data Points Validated: 1322

Fields Validated

All fields in data are validated

Description(s)

(Validation Type)

Method: Describe the validation method here. Include links where necessary.

We require annotators to write the logic to determine the relevance between queries and documents. The reviewers not only check the data, but also annotators' notes.

Validation Results:

Over 90% of annotation passes peer review, and we discard the the rest part.

Description of Human Validators

Characteristic(s)

(Validation Type)

  • Unique validators: 8
  • Number of examples per validator: 300
  • Average cost/task/validator: N/A
  • Training provided: N
  • Expertise required: N

Description(s)

(Validation Type)

Validator description: Validators are domain experts, e.g., PhD students from the corresponding domains.

Training provided: We do not provide training, but verify that the annotators, reviewers are qualified

Validator selection criteria: We have a test containing verified examples. An annotator is qualified if they can work out these examples.

Training provided: N/A

Language(s)

(Validation Type)

  • English [100 %]

Location(s)

(Validation Type)

  • Asia [60 %]
  • US [40 %]

Gender(s)

(Validation Type)

  • Male [70 %]
  • Female [30 %]

Sampling Methods

Method(s) Used

  • Unsampled

Characteristic(s)

N/A

Sampling Criteria

N/A

Known Applications & Benchmarks

ML Application(s)

Retrieval evaluation

Evaluation Result(s)

SFR-Embedding-Mistral 17.8

Model Card: https://huggingface.co/Salesforce/SFR-Embedding-Mistral/tree/main

Evaluation Results

  • nDCG@10: 17.8

Evaluation Process(es)

We write python scripts to run retrieval models on BRIGHT.

Description(s) and Statistic(s)

SFR-Embedding-Mistral

Model Card: https://huggingface.co/Salesforce/SFR-Embedding-Mistral/tree/main

Model Description: The best-class retrieval model trained from mistral-7b

  • Model Size: 7.11B
  • Model Weights: 7.11B
  • Model Layers 32
  • Latency: 2s

Expected Performance and Known Caveats

Claude-3 + BM25

Expected Performance: surpasses results obtained without using LLMs

Known Caveats: The inference of LLMs can be expensive

Terms of Art

Concepts and Definitions referenced in this Data Card

BRIGHT

Definition: The name of this benchmark

Source: https://huggingface.co/datasets/xlangai/BRIGHT

Interpretation: N/A

Reflections on Data

We believe that BRIGHT paves the way for future research on retrieval20 systems in more realistic and challenging settings.