Skip to content

Restructure Retrieval benchmarks: rename, clean slugs, add Miscellaneous, and relocate tasks#3212

Closed
q275343119 wants to merge 2 commits intoembeddings-benchmark:mainfrom
embedding-benchmark:feat-rteb-sider-ui
Closed

Restructure Retrieval benchmarks: rename, clean slugs, add Miscellaneous, and relocate tasks#3212
q275343119 wants to merge 2 commits intoembeddings-benchmark:mainfrom
embedding-benchmark:feat-rteb-sider-ui

Conversation

@q275343119
Copy link
Collaborator

This PR makes the following adjustments to the Retrieval benchmark structure:

  1. Rename “RTEB (Retrieval)” to “Retrieval”.
  2. Remove the “RTEB” slug from sub-benchmarks (e.g., change “RTEB Finance”“Finance”).
  3. Add a new Miscellaneous category under Retrieval.
  4. Move the following benchmarks from MTEB/Miscellaneous to Retrieval/Miscellaneous:
    • BEIR
    • NanoBEIR
    • BRIGHT
    • BRIGHT (long)
    • Codenformation Retrieval
    • Instruction Following
    • Long-context Retrieval
    • Reasoning Retrieval

Preview space: https://huggingface.co/spaces/SmileXing/leaderboard

RTEB_ENGLISH = Benchmark(
name="RTEB(eng, beta)",
display_name="RTEB English",
display_name="English",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be a bit confusing, because we have other language specific datasets

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see — two different English buttons are displaying the same content.
After looking into it, I found that the issue comes from this part of the code:

def _create_button(
    i: int,
    benchmark: Benchmark,
    state: gr.State,
    label_to_value: dict[str, str],
    **kwargs,
):
    val = benchmark.name
    label = (
        benchmark.display_name if benchmark.display_name is not None else benchmark.name
    )
    label_to_value[label] = benchmark.name
    button = gr.Button(
        label,
        variant="secondary" if i != 0 else "primary",
        icon=benchmark.icon,
        key=f"{i}_button_{val}",
        elem_classes="text-white",
        **kwargs,
    )

Since label_to_value is a dict, assigning with the same key will overwrite the previous value whenever two buttons share the same label.

So...I could not use the same display_name in the Benchmark 😥

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add test for this

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be a bit confusing, because we have other language specific datasets

Good point; I can see arguments for either direction. @KennethEnevoldsen what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so visually, I don't see a big issue with having two "English" benchmarks, as we have the structure (especially with the change below)

Whether this is possible in the code, I'm not sure, though. I will have to run a test on that Monday.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(will get back to this Monday to test the buttons)

RTEB_BENCHMARK_ENTRIES = [
MenuEntry(
name="RTEB (Retrieval)",
name="Retrieval",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to rename "select benchmark" to "General Purpose"

RTEB_ENGLISH = Benchmark(
name="RTEB(eng, beta)",
display_name="RTEB English",
display_name="English",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so visually, I don't see a big issue with having two "English" benchmarks, as we have the structure (especially with the change below)

Whether this is possible in the code, I'm not sure, though. I will have to run a test on that Monday.

@q275343119
Copy link
Collaborator Author

I added blank after display_name to prevent duplicate labels from appearing.

@KennethEnevoldsen
Copy link
Contributor

Since I can't update this branch, I made a separate PR here #3222 @q275343119, can you review it to see if it works for you?

@q275343119
Copy link
Collaborator Author

Since I can't update this branch, I made a separate PR here #3222 @q275343119, can you review it to see if it works for you?

It works for me.
Adding the suffix is a nice idea— no problem on my side.

@KennethEnevoldsen
Copy link
Contributor

great I will close this then in favor of the other PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants