Skip to content

Add Big5 track#775

Merged
wangch079 merged 3 commits intoelastic:masterfrom
wangch079:chenhui/big5
May 2, 2025
Merged

Add Big5 track#775
wangch079 merged 3 commits intoelastic:masterfrom
wangch079:chenhui/big5

Conversation

@wangch079
Copy link
Member

https://github.com/elastic/search-developer-productivity/issues/3789

This PR adds Big5 rally track to benchmark the five essential areas:

  1. Text Querying
  2. Sorting
  3. Date Histogram
  4. Range Queries
  5. Terms Aggregation

@wangch079 wangch079 requested a review from a team April 18, 2025 14:50
big5/track.json Outdated
Comment on lines 17 to 20
"source-file": "logs.ndjson.bz2",
"document-count": 1131862,
"compressed-bytes": 57764621,
"uncompressed-bytes": 1047614086
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this file to https://rally-tracks.elastic.co/big5 for testing purposes. We will regenerate a much larger data set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to get the test mode corpus there too - that is the -1k variants - this will allow the full CI to run too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @gareth-ellis , I see the CI failures.

I will generate a proper data set and add -1k files for testing at the same time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gareth-ellis , I updated corpora with 8 files, the size of each (after decompressed) is about 128GB (1TB in total).

The CI still fails:

Error: The action 'Run tests' has timed out after 120 minutes.

Even the copora size is large, I think the test only uses the -1k file (which is less than 100 KB)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue seems to be more that (probably) your track is leaving indices in a none green state, so apm which comes a little later sits and waits for all indices to be green.

You have shards and replicas hard coded to 1:1 - i would suggest changing so these can be configured, and having default value for replicas as 0, this will stop the indices being yellow (since IT tests run with a single node, we wont automatically allocate a replica on the same node as a primary). If you feel that it is inappropriate to have a default replica count of 0, then you can set as 1 and then add a new IT file - i suggest just call it custom_config or similar, and then override the number of replicas (that you've set as configurable as mentioned above) to 0 - in a similar way to as is done here; https://github.com/elastic/rally-tracks/blob/master/it/test_security.py#L34

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the points regarding size, I dont have any concern - its your choice if you have a single file or multiple, though note that rally will still download all 8 files even if ingest-percentage is set to 12.5% ( I believe, at least).

I think having a large corpus is a good thing - as mentioned we have e.g github_archive with over 6TB - (though it needs merging into the pubic repo, still).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have shards and replicas hard coded to 1:1 - i would suggest changing so these can be configured, and having default value for replicas as 0, this will stop the indices being yellow

Will look into this soon. Thank you

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gareth-ellis , just confirm you were referring to this:

"index.number_of_replicas": "1"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangch079 wangch079 requested a review from gareth-ellis May 1, 2025 14:39
Copy link
Member

@gareth-ellis gareth-ellis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@gareth-ellis gareth-ellis requested a review from Copilot May 2, 2025 05:36
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new Big5 rally track to benchmark key Elasticsearch performance areas.

  • Introduces a new README.md file with details on text querying, sorting, date histogram, range queries, and terms aggregation.
  • Provides documentation on document structure and configurable track parameters.
Files not reviewed (6)
  • big5/challenges/default.json: Language not supported
  • big5/request_body/index_template/elasticsearch.json: Language not supported
  • big5/request_body/index_template/opensearch.json: Language not supported
  • big5/request_body/policy/elasticsearch.json: Language not supported
  • big5/request_body/policy/opensearch.json: Language not supported
  • big5/track.json: Language not supported

@wangch079 wangch079 merged commit d018973 into elastic:master May 2, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments