Conversation
big5/track.json
Outdated
| "source-file": "logs.ndjson.bz2", | ||
| "document-count": 1131862, | ||
| "compressed-bytes": 57764621, | ||
| "uncompressed-bytes": 1047614086 |
There was a problem hiding this comment.
I added this file to https://rally-tracks.elastic.co/big5 for testing purposes. We will regenerate a much larger data set.
There was a problem hiding this comment.
It would be good to get the test mode corpus there too - that is the -1k variants - this will allow the full CI to run too
There was a problem hiding this comment.
Thank you @gareth-ellis , I see the CI failures.
I will generate a proper data set and add -1k files for testing at the same time.
There was a problem hiding this comment.
Hi @gareth-ellis , I updated corpora with 8 files, the size of each (after decompressed) is about 128GB (1TB in total).
The CI still fails:
Error: The action 'Run tests' has timed out after 120 minutes.Even the copora size is large, I think the test only uses the -1k file (which is less than 100 KB)?
There was a problem hiding this comment.
The issue seems to be more that (probably) your track is leaving indices in a none green state, so apm which comes a little later sits and waits for all indices to be green.
You have shards and replicas hard coded to 1:1 - i would suggest changing so these can be configured, and having default value for replicas as 0, this will stop the indices being yellow (since IT tests run with a single node, we wont automatically allocate a replica on the same node as a primary). If you feel that it is inappropriate to have a default replica count of 0, then you can set as 1 and then add a new IT file - i suggest just call it custom_config or similar, and then override the number of replicas (that you've set as configurable as mentioned above) to 0 - in a similar way to as is done here; https://github.com/elastic/rally-tracks/blob/master/it/test_security.py#L34
There was a problem hiding this comment.
Regarding the points regarding size, I dont have any concern - its your choice if you have a single file or multiple, though note that rally will still download all 8 files even if ingest-percentage is set to 12.5% ( I believe, at least).
I think having a large corpus is a good thing - as mentioned we have e.g github_archive with over 6TB - (though it needs merging into the pubic repo, still).
There was a problem hiding this comment.
You have shards and replicas hard coded to 1:1 - i would suggest changing so these can be configured, and having default value for replicas as 0, this will stop the indices being yellow
Will look into this soon. Thank you
There was a problem hiding this comment.
Hi @gareth-ellis , just confirm you were referring to this:
There was a problem hiding this comment.
Indeed, you can provide parameters as we do e.g here : https://github.com/elastic/rally-tracks/blob/master/geonames/index.json#L8
There was a problem hiding this comment.
Pull Request Overview
This PR adds a new Big5 rally track to benchmark key Elasticsearch performance areas.
- Introduces a new README.md file with details on text querying, sorting, date histogram, range queries, and terms aggregation.
- Provides documentation on document structure and configurable track parameters.
Files not reviewed (6)
- big5/challenges/default.json: Language not supported
- big5/request_body/index_template/elasticsearch.json: Language not supported
- big5/request_body/index_template/opensearch.json: Language not supported
- big5/request_body/policy/elasticsearch.json: Language not supported
- big5/request_body/policy/opensearch.json: Language not supported
- big5/track.json: Language not supported
https://github.com/elastic/search-developer-productivity/issues/3789
This PR adds Big5 rally track to benchmark the five essential areas: