Skip to content

Conversation

@alexeykudinkin
Copy link
Contributor

@alexeykudinkin alexeykudinkin commented Aug 30, 2025

Historically, ParquetDatasource have been fetching individual files parquet footer metadata to obtain granular metadata about the dataset.

While laudable in principle, it's really inefficient in practice and manifests itself in extremely poor performance on very large datasets (10s to 100s Ks of files).

This change revisits this approach by

  • Removing metadata fetching as a step (and deprecating involved components)
  • Cleaning up and streamlining some of the other utilities further optimizing performance

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@alexeykudinkin alexeykudinkin requested a review from a team as a code owner August 30, 2025 01:24
@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Aug 30, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively removes the inefficient Parquet metadata fetching, which should significantly improve performance on large datasets. The code changes are well-structured, with logic nicely refactored into helper functions. I've identified a few minor issues, such as leftover debug statements and a TODO, which should be addressed for code hygiene. Overall, this is a great improvement.

@alexeykudinkin alexeykudinkin changed the title [Data] Removing Parquet metadata fetching in ParquetDatasource [Data] Removing Parquet metadata fetching in ParquetDatasource Aug 30, 2025
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Aug 30, 2025
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
@alexeykudinkin alexeykudinkin requested a review from a team as a code owner September 2, 2025 18:19
Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Alexey Kudinkin <[email protected]>
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) September 2, 2025 20:39
@alexeykudinkin alexeykudinkin merged commit cc08726 into ray-project:master Sep 2, 2025
5 of 6 checks passed
alexeykudinkin added a commit that referenced this pull request Sep 6, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

This change is a follow-up for
#56105.

Now dataset size estimation is based on listed file sizes. However,
encoding ratio was still based on the file size estimates derived from
the uncompressed data size obtained from Parquet metadata.

This change is addressing that by:

- Rebasing encoding ratio to relate estimated in-memory size to the
listed file size
 - Cleaning up unused abstractions (like `ParquetMetadataProvider`)

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
sampan-s-nayak pushed a commit to sampan-s-nayak/ray that referenced this pull request Sep 8, 2025
…-project#56105)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

Historically, `ParquetDatasource` have been fetching individual files
parquet footer metadata to obtain granular metadata about the dataset.

While laudable in principle, it's really inefficient in practice and
manifests itself in extremely poor performance on very large datasets
(10s to 100s Ks of files).

This change revisits this approach by

- Removing metadata fetching as a step (and deprecating involved
components)
- Cleaning up and streamlining some of the other utilities further
optimizing performance

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: sampan <[email protected]>
sampan-s-nayak pushed a commit to sampan-s-nayak/ray that referenced this pull request Sep 8, 2025
…ct#56268)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

This change is a follow-up for
ray-project#56105.

Now dataset size estimation is based on listed file sizes. However,
encoding ratio was still based on the file size estimates derived from
the uncompressed data size obtained from Parquet metadata.

This change is addressing that by:

- Rebasing encoding ratio to relate estimated in-memory size to the
listed file size
 - Cleaning up unused abstractions (like `ParquetMetadataProvider`)

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: sampan <[email protected]>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…-project#56105)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

Historically, `ParquetDatasource` have been fetching individual files
parquet footer metadata to obtain granular metadata about the dataset.

While laudable in principle, it's really inefficient in practice and
manifests itself in extremely poor performance on very large datasets
(10s to 100s Ks of files).

This change revisits this approach by

- Removing metadata fetching as a step (and deprecating involved
components)
- Cleaning up and streamlining some of the other utilities further
optimizing performance

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: jugalshah291 <[email protected]>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…ct#56268)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

This change is a follow-up for
ray-project#56105.

Now dataset size estimation is based on listed file sizes. However,
encoding ratio was still based on the file size estimates derived from
the uncompressed data size obtained from Parquet metadata.

This change is addressing that by:

- Rebasing encoding ratio to relate estimated in-memory size to the
listed file size
 - Cleaning up unused abstractions (like `ParquetMetadataProvider`)

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: jugalshah291 <[email protected]>
wyhong3103 pushed a commit to wyhong3103/ray that referenced this pull request Sep 12, 2025
…-project#56105)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

Historically, `ParquetDatasource` have been fetching individual files
parquet footer metadata to obtain granular metadata about the dataset.

While laudable in principle, it's really inefficient in practice and
manifests itself in extremely poor performance on very large datasets
(10s to 100s Ks of files).

This change revisits this approach by

- Removing metadata fetching as a step (and deprecating involved
components)
- Cleaning up and streamlining some of the other utilities further
optimizing performance

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: yenhong.wong <[email protected]>
wyhong3103 pushed a commit to wyhong3103/ray that referenced this pull request Sep 12, 2025
…ct#56268)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

This change is a follow-up for
ray-project#56105.

Now dataset size estimation is based on listed file sizes. However,
encoding ratio was still based on the file size estimates derived from
the uncompressed data size obtained from Parquet metadata.

This change is addressing that by:

- Rebasing encoding ratio to relate estimated in-memory size to the
listed file size
 - Cleaning up unused abstractions (like `ParquetMetadataProvider`)

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: yenhong.wong <[email protected]>
alexeykudinkin added a commit that referenced this pull request Sep 23, 2025
)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

Historically, `ParquetDatasource` have been fetching individual files
parquet footer metadata to obtain granular metadata about the dataset.

While laudable in principle, it's really inefficient in practice and
manifests itself in extremely poor performance on very large datasets
(10s to 100s Ks of files).

This change revisits this approach by 

- Removing metadata fetching as a step (and deprecating involved
components)
- Cleaning up and streamlining some of the other utilities further
optimizing performance


<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
alexeykudinkin added a commit that referenced this pull request Sep 23, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

This change is a follow-up for
#56105.

Now dataset size estimation is based on listed file sizes. However,
encoding ratio was still based on the file size estimates derived from
the uncompressed data size obtained from Parquet metadata.

This change is addressing that by:

- Rebasing encoding ratio to relate estimated in-memory size to the
listed file size
 - Cleaning up unused abstractions (like `ParquetMetadataProvider`)

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
…ct#56268)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

This change is a follow-up for
ray-project#56105.

Now dataset size estimation is based on listed file sizes. However,
encoding ratio was still based on the file size estimates derived from
the uncompressed data size obtained from Parquet metadata.

This change is addressing that by:

- Rebasing encoding ratio to relate estimated in-memory size to the
listed file size
 - Cleaning up unused abstractions (like `ParquetMetadataProvider`)

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: zac <[email protected]>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

Historically, `ParquetDatasource` have been fetching individual files
parquet footer metadata to obtain granular metadata about the dataset.

While laudable in principle, it's really inefficient in practice and
manifests itself in extremely poor performance on very large datasets
(10s to 100s Ks of files).

This change revisits this approach by

- Removing metadata fetching as a step (and deprecating involved
components)
- Cleaning up and streamlining some of the other utilities further
optimizing performance

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Douglas Strodtman <[email protected]>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

This change is a follow-up for
#56105.

Now dataset size estimation is based on listed file sizes. However,
encoding ratio was still based on the file size estimates derived from
the uncompressed data size obtained from Parquet metadata.

This change is addressing that by:

- Rebasing encoding ratio to relate estimated in-memory size to the
listed file size
 - Cleaning up unused abstractions (like `ParquetMetadataProvider`)

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Douglas Strodtman <[email protected]>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…ct#56268)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

This change is a follow-up for
ray-project#56105.

Now dataset size estimation is based on listed file sizes. However,
encoding ratio was still based on the file size estimates derived from
the uncompressed data size obtained from Parquet metadata.

This change is addressing that by:

- Rebasing encoding ratio to relate estimated in-memory size to the
listed file size
 - Cleaning up unused abstractions (like `ParquetMetadataProvider`)

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
snorkelopsstgtesting1-spec pushed a commit to snorkel-marlin-repos/ray-project_ray_pr_56299_264afdbd-dcea-472b-9966-b2fd4ffce1f5 that referenced this pull request Oct 22, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

This change is a follow-up for
ray-project/ray#56105.

Now dataset size estimation is based on listed file sizes. However,
encoding ratio was still based on the file size estimates derived from
the uncompressed data size obtained from Parquet metadata.

This change is addressing that by:

- Rebasing encoding ratio to relate estimated in-memory size to the
listed file size
 - Cleaning up unused abstractions (like `ParquetMetadataProvider`)

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
snorkelopstesting2-coder pushed a commit to snorkel-marlin-repos/ray-project_ray_pr_56105_05e33cc6-0672-4d8a-826c-d64b6a3f24d1 that referenced this pull request Oct 22, 2025
snorkelopstesting3-bot added a commit to snorkel-marlin-repos/ray-project_ray_pr_56105_05e33cc6-0672-4d8a-826c-d64b6a3f24d1 that referenced this pull request Oct 22, 2025
…ParquetDatasource`

Merged from original PR #56105
Original: ray-project/ray#56105
bveeramani added a commit that referenced this pull request Oct 23, 2025
…se test

This test became broken after the removal of Parquet metadata fetching
tasks in #56105. The test relies on specific autoscaling behavior that
no longer works as expected without metadata fetching.

Signed-off-by: Balaji Veeramani <[email protected]>
bveeramani added a commit that referenced this pull request Oct 23, 2025
…ease test (#58048)

## Summary

This PR removes the `image_classification_chaos_no_scale_back` release
test and its associated setup script
(`setup_cluster_compute_config_updater.py`). This test has become
non-functional and is no longer providing useful signal.

## Background

The `image_classification_chaos_no_scale_back` release test was designed
to validate Ray Data's fault tolerance when many nodes abruptly get
preempted at the same time.

The test worked by:
1. Running on an autoscaling cluster with 1-10 nodes
2. Updating the compute config mid-test to downscale to 5 nodes
3. Asserting that there are dead nodes as a sanity check

## Why This Test Is Broken

After the removal of Parquet metadata fetching in #56105 (September 2,
2025), the autoscaling behavior changed significantly:

- **Before metadata removal**: The cluster would autoscale more
aggressively because metadata fetching created additional tasks that
triggered faster scale-up. The cluster would scale past 5 nodes, then
downscale, leaving dead nodes that the test could detect.

- **After metadata removal**: Without the metadata fetching tasks, the
cluster doesn't scale up fast enough to get past 5 nodes before the
downscale happens. This means there are no dead nodes to detect, causing
the test to fail.

## Why We're Removing It

1. **Test is fundamentally broken**: The test's assumptions about
autoscaling behavior are no longer valid after the metadata fetching
removal
2. **Not actively monitored**: The test is marked as unstable and isn't
closely watched

## Changes

- Removed `image_classification_chaos_no_scale_back` test from
`release/release_data_tests.yaml`
- Deleted
`release/nightly_tests/setup_cluster_compute_config_updater.py` (only
used by this test)

## Related

See #56105

Fixes #56528

Signed-off-by: Balaji Veeramani <[email protected]>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 27, 2025
…ease test (ray-project#58048)

## Summary

This PR removes the `image_classification_chaos_no_scale_back` release
test and its associated setup script
(`setup_cluster_compute_config_updater.py`). This test has become
non-functional and is no longer providing useful signal.

## Background

The `image_classification_chaos_no_scale_back` release test was designed
to validate Ray Data's fault tolerance when many nodes abruptly get
preempted at the same time.

The test worked by:
1. Running on an autoscaling cluster with 1-10 nodes
2. Updating the compute config mid-test to downscale to 5 nodes
3. Asserting that there are dead nodes as a sanity check

## Why This Test Is Broken

After the removal of Parquet metadata fetching in ray-project#56105 (September 2,
2025), the autoscaling behavior changed significantly:

- **Before metadata removal**: The cluster would autoscale more
aggressively because metadata fetching created additional tasks that
triggered faster scale-up. The cluster would scale past 5 nodes, then
downscale, leaving dead nodes that the test could detect.

- **After metadata removal**: Without the metadata fetching tasks, the
cluster doesn't scale up fast enough to get past 5 nodes before the
downscale happens. This means there are no dead nodes to detect, causing
the test to fail.

## Why We're Removing It

1. **Test is fundamentally broken**: The test's assumptions about
autoscaling behavior are no longer valid after the metadata fetching
removal
2. **Not actively monitored**: The test is marked as unstable and isn't
closely watched

## Changes

- Removed `image_classification_chaos_no_scale_back` test from
`release/release_data_tests.yaml`
- Deleted
`release/nightly_tests/setup_cluster_compute_config_updater.py` (only
used by this test)

## Related

See ray-project#56105

Fixes ray-project#56528

Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: xgui <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…-project#56105)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

Historically, `ParquetDatasource` have been fetching individual files
parquet footer metadata to obtain granular metadata about the dataset.

While laudable in principle, it's really inefficient in practice and
manifests itself in extremely poor performance on very large datasets
(10s to 100s Ks of files).

This change revisits this approach by 

- Removing metadata fetching as a step (and deprecating involved
components)
- Cleaning up and streamlining some of the other utilities further
optimizing performance


<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ct#56268)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

This change is a follow-up for
ray-project#56105.

Now dataset size estimation is based on listed file sizes. However,
encoding ratio was still based on the file size estimates derived from
the uncompressed data size obtained from Parquet metadata.

This change is addressing that by:

- Rebasing encoding ratio to relate estimated in-memory size to the
listed file size
 - Cleaning up unused abstractions (like `ParquetMetadataProvider`)

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ease test (ray-project#58048)

## Summary

This PR removes the `image_classification_chaos_no_scale_back` release
test and its associated setup script
(`setup_cluster_compute_config_updater.py`). This test has become
non-functional and is no longer providing useful signal.

## Background

The `image_classification_chaos_no_scale_back` release test was designed
to validate Ray Data's fault tolerance when many nodes abruptly get
preempted at the same time.

The test worked by:
1. Running on an autoscaling cluster with 1-10 nodes
2. Updating the compute config mid-test to downscale to 5 nodes
3. Asserting that there are dead nodes as a sanity check

## Why This Test Is Broken

After the removal of Parquet metadata fetching in ray-project#56105 (September 2,
2025), the autoscaling behavior changed significantly:

- **Before metadata removal**: The cluster would autoscale more
aggressively because metadata fetching created additional tasks that
triggered faster scale-up. The cluster would scale past 5 nodes, then
downscale, leaving dead nodes that the test could detect.

- **After metadata removal**: Without the metadata fetching tasks, the
cluster doesn't scale up fast enough to get past 5 nodes before the
downscale happens. This means there are no dead nodes to detect, causing
the test to fail.

## Why We're Removing It

1. **Test is fundamentally broken**: The test's assumptions about
autoscaling behavior are no longer valid after the metadata fetching
removal
2. **Not actively monitored**: The test is marked as unstable and isn't
closely watched

## Changes

- Removed `image_classification_chaos_no_scale_back` test from
`release/release_data_tests.yaml`
- Deleted
`release/nightly_tests/setup_cluster_compute_config_updater.py` (only
used by this test)

## Related

See ray-project#56105

Fixes ray-project#56528

Signed-off-by: Balaji Veeramani <[email protected]>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ease test (ray-project#58048)

## Summary

This PR removes the `image_classification_chaos_no_scale_back` release
test and its associated setup script
(`setup_cluster_compute_config_updater.py`). This test has become
non-functional and is no longer providing useful signal.

## Background

The `image_classification_chaos_no_scale_back` release test was designed
to validate Ray Data's fault tolerance when many nodes abruptly get
preempted at the same time.

The test worked by:
1. Running on an autoscaling cluster with 1-10 nodes
2. Updating the compute config mid-test to downscale to 5 nodes
3. Asserting that there are dead nodes as a sanity check

## Why This Test Is Broken

After the removal of Parquet metadata fetching in ray-project#56105 (September 2,
2025), the autoscaling behavior changed significantly:

- **Before metadata removal**: The cluster would autoscale more
aggressively because metadata fetching created additional tasks that
triggered faster scale-up. The cluster would scale past 5 nodes, then
downscale, leaving dead nodes that the test could detect.

- **After metadata removal**: Without the metadata fetching tasks, the
cluster doesn't scale up fast enough to get past 5 nodes before the
downscale happens. This means there are no dead nodes to detect, causing
the test to fail.

## Why We're Removing It

1. **Test is fundamentally broken**: The test's assumptions about
autoscaling behavior are no longer valid after the metadata fetching
removal
2. **Not actively monitored**: The test is marked as unstable and isn't
closely watched

## Changes

- Removed `image_classification_chaos_no_scale_back` test from
`release/release_data_tests.yaml`
- Deleted
`release/nightly_tests/setup_cluster_compute_config_updater.py` (only
used by this test)

## Related

See ray-project#56105

Fixes ray-project#56528

Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…ease test (ray-project#58048)

## Summary

This PR removes the `image_classification_chaos_no_scale_back` release
test and its associated setup script
(`setup_cluster_compute_config_updater.py`). This test has become
non-functional and is no longer providing useful signal.

## Background

The `image_classification_chaos_no_scale_back` release test was designed
to validate Ray Data's fault tolerance when many nodes abruptly get
preempted at the same time.

The test worked by:
1. Running on an autoscaling cluster with 1-10 nodes
2. Updating the compute config mid-test to downscale to 5 nodes
3. Asserting that there are dead nodes as a sanity check

## Why This Test Is Broken

After the removal of Parquet metadata fetching in ray-project#56105 (September 2,
2025), the autoscaling behavior changed significantly:

- **Before metadata removal**: The cluster would autoscale more
aggressively because metadata fetching created additional tasks that
triggered faster scale-up. The cluster would scale past 5 nodes, then
downscale, leaving dead nodes that the test could detect.

- **After metadata removal**: Without the metadata fetching tasks, the
cluster doesn't scale up fast enough to get past 5 nodes before the
downscale happens. This means there are no dead nodes to detect, causing
the test to fail.

## Why We're Removing It

1. **Test is fundamentally broken**: The test's assumptions about
autoscaling behavior are no longer valid after the metadata fetching
removal
2. **Not actively monitored**: The test is marked as unstable and isn't
closely watched

## Changes

- Removed `image_classification_chaos_no_scale_back` test from
`release/release_data_tests.yaml`
- Deleted
`release/nightly_tests/setup_cluster_compute_config_updater.py` (only
used by this test)

## Related

See ray-project#56105

Fixes ray-project#56528

Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants