Skip to content

Conversation

@JDarDagran
Copy link
Contributor

@JDarDagran JDarDagran commented Feb 8, 2025

Why are these changes needed?

The ParquetDatasource currently does not persist information about the filesystem and unresolved paths, unlike the FileBasedDatasource. This PR aims to achieve greater coverage in collecting metadata about input data used in Ray by allowing ParquetDatasource to persist this information.

Changes implemented

  • Added functionality to persist filesystem and unresolved path information for ParquetDatasource
  • Aligned ParquetDatasource behavior with FileBasedDatasource for consistency
  • Enhanced metadata collection for input data used in Ray

Benefits

  • Improved data traceability and reproducibility
  • Consistent behavior across datasource implementations
  • Enhanced debugging and analysis capabilities

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@JDarDagran JDarDagran requested a review from a team as a code owner February 8, 2025 01:16
@JDarDagran JDarDagran force-pushed the data/persist-parquet-datasource-metadata branch from c929f5c to b0b916f Compare February 8, 2025 22:08
@jcotant1 jcotant1 added the data Ray Data-related issues label Feb 9, 2025
@raulchen raulchen enabled auto-merge (squash) February 12, 2025 01:58
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Feb 12, 2025
@raulchen raulchen merged commit 1da7039 into ray-project:master Feb 12, 2025
7 checks passed
xsuler pushed a commit to antgroup/ant-ray that referenced this pull request Mar 4, 2025
## Why are these changes needed?

The `ParquetDatasource` currently does not persist information about the
filesystem and unresolved paths, unlike the `FileBasedDatasource`. This
PR aims to achieve greater coverage in collecting metadata about input
data used in Ray by allowing `ParquetDatasource` to persist this
information.


Signed-off-by: Jakub Dardzinski <[email protected]>
xsuler pushed a commit to antgroup/ant-ray that referenced this pull request Mar 4, 2025
## Why are these changes needed?

The `ParquetDatasource` currently does not persist information about the
filesystem and unresolved paths, unlike the `FileBasedDatasource`. This
PR aims to achieve greater coverage in collecting metadata about input
data used in Ray by allowing `ParquetDatasource` to persist this
information.


Signed-off-by: Jakub Dardzinski <[email protected]>
park12sj pushed a commit to park12sj/ray that referenced this pull request Mar 18, 2025
## Why are these changes needed?

The `ParquetDatasource` currently does not persist information about the
filesystem and unresolved paths, unlike the `FileBasedDatasource`. This
PR aims to achieve greater coverage in collecting metadata about input
data used in Ray by allowing `ParquetDatasource` to persist this
information.


Signed-off-by: Jakub Dardzinski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants