Skip to content

Conversation

akshaychitneni
Copy link

What this PR does / why we need it:
Adds cache initializer in sdk

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes # kubeflow/trainer#2866

Checklist:

  • Docs included if any changes are user facing

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coveralls
Copy link

Pull Request Test Coverage Report for Build 18136229986

Details

  • 1 of 11 (9.09%) changed or added relevant lines in 1 file are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-1.2%) to 70.546%

Changes Missing Coverage Covered Lines Changed/Added Lines %
kubeflow/trainer/utils/utils.py 1 11 9.09%
Files with Coverage Reduction New Missed Lines %
kubeflow/trainer/utils/utils.py 1 62.15%
Totals Coverage Status
Change from base Build 17979146135: -1.2%
Covered Lines: 297
Relevant Lines: 421

💛 - Coveralls

metadata_loc: str
schema_name: str
table_name: str
env: Optional[dict[str, str]] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to be explicit which optional values user can set rather than giving them arbitrary env configuration in DataCacheInitializer?

if not isinstance(dataset, types.HuggingFaceDatasetInitializer):
return None
if isinstance(dataset, types.HuggingFaceDatasetInitializer):
# TODO (andreyvelich): Support more parameters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably can remove this TODO for now.

Suggested change
# TODO (andreyvelich): Support more parameters.

),
)
return dataset_initializer
elif isinstance(dataset, types.DataCacheInitializer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably can make this more generic, since ENV names are equal to the DataClass field names:

    envs = []
    for f in fields(dataset):
        name = f.name.upper()
        value = getattr(dataset, f.name)
        envs.append(models.IoK8sApiCoreV1EnvVar(name=name, value=str(value)))

Comment on lines 225 to 228
cluster_size: int
metadata_loc: str
schema_name: str
table_name: str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to leverage storage_uri to set location or it has some limitations ?

e.g.

cache://<CATALOG_NAME>/<DATABASE_NAME>/<TABLE_NAME>

I can imagine that metadata location can be set separately.

Copy link
Author

@akshaychitneni akshaychitneni Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, lets have uri to cache://<DATABASE_NAME>/<TABLE_NAME> and use metadata_loc

@andreyvelich
Copy link
Member

/milestone v0.2

@google-oss-prow google-oss-prow bot added this to the v0.2 milestone Oct 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants