Skip to content

Conversation

@homksei
Copy link
Contributor

@homksei homksei commented Nov 26, 2025

Description

This PR updates the dataset downloading mechanism to ensure data integrity by implementing SHA256 checksum verification. It replaces the custom retrieve function with sklearn.datasets._base.fetch_file.

Changes:

  • sklbench/datasets/downloaders.py:

    • Modified download_and_read_csv to accept a tuple containing (filename, url, sha256) instead of a raw URL.
    • Replaced the local retrieve function with sklearn.datasets._base.fetch_file to handle downloads and hash validation.
    • Added logging for download operations.
  • sklbench/datasets/loaders.py:

    • Updated all dataset loading functions (e.g., load_airline_depdelay, load_hepmass, load_higgs, load_sift, etc.) to provide the specific filename, base URL, and corresponding SHA256 hash.
    • Refactored load_ann_dataset_template to support the new metadata structure.

Motivation:
To prevent the usage of corrupted or tampered data files and to standardize the downloading logic using scikit-learn's internal utilities.


Checklist:

Completeness and readability

  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with updates and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

@david-cortes-intel
Copy link
Contributor

/intelci: run

@david-cortes-intel
Copy link
Contributor

/intelci: run ml-benchmarks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants