Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] save_time_based_splits function does not support CPU mode well #789

Open
lmatejka opened this issue Sep 23, 2024 · 0 comments
Open
Labels
bug Something isn't working status/needs-triage

Comments

@lmatejka
Copy link

Bug description

Function save_time_based_splits in data_utils.py does not support CPU mode correctly. In particular, function _save_time_based_splits_cpu assumes using Rapids libraries, moreover Dask Dataframe seems incorrectly imported.

Steps/Code to reproduce bug

Using code from examples, just with option CPU set to True (https://github.com/NVIDIA-Merlin/Transformers4Rec/blob/main/examples/getting-started-session-based/01-ETL-with-NVTabular.ipynb)

sessions_gdf = df.read_parquet(BASE_PATH / "processed_nvt/part_0.parquet")
from transformers4rec.utils.data_utils import save_time_based_splits

save_time_based_splits(
data=nvt.Dataset(sessions_gdf),
output_dir=BASE_PATH / f"session_by_day",
partition_col="day-first",
timestamp_col="session_id",
cpu=True
)

Expected behavior

No exception is thrown and data are splitted.

Environment details

  • Transformers4Rec version: 23.12.0
  • Platform: Ubuntu 20.04.3 LTS
  • Python version: 3.8.10
  • Huggingface Transformers version: 4.30.2
  • PyTorch version (GPU?): 2.4.1
  • Tensorflow version (GPU?): 2.7.0

Additional context

@lmatejka lmatejka added bug Something isn't working status/needs-triage labels Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working status/needs-triage
Projects
None yet
Development

No branches or pull requests

1 participant