Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inspect training data without data indices #593

Merged
merged 10 commits into from
May 24, 2024

Conversation

2015aroras
Copy link
Collaborator

This PR updates the inspect_train_data.py script to enable inspecting training data when the device data indices are not present. Our runs save these indices locally but not in remote storage. The implementation has the following advantages:

  • We can look at training data without the data indices (the main point of this PR).
  • The script can be called with an S3 path when data indices are not present (nothing needs to be downloaded locally beforehand).
  • The script can look as far into the run as you would like without a perf hit.

The implementation is such that the script will default to the original behavior when data indices are present.

@2015aroras 2015aroras marked this pull request as ready for review May 23, 2024 23:43
@2015aroras 2015aroras requested a review from epwalsh May 23, 2024 23:43
os.environ["FS_LOCAL_RANK"] = "1"

for step in steps:
dataloader = build_train_dataloader(cfg, world_size=world_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would have to rebuild the indices file every time, right? That could be slow, but we could probably avoid rebuilding for every rank.

Copy link
Collaborator Author

@2015aroras 2015aroras May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting FS_LOCAL_RANK=1 avoids rebuilding the indices file every time since it's only down for local FS rank 0.
# Set FS_LOCAL_RANK to a non-zero number so that global data indices are not rewritten

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. Could you just add a comment explaining that for future reference?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

os.environ["FS_LOCAL_RANK"] = "1"

for step in steps:
dataloader = build_train_dataloader(cfg, world_size=world_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. Could you just add a comment explaining that for future reference?

@2015aroras 2015aroras merged commit 5789cfe into main May 24, 2024
10 of 12 checks passed
@2015aroras 2015aroras deleted the shanea/inspect-train-data-no-indices branch May 24, 2024 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants