Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataloader #83

Merged
merged 58 commits into from
Jul 24, 2024
Merged

Dataloader #83

merged 58 commits into from
Jul 24, 2024

Conversation

kks32
Copy link
Contributor

@kks32 kks32 commented Jun 27, 2024

Describe the PR
Supports both npz and hdf5 data format

Related Issues/PRs
#82

Additional Context
Will remove data_loader.py once we have merged multinode training to train.py

# Spawn training to GPUs
distribute.spawn_train(train, cfg, world_size, device)
torch.multiprocessing.set_start_method("spawn")
verbose, world_size = distribute.setup(local_rank)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running it on single node shows the following error:

Error executing job with overrides: []
Traceback (most recent call last):
  File "/work2/08264/baagee/frontera/gns-main/gns/train.py", line 817, in main
    verbose, world_size = distribute.setup(local_rank)
UnboundLocalError: local variable 'local_rank' referenced before assignment

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

@yjchoi1
Copy link
Collaborator

yjchoi1 commented Jul 10, 2024

Now, it works with my custom data on a single node and multi-nodes. I haven't checked with .h5 though.

Copy link
Collaborator

@yjchoi1 yjchoi1 Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the resume mode, simulator.module.load() seems to should be changed to simulator.load() although the previous version of parallel GNS used to work with simulator.module.load() in the distributed setting.

Error executing job with overrides: []
Traceback (most recent call last):
  File "/work2/08264/baagee/frontera/gns-main/gns/train.py", line 843, in main
    train(local_rank, cfg, world_size, device, verbose, use_dist)
  File "/work2/08264/baagee/frontera/gns-main/gns/train.py", line 475, in train
    simulator.module.load(cfg.model.path + cfg.model.file)
  File "/work2/08264/baagee/frontera/venvs/venv-frontera-gpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'LearnedSimulator' object has no attribute 'module'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skye-glitch : Could you address this please.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the updated code.

@yjchoi1
Copy link
Collaborator

yjchoi1 commented Jul 10, 2024

Also, there is a minor error about tqdm count.

When resume, tqdm does not show the correct nstep. For example, given that the model is saved at 100/500000 step at epoch=0, when we resume from model-100.pt, tqdm count starts from 0/500000 rather than starting from 100/500000, although the code correctly loaded the model at 100 step.

@skye-glitch
Copy link
Collaborator

Also, there is a minor error about tqdm count.

When resume, tqdm does not show the correct nstep. For example, given that the model is saved at 100/500000 step at epoch=0, when we resume from model-100.pt, tqdm count starts from 0/500000 rather than starting from 100/500000, although the code correctly loaded the model at 100 step.

Should have been fixed

@yjchoi1
Copy link
Collaborator

yjchoi1 commented Jul 17, 2024

Another minor issue is that gns/args.py still has n_gpus arg although we don't use them.

@skye-glitch
Copy link
Collaborator

Another minor issue is that gns/args.py still has n_gpus arg although we don't use them.

Fixed. Thanks!

@kks32
Copy link
Contributor Author

kks32 commented Jul 17, 2024

@yjchoi1 Could you check to see if everything is good to merge?

@yjchoi1
Copy link
Collaborator

yjchoi1 commented Jul 17, 2024

After addressing above comment about single node issue, everything seems good to merge.

@kks32 kks32 merged commit 325fc8d into v2 Jul 24, 2024
1 check passed
@kks32 kks32 deleted the dataloader branch July 24, 2024 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Medium Priority: Medium
Projects
Development

Successfully merging this pull request may close these issues.

3 participants