Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.stft() signature has been updated for PyTorch 1.7+ Please update PyTorch to remain compatible with later versions of NeMo. #2780

Closed
briebe opened this issue Sep 6, 2021 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@briebe
Copy link

briebe commented Sep 6, 2021

Describe the bug

[NeMo W 2021-09-06 11:58:47 patch_utils:50] torch.stft() signature has been updated for PyTorch 1.7+
Please update PyTorch to remain compatible with later versions of NeMo.

and followed by

/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in _pad(input, pad, mode, value)
4157 assert len(pad) == 2, "3D tensors expect 2 values for padding"
4158 if mode == "reflect":
-> 4159 return torch._C._nn.reflection_pad1d(input, pad)
4160 elif mode == "replicate":
4161 return torch._C._nn.replication_pad1d(input, pad)

RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (256, 256) at dimension 2 of input [1, 2, 2]

also in this notebook, next to the AN4 Source not available problem:

Original Cell: restored_model.setup_finetune_model(config.model)

TypeError Traceback (most recent call last)

in ()
----> 1 restored_model.setup_finetune_model(config.model)

if i change to
Cell: restored_model.setup_finetune_model(model_config = config.model)

TypeError: setup_finetune_model() missing 1 required positional argument: 'model_config'
NameError Traceback (most recent call last)
in ()
----> 1 restored_model.setup_finetune_model(self, model_config=config.model)

NameError: name 'self' is not defined

same with this
cell: restored_model.set_trainer(trainer_finetune)

TypeError Traceback (most recent call last)
in ()
----> 1 restored_model.set_trainer(trainer_finetune)
2 log_dir_finetune = exp_manager(trainer_finetune, config.get("exp_manager", None))
3 print(log_dir_finetune)

TypeError: set_trainer() missing 1 required positional argument: 'trainer'

Steps/Code to reproduce bug

Cell: trainer.fit(speaker_model)

in

https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/speaker_recognition/Speaker_Recognition_Verification.ipynb

Expected behavior
(as expected by the ppl that made this notebook.... Colab training should work without bugfixing :-))

Torch 1.9 is installed, no updates possible as it seems...

Environment overview (please complete the following information)

torch @ https://download.pytorch.org/whl/cu102/torch-1.9.0%2Bcu102-cp37-cp37m-linux_x86_64.whl
torch-stft==0.1.4
torchaudio==0.9.0
torchmetrics==0.5.1
torchsummary==1.5.1
torchtext==0.10.0
torchvision @ https://download.pytorch.org/whl/cu102/torchvision-0.10.0%2Bcu102-cp37-cp37m-linux_x86_64.whl

Environment details

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs

@briebe briebe added the bug Something isn't working label Sep 6, 2021
@nithinraok
Copy link
Collaborator

nithinraok commented Sep 15, 2021

I cant seem to reproduce the issue, its working good on colab. Could you rerun?
Updated link: https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb

@briebe
Copy link
Author

briebe commented Sep 16, 2021

So for you to be sure i didnt miss anything, i used "Run all" (cells)
Training seems to have worked and final checkpoint could be loaded, (but):

trainer.fit(speaker_model)

[NeMo I 2021-09-16 06:26:58 label_models:240] val_loss: 32.002

Epoch 4, global step 83: val_loss was not in top 3

it now runs without problems until:

"Restoring from a PyTorch Lightning checkpoint

To restore a model using the LightningModule.load_from_checkpoint() class method."

restored_model = nemo_asr.models.EncDecSpeakerLabelModel.load_from_checkpoint(final_checkpoint)


TypeError Traceback (most recent call last)

in ()
----> 1 restored_model = nemo_asr.models.EncDecSpeakerLabelModel.load_from_checkpoint(final_checkpoint)

2 frames

/usr/local/lib/python3.7/dist-packages/pytorch_lightning/core/saving.py in _load_model_state(cls, checkpoint, strict, cls_kwargs_new)
193 _cls_kwargs = {k: v for k, v in _cls_kwargs.items() if k in cls_init_args_name}
194
--> 195 model = cls(
_cls_kwargs)
196
197 # give model a chance to load something

TypeError: init() missing 1 required positional argument: 'cfg'

@nithinraok
Copy link
Collaborator

This looks to me issue with the latest pytorch lightning. Can you manually run
!pip install pytorch_lightning==1.4.2 before the cell where it throws error. Also there was an import fix provided with #2821

@briebe
Copy link
Author

briebe commented Sep 16, 2021

this fix brings us to cell/code:

manifest_filepath = os.path.join(NEMO_ROOT,'embeddings_manifest.json')
device = 'cuda' if torch.cuda.is_available() else 'cpu'
get_embeddings(verification_model, manifest_filepath, batch_size=64,embedding_dir='./', device=device)


[NeMo I 2021-09-16 07:11:06 audio_to_label:445] Time length considered for collate func is 20
[NeMo I 2021-09-16 07:11:06 audio_to_label:446] Shift length considered for collate func is 0.75
[NeMo I 2021-09-16 07:11:06 collections:267] Filtered duration for loading collection is 0.000000.
[NeMo I 2021-09-16 07:11:06 collections:270] # 5 files loaded accounting to # 5 labels
[NeMo I 2021-09-16 07:11:06 label_models:126] Setting up identification parameters


NameError Traceback (most recent call last)

in ()
1 manifest_filepath = os.path.join(NEMO_ROOT,'embeddings_manifest.json')
2 device = 'cuda' if torch.cuda.is_available() else 'cpu'
----> 3 get_embeddings(verification_model, manifest_filepath, batch_size=64,embedding_dir='./', device=device)

in get_embeddings(speaker_model, manifest_file, batch_size, embedding_dir, device)
18 out_embeddings = {}
19
---> 20 for test_batch in tqdm(speaker_model.test_dataloader()):
21 test_batch = [x.to(device) for x in test_batch]
22 audio_signal, audio_signal_len, labels, slices = test_batch

NameError: name 'tqdm' is not defined

@nithinraok
Copy link
Collaborator

Please read my above comment, import fix for that is provided through PR #2821

@briebe
Copy link
Author

briebe commented Sep 16, 2021

ok, i got you. Used the changes you made there and now its running without problems! Great work!
Added myself to the finetuning and will see about the results. :-)
Related question:
I was trying to use the "hi-mia" dataset yesterday, because the AN4 source is/was not very stable in the last week.
This is the first line of my test.json:

{"audio_filepath": "../rivaclient/NeMo/scripts/dataset_processing/data/dev/SPEECHDATA/wav/SV0280/SV0280_6_07_S3653.wav", "offset": 0, "duration": 1.488, "label": "SV0280"}

KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/conda/lib/python3.8/site-packages/nemo/collections/asr/data/audio_to_label.py", line 364, in getitem
t = torch.tensor(self.label2id[sample.label]).long()
KeyError: 'SV0280'

is this related to todays fix? will try later thanks!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants