Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hackathon 7th] 修复不存在 *.npy 文件的空文件夹导致的数据遍历错误 #3948

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

megemini
Copy link
Contributor

PR types

Bug fixes

PR changes

Others

Describe

修复不存在 *.npy 文件的空文件夹导致的数据遍历错误。

aistudio@jupyter-942478-8626068:~/PaddleSpeech/examples/other/ge2e$ CUDA_VISIBLE_DEVICES=0,1 ./local/train.sh ./dump ./output
data:
  audio_norm_target_dBFS: -30
  mel_window_length: 25
  mel_window_step: 10
  min_pad_coverage: 0.75
  n_mels: 40
  partial_n_frames: 160
  partial_overlap_ratio: 0.5
  sampling_rate: 16000
  vad_max_silence_length: 6
  vad_moving_average_width: 8
  vad_window_length: 30
model:
  embedding_size: 256
  hidden_size: 256
  num_layers: 3
training:
  learning_rate_init: 0.0001
  max_iteration: 1560000
  save_interval: 10000
  speakers_per_batch: 8
  utterances_per_speaker: 4
  valid_interval: 10000
Namespace(checkpoint_path=None, config=None, data='./dump', ngpu=1, opts=None, output='./output')
W1211 06:24:24.832052 28985 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 12.0, Runtime API Version: 11.8
W1211 06:24:24.833473 28985 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
Traceback (most recent call last):
  File "/home/aistudio/PaddleSpeech/paddlespeech/vector/exps/ge2e/speaker_verification_dataset.py", line 114, in __iter__
    tmp = next(us)
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/aistudio/PaddleSpeech/paddlespeech/vector/exps/ge2e/train.py", line 123, in <module>
    main(config, args)
  File "/home/aistudio/PaddleSpeech/paddlespeech/vector/exps/ge2e/train.py", line 108, in main
    main_sp(config, args)
  File "/home/aistudio/PaddleSpeech/paddlespeech/vector/exps/ge2e/train.py", line 101, in main_sp
    exp.run()
  File "/home/aistudio/PaddleSpeech/paddlespeech/t2s/training/experiment.py", line 210, in run
    self.train()
  File "/home/aistudio/PaddleSpeech/paddlespeech/t2s/training/experiment.py", line 194, in train
    self.new_epoch()
  File "/home/aistudio/PaddleSpeech/paddlespeech/t2s/training/experiment.py", line 186, in new_epoch
    self.iterator = iter(self.train_loader)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/io/reader.py", line 582, in __iter__
    return _DataLoaderIterMultiProcess(self)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/io/dataloader/dataloader_iter.py", line 435, in __init__
    self._try_put_indices()
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/io/dataloader/dataloader_iter.py", line 797, in _try_put_indices
    indices = next(self._sampler_iter)
RuntimeError: generator raised StopIteration

MultiSpeakerMelDataset 初始化的时候,

    def __init__(self, dataset_root: Path):
        self.root = Path(dataset_root).expanduser()
        speaker_dirs = [f for f in self.root.glob("*") if f.is_dir()]

        speaker_utterances = {
            speaker_dir: list(speaker_dir.glob("*.npy"))
            for speaker_dir in speaker_dirs
        }

如果 list(speaker_dir.glob("*.npy")) 为空 list,即,speaker_dir 文件夹中没有 npy 数据(数据集 dump 的时候,没有生成 npy 文件),则在后续遍历的时候

class MultiSpeakerSampler(BatchSampler):
...
    def __iter__(self):
        # yield list of Paths
        speaker_generator = iter(random_cycle(self._speakers))
        speaker_utterances_generator = {
            s: iter(random_cycle(us))
            for s, us in self._speaker_to_utterances.items()
        }

        while True:
            speakers = []
            for _ in range(self.speakers_per_batch):
                speakers.append(next(speaker_generator))

            utterances = []
            for s in speakers:
                us = speaker_utterances_generator[s]
                for _ in range(self.utterances_per_speaker):
                    utterances.append(next(us)) # 此处 StopIteration
            yield utterances

跳出遍历 ~

而,_DataLoaderIterMultiProcess 初始化的时候,

        # init workers and indices queues and put 2 indices in each indices queue
        self._init_workers()
        for _ in range(self._outstanding_capacity):
            self._try_put_indices() # 此处 StopIteration

        self._init_thread()
        self._shutdown = False

由于上面 self._try_put_indices() 报错,导致其实例没有初始化 self._shutdown 属性,从而报错

Traceback (most recent call last):
  File "/home/aistudio/PaddleSpeech/paddlespeech/vector/exps/ge2e/train.py", line 123, in <module>
    main(config, args)
  File "/home/aistudio/PaddleSpeech/paddlespeech/vector/exps/ge2e/train.py", line 108, in main
    main_sp(config, args)
  File "/home/aistudio/PaddleSpeech/paddlespeech/vector/exps/ge2e/train.py", line 101, in main_sp
    exp.run()
  File "/home/aistudio/PaddleSpeech/paddlespeech/t2s/training/experiment.py", line 210, in run
    self.train()
  File "/home/aistudio/PaddleSpeech/paddlespeech/t2s/training/experiment.py", line 194, in train
    self.new_epoch()
  File "/home/aistudio/PaddleSpeech/paddlespeech/t2s/training/experiment.py", line 186, in new_epoch
    self.iterator = iter(self.train_loader)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/io/reader.py", line 582, in __iter__
    return _DataLoaderIterMultiProcess(self)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/io/dataloader/dataloader_iter.py", line 433, in __init__
    self._try_put_indices()
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/io/dataloader/dataloader_iter.py", line 792, in _try_put_indices
    indices = next(self._sampler_iter)
RuntimeError: generator raised StopIteration
Exception ignored in: <function _DataLoaderIterMultiProcess.__del__ at 0x7fdb8e537c10>
Traceback (most recent call last):
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/io/dataloader/dataloader_iter.py", line 809, in __del__
    self._try_shutdown_all()
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/io/dataloader/dataloader_iter.py", line 587, in _try_shutdown_all
    if not self._shutdown:
AttributeError: '_DataLoaderIterMultiProcess' object has no attribute '_shutdown'

综上,这里在 MultiSpeakerMelDataset 初始化的时候,便将空数据的文件夹过滤掉,命令可正常执行

aistudio@jupyter-942478-8626068:~/PaddleSpeech/examples/other/ge2e$ CUDA_VISIBLE_DEVICES=0,1 ./local/train.sh ./dump ./output
data:
  audio_norm_target_dBFS: -30
  mel_window_length: 25
  mel_window_step: 10
  min_pad_coverage: 0.75
  n_mels: 40
  partial_n_frames: 160
  partial_overlap_ratio: 0.5
  sampling_rate: 16000
  vad_max_silence_length: 6
  vad_moving_average_width: 8
  vad_window_length: 30
model:
  embedding_size: 256
  hidden_size: 256
  num_layers: 3
training:
  learning_rate_init: 0.0001
  max_iteration: 1560000
  save_interval: 10000
  speakers_per_batch: 8
  utterances_per_speaker: 4
  valid_interval: 10000
Namespace(checkpoint_path=None, config=None, data='./dump', ngpu=1, opts=None, output='./output')
W1211 06:32:50.198007 30136 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 12.0, Runtime API Version: 11.8
W1211 06:32:50.199373 30136 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
[2024-12-11 06:32:55] [INFO] [train.py:81] Rank: 0, step: 1, time: 0.163s/1.387s, loss: 2.091243 err: 0.514369
[2024-12-11 06:32:55] [INFO] [train.py:81] Rank: 0, step: 2, time: 0.000s/0.031s, loss: 2.090077 err: 0.501953
[2024-12-11 06:32:55] [INFO] [train.py:81] Rank: 0, step: 3, time: 0.000s/0.031s, loss: 2.089049 err: 0.502930
[2024-12-11 06:32:55] [INFO] [train.py:81] Rank: 0, step: 4, time: 0.000s/0.034s, loss: 2.088074 err: 0.502790

@zxcd @Liyulingyue @enkilee @GreatV @yinfan98

Copy link

paddle-bot bot commented Dec 11, 2024

Thanks for your contribution!

@mergify mergify bot added the Vector SID/LID/etc. label Dec 11, 2024
@zxcd
Copy link
Collaborator

zxcd commented Dec 12, 2024

为什么会存在数据为空的文件夹?

@megemini
Copy link
Contributor Author

为什么会存在数据为空的文件夹?

好问题 ~ emm... ... 不晓得 ~

有可能,是因为我这里测试的数据不全导致的 ~

这个例子原本的数据集 Librispeech-other-500, VoxCeleb, VoxCeleb2,ai-datatang-200zh, magicdata 太大了,我就用 librispeech train-clean-100 test-clean 代替了 ~ 然后 dump 出来的文件夹里面就存在没有 npy 的情况 ~

不管咋样,感觉这里做个保护也没啥大问题?~ 🫠

@zxcd
Copy link
Collaborator

zxcd commented Dec 18, 2024

为什么会存在数据为空的文件夹?

好问题 ~ emm... ... 不晓得 ~

有可能,是因为我这里测试的数据不全导致的 ~

这个例子原本的数据集 Librispeech-other-500, VoxCeleb, VoxCeleb2,ai-datatang-200zh, magicdata 太大了,我就用 librispeech train-clean-100 test-clean 代替了 ~ 然后 dump 出来的文件夹里面就存在没有 npy 的情况 ~

不管咋样,感觉这里做个保护也没啥大问题?~ 🫠

不建议在这里做保护,如果空文件很多这里的保护只能引起后续数据对不上的错误,更加难查。

@megemini
Copy link
Contributor Author

为什么会存在数据为空的文件夹?

好问题 ~ emm... ... 不晓得 ~
有可能,是因为我这里测试的数据不全导致的 ~
这个例子原本的数据集 Librispeech-other-500, VoxCeleb, VoxCeleb2,ai-datatang-200zh, magicdata 太大了,我就用 librispeech train-clean-100 test-clean 代替了 ~ 然后 dump 出来的文件夹里面就存在没有 npy 的情况 ~
不管咋样,感觉这里做个保护也没啥大问题?~ 🫠

不建议在这里做保护,如果空文件很多这里的保护只能引起后续数据对不上的错误,更加难查。

那这里是允许空文件夹还是不允许?

如果允许的话,框架那边抛错误咋整?

如果不允许的话,抛个错误?

@zxcd
Copy link
Collaborator

zxcd commented Dec 24, 2024

为什么会存在数据为空的文件夹?

好问题 ~ emm... ... 不晓得 ~
有可能,是因为我这里测试的数据不全导致的 ~
这个例子原本的数据集 Librispeech-other-500, VoxCeleb, VoxCeleb2,ai-datatang-200zh, magicdata 太大了,我就用 librispeech train-clean-100 test-clean 代替了 ~ 然后 dump 出来的文件夹里面就存在没有 npy 的情况 ~
不管咋样,感觉这里做个保护也没啥大问题?~ 🫠

不建议在这里做保护,如果空文件很多这里的保护只能引起后续数据对不上的错误,更加难查。

那这里是允许空文件夹还是不允许?

如果允许的话,框架那边抛错误咋整?

如果不允许的话,抛个错误?

抛个错误吧

@megemini
Copy link
Contributor Author

megemini commented Dec 24, 2024

抛个错误吧

done ~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants