[Hackathon 7th] 修复不存在 `*.npy` 文件的空文件夹导致的数据遍历错误 #3948

megemini · 2024-12-11T06:33:26Z

PR types

Bug fixes

PR changes

Others

Describe

修复不存在 *.npy 文件的空文件夹导致的数据遍历错误。

aistudio@jupyter-942478-8626068:~/PaddleSpeech/examples/other/ge2e$ CUDA_VISIBLE_DEVICES=0,1 ./local/train.sh ./dump ./output
data:
  audio_norm_target_dBFS: -30
  mel_window_length: 25
  mel_window_step: 10
  min_pad_coverage: 0.75
  n_mels: 40
  partial_n_frames: 160
  partial_overlap_ratio: 0.5
  sampling_rate: 16000
  vad_max_silence_length: 6
  vad_moving_average_width: 8
  vad_window_length: 30
model:
  embedding_size: 256
  hidden_size: 256
  num_layers: 3
training:
  learning_rate_init: 0.0001
  max_iteration: 1560000
  save_interval: 10000
  speakers_per_batch: 8
  utterances_per_speaker: 4
  valid_interval: 10000
Namespace(checkpoint_path=None, config=None, data='./dump', ngpu=1, opts=None, output='./output')
W1211 06:24:24.832052 28985 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 12.0, Runtime API Version: 11.8
W1211 06:24:24.833473 28985 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
Traceback (most recent call last):
  File "/home/aistudio/PaddleSpeech/paddlespeech/vector/exps/ge2e/speaker_verification_dataset.py", line 114, in __iter__
    tmp = next(us)
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/aistudio/PaddleSpeech/paddlespeech/vector/exps/ge2e/train.py", line 123, in <module>
    main(config, args)
  File "/home/aistudio/PaddleSpeech/paddlespeech/vector/exps/ge2e/train.py", line 108, in main
    main_sp(config, args)
  File "/home/aistudio/PaddleSpeech/paddlespeech/vector/exps/ge2e/train.py", line 101, in main_sp
    exp.run()
  File "/home/aistudio/PaddleSpeech/paddlespeech/t2s/training/experiment.py", line 210, in run
    self.train()
  File "/home/aistudio/PaddleSpeech/paddlespeech/t2s/training/experiment.py", line 194, in train
    self.new_epoch()
  File "/home/aistudio/PaddleSpeech/paddlespeech/t2s/training/experiment.py", line 186, in new_epoch
    self.iterator = iter(self.train_loader)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/io/reader.py", line 582, in __iter__
    return _DataLoaderIterMultiProcess(self)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/io/dataloader/dataloader_iter.py", line 435, in __init__
    self._try_put_indices()
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/io/dataloader/dataloader_iter.py", line 797, in _try_put_indices
    indices = next(self._sampler_iter)
RuntimeError: generator raised StopIteration

MultiSpeakerMelDataset 初始化的时候，

    def __init__(self, dataset_root: Path):
        self.root = Path(dataset_root).expanduser()
        speaker_dirs = [f for f in self.root.glob("*") if f.is_dir()]

        speaker_utterances = {
            speaker_dir: list(speaker_dir.glob("*.npy"))
            for speaker_dir in speaker_dirs
        }

如果 list(speaker_dir.glob("*.npy")) 为空 list，即，speaker_dir 文件夹中没有 npy 数据（数据集 dump 的时候，没有生成 npy 文件），则在后续遍历的时候

class MultiSpeakerSampler(BatchSampler):
...
    def __iter__(self):
        # yield list of Paths
        speaker_generator = iter(random_cycle(self._speakers))
        speaker_utterances_generator = {
            s: iter(random_cycle(us))
            for s, us in self._speaker_to_utterances.items()
        }

        while True:
            speakers = []
            for _ in range(self.speakers_per_batch):
                speakers.append(next(speaker_generator))

            utterances = []
            for s in speakers:
                us = speaker_utterances_generator[s]
                for _ in range(self.utterances_per_speaker):
                    utterances.append(next(us)) # 此处 StopIteration
            yield utterances

跳出遍历～

而，_DataLoaderIterMultiProcess 初始化的时候，

        # init workers and indices queues and put 2 indices in each indices queue
        self._init_workers()
        for _ in range(self._outstanding_capacity):
            self._try_put_indices() # 此处 StopIteration

        self._init_thread()
        self._shutdown = False

由于上面 self._try_put_indices() 报错，导致其实例没有初始化 self._shutdown 属性，从而报错

Traceback (most recent call last):
  File "/home/aistudio/PaddleSpeech/paddlespeech/vector/exps/ge2e/train.py", line 123, in <module>
    main(config, args)
  File "/home/aistudio/PaddleSpeech/paddlespeech/vector/exps/ge2e/train.py", line 108, in main
    main_sp(config, args)
  File "/home/aistudio/PaddleSpeech/paddlespeech/vector/exps/ge2e/train.py", line 101, in main_sp
    exp.run()
  File "/home/aistudio/PaddleSpeech/paddlespeech/t2s/training/experiment.py", line 210, in run
    self.train()
  File "/home/aistudio/PaddleSpeech/paddlespeech/t2s/training/experiment.py", line 194, in train
    self.new_epoch()
  File "/home/aistudio/PaddleSpeech/paddlespeech/t2s/training/experiment.py", line 186, in new_epoch
    self.iterator = iter(self.train_loader)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/io/reader.py", line 582, in __iter__
    return _DataLoaderIterMultiProcess(self)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/io/dataloader/dataloader_iter.py", line 433, in __init__
    self._try_put_indices()
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/io/dataloader/dataloader_iter.py", line 792, in _try_put_indices
    indices = next(self._sampler_iter)
RuntimeError: generator raised StopIteration
Exception ignored in: <function _DataLoaderIterMultiProcess.__del__ at 0x7fdb8e537c10>
Traceback (most recent call last):
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/io/dataloader/dataloader_iter.py", line 809, in __del__
    self._try_shutdown_all()
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/io/dataloader/dataloader_iter.py", line 587, in _try_shutdown_all
    if not self._shutdown:
AttributeError: '_DataLoaderIterMultiProcess' object has no attribute '_shutdown'

综上，这里在 MultiSpeakerMelDataset 初始化的时候，便将空数据的文件夹过滤掉，命令可正常执行

aistudio@jupyter-942478-8626068:~/PaddleSpeech/examples/other/ge2e$ CUDA_VISIBLE_DEVICES=0,1 ./local/train.sh ./dump ./output
data:
  audio_norm_target_dBFS: -30
  mel_window_length: 25
  mel_window_step: 10
  min_pad_coverage: 0.75
  n_mels: 40
  partial_n_frames: 160
  partial_overlap_ratio: 0.5
  sampling_rate: 16000
  vad_max_silence_length: 6
  vad_moving_average_width: 8
  vad_window_length: 30
model:
  embedding_size: 256
  hidden_size: 256
  num_layers: 3
training:
  learning_rate_init: 0.0001
  max_iteration: 1560000
  save_interval: 10000
  speakers_per_batch: 8
  utterances_per_speaker: 4
  valid_interval: 10000
Namespace(checkpoint_path=None, config=None, data='./dump', ngpu=1, opts=None, output='./output')
W1211 06:32:50.198007 30136 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 12.0, Runtime API Version: 11.8
W1211 06:32:50.199373 30136 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
[2024-12-11 06:32:55] [INFO] [train.py:81] Rank: 0, step: 1, time: 0.163s/1.387s, loss: 2.091243 err: 0.514369
[2024-12-11 06:32:55] [INFO] [train.py:81] Rank: 0, step: 2, time: 0.000s/0.031s, loss: 2.090077 err: 0.501953
[2024-12-11 06:32:55] [INFO] [train.py:81] Rank: 0, step: 3, time: 0.000s/0.031s, loss: 2.089049 err: 0.502930
[2024-12-11 06:32:55] [INFO] [train.py:81] Rank: 0, step: 4, time: 0.000s/0.034s, loss: 2.088074 err: 0.502790

@zxcd @Liyulingyue @enkilee @GreatV @yinfan98

paddle-bot · 2024-12-11T06:33:32Z

Thanks for your contribution!

zxcd · 2024-12-12T07:10:39Z

为什么会存在数据为空的文件夹？

megemini · 2024-12-12T09:38:25Z

为什么会存在数据为空的文件夹？

好问题～ emm... ... 不晓得～

有可能，是因为我这里测试的数据不全导致的～

这个例子原本的数据集 Librispeech-other-500, VoxCeleb, VoxCeleb2,ai-datatang-200zh, magicdata 太大了，我就用 librispeech train-clean-100 test-clean 代替了～然后 dump 出来的文件夹里面就存在没有 npy 的情况～

不管咋样，感觉这里做个保护也没啥大问题？～ 🫠

zxcd · 2024-12-18T06:37:41Z

为什么会存在数据为空的文件夹？

好问题～ emm... ... 不晓得～

有可能，是因为我这里测试的数据不全导致的～

这个例子原本的数据集 Librispeech-other-500, VoxCeleb, VoxCeleb2,ai-datatang-200zh, magicdata 太大了，我就用 librispeech train-clean-100 test-clean 代替了～然后 dump 出来的文件夹里面就存在没有 npy 的情况～

不管咋样，感觉这里做个保护也没啥大问题？～ 🫠

不建议在这里做保护，如果空文件很多这里的保护只能引起后续数据对不上的错误，更加难查。

megemini · 2024-12-18T06:48:15Z

为什么会存在数据为空的文件夹？

好问题～ emm... ... 不晓得～
有可能，是因为我这里测试的数据不全导致的～
这个例子原本的数据集 Librispeech-other-500, VoxCeleb, VoxCeleb2,ai-datatang-200zh, magicdata 太大了，我就用 librispeech train-clean-100 test-clean 代替了～然后 dump 出来的文件夹里面就存在没有 npy 的情况～
不管咋样，感觉这里做个保护也没啥大问题？～ 🫠

不建议在这里做保护，如果空文件很多这里的保护只能引起后续数据对不上的错误，更加难查。

那这里是允许空文件夹还是不允许？

如果允许的话，框架那边抛错误咋整？

如果不允许的话，抛个错误？

zxcd · 2024-12-24T06:22:35Z

为什么会存在数据为空的文件夹？

好问题～ emm... ... 不晓得～
有可能，是因为我这里测试的数据不全导致的～
这个例子原本的数据集 Librispeech-other-500, VoxCeleb, VoxCeleb2,ai-datatang-200zh, magicdata 太大了，我就用 librispeech train-clean-100 test-clean 代替了～然后 dump 出来的文件夹里面就存在没有 npy 的情况～
不管咋样，感觉这里做个保护也没啥大问题？～ 🫠

不建议在这里做保护，如果空文件很多这里的保护只能引起后续数据对不上的错误，更加难查。

那这里是允许空文件夹还是不允许？

如果允许的话，框架那边抛错误咋整？

如果不允许的话，抛个错误？

抛个错误吧

megemini · 2024-12-24T07:28:20Z

抛个错误吧

done ~

[Fix] empty npy folder

cb05ca2

paddle-bot bot added the contributor label Dec 11, 2024

mergify bot added the Vector SID/LID/etc. label Dec 11, 2024

[Update] assert empyt folder

c1ff4c2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hackathon 7th] 修复不存在 `*.npy` 文件的空文件夹导致的数据遍历错误 #3948

[Hackathon 7th] 修复不存在 `*.npy` 文件的空文件夹导致的数据遍历错误 #3948

megemini commented Dec 11, 2024

paddle-bot bot commented Dec 11, 2024

zxcd commented Dec 12, 2024

megemini commented Dec 12, 2024

zxcd commented Dec 18, 2024

megemini commented Dec 18, 2024

zxcd commented Dec 24, 2024

megemini commented Dec 24, 2024 •

edited

Loading

[Hackathon 7th] 修复不存在 *.npy 文件的空文件夹导致的数据遍历错误 #3948

Are you sure you want to change the base?

[Hackathon 7th] 修复不存在 *.npy 文件的空文件夹导致的数据遍历错误 #3948

Conversation

megemini commented Dec 11, 2024

PR types

PR changes

Describe

paddle-bot bot commented Dec 11, 2024

zxcd commented Dec 12, 2024

megemini commented Dec 12, 2024

zxcd commented Dec 18, 2024

megemini commented Dec 18, 2024

zxcd commented Dec 24, 2024

megemini commented Dec 24, 2024 • edited Loading

[Hackathon 7th] 修复不存在 `*.npy` 文件的空文件夹导致的数据遍历错误 #3948

[Hackathon 7th] 修复不存在 `*.npy` 文件的空文件夹导致的数据遍历错误 #3948

megemini commented Dec 24, 2024 •

edited

Loading