Parallel data loading #2649

jaheba · 2023-02-14T08:32:10Z

We've removed parallel data loading, since multiprocessing caused some issues, made the code convoluted and didn't yield much performance benefits. Instead, we focus on improving throughput.

Still, we want to re-introduce parallel loading of data in the future. Getting this right is more difficult in Python because of the GIL, which more or less forces us to use multiprocessing. That approach has significant overhead and one needs to be careful to set up a multiprocessing pipeline.

admivsn · 2024-04-19T15:42:43Z

Is this a blocker to the implementation of multi-GPU training?

I've described an issue where I am seeing 0 speed improvement with strategy="dpp" on a multi-GPU instance compared to that on a single-GPU instance.

If so, it would be good to have this documented as a limitation of this package, unless there is a workaround you know of?

Thanks in advance

jaheba added the enhancement New feature or request label Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel data loading #2649

Parallel data loading #2649

jaheba commented Feb 14, 2023

admivsn commented Apr 19, 2024

Parallel data loading #2649

Parallel data loading #2649

Comments

jaheba commented Feb 14, 2023

admivsn commented Apr 19, 2024