Optimize nemotron VL image/video preprocessing#40093
Conversation
Signed-off-by: milesial <milesial@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the preprocessing pipeline for Nano-Nemotron-VL by introducing a unified _bicubic_resize_and_normalize function and optimizing frame separator tokenization using batch encoding. It also adds support for configurable dtypes and reduces unnecessary tensor operations. Review feedback highlights significant concerns regarding the use of @torch.compile on the new preprocessing function, specifically citing the risk of excessive recompilation due to varying image dimensions and the operational overhead of requiring a C++ compiler in production environments.
netanel-haber
left a comment
There was a problem hiding this comment.
LGTM. I ran evals and VoxPopuli (audio+text), InfoVQA_VAL (image+text) and DailyOmni (video+audio+text) are on par before and after.
Purpose
Compile and reorganize image/video preprocessing for nemotron nano VL, reducing the amount of CPU time and memory needed.