-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Hey Team,
I am trying to run a PT checkpoint that was continually pre-trained over the base model ,
"meta-llama/llama-3.1-8b-Instruct".
When i try to run it on inference it fails with a CUDA_DSA_ERROR
Error:
[NeMo W 2025-09-04 06:14:13 nemo_logging:405] /usr/local/lib/python3.12/dist-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
💡 Tip: For seamless cloud uploads and versioning, try installing litmodels to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Enabling Flash Decode for in-framework inference
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has data parallel group : [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Ranks 0 has data parallel rank: 0
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has context parallel group: [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] All context parallel group ranks: [[0]]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Ranks 0 has context parallel rank: 0
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has model parallel group: [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] All model parallel group ranks: [[0]]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has tensor model parallel group: [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] All tensor model parallel group ranks: [[0]]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has tensor model parallel rank: 0
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has embedding group: [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] All pipeline model parallel group ranks: [[0]]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has pipeline model parallel rank 0
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] All embedding group ranks: [[0]]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has embedding rank: 0
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Padded vocab_size: 128640, original vocab_size: 128629, dummy tokens: 11.
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Apply rope scaling with factor=8.0, low_freq_factor=1.0, high_freq_factor=4.0, old_context_len=8192.
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] > number of parameters on (tensor, pipeline) model parallel rank (0 ,0): 8033406976
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Doing selective restore from RestoreConfig(path='/workspace/logs_02_09_curated/llama31_8b_dapt/checkpoints/model_name=0--val_loss=7.07-step=21976-consumed_samples=87908.0-last', load_model_state=True, load_optim_state=False, load_artifacts=True)
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Using <megatron.core.dist_checkpointing.strategies.fully_parallel.FullyParallelLoadStrategyWrapper object at 0x7ffa0c197ef0> dist-ckpt load strategy.
[NeMo I 2025-09-04 06:14:27 nemo_logging:393] Global Checkpoint Load : Rank : 0 : Start time : 1756966456.551s : Time spent in load_checkpoint: 10.740s
[NeMo I 2025-09-04 06:14:27 nemo_logging:393] Restoring model weights from RestoreConfig(path='/workspace/logs_02_09_curated/llama31_8b_dapt/checkpoints/model_name=0--val_loss=7.07-step=21976-consumed_samples=87908.0-last', load_model_state=True, load_optim_state=False, load_artifacts=True)
[NeMo I 2025-09-04 06:14:27 nemo_logging:393] Finished restoring from RestoreConfig(path='/workspace/logs_02_09_curated/llama31_8b_dapt/checkpoints/model_name=0--val_loss=7.07-step=21976-consumed_samples=87908.0-last', load_model_state=True, load_optim_state=False, load_artifacts=True), cleaning up.
static requests: 0%| | 0/1 [00:00<?, ?it/s][WARNING | DotProductAttention]: flash-attn v3 may provide important feature support or performance improvement. Please install flash-attn v3 by
(1) git clone https://github.com/Dao-AILab/flash-attention.git
(2) cd flash-attention/ && git checkout 27f501d && cd hopper/ && python setup.py install
(3) python_path=python -c "import site; print(site.getsitepackages()[0])"
(4) mkdir -p $python_path/flash_attn_3
(5) wget -P $python_path/flash_attn_3 https://raw.githubusercontent.com/Dao-AILab/flash-attention/27f501dbe011f4371bff938fe7e09311ab3002fa/hopper/flash_attn_interface.py
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/eval_exp_02_09_curated/inference_01.py", line 82, in
[rank0]: inference (checkpoints_path, trainer , prompt , temp , topk , max_tokens )
[rank0]: File "/workspace/eval_exp_02_09_curated/inference_01.py", line 55, in inference
[rank0]: results = api.generate(
[rank0]: ^^^^^^^^^^^^^
[rank0]: File "/opt/NeMo/nemo/collections/llm/api.py", line 1159, in generate
[rank0]: results_on_this_dp_rank = inference.generate(
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/NeMo/nemo/collections/llm/inference/base.py", line 289, in generate
[rank0]: results = mcore_engine.generate(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/megatron-lm/megatron/core/inference/engines/static_engine.py", line 191, in generate
[rank0]: self.run_engine()
[rank0]: File "/opt/megatron-lm/megatron/core/inference/engines/static_engine.py", line 225, in run_engine
[rank0]: self.text_generation_controller.generate_all_output_tokens_static_batch(
[rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/megatron-lm/megatron/core/inference/text_generation_controllers/text_generation_controller.py", line 722, in generate_all_output_tokens_static_batch
[rank0]: sampled_logits = self.sample_from_logits(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/megatron-lm/megatron/core/inference/text_generation_controllers/text_generation_controller.py", line 292, in sample_from_logits
[rank0]: sampled_logits = torch.multinomial(probabilities, num_samples=1).view(-1)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: CUDA error: device-side assert triggered
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.