Inference on a continued_DAPT checkpoint fails with CUDA_DSA_ERROR

Hey Team, 
I am trying to run a PT checkpoint that was continually pre-trained over the base model ,
"meta-llama/llama-3.1-8b-Instruct".
When i try to run it on inference it fails with a CUDA_DSA_ERROR


Error:
[NeMo W 2025-09-04 06:14:13 nemo_logging:405] /usr/local/lib/python3.12/dist-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
      warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
    
💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Enabling Flash Decode for in-framework inference
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has data parallel group : [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Ranks 0 has data parallel rank: 0
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has context parallel group: [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] All context parallel group ranks: [[0]]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Ranks 0 has context parallel rank: 0
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has model parallel group: [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] All model parallel group ranks: [[0]]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has tensor model parallel group: [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] All tensor model parallel group ranks: [[0]]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has tensor model parallel rank: 0
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has embedding group: [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] All pipeline model parallel group ranks: [[0]]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has pipeline model parallel rank 0
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] All embedding group ranks: [[0]]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has embedding rank: 0
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Padded vocab_size: 128640, original vocab_size: 128629, dummy tokens: 11.
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Apply rope scaling with factor=8.0, low_freq_factor=1.0, high_freq_factor=4.0, old_context_len=8192.
[NeMo I 2025-09-04 06:14:16 nemo_logging:393]  > number of parameters on (tensor, pipeline) model parallel rank (0 ,0): 8033406976
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Doing selective restore from RestoreConfig(path='/workspace/logs_02_09_curated/llama31_8b_dapt/checkpoints/model_name=0--val_loss=7.07-step=21976-consumed_samples=87908.0-last', load_model_state=True, load_optim_state=False, load_artifacts=True)
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Using <megatron.core.dist_checkpointing.strategies.fully_parallel.FullyParallelLoadStrategyWrapper object at 0x7ffa0c197ef0> dist-ckpt load strategy.
[NeMo I 2025-09-04 06:14:27 nemo_logging:393] Global Checkpoint Load : Rank : 0 : Start time : 1756966456.551s : Time spent in load_checkpoint: 10.740s
[NeMo I 2025-09-04 06:14:27 nemo_logging:393] Restoring model weights from RestoreConfig(path='/workspace/logs_02_09_curated/llama31_8b_dapt/checkpoints/model_name=0--val_loss=7.07-step=21976-consumed_samples=87908.0-last', load_model_state=True, load_optim_state=False, load_artifacts=True)
[NeMo I 2025-09-04 06:14:27 nemo_logging:393] Finished restoring from RestoreConfig(path='/workspace/logs_02_09_curated/llama31_8b_dapt/checkpoints/model_name=0--val_loss=7.07-step=21976-consumed_samples=87908.0-last', load_model_state=True, load_optim_state=False, load_artifacts=True), cleaning up.
static requests:   0%|                                    | 0/1 [00:00<?, ?it/s][WARNING  | DotProductAttention]: flash-attn v3 may provide important feature support or performance improvement. Please install flash-attn v3 by 
(1) git clone https://github.com/Dao-AILab/flash-attention.git
(2) cd flash-attention/ && git checkout 27f501d && cd hopper/ && python setup.py install
(3) python_path=`python -c "import site; print(site.getsitepackages()[0])"`
(4) mkdir -p $python_path/flash_attn_3
(5) wget -P $python_path/flash_attn_3 https://raw.githubusercontent.com/Dao-AILab/flash-attention/27f501dbe011f4371bff938fe7e09311ab3002fa/hopper/flash_attn_interface.py
[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/eval_exp_02_09_curated/inference_01.py", line 82, in <module>
[rank0]:     inference (checkpoints_path, trainer  , prompt , temp , topk , max_tokens )
[rank0]:   File "/workspace/eval_exp_02_09_curated/inference_01.py", line 55, in inference
[rank0]:     results = api.generate(
[rank0]:               ^^^^^^^^^^^^^
[rank0]:   File "/opt/NeMo/nemo/collections/llm/api.py", line 1159, in generate
[rank0]:     results_on_this_dp_rank = inference.generate(
[rank0]:                               ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/NeMo/nemo/collections/llm/inference/base.py", line 289, in generate
[rank0]:     results = mcore_engine.generate(
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/megatron-lm/megatron/core/inference/engines/static_engine.py", line 191, in generate
[rank0]:     self.run_engine()
[rank0]:   File "/opt/megatron-lm/megatron/core/inference/engines/static_engine.py", line 225, in run_engine
[rank0]:     self.text_generation_controller.generate_all_output_tokens_static_batch(
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/megatron-lm/megatron/core/inference/text_generation_controllers/text_generation_controller.py", line 722, in generate_all_output_tokens_static_batch
[rank0]:     sampled_logits = self.sample_from_logits(
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/megatron-lm/megatron/core/inference/text_generation_controllers/text_generation_controller.py", line 292, in sample_from_logits
[rank0]:     sampled_logits = torch.multinomial(probabilities, num_samples=1).view(-1)
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: CUDA error: device-side assert triggered
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inference on a continued_DAPT checkpoint fails with CUDA_DSA_ERROR #14655

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inference on a continued_DAPT checkpoint fails with CUDA_DSA_ERROR #14655

Description

distributed_backend=nccl All distributed processes registered. Starting with 1 processes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes