Skip to content

Inference on a continued_DAPT checkpoint fails with CUDA_DSA_ERROR #14655

@Shrii-WorkspaceNSX

Description

@Shrii-WorkspaceNSX

Hey Team,
I am trying to run a PT checkpoint that was continually pre-trained over the base model ,
"meta-llama/llama-3.1-8b-Instruct".
When i try to run it on inference it fails with a CUDA_DSA_ERROR

Error:
[NeMo W 2025-09-04 06:14:13 nemo_logging:405] /usr/local/lib/python3.12/dist-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)

💡 Tip: For seamless cloud uploads and versioning, try installing litmodels to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Enabling Flash Decode for in-framework inference
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has data parallel group : [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Ranks 0 has data parallel rank: 0
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has context parallel group: [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] All context parallel group ranks: [[0]]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Ranks 0 has context parallel rank: 0
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has model parallel group: [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] All model parallel group ranks: [[0]]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has tensor model parallel group: [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] All tensor model parallel group ranks: [[0]]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has tensor model parallel rank: 0
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has embedding group: [0]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] All pipeline model parallel group ranks: [[0]]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has pipeline model parallel rank 0
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] All embedding group ranks: [[0]]
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Rank 0 has embedding rank: 0

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Padded vocab_size: 128640, original vocab_size: 128629, dummy tokens: 11.
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Apply rope scaling with factor=8.0, low_freq_factor=1.0, high_freq_factor=4.0, old_context_len=8192.
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] > number of parameters on (tensor, pipeline) model parallel rank (0 ,0): 8033406976
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Doing selective restore from RestoreConfig(path='/workspace/logs_02_09_curated/llama31_8b_dapt/checkpoints/model_name=0--val_loss=7.07-step=21976-consumed_samples=87908.0-last', load_model_state=True, load_optim_state=False, load_artifacts=True)
[NeMo I 2025-09-04 06:14:16 nemo_logging:393] Using <megatron.core.dist_checkpointing.strategies.fully_parallel.FullyParallelLoadStrategyWrapper object at 0x7ffa0c197ef0> dist-ckpt load strategy.
[NeMo I 2025-09-04 06:14:27 nemo_logging:393] Global Checkpoint Load : Rank : 0 : Start time : 1756966456.551s : Time spent in load_checkpoint: 10.740s
[NeMo I 2025-09-04 06:14:27 nemo_logging:393] Restoring model weights from RestoreConfig(path='/workspace/logs_02_09_curated/llama31_8b_dapt/checkpoints/model_name=0--val_loss=7.07-step=21976-consumed_samples=87908.0-last', load_model_state=True, load_optim_state=False, load_artifacts=True)
[NeMo I 2025-09-04 06:14:27 nemo_logging:393] Finished restoring from RestoreConfig(path='/workspace/logs_02_09_curated/llama31_8b_dapt/checkpoints/model_name=0--val_loss=7.07-step=21976-consumed_samples=87908.0-last', load_model_state=True, load_optim_state=False, load_artifacts=True), cleaning up.
static requests: 0%| | 0/1 [00:00<?, ?it/s][WARNING | DotProductAttention]: flash-attn v3 may provide important feature support or performance improvement. Please install flash-attn v3 by
(1) git clone https://github.com/Dao-AILab/flash-attention.git
(2) cd flash-attention/ && git checkout 27f501d && cd hopper/ && python setup.py install
(3) python_path=python -c "import site; print(site.getsitepackages()[0])"
(4) mkdir -p $python_path/flash_attn_3
(5) wget -P $python_path/flash_attn_3 https://raw.githubusercontent.com/Dao-AILab/flash-attention/27f501dbe011f4371bff938fe7e09311ab3002fa/hopper/flash_attn_interface.py
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/eval_exp_02_09_curated/inference_01.py", line 82, in
[rank0]: inference (checkpoints_path, trainer , prompt , temp , topk , max_tokens )
[rank0]: File "/workspace/eval_exp_02_09_curated/inference_01.py", line 55, in inference
[rank0]: results = api.generate(
[rank0]: ^^^^^^^^^^^^^
[rank0]: File "/opt/NeMo/nemo/collections/llm/api.py", line 1159, in generate
[rank0]: results_on_this_dp_rank = inference.generate(
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/NeMo/nemo/collections/llm/inference/base.py", line 289, in generate
[rank0]: results = mcore_engine.generate(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/megatron-lm/megatron/core/inference/engines/static_engine.py", line 191, in generate
[rank0]: self.run_engine()
[rank0]: File "/opt/megatron-lm/megatron/core/inference/engines/static_engine.py", line 225, in run_engine
[rank0]: self.text_generation_controller.generate_all_output_tokens_static_batch(
[rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/megatron-lm/megatron/core/inference/text_generation_controllers/text_generation_controller.py", line 722, in generate_all_output_tokens_static_batch
[rank0]: sampled_logits = self.sample_from_logits(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/megatron-lm/megatron/core/inference/text_generation_controllers/text_generation_controller.py", line 292, in sample_from_logits
[rank0]: sampled_logits = torch.multinomial(probabilities, num_samples=1).view(-1)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: CUDA error: device-side assert triggered
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions