Skip to content

Disabling timers synchronization (#154)#1879

Merged
regisss merged 2 commits into
huggingface:mainfrom
HabanaAI:auto-pr-5fa4c45
Apr 17, 2025
Merged

Disabling timers synchronization (#154)#1879
regisss merged 2 commits into
huggingface:mainfrom
HabanaAI:auto-pr-5fa4c45

Conversation

@bhargaveede
Copy link
Copy Markdown

@bhargaveede bhargaveede commented Mar 25, 2025

This change is done to improve some perf. Without this change, Timer synchronization waits on host and that creates little idle time. This change avoids that synchronization resulting in better device utilization.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@bhargaveede bhargaveede requested a review from regisss as a code owner March 25, 2025 03:35
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@vidyasiv
Copy link
Copy Markdown
Contributor

vidyasiv commented Mar 26, 2025

Tested with README command

PT_HPU_MAX_COMPOUND_OP_SIZE=10 \
python3 examples/gaudi_spawn.py --use_deepspeed  --world_size 8  examples/language-modeling/run_lora_clm.py \
  --model_name_or_path meta-llama/Llama-2-70b-hf \
  --deepspeed  examples/language-modeling/llama2_ds_zero3_config.json \
  --dataset_name tatsu-lab/alpaca \
  --bf16 True \
  --output_dir ./lora_out \
  --num_train_epochs 2 \
  --max_seq_len 2048 \
  --per_device_train_batch_size 10 \
  --per_device_eval_batch_size 1 \
  --gradient_checkpointing \
  --eval_strategy epoch \
  --eval_delay 2 \
  --save_strategy no \
  --learning_rate 0.0018 \
  --warmup_ratio 0.03 \
  --lr_scheduler_type "cosine" \
  --logging_steps 1 \
  --dataset_concatenation \
  --attn_softmax_bf16 True \
  --do_train \
  --do_eval \
  --use_habana \
  --use_lazy_mode \
  --pipelining_fwd_bwd \
  --throughput_warmup_steps 3 \
  --lora_rank 4 \
  --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" \
  --validation_split_percentage 4 \
  --use_flash_attention True \
  --flash_attention_causal_mask True \
  --fp8 True

Results with PR

***** train metrics *****
  epoch                       =        2.0
  max_memory_allocated (GB)   =      93.99
  memory_allocated (GB)       =      17.38
  total_flos                  =  1264185GF
  total_memory_available (GB) =      94.62
  train_loss                  =      0.901
  train_runtime               = 0:27:28.74
  train_samples_per_second    =      3.928
  train_steps_per_second      =      0.049
***** eval metrics *****
  epoch                           =        2.0
  eval_accuracy                   =     0.7915
  eval_graph_compliation_duration =     5.4882
  eval_loss                       =     0.7644
  eval_runtime                    = 0:00:21.41
  eval_samples                    =        125
  eval_samples_per_second         =      6.356
  eval_steps_per_second           =      0.818
  max_memory_allocated (GB)       =      93.99
  memory_allocated (GB)           =      17.38
  perplexity                      =     2.1478
  total_memory_available (GB)     =      94.62

Results on main(without PR)

***** train metrics *****
  epoch                       =        2.0
  max_memory_allocated (GB)   =       94.3
  memory_allocated (GB)       =      17.38
  total_flos                  =  1264185GF
  total_memory_available (GB) =      94.62
  train_loss                  =     0.9119
  train_runtime               = 0:28:09.94
  train_samples_per_second    =      3.823
  train_steps_per_second      =      0.048
***** eval metrics *****
  epoch                           =        2.0
  eval_accuracy                   =     0.7915
  eval_graph_compliation_duration =     5.7394
  eval_loss                       =     0.7638
  eval_runtime                    = 0:00:24.54
  eval_samples                    =        125
  eval_samples_per_second         =      5.376
  eval_steps_per_second           =      0.692
  max_memory_allocated (GB)       =       94.3
  memory_allocated (GB)           =      17.38
  perplexity                      =     2.1464
  total_memory_available (GB)     =      94.62

@bhargaveede thanks for your PR. Could you provide a description of the change and why it is needed?

Copy link
Copy Markdown
Contributor

@vidyasiv vidyasiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhargaveede thanks for your PR. Could you provide a description of the change and why it is needed?

@bhargaveede
Copy link
Copy Markdown
Author

This change is done to improve some perf. Without this change, Timer synchronization waits on host and that creates little idle time. This change avoids that synchronization resulting in better device utilization.

Copy link
Copy Markdown
Contributor

@vidyasiv vidyasiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@regisss please take a look and let us know if any further testing is needed

@libinta libinta added the run-test Run CI for PRs from external contributors label Apr 16, 2025
Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@regisss regisss merged commit 029f8fb into huggingface:main Apr 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-test Run CI for PRs from external contributors synapse 1.21

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants