Skip to content

fea(gemma2): cleaned up old codes#2212

Merged
regisss merged 2 commits into
huggingface:mainfrom
imangohari1:ig/gemma2-cleanups
Aug 21, 2025
Merged

fea(gemma2): cleaned up old codes#2212
regisss merged 2 commits into
huggingface:mainfrom
imangohari1:ig/gemma2-cleanups

Conversation

@imangohari1
Copy link
Copy Markdown
Contributor

What does this PR do?

This PR updates on Gemma2 with clean up on old, local implementations of KVCache, MLP, Rotary etc with the unified ones.

conducted Tests

CI (only Gemma2)

PT_HPU_LAZY_MODE=1  RUN_SLOW=true python -m pytest tests/test_text_generation_example.py::test_text_generation_bf16_1x -s -v 
==================================================================================================== 2 passed in 437.09s (0:07:17) =====================================================================================================

Readme tests

PT_HPU_LAZY_MODE=1 python examples/text-generation/run_generation.py --model_name_or_path google/gemma-2-2b --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --prompt "Here is my prompt" --sdp_on_bf16

Input/outputs:
input 1: ('Here is my prompt',)
output 1.1: ("Here is my prompt request! \nYou don't have to read the short story, that isn't important. just do your own original piece about the boy getting kicked out of school for being 'weird', later on it states that Mr. Handley and Miss Hottens and their mother are all going to choir practice. So have fun! Please comment if you do do it, I want to read that story when I am done with mine! Good luck!!\n\nP.S. I will try",)


Stats:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Input tokens
Throughput (including tokenization) = 116.26774185544866 tokens/second
Average first token latency         = 13.424454599589808 ms
Average rest token latency          = 7.379330329297433 ms
Average end to end latency          = 859.5175283997378 ms
Memory allocated                    = 12.59 GB
Max memory allocated                = 12.61 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 3.5712405099984608 seconds
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

LM_eval with 8k max new token

 PT_HPU_LAZY_MODE=1 python examples/text-generation/run_lm_eval.py --model_name_or_path google/gemma-2-2b  --use_hpu_graphs --use_kv_cache --bf16 --batch_size=1  --max_new_tokens 8192 --tasks piqa -o tmp.json 
{
  "results": {
    "piqa": {
      "alias": "piqa",
      "acc,none": 0.7856365614798694,
      "acc_stderr,none": 0.009574842136050941,
      "acc_norm,none": 0.7932535364526659,
      "acc_norm_stderr,none": 0.009448665514183273
    }
  },
  "group_subtasks": {
    "piqa": []
  },
  "configs": {
    "piqa": {
      "task": "piqa",
      "dataset_path": "piqa",
      "dataset_kwargs": {
        "trust_remote_code": true
      },
      "training_split": "train",
      "validation_split": "validation",
      "doc_to_text": "Question: {{goal}}\nAnswer:",
      "doc_to_target": "label",
      "doc_to_choice": "{{[sol1, sol2]}}",
      "description": "",
      "target_delimiter": " ",
      "fewshot_delimiter": "\n\n",
      "num_fewshot": 0,
      "metric_list": [
        {
          "metric": "acc",
          "aggregation": "mean",
          "higher_is_better": true
        },
        {
          "metric": "acc_norm",
          "aggregation": "mean",
          "higher_is_better": true
        }
      ],
      "output_type": "multiple_choice",
      "repeats": 1,
      "should_decontaminate": true,
      "doc_to_decontamination_query": "goal",
      "metadata": {
        "version": 1.0
      }
    }
  },
  "versions": {
    "piqa": 1.0
  },
  "n-shot": {
    "piqa": 0
  },
  "higher_is_better": {
    "piqa": {
      "acc": true,
      "acc_norm": true
    }
  },
  "n-samples": {
    "piqa": {
      "original": 1838,
      "effective": 1838
    }
  },
  "config": {
    "model": "google/gemma-2-2b",
    "model_args": null,
    "model_num_parameters": 3204165888,
    "model_dtype": "torch.bfloat16",
    "model_revision": "main",
    "model_sha": "",
    "batch_size": null,
    "batch_sizes": [],
    "device": null,
    "use_cache": null,
    "limit": null,
    "bootstrap_iters": 100000,
    "gen_kwargs": null,
    "random_seed": 0,
    "numpy_seed": 1234,
    "torch_seed": 1234,
    "fewshot_seed": 1234
  },
  "git_hash": "ci_13032024-890-g10a17f75",
  "date": 1755555273.21805,
  "pretty_env_info": "PyTorch version: 2.6.0+hpu_1.21.0-555.gitabf798b\nIs debug build: False\nCUDA used to build PyTorch: None\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.5 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.22.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Feb  4 2025, 14:57:36) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-6.8.0-71-generic-x86_64-with-glibc2.35\nIs CUDA available: False\nCUDA runtime version: No CUDA\nCUDA_MODULE_LOADING set to: N/A\nGPU models and configuration: No CUDA\nNvidia driver version: No CUDA\ncuDNN version: No CUDA\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture:                         x86_64\nCPU op-mode(s):                       32-bit, 64-bit\nAddress sizes:                        46 bits physical, 57 bits virtual\nByte Order:                           Little Endian\nCPU(s):                               20\nOn-line CPU(s) list:                  0-19\nVendor ID:                            GenuineIntel\nBIOS Vendor ID:                       QEMU\nModel name:                           Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz\nBIOS Model name:                      pc-q35-4.2\nCPU family:                           6\nModel:                                106\nThread(s) per core:                   2\nCore(s) per socket:                   5\nSocket(s):                            2\nStepping:                             6\nBogoMIPS:                             4589.21\nFlags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq dtes64 vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear flush_l1d arch_capabilities\nVirtualization:                       VT-x\nHypervisor vendor:                    KVM\nVirtualization type:                  full\nL1d cache:                            480 KiB (10 instances)\nL1i cache:                            320 KiB (10 instances)\nL2 cache:                             12.5 MiB (10 instances)\nL3 cache:                             120 MiB (2 instances)\nNUMA node(s):                         1\nNUMA node0 CPU(s):                    0-19\nVulnerability Gather data sampling:   Not affected\nVulnerability Itlb multihit:          Not affected\nVulnerability L1tf:                   Not affected\nVulnerability Mds:                    Not affected\nVulnerability Meltdown:               Not affected\nVulnerability Mmio stale data:        Mitigation; Clear CPU buffers; SMT Host state unknown\nVulnerability Reg file data sampling: Not affected\nVulnerability Retbleed:               Not affected\nVulnerability Spec rstack overflow:   Not affected\nVulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl\nVulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop\nVulnerability Srbds:                  Not affected\nVulnerability Tsx async abort:        Mitigation; TSX disabled\n\nVersions of relevant libraries:\n[pip3] habana-torch-dataloader==1.21.0.555\n[pip3] habana-torch-plugin==1.21.0.555\n[pip3] numpy==1.26.4\n[pip3] pytorch-lightning==2.5.1.post0\n[pip3] torch==2.6.0+hpu.1.21.0.555.gitabf798b\n[pip3] torch_tb_profiler==0.4.0\n[pip3] torchaudio==2.6.0+cpu\n[pip3] torchdata==0.10.1+cpu\n[pip3] torchmetrics==1.7.1\n[pip3] torchsde==0.2.6\n[pip3] torchtext==0.18.0+cpu\n[pip3] torchvision==0.21.0+cpu\n[conda] Could not collect",
  "transformers_version": "4.51.3",
  "upper_git_hash": null,
  "tokenizer_pad_token": [
    "<pad>",
    "0"
  ],
  "tokenizer_eos_token": [
    "<eos>",
    "1"
  ],
  "tokenizer_bos_token": [
    "<bos>",
    "2"
  ],
  "eot_token_id": 1,
  "max_length": 985,
  "args": {
    "buckets": [
      16,
      32,
      64,
      128,
      189,
      284,
      384,
      985
    ],
    "output_file": "tmp.json",
    "tasks": [
      "piqa"
    ],.
.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@imangohari1 imangohari1 marked this pull request as ready for review August 19, 2025 16:34
@imangohari1 imangohari1 requested a review from regisss as a code owner August 19, 2025 16:34
@imangohari1 imangohari1 mentioned this pull request Aug 19, 2025
3 tasks
Comment thread optimum/habana/transformers/models/gemma2/modeling_gemma2.py Outdated
Copy link
Copy Markdown
Contributor

@yafshar yafshar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imangohari1 please address the comment. Except that everything looks good to me. Thanks for the clean up :)

LGTM!

@regisss please do the final check on this PR.

@imangohari1
Copy link
Copy Markdown
Contributor Author

@imangohari1 please address the comment. Except that everything looks good to me. Thanks for the clean up :)

LGTM!

@regisss please do the final check on this PR.

Done.

Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@regisss regisss merged commit 5877b21 into huggingface:main Aug 21, 2025
2 of 4 checks passed
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

astachowiczhabana pushed a commit that referenced this pull request Sep 17, 2025
gplutop7 pushed a commit to HabanaAI/optimum-habana-fork that referenced this pull request Oct 15, 2025
Co-authored-by: Iman Gohari <s.m.iman.gohari@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants