finetune_lora upgrades by ysjprojects · Pull Request #2086 · Lightning-AI/litgpt

ysjprojects · 2025-07-03T13:09:10Z

Integrated some improvements to finetune_lora based on @lantiga's reference implementation.

Most of the optimizations are there, but I did not include the model registry feature because litgpt does not currently support litmodels and adding litmodels support is a PR in itself IMO.

Still WIP as I need to run some experiments (and potentially having to modify some test cases)

closes #2104

for more information, see https://pre-commit.ci

…; fixed test_cli and other errors

for more information, see https://pre-commit.ci

…s/litgpt into finetune_lora_upgrade divergent

ysjprojects · 2025-07-08T19:38:49Z

UPDATE: tested it in a 4xA100 environment, works well for multi-gpu setup

Borda

LGTM, just confused about the legacy staff

ysjprojects · 2025-07-08T23:41:50Z

LGTM, just confused about the legacy staff

Essentially it's the exact lora implementation before this PR.

The changes are quite breaking so my rationale is it may be apt to still give users the option to run the previous version of lora.py, hence the name legacy

KaelanDt

Looks great! I think we should add some tests to new functions though, left some comments

litgpt/finetune/lora.py

litgpt/utils.py

for more information, see https://pre-commit.ci

ysjprojects · 2025-07-23T00:02:25Z

Hi, any updates on this PR?

Borda · 2025-07-25T19:21:27Z

Hi, any updates on this PR?

let's check the failing tests, pls

ysjprojects · 2025-07-31T19:59:55Z

Hi, any updates on this PR?

let's check the failing tests, pls

Just checked, it's an error with authorization that is unrelated to the code.

2025-07-31T19:48:00.1917520Z >               �[94mraise�[39;49;00m �[96mOSError�[39;49;00m(�[33mf�[39;49;00m�[33m"�[39;49;00m�[33mThere was a specific connection error when trying to load �[39;49;00m�[33m{�[39;49;00mpath_or_repo_id�[33m}�[39;49;00m�[33m:�[39;49;00m�[33m\n�[39;49;00m�[33m{�[39;49;00me�[33m}�[39;49;00m�[33m"�[39;49;00m)�[90m�[39;49;00m
2025-07-31T19:48:00.1917960Z �[1m�[31mE               OSError: There was a specific connection error when trying to load deepseek-ai/DeepSeek-R1-Distill-Llama-70B:�[0m
2025-07-31T19:48:00.1918630Z �[1m�[31mE               401 Client Error: Unauthorized for url: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B/resolve/main/config.json (Request ID: Root=1-688bc86a-369f476a1220b18c4b573570;66f9fa0c-4e04-46b4-85ba-3c286e8d0d91)�[0m
2025-07-31T19:48:00.1918720Z �[1m�[31mE               �[0m
2025-07-31T19:48:00.1918990Z �[1m�[31mE               Invalid credentials in Authorization header�[0m
2025-07-31T19:48:00.1919380Z 
2025-07-31T19:48:00.1919840Z �[1m�[31m/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/transformers/utils/hub.py�[0m:501: OSError

Also, I noticed that pytests for pull_request is skipped in other PRs but not this PR. These are also the tests that just so happen to fail. Could this be the reason?

ysjprojects · 2025-08-13T15:59:01Z

Hi @Borda there are still HF Authorization related issues in the pull_request workflow.

Would like your feedback on this, thanks!

Borda · 2025-08-13T16:05:46Z

Hi @Borda there are still HF Authorization related issues in the pull_request workflow.

Would like your feedback on this, thanks!

Yes, it is still there, so in such a case I can open a "shadow" PR for you that will be running your contribution as pull_request and if all pass, we can safely merge your PR
But I hear your call, that is not the right way, and it makes the contribution difficult 😕

Borda · 2025-08-13T16:24:46Z

just failing on GPU, all the rest is fine

        assert {p.name for p in tmp_path.rglob("*.lora")} == {"lit_model.pth.lora"}
        state_dict = torch.load(tmp_path / "final" / "lit_model.pth.lora")
>       assert len(state_dict) == 1
E       AssertionError: assert 4 == 1
E        +  where 4 = len({'transformer.h.0.attn.qkv.lora_A': tensor([[-1.3733e-01, -2.3328e-01, -1.6028e-01, -3.3228e-01, -4.2084e-02,\n         -8.1421e-02, -2.2070e-01,  5.6458e-02],\n        [-6.9885e-02,  7.1777e-02,  2.3758e-02,  6.6101e-02, -3.4351e-01,\n          2.1814e-01, -2.5903e-01, -3.1348e-01],\n        [-1.0059e-01,  2.1704e-01, -4.4891e-02,  2.4445e-02,  2.3035e-01,\n         -2.5806e-01,  1.2158e-01, -1.6345e-01],\n        [ 3.3051e-02, -2.4329e-01,  2.1362e-01, -2.7856e-01, -9.4849e-02,\n         -2.1936e-01,  2.0789e-01, -8.0505e-02],\n        [ 1.2842e-01, -7.9285e-02, -3.3838e-01, -1.0480e-01,  2.1631e-01,\n          5.6030e-02,  9.2529e-02, -3.5254e-01],\n        [ 2.7100e-01, -2.2778e-01,  3.1421e-01,  6.6101e-02,  3.3032e-01,\n         -2.1545e-02, -2.3132e-01,  2.8369e-01],\n        [ 3.3105e-01, -2.2961e-01, -1.3501e-01,  1.0046e-01, -9.7778e-02,\n          1.0681e-01, -1.5198e-01,  3.0975e-02],\n        [ 1.8372e-01, -2.4805e-01,  1.7395e-02, -1.9165e-01, -3.4692e-01,\n         -2.3083e-01, -2.8394e-01, -2.6587e-01],\n        [-4.0588e-02,  3.2422e-01, -6.2988e-02,  7.5317e-02,  2.7856e-01,\n          2.7393e-01,  2.7759e-01, -2.3718e-01],\n        [-2.6904e-01, -2.8931e-01, -1.3318e-01,  3.1128...1,  4.1870e-02,  1.5625e-01,\n          1.0338e-02,  2.4756e-01,  3.1525e-02],\n        [-7.3669e-02,  2.0850e-01, -2.5366e-01, -9.5581e-02,  3.7384e-02,\n          1.7761e-01, -2.6221e-01, -1.3708e-01],\n        [-2.4023e-01,  1.0742e-01,  1.0956e-01, -1.5869e-01,  2.3230e-01,\n         -2.6904e-01, -2.7490e-01,  3.9124e-02],\n        [ 2.5659e-01,  2.8369e-01, -2.2278e-01, -1.1127e-01,  2.9834e-01,\n         -4.7272e-02,  8.2932e-03,  1.1517e-01]], dtype=torch.float16), 'transformer.h.1.attn.qkv.lora_B': tensor([[0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.]], dtype=torch.float16)})

…eckpoint

… into finetune_lora_upgrade need to merge

for more information, see https://pre-commit.ci

… into finetune_lora_upgrade new remote changes

for more information, see https://pre-commit.ci

… into finetune_lora_upgrade changes in remote

for more information, see https://pre-commit.ci

Borda · 2025-08-13T21:29:32Z

@ysjprojects nice push! 💟

for more information, see https://pre-commit.ci

ysjprojects · 2025-08-13T22:33:58Z

@Borda hi, the test cases are fixed! The PR merged by itself, not sure if it's the intended effect.

enhanced lora

0f0f0eb

ysjprojects requested review from Borda, lantiga and t-vi as code owners July 3, 2025 13:09

pre-commit-ci bot and others added 7 commits July 3, 2025 13:09

[pre-commit.ci] auto fixes from pre-commit.com hooks

b0a804a

for more information, see https://pre-commit.ci

moved load_from_full_model_state_dict to utils for better abstraction…

4798b75

…; fixed test_cli and other errors

[pre-commit.ci] auto fixes from pre-commit.com hooks

d0eeb31

for more information, see https://pre-commit.ci

torch.distributed.barrier -> fabric.barrier

e59911f

Merge branch 'finetune_lora_upgrade' of https://github.com/ysjproject…

5468601

…s/litgpt into finetune_lora_upgrade divergent

fix: rm passing lora_params to merge_lora

a321f64

Merge branch 'main' into finetune_lora_upgrade

11ef960

ysjprojects changed the title ~~(WIP) finetune_lora upgrades~~ finetune_lora upgrades Jul 8, 2025

Borda approved these changes Jul 8, 2025

View reviewed changes

KaelanDt requested changes Jul 9, 2025

View reviewed changes

litgpt/finetune/lora.py Show resolved Hide resolved

litgpt/utils.py Show resolved Hide resolved

ysjprojects and others added 3 commits July 10, 2025 01:39

test cases

20a0fea

[pre-commit.ci] auto fixes from pre-commit.com hooks

be3fcfc

for more information, see https://pre-commit.ci

Merge branch 'main' into finetune_lora_upgrade

f04c4df

ysjprojects enabled auto-merge (squash) July 10, 2025 06:36

ysjprojects requested a review from KaelanDt July 10, 2025 06:37

Merge branch 'main' into finetune_lora_upgrade

687dd18

KaelanDt approved these changes Jul 15, 2025

View reviewed changes

Merge branch 'main' into finetune_lora_upgrade

e27a064

ysjprojects added 3 commits August 13, 2025 01:37

Merge branch 'main' into finetune_lora_upgrade

34a6690

Merge branch 'main' into finetune_lora_upgrade

9234f87

fix var shadowing linting err in test_lora.py

9b712fc

ysjprojects and others added 2 commits August 13, 2025 08:09

Merge branch 'main' into finetune_lora_upgrade

347e953

Merge branch 'main' into finetune_lora_upgrade

22bf472

Borda mentioned this pull request Aug 13, 2025

[clone] PR #2086 [do not merge] #2104

Closed

ysjprojects and others added 13 commits August 13, 2025 13:34

Consistency with standard PyTorch checkpoint formats for save_lora_ch…

0a14ad3

…eckpoint

Merge branch 'finetune_lora_upgrade' of github.com:ysjprojects/litgpt…

fd54bb9

… into finetune_lora_upgrade need to merge

fixes to test functions for new lora impl

041f1a3

[pre-commit.ci] auto fixes from pre-commit.com hooks

6fab829

for more information, see https://pre-commit.ci

test lora fixes

f52eb8b

Merge branch 'finetune_lora_upgrade' of github.com:ysjprojects/litgpt…

96a25d5

… into finetune_lora_upgrade new remote changes

fix allclose error in test_lora

d91d8a5

[pre-commit.ci] auto fixes from pre-commit.com hooks

cf2eea9

for more information, see https://pre-commit.ci

fix linting error: no bare except

3ce97f2

Merge branch 'finetune_lora_upgrade' of github.com:ysjprojects/litgpt…

1e54923

… into finetune_lora_upgrade changes in remote

test_lora fixes

e6a2099

[pre-commit.ci] auto fixes from pre-commit.com hooks

38bc350

for more information, see https://pre-commit.ci

full state dict fix test_lora

85aa460

ysjprojects and others added 2 commits August 13, 2025 22:00

load_model_from_full_state_dict fix: added bias conversion

1526a61

[pre-commit.ci] auto fixes from pre-commit.com hooks

c6037be

for more information, see https://pre-commit.ci

ysjprojects merged commit 5ac69b0 into Lightning-AI:main Aug 13, 2025
44 of 51 checks passed

Conversation

ysjprojects commented Jul 3, 2025 • edited by Borda Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ysjprojects commented Jul 8, 2025

Uh oh!

Borda left a comment

Choose a reason for hiding this comment

Uh oh!

ysjprojects commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KaelanDt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ysjprojects commented Jul 23, 2025

Uh oh!

Borda commented Jul 25, 2025

Uh oh!

ysjprojects commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ysjprojects commented Aug 13, 2025

Uh oh!

Borda commented Aug 13, 2025

Uh oh!

Borda commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Borda commented Aug 13, 2025

Uh oh!

Uh oh!

ysjprojects commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ysjprojects commented Jul 3, 2025 •

edited by Borda

Loading

ysjprojects commented Jul 8, 2025 •

edited

Loading

ysjprojects commented Jul 31, 2025 •

edited

Loading

Borda commented Aug 13, 2025 •

edited

Loading