Skip to content

finetune_lora upgrades#2086

Merged
ysjprojects merged 33 commits intoLightning-AI:mainfrom
ysjprojects:finetune_lora_upgrade
Aug 13, 2025
Merged

finetune_lora upgrades#2086
ysjprojects merged 33 commits intoLightning-AI:mainfrom
ysjprojects:finetune_lora_upgrade

Conversation

@ysjprojects
Copy link
Collaborator

@ysjprojects ysjprojects commented Jul 3, 2025

Integrated some improvements to finetune_lora based on @lantiga's reference implementation.

Most of the optimizations are there, but I did not include the model registry feature because litgpt does not currently support litmodels and adding litmodels support is a PR in itself IMO.

Still WIP as I need to run some experiments (and potentially having to modify some test cases)

closes #2104

@ysjprojects
Copy link
Collaborator Author

UPDATE: tested it in a 4xA100 environment, works well for multi-gpu setup

@ysjprojects ysjprojects changed the title (WIP) finetune_lora upgrades finetune_lora upgrades Jul 8, 2025
Copy link
Collaborator

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just confused about the legacy staff

@ysjprojects
Copy link
Collaborator Author

ysjprojects commented Jul 8, 2025

LGTM, just confused about the legacy staff

Essentially it's the exact lora implementation before this PR.

The changes are quite breaking so my rationale is it may be apt to still give users the option to run the previous version of lora.py, hence the name legacy

Copy link
Contributor

@KaelanDt KaelanDt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I think we should add some tests to new functions though, left some comments

@ysjprojects ysjprojects enabled auto-merge (squash) July 10, 2025 06:36
@ysjprojects ysjprojects requested a review from KaelanDt July 10, 2025 06:37
@ysjprojects
Copy link
Collaborator Author

Hi, any updates on this PR?

@Borda
Copy link
Collaborator

Borda commented Jul 25, 2025

Hi, any updates on this PR?

let's check the failing tests, pls

@ysjprojects
Copy link
Collaborator Author

ysjprojects commented Jul 31, 2025

Hi, any updates on this PR?

let's check the failing tests, pls

Just checked, it's an error with authorization that is unrelated to the code.

2025-07-31T19:48:00.1917520Z >               �[94mraise�[39;49;00m �[96mOSError�[39;49;00m(�[33mf�[39;49;00m�[33m"�[39;49;00m�[33mThere was a specific connection error when trying to load �[39;49;00m�[33m{�[39;49;00mpath_or_repo_id�[33m}�[39;49;00m�[33m:�[39;49;00m�[33m\n�[39;49;00m�[33m{�[39;49;00me�[33m}�[39;49;00m�[33m"�[39;49;00m)�[90m�[39;49;00m
2025-07-31T19:48:00.1917960Z �[1m�[31mE               OSError: There was a specific connection error when trying to load deepseek-ai/DeepSeek-R1-Distill-Llama-70B:�[0m
2025-07-31T19:48:00.1918630Z �[1m�[31mE               401 Client Error: Unauthorized for url: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B/resolve/main/config.json (Request ID: Root=1-688bc86a-369f476a1220b18c4b573570;66f9fa0c-4e04-46b4-85ba-3c286e8d0d91)�[0m
2025-07-31T19:48:00.1918720Z �[1m�[31mE               �[0m
2025-07-31T19:48:00.1918990Z �[1m�[31mE               Invalid credentials in Authorization header�[0m
2025-07-31T19:48:00.1919380Z 
2025-07-31T19:48:00.1919840Z �[1m�[31m/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/transformers/utils/hub.py�[0m:501: OSError

Also, I noticed that pytests for pull_request is skipped in other PRs but not this PR. These are also the tests that just so happen to fail. Could this be the reason?

@ysjprojects
Copy link
Collaborator Author

Hi @Borda there are still HF Authorization related issues in the pull_request workflow.

Would like your feedback on this, thanks!

@Borda
Copy link
Collaborator

Borda commented Aug 13, 2025

Hi @Borda there are still HF Authorization related issues in the pull_request workflow.

Would like your feedback on this, thanks!

Yes, it is still there, so in such a case I can open a "shadow" PR for you that will be running your contribution as pull_request and if all pass, we can safely merge your PR
But I hear your call, that is not the right way, and it makes the contribution difficult 😕

@Borda
Copy link
Collaborator

Borda commented Aug 13, 2025

just failing on GPU, all the rest is fine

        assert {p.name for p in tmp_path.rglob("*.lora")} == {"lit_model.pth.lora"}
        state_dict = torch.load(tmp_path / "final" / "lit_model.pth.lora")
>       assert len(state_dict) == 1
E       AssertionError: assert 4 == 1
E        +  where 4 = len({'transformer.h.0.attn.qkv.lora_A': tensor([[-1.3733e-01, -2.3328e-01, -1.6028e-01, -3.3228e-01, -4.2084e-02,\n         -8.1421e-02, -2.2070e-01,  5.6458e-02],\n        [-6.9885e-02,  7.1777e-02,  2.3758e-02,  6.6101e-02, -3.4351e-01,\n          2.1814e-01, -2.5903e-01, -3.1348e-01],\n        [-1.0059e-01,  2.1704e-01, -4.4891e-02,  2.4445e-02,  2.3035e-01,\n         -2.5806e-01,  1.2158e-01, -1.6345e-01],\n        [ 3.3051e-02, -2.4329e-01,  2.1362e-01, -2.7856e-01, -9.4849e-02,\n         -2.1936e-01,  2.0789e-01, -8.0505e-02],\n        [ 1.2842e-01, -7.9285e-02, -3.3838e-01, -1.0480e-01,  2.1631e-01,\n          5.6030e-02,  9.2529e-02, -3.5254e-01],\n        [ 2.7100e-01, -2.2778e-01,  3.1421e-01,  6.6101e-02,  3.3032e-01,\n         -2.1545e-02, -2.3132e-01,  2.8369e-01],\n        [ 3.3105e-01, -2.2961e-01, -1.3501e-01,  1.0046e-01, -9.7778e-02,\n          1.0681e-01, -1.5198e-01,  3.0975e-02],\n        [ 1.8372e-01, -2.4805e-01,  1.7395e-02, -1.9165e-01, -3.4692e-01,\n         -2.3083e-01, -2.8394e-01, -2.6587e-01],\n        [-4.0588e-02,  3.2422e-01, -6.2988e-02,  7.5317e-02,  2.7856e-01,\n          2.7393e-01,  2.7759e-01, -2.3718e-01],\n        [-2.6904e-01, -2.8931e-01, -1.3318e-01,  3.1128...1,  4.1870e-02,  1.5625e-01,\n          1.0338e-02,  2.4756e-01,  3.1525e-02],\n        [-7.3669e-02,  2.0850e-01, -2.5366e-01, -9.5581e-02,  3.7384e-02,\n          1.7761e-01, -2.6221e-01, -1.3708e-01],\n        [-2.4023e-01,  1.0742e-01,  1.0956e-01, -1.5869e-01,  2.3230e-01,\n         -2.6904e-01, -2.7490e-01,  3.9124e-02],\n        [ 2.5659e-01,  2.8369e-01, -2.2278e-01, -1.1127e-01,  2.9834e-01,\n         -4.7272e-02,  8.2932e-03,  1.1517e-01]], dtype=torch.float16), 'transformer.h.1.attn.qkv.lora_B': tensor([[0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.],\n        [0., 0., 0., 0., 0., 0., 0., 0.]], dtype=torch.float16)})

@Borda
Copy link
Collaborator

Borda commented Aug 13, 2025

@ysjprojects nice push! 💟

@ysjprojects ysjprojects merged commit 5ac69b0 into Lightning-AI:main Aug 13, 2025
44 of 51 checks passed
@ysjprojects
Copy link
Collaborator Author

@Borda hi, the test cases are fixed! The PR merged by itself, not sure if it's the intended effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants