feat: Support LoraConfig in TorchTune BuiltinTrainer #102

Electronic-Waste · 2025-09-17T16:12:53Z

What this PR does / why we need it:

This PR introduces LoraConfig in TorchTuneConfig, and propagates configs in cmd lines.

I tested it with script:

from kubeflow.trainer import *

client = TrainerClient()

# QLoRA
client.train(
    runtime=client.get_runtime(name="torchtune-llama3.2-1b"),
    initializer=Initializer(
        dataset=HuggingFaceDatasetInitializer(
            storage_uri="hf://tatsu-lab/alpaca/data"
        ),
        model=HuggingFaceModelInitializer(
            storage_uri="hf://meta-llama/Llama-3.2-1B-Instruct",
            access_token="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",
        )
    ),
    trainer=BuiltinTrainer(
        config=TorchTuneConfig(
            dataset_preprocess_config=TorchTuneInstructDataset(
                source=DataFormat.PARQUET,
            ),
            peft_config=LoraConfig(
                apply_lora_to_mlp=True,
                lora_attn_modules=["q_proj", "k_proj", "v_proj", "output_proj"],
                quantize_base=True,
            ),
            resources_per_node={
                "gpu": 1,
            }
        )
    )
)

Results:

Setting manual seed to local seed 2668943352. Local seed is seed + rank = 2668943352 + 0
Model is initialized with precision torch.bfloat16.
Memory stats after model init:
        GPU peak memory allocation: 1.21 GiB
        GPU peak memory reserved: 1.25 GiB
        GPU peak memory active: 1.21 GiB
Tokenizer is initialized from file.
Optimizer and loss are initialized.
Loss is initialized.
Writing logs to /workspace/output/logs/log_1758120939.txt
Generating train split: 52002 examples [00:00, 275879.07 examples/s]
Learning rate scheduler is initialized.
 Profiling disabled.
 Profiler config after instantiation: {'enabled': False}
1|1625|Loss: 1.6104315519332886: 100%|██████████| 1625/1625 [32:52<00:00,  1.26s/it]Starting checkpoint save...
Model checkpoint of size 2.30 GiB saved to /workspace/output/epoch_0/model-00001-of-00001.safetensors
Adapter checkpoint of size 0.08 GiB saved to /workspace/output/epoch_0/adapter_model.pt
Adapter checkpoint of size 0.08 GiB saved to /workspace/output/epoch_0/adapter_model.safetensors
Adapter checkpoint of size 0.00 GiB saved to /workspace/output/epoch_0/adapter_config.json
Saving final epoch checkpoint.
The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
Checkpoint saved in 5.90 seconds.
1|1625|Loss: 1.6104315519332886: 100%|██████████| 1625/1625 [32:57<00:00,  1.22s/it]

/cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @kramaranya @szaher @eoinfennessy @franciscojavierarceo @rudeigerc

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #

Checklist:

Docs included if any changes are user facing

Signed-off-by: Electronic-Waste <[email protected]>

google-oss-prow · 2025-09-17T16:13:02Z

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: kubeflow/kubeflow-trainer-team, kubeflow/kubeflow-sdk-team.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What this PR does / why we need it:

This PR introduces LoraConfig in TorchTuneConfig, and propagates configs in cmd lines.

I tested it with script:

from kubeflow.trainer import *

client = TrainerClient()

# QLoRA
client.train(
   runtime=client.get_runtime(name="torchtune-llama3.2-1b"),
   initializer=Initializer(
       dataset=HuggingFaceDatasetInitializer(
           storage_uri="hf://tatsu-lab/alpaca/data"
       ),
       model=HuggingFaceModelInitializer(
           storage_uri="hf://meta-llama/Llama-3.2-1B-Instruct",
           access_token="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",
       )
   ),
   trainer=BuiltinTrainer(
       config=TorchTuneConfig(
           dataset_preprocess_config=TorchTuneInstructDataset(
               source=DataFormat.PARQUET,
           ),
           peft_config=LoraConfig(
               apply_lora_to_mlp=True,
               lora_attn_modules=["q_proj", "k_proj", "v_proj", "output_proj"],
               quantize_base=True,
           ),
           resources_per_node={
               "gpu": 1,
           }
       )
   )
)

Results:

Setting manual seed to local seed 2668943352. Local seed is seed + rank = 2668943352 + 0
Model is initialized with precision torch.bfloat16.
Memory stats after model init:
       GPU peak memory allocation: 1.21 GiB
       GPU peak memory reserved: 1.25 GiB
       GPU peak memory active: 1.21 GiB
Tokenizer is initialized from file.
Optimizer and loss are initialized.
Loss is initialized.
Writing logs to /workspace/output/logs/log_1758120939.txt
Generating train split: 52002 examples [00:00, 275879.07 examples/s]
Learning rate scheduler is initialized.
Profiling disabled.
Profiler config after instantiation: {'enabled': False}
1|1625|Loss: 1.6104315519332886: 100%|██████████| 1625/1625 [32:52<00:00,  1.26s/it]Starting checkpoint save...
Model checkpoint of size 2.30 GiB saved to /workspace/output/epoch_0/model-00001-of-00001.safetensors
Adapter checkpoint of size 0.08 GiB saved to /workspace/output/epoch_0/adapter_model.pt
Adapter checkpoint of size 0.08 GiB saved to /workspace/output/epoch_0/adapter_model.safetensors
Adapter checkpoint of size 0.00 GiB saved to /workspace/output/epoch_0/adapter_config.json
Saving final epoch checkpoint.
The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
Checkpoint saved in 5.90 seconds.
1|1625|Loss: 1.6104315519332886: 100%|██████████| 1625/1625 [32:57<00:00,  1.22s/it]

/cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @kramaranya @szaher @eoinfennessy @franciscojavierarceo @rudeigerc

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #

Checklist:

Docs included if any changes are user facing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

coveralls · 2025-09-17T16:14:45Z

Pull Request Test Coverage Report for Build 18339419302

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

14 of 16 (87.5%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.7%) to 72.43%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
kubeflow/trainer/utils/utils.py	14	16	87.5%

Totals
Change from base Build 17797245683:	0.7%
Covered Lines:	310
Relevant Lines:	428

💛 - Coveralls

szaher · 2025-09-23T09:07:04Z

Thanks @Electronic-Waste for your PR.
I believe TorchTune was deprecated by the repo maintainers as per meta-pytorch/torchtune#2883
do we need to keep it part of kubeflow Trainer/SDK or may be implement something like Unsloth or axolotl?

Electronic-Waste · 2025-09-23T09:31:19Z

@szaher Yes, TorchTune has been deprecated. So we plan to support "Kubeflow Dynamic LLM Trainer Framework"(kubeflow/trainer#2839) in the future.

However, TorchTune is still one of the best choices for SFT in a wide range of models. And also, we've recently added a new runtime for Qwen in Kubeflow Trainer(kubeflow/trainer#2835) to provide initial out-of-box LLM fine-tuning experience on Kubernetes.

We will follow this workflow:

Implement some basic features in TorchTune LLM Trainer (TorchTune BuiltinTrainer) to provide initial supports for out-of-box LLM fine-tuning experience on Kubernetes
Design and implement "Kubeflow Dynamic LLM Trainer Framework" while integrating TorchTune LLM Trainer into this framework
Gradually deprecate TorchTune LLM Trainer with the evolvement of new models and implementation of new post-training frameworks

This will maintain the backward compatibility and also provides early support for out-of-box LLM fine-tuning experience on Kubernetes. I think it's a more graceful and user-friendly workflow instead of deprecating TorchTune immediately.

What's more, we also want to present this feature at Kubeflow Summit NA this year. So it might be better if we could implement this feature before that.

WDYT? @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team

kramaranya · 2025-09-23T10:11:49Z

+1 to what @Electronic-Waste mentioned! As we agreed on Dynamic LLM Trainer Framework as part of kubeflow/trainer#2752, I think we shouldn't deprecate TorchTune LLM Trainer for now, but instead keep it extensible to support multiple backends

kubeflow/trainer/utils/utils.py

andreyvelich · 2025-09-23T14:47:18Z

Agree with @Electronic-Waste to support LoRA configurations in TorchTune config for now.
That will give users short-term value of LLM fine-tuning via torchtune, and might help us in the future to migrate to the newer solution from the PyTorch maintainers.

Signed-off-by: Electronic-Waste <[email protected]>

andreyvelich

Thanks @Electronic-Waste!
I left a few comments.
cc @kubeflow/kubeflow-sdk-team

kubeflow/trainer/types/types.py

kubeflow/trainer/utils/utils.py

andreyvelich · 2025-10-06T14:39:55Z

@Electronic-Waste Please also add unit test for it: https://github.com/kubeflow/sdk/blob/main/kubeflow/trainer/backends/kubernetes/backend_test.py#L694

Signed-off-by: Electronic-Waste <[email protected]>

andreyvelich · 2025-10-08T15:18:42Z

/milestone v0.2

Electronic-Waste · 2025-10-08T15:22:11Z

I think, this PR should be ready now.

andreyvelich

Thank you for this @Electronic-Waste!
/lgtm
/approve
/hold for kubeflow/trainer#2832

google-oss-prow · 2025-10-08T17:55:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Electronic-Waste · 2025-10-15T14:27:05Z

/hold cancel

* feat: Add lora types. Signed-off-by: Electronic-Waste <[email protected]> * chore: propagate lora parameters in command. Signed-off-by: Electronic-Waste <[email protected]> * feat(lora): Add support for QLoRA. Signed-off-by: Electronic-Waste <[email protected]> * fix(lora): remove extra quote symbol in lora attn module. Signed-off-by: Electronic-Waste <[email protected]> * fix(lora): replace direct field override with field map. Signed-off-by: Electronic-Waste <[email protected]> * fix(lora): remove extra flags. Signed-off-by: Electronic-Waste <[email protected]> * fix(lora): fix wrong default list value in LoraConfig. Signed-off-by: Electronic-Waste <[email protected]> * fix(lora): rmeove outdated code. Signed-off-by: Electronic-Waste <[email protected]> * test(backend): Add test for lora. Signed-off-by: Electronic-Waste <[email protected]> --------- Signed-off-by: Electronic-Waste <[email protected]>

Electronic-Waste added 4 commits September 16, 2025 15:42

feat: Add lora types.

19e473c

Signed-off-by: Electronic-Waste <[email protected]>

chore: propagate lora parameters in command.

c3f6270

Signed-off-by: Electronic-Waste <[email protected]>

feat(lora): Add support for QLoRA.

0c77875

Signed-off-by: Electronic-Waste <[email protected]>

fix(lora): remove extra quote symbol in lora attn module.

62385e0

Signed-off-by: Electronic-Waste <[email protected]>

google-oss-prow bot requested review from eoinfennessy, franciscojavierarceo, kramaranya, rudeigerc and szaher September 17, 2025 16:12

google-oss-prow bot added the size/L label Sep 17, 2025

Electronic-Waste changed the title ~~Support LoraConfig in TorchTune BuiltinTrainer~~ feat: Support LoraConfig in TorchTune BuiltinTrainer Sep 17, 2025

szaher reviewed Sep 23, 2025

View reviewed changes

kubeflow/trainer/utils/utils.py Outdated Show resolved Hide resolved

fix(lora): replace direct field override with field map.

be19246

Signed-off-by: Electronic-Waste <[email protected]>

google-oss-prow bot added size/M and removed size/L labels Oct 2, 2025

Electronic-Waste added 2 commits October 2, 2025 16:11

fix(lora): remove extra flags.

8810478

Signed-off-by: Electronic-Waste <[email protected]>

fix(lora): fix wrong default list value in LoraConfig.

2ced58d

Signed-off-by: Electronic-Waste <[email protected]>

andreyvelich reviewed Oct 6, 2025

View reviewed changes

kubeflow/trainer/types/types.py Outdated Show resolved Hide resolved

kubeflow/trainer/types/types.py Outdated Show resolved Hide resolved

kubeflow/trainer/types/types.py Outdated Show resolved Hide resolved

kubeflow/trainer/utils/utils.py Outdated Show resolved Hide resolved

Electronic-Waste added 2 commits October 8, 2025 08:40

fix(lora): rmeove outdated code.

85b4d59

Signed-off-by: Electronic-Waste <[email protected]>

test(backend): Add test for lora.

1e9cc4c

Signed-off-by: Electronic-Waste <[email protected]>

google-oss-prow bot added size/L and removed size/M labels Oct 8, 2025

Electronic-Waste mentioned this pull request Oct 8, 2025

Kubeflow SDK v0.2 Release #120

Open

6 tasks

google-oss-prow bot added this to the v0.2 milestone Oct 8, 2025

andreyvelich reviewed Oct 8, 2025

View reviewed changes

google-oss-prow bot added the do-not-merge/hold label Oct 8, 2025

google-oss-prow bot assigned andreyvelich Oct 8, 2025

google-oss-prow bot added the lgtm label Oct 8, 2025

google-oss-prow bot added the approved label Oct 8, 2025

google-oss-prow bot removed the do-not-merge/hold label Oct 15, 2025

google-oss-prow bot merged commit 80f6b0e into kubeflow:main Oct 15, 2025
10 checks passed

feat: Support LoraConfig in TorchTune BuiltinTrainer #102

feat: Support LoraConfig in TorchTune BuiltinTrainer #102

Conversation

Electronic-Waste commented Sep 17, 2025

Uh oh!

google-oss-prow bot commented Sep 17, 2025

Uh oh!

coveralls commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 18339419302

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

szaher commented Sep 23, 2025

Uh oh!

Electronic-Waste commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kramaranya commented Sep 23, 2025

Uh oh!

Uh oh!

andreyvelich commented Sep 23, 2025

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreyvelich commented Oct 6, 2025

Uh oh!

andreyvelich commented Oct 8, 2025

Uh oh!

Electronic-Waste commented Oct 8, 2025

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Oct 8, 2025

Uh oh!

Electronic-Waste commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

coveralls commented Sep 17, 2025 •

edited

Loading

Electronic-Waste commented Sep 23, 2025 •

edited

Loading