Skip to content

Conversation

Electronic-Waste
Copy link
Member

What this PR does / why we need it:

This PR introduces LoraConfig in TorchTuneConfig, and propagates configs in cmd lines.

I tested it with script:

from kubeflow.trainer import *

client = TrainerClient()

# QLoRA
client.train(
    runtime=client.get_runtime(name="torchtune-llama3.2-1b"),
    initializer=Initializer(
        dataset=HuggingFaceDatasetInitializer(
            storage_uri="hf://tatsu-lab/alpaca/data"
        ),
        model=HuggingFaceModelInitializer(
            storage_uri="hf://meta-llama/Llama-3.2-1B-Instruct",
            access_token="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",
        )
    ),
    trainer=BuiltinTrainer(
        config=TorchTuneConfig(
            dataset_preprocess_config=TorchTuneInstructDataset(
                source=DataFormat.PARQUET,
            ),
            peft_config=LoraConfig(
                apply_lora_to_mlp=True,
                lora_attn_modules=["q_proj", "k_proj", "v_proj", "output_proj"],
                quantize_base=True,
            ),
            resources_per_node={
                "gpu": 1,
            }
        )
    )
)

Results:

Setting manual seed to local seed 2668943352. Local seed is seed + rank = 2668943352 + 0
Model is initialized with precision torch.bfloat16.
Memory stats after model init:
        GPU peak memory allocation: 1.21 GiB
        GPU peak memory reserved: 1.25 GiB
        GPU peak memory active: 1.21 GiB
Tokenizer is initialized from file.
Optimizer and loss are initialized.
Loss is initialized.
Writing logs to /workspace/output/logs/log_1758120939.txt
Generating train split: 52002 examples [00:00, 275879.07 examples/s]
Learning rate scheduler is initialized.
 Profiling disabled.
 Profiler config after instantiation: {'enabled': False}
1|1625|Loss: 1.6104315519332886: 100%|██████████| 1625/1625 [32:52<00:00,  1.26s/it]Starting checkpoint save...
Model checkpoint of size 2.30 GiB saved to /workspace/output/epoch_0/model-00001-of-00001.safetensors
Adapter checkpoint of size 0.08 GiB saved to /workspace/output/epoch_0/adapter_model.pt
Adapter checkpoint of size 0.08 GiB saved to /workspace/output/epoch_0/adapter_model.safetensors
Adapter checkpoint of size 0.00 GiB saved to /workspace/output/epoch_0/adapter_config.json
Saving final epoch checkpoint.
The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
Checkpoint saved in 5.90 seconds.
1|1625|Loss: 1.6104315519332886: 100%|██████████| 1625/1625 [32:57<00:00,  1.22s/it]

/cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @kramaranya @szaher @eoinfennessy @franciscojavierarceo @rudeigerc

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #

Checklist:

  • Docs included if any changes are user facing

Copy link

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: kubeflow/kubeflow-trainer-team, kubeflow/kubeflow-sdk-team.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What this PR does / why we need it:

This PR introduces LoraConfig in TorchTuneConfig, and propagates configs in cmd lines.

I tested it with script:

from kubeflow.trainer import *

client = TrainerClient()

# QLoRA
client.train(
   runtime=client.get_runtime(name="torchtune-llama3.2-1b"),
   initializer=Initializer(
       dataset=HuggingFaceDatasetInitializer(
           storage_uri="hf://tatsu-lab/alpaca/data"
       ),
       model=HuggingFaceModelInitializer(
           storage_uri="hf://meta-llama/Llama-3.2-1B-Instruct",
           access_token="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",
       )
   ),
   trainer=BuiltinTrainer(
       config=TorchTuneConfig(
           dataset_preprocess_config=TorchTuneInstructDataset(
               source=DataFormat.PARQUET,
           ),
           peft_config=LoraConfig(
               apply_lora_to_mlp=True,
               lora_attn_modules=["q_proj", "k_proj", "v_proj", "output_proj"],
               quantize_base=True,
           ),
           resources_per_node={
               "gpu": 1,
           }
       )
   )
)

Results:

Setting manual seed to local seed 2668943352. Local seed is seed + rank = 2668943352 + 0
Model is initialized with precision torch.bfloat16.
Memory stats after model init:
       GPU peak memory allocation: 1.21 GiB
       GPU peak memory reserved: 1.25 GiB
       GPU peak memory active: 1.21 GiB
Tokenizer is initialized from file.
Optimizer and loss are initialized.
Loss is initialized.
Writing logs to /workspace/output/logs/log_1758120939.txt
Generating train split: 52002 examples [00:00, 275879.07 examples/s]
Learning rate scheduler is initialized.
Profiling disabled.
Profiler config after instantiation: {'enabled': False}
1|1625|Loss: 1.6104315519332886: 100%|██████████| 1625/1625 [32:52<00:00,  1.26s/it]Starting checkpoint save...
Model checkpoint of size 2.30 GiB saved to /workspace/output/epoch_0/model-00001-of-00001.safetensors
Adapter checkpoint of size 0.08 GiB saved to /workspace/output/epoch_0/adapter_model.pt
Adapter checkpoint of size 0.08 GiB saved to /workspace/output/epoch_0/adapter_model.safetensors
Adapter checkpoint of size 0.00 GiB saved to /workspace/output/epoch_0/adapter_config.json
Saving final epoch checkpoint.
The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
Checkpoint saved in 5.90 seconds.
1|1625|Loss: 1.6104315519332886: 100%|██████████| 1625/1625 [32:57<00:00,  1.22s/it]

/cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @kramaranya @szaher @eoinfennessy @franciscojavierarceo @rudeigerc

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #

Checklist:

  • Docs included if any changes are user facing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Electronic-Waste Electronic-Waste changed the title Support LoraConfig in TorchTune BuiltinTrainer feat: Support LoraConfig in TorchTune BuiltinTrainer Sep 17, 2025
@coveralls
Copy link

coveralls commented Sep 17, 2025

Pull Request Test Coverage Report for Build 18339419302

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 14 of 16 (87.5%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.7%) to 72.43%

Changes Missing Coverage Covered Lines Changed/Added Lines %
kubeflow/trainer/utils/utils.py 14 16 87.5%
Totals Coverage Status
Change from base Build 17797245683: 0.7%
Covered Lines: 310
Relevant Lines: 428

💛 - Coveralls

@szaher
Copy link
Member

szaher commented Sep 23, 2025

Thanks @Electronic-Waste for your PR.
I believe TorchTune was deprecated by the repo maintainers as per meta-pytorch/torchtune#2883
do we need to keep it part of kubeflow Trainer/SDK or may be implement something like Unsloth or axolotl?

@Electronic-Waste
Copy link
Member Author

Electronic-Waste commented Sep 23, 2025

@szaher Yes, TorchTune has been deprecated. So we plan to support "Kubeflow Dynamic LLM Trainer Framework"(kubeflow/trainer#2839) in the future.

However, TorchTune is still one of the best choices for SFT in a wide range of models. And also, we've recently added a new runtime for Qwen in Kubeflow Trainer(kubeflow/trainer#2835) to provide initial out-of-box LLM fine-tuning experience on Kubernetes.

We will follow this workflow:

  1. Implement some basic features in TorchTune LLM Trainer (TorchTune BuiltinTrainer) to provide initial supports for out-of-box LLM fine-tuning experience on Kubernetes
  2. Design and implement "Kubeflow Dynamic LLM Trainer Framework" while integrating TorchTune LLM Trainer into this framework
  3. Gradually deprecate TorchTune LLM Trainer with the evolvement of new models and implementation of new post-training frameworks

This will maintain the backward compatibility and also provides early support for out-of-box LLM fine-tuning experience on Kubernetes. I think it's a more graceful and user-friendly workflow instead of deprecating TorchTune immediately.

What's more, we also want to present this feature at Kubeflow Summit NA this year. So it might be better if we could implement this feature before that.

WDYT? @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team

@kramaranya
Copy link
Contributor

+1 to what @Electronic-Waste mentioned! As we agreed on Dynamic LLM Trainer Framework as part of kubeflow/trainer#2752, I think we shouldn't deprecate TorchTune LLM Trainer for now, but instead keep it extensible to support multiple backends

@andreyvelich
Copy link
Member

Agree with @Electronic-Waste to support LoRA configurations in TorchTune config for now.
That will give users short-term value of LLM fine-tuning via torchtune, and might help us in the future to migrate to the newer solution from the PyTorch maintainers.

@google-oss-prow google-oss-prow bot added size/M and removed size/L labels Oct 2, 2025
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Electronic-Waste!
I left a few comments.
cc @kubeflow/kubeflow-sdk-team

@andreyvelich
Copy link
Member

@Electronic-Waste Please also add unit test for it: https://github.com/kubeflow/sdk/blob/main/kubeflow/trainer/backends/kubernetes/backend_test.py#L694

@google-oss-prow google-oss-prow bot added size/L and removed size/M labels Oct 8, 2025
@Electronic-Waste Electronic-Waste mentioned this pull request Oct 8, 2025
6 tasks
@andreyvelich
Copy link
Member

/milestone v0.2

@google-oss-prow google-oss-prow bot added this to the v0.2 milestone Oct 8, 2025
@Electronic-Waste
Copy link
Member Author

I think, this PR should be ready now.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @Electronic-Waste!
/lgtm
/approve
/hold for kubeflow/trainer#2832

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Electronic-Waste
Copy link
Member Author

/hold cancel

@google-oss-prow google-oss-prow bot merged commit 80f6b0e into kubeflow:main Oct 15, 2025
10 checks passed
briangallagher pushed a commit to opendatahub-io/kubeflow-sdk that referenced this pull request Oct 15, 2025
* feat: Add lora types.

Signed-off-by: Electronic-Waste <[email protected]>

* chore: propagate lora parameters in command.

Signed-off-by: Electronic-Waste <[email protected]>

* feat(lora): Add support for QLoRA.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(lora): remove extra quote symbol in lora attn module.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(lora): replace direct field override with field map.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(lora): remove extra flags.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(lora): fix wrong default list value in LoraConfig.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(lora): rmeove outdated code.

Signed-off-by: Electronic-Waste <[email protected]>

* test(backend): Add test for lora.

Signed-off-by: Electronic-Waste <[email protected]>

---------

Signed-off-by: Electronic-Waste <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants