Kubeflow pipeline unable to run due to large packages taking limited space #37

K123AsJ0k1 · 2024-05-20T08:47:55Z

In the current kubeflow pipeline resource configuration large packages like Pytorch can cause the pipeline to fail due to their size (around 4GB) taking space during package installation. The temporary fix for this seems to be to delete the created kubeflow component pods with:

kubectl get pods -n
kubectl delete pod () -n ()

A better fix would be to somehow enable kubeflow components to install the cpu only variant of torch, which in a regular venv can be installed with:

torch==2.3.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
torchvision==0.18.0+cpu -f https://download.pytorch.org/whl/torch_stable.html

However, kubeflow component package installs don't understand the -f option, which is why I think a more lasting fix would be increasing the kubeflow pipeline resource configuration if possible.

The text was updated successfully, but these errors were encountered:

JoaquinRivesGambin · 2024-05-20T08:57:54Z

@K123AsJ0k1 Is the issue that the component goes over the disk memory limit for the task? If that is the case, we can also increase that limit. I can't remember exactly how was the argument called, but I think it was something like disk_limit:

@component(
    base_image="python:3.10",
    packages_to_install=["numpy", "mlflow~=2.4.1"],
    output_component_file='components/evaluate_component.yaml',
    disk_limit='10Gi'
)

K123AsJ0k1 · 2024-05-20T09:17:26Z

@JoaquinRivesGambin It is possible, but to me it seems more collective memory size than single component size. I will however test if that change works. Regardless, after checking the kubeflow pipeline docs, I found out that components can use index_urls that can be used similarly to -f option in pip installs. This means that we can reduce the torch package size with the following:

base_image = "python:3.10",
packages_to_install = [
      "python-swiftclient",
      "torch==2.3.0", 
      "torchvision==0.18.0"
],
pip_index_urls=[
      "https://pypi.org/simple",
      "https://download.pytorch.org/whl/cpu",
      "https://download.pytorch.org/whl/cpu"
  ]

K123AsJ0k1 added enhancement New feature or request question Further information is requested labels May 20, 2024

K123AsJ0k1 assigned dmuiruri, K123AsJ0k1 and JoaquinRivesGambin May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubeflow pipeline unable to run due to large packages taking limited space #37

Kubeflow pipeline unable to run due to large packages taking limited space #37

K123AsJ0k1 commented May 20, 2024 •

edited

Loading

JoaquinRivesGambin commented May 20, 2024

K123AsJ0k1 commented May 20, 2024

Kubeflow pipeline unable to run due to large packages taking limited space #37

Kubeflow pipeline unable to run due to large packages taking limited space #37

Comments

K123AsJ0k1 commented May 20, 2024 • edited Loading

JoaquinRivesGambin commented May 20, 2024

K123AsJ0k1 commented May 20, 2024

K123AsJ0k1 commented May 20, 2024 •

edited

Loading