Consolidate on how we run python #264

gregtatum · 2023-11-17T16:35:28Z

We have both Conda, which was part of the original project, and I introduced poetry with some of the CI tooling. We should consolidate on some kind of tooling solution so we don't have multiple confusing errors like in #263. I was less familiar with Conda, and had trouble getting it working on my machine when I started working on the project. When implementing some CI features, I used poetry due to both my familiarity with it.

Our python packages management should solve the following issue:

Provide a lock file for dependency management.
Create a virtual environment so that modules can be installed in a predictable way.
Work locally when running our utils.
Work in CI and in Taskcluster.
Supporting pyproject.toml is a plus.
It should practically allow us to install and manage dependencies.

Conda

There is a plugin for this: https://github.com/conda/conda-lock
This is supported.
I had some trouble getting it setup, but maybe this was just me.
This is what's used in Taskcluster, but not CI.
I guess this is a thing. I'm not sure how good it is though.
Conda is really used in lots of ML-related work so it is a very practical solution. It's one of the few package managers that support PyTorch workflows, not that we are using PyTorch here.

Poetry

Yes, by default.
Yes, with poetry run
Yes, although it's a little awkward putting things in the Makefile, and then having to remember to use the poetry command correctly locally when switching between different scripts with different dependencies.
CI uses it, but Taskcluster does not.
Yes, by default.
I'm running into some issues with the dogmatic, standards based approach of poetry. For instance, the mtdata dependency requires a specific requests version, while other dependencies require different version of requests. There is no way to work around this issue other than forking mtdata. See Ability to override/ignore sub-dependencies python-poetry/poetry#697 . In addition to this issue, when I use poetry on side projects, PyTorch can't be installed as it uses non-standardized module installation. This is somewhat concerning to me since I want to align on the practices in the ML community at large.

pyenv and pip

It would probably be worth filling in this one to consider.

The text was updated successfully, but these errors were encountered:

eu9ene · 2023-11-17T18:46:56Z

We actually use even more ways of managing environment:

For python packages:

pip-compile + pip install packages for a specific step, for example pipeline/clean/requirements/clean.txt and taskcluster/ci/clean-corpus/kind.yml
Installing packages in a Dockerfile, for example taskcluster/docker/train/Dockerfile and taskcluster/docker/base/Dockerfile. I don't know if those are being used though since we switched some of the steps to generic workers
Running a separate script for kenlm or installing it through a toolchain

And a bunch of ways to manage linux packages:

Singularity
Docker
Generic worker
Toolchains
Installing some linux packages with a script in a container image

Some of it is Snakemake legacy.

As for conda vs poetry I used both extensively and poetry is more user friendly and easier to install and run in a local env and in a script. I did encounter issue with poetry groups recently though, when they conflicted. I would assume they should work separately. If not we might need multiple poetry envs in the same way as we have for conda or pip-compile.

Conda usually provides more reproducibility and ability to install other software, not only python (for example cmake). Like you mentioned it's easier to install some scientific packages but it's always a pain to activate a conda env in a script that's why I did all those unpleasant CONDA_ACTIVATE commands right before running things. Locking env is also possible but a bit less native to conda and a bit less convenient than poetry lock. We currenlty use only high level dependencies in our Conda envs. Also regular Conda is incredibly slow and can hang for 10 min to lock the env, that's why people switch to Mamba. Conda/Mamba was the only choice for Snakemake and worked fine with Singularity but was not user friendly to run locally. Snakemake was also great at reusing those conda env without reinstalling any packages each time like we do on TC.

It would be great to settle on one approach for Taskcluster. My suggestion would be having an optional Docker image for each step and if any packages are required either adding them in the base image or override in the step image, so that we don't need to install anything when running the step. When you have linux env setup under Docker, pip/poetry with locking should be enough because you can also preinstall something in the image in rare cases like Pytorch.

As for Snakemake, we can just move everything related to it to a separate directory and leave maintenance up to contributors. It's already pretty hard to sync the pipeline with the latest changes and it's not tested anyway.

eu9ene · 2023-11-17T18:54:22Z

Another perspective to think about all this is how easy it is to test the pipeline steps locally. Right now it's not easy at all because you might need some linux package and also to compile Marian. With pre-created Docker images it would become way easier assuming we publish them in the public Docker registry. Maybe to simplify we could even reuse just one Docker image that has everything compiled and installed.

gregtatum · 2023-11-17T19:57:42Z

To re-state, the recommendation would approximately be:

Locally

Use poetry
Docker images are available for steps with complicated dependencies, but is optional otherwise.

CI

Use poetry
Use a docker image

Training runs

Use poetry
Use a docker image

Clean-up (ignore snakemake)

Remove conda
Remove Singularity
Adapt pip install workflows to poetry.
Adapt Installing packages in a Dockerfile to use the poetry workflow for installation.

If we run into dependency issues like I ran into above we can work around them by relying on a fork with a fixed dependency declaration, or work to get it fixed upstream (in the case of mtdata), or maybe find another way to work around the issue.

bhearsum · 2023-11-20T15:47:44Z

I'm fully +1 on the above recommendations. The only thing I might suggest as a non-blocking enhancement is that I encourage the use of Docker locally as much as possible. (Even if your development machine is Linux you will likely eventually run into issues with differences between system tool versions or other things that can cause bustage or confusion.)

gregtatum · 2024-04-09T20:48:37Z

I think we have mostly aligned on this, and if we want more changes we can always re-open an issue or discussion for further python changes.

gregtatum added the question Further information is requested label Nov 17, 2023

eu9ene added the refactoring Improve the code quality label Jan 8, 2024

eu9ene mentioned this issue Mar 20, 2024

Fix preflight check util #480

Merged

gregtatum closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate on how we run python #264

Consolidate on how we run python #264

gregtatum commented Nov 17, 2023 •

edited

Loading

eu9ene commented Nov 17, 2023 •

edited

Loading

eu9ene commented Nov 17, 2023

gregtatum commented Nov 17, 2023

bhearsum commented Nov 20, 2023

gregtatum commented Apr 9, 2024

Consolidate on how we run python #264

Consolidate on how we run python #264

Comments

gregtatum commented Nov 17, 2023 • edited Loading

Conda

Poetry

pyenv and pip

eu9ene commented Nov 17, 2023 • edited Loading

eu9ene commented Nov 17, 2023

gregtatum commented Nov 17, 2023

Locally

CI

Training runs

Clean-up (ignore snakemake)

bhearsum commented Nov 20, 2023

gregtatum commented Apr 9, 2024

gregtatum commented Nov 17, 2023 •

edited

Loading

eu9ene commented Nov 17, 2023 •

edited

Loading