[Core] Remove cd SKY_REMOTE_WORKDIR step before submitting jobs #7760
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We caught this in our release test failure (https://buildkite.com/skypilot-1/full-smoke-tests-run/builds/83/steps/canvas?jid=019a2740-3626-40f1-a132-067611370959). TL;DR: This only errors when workdir is set to https://github.com/skypilot-org/skypilot AND the API server is not running the latest master branch AND the master branch has a new dependency added. In other words, it is quite rare.
What happened was we had just merged a change to master that added a new dependency, orjson. But the 0.10.4 release branch was cut a few days back, so it didn't have this new dependency. This shouldn't be a problem normally, but in
test_minimal_with_git_workdir, we git clone the master branch of the skypilot repo.And we caught this test failing with:
If we look closely, it's trying to use the
skymodule from the workdir (/home/sky/sky_workdir), instead of the one in the~/skypilot-runtimevenv, which is why it saw the new dependency, and complained because it is not installed yet (which is expected, since the API server is not aware of this new package).Here's a more minimal repro, so when we import
skyfrom $HOME, it works fine, but when we do the same from~/sky_workdir/, it fails with the same error:I believe the difference is due to how sys.path works in Python: https://docs.python.org/3.10/library/sys.html#sys.path.
But I'm not 100% sure because the path of the script is actually at~/.sky/sky_app/sky_job_1. Based on this, path[0] should be~/.sky/sky_app/, not~/sky_workdir/. What would make sense is the latter statement, which is if it uses the cwd, which is indeed~/sky_workdir/.Actually, in
job_submit_cmd, there is alsojob_lib.JobLibCodeGen.queue_job(job_id, job_submit_cmd), and our codegen stuff do invoke the python interpreter interactively (-c), see:skypilot/sky/skylet/job_lib.py
Lines 1323 to 1326 in 725cc00
But anyways, given the above, changing the
job_submit_cmdto be run from $HOME fixes this issue. I think it should be safe to remove this cd, because it's not needed for anything (?). But I need to double check this.This shouldn't break the workdir functionality in general, as we do still cd to the workdir in
make_task_bash_script:skypilot/sky/skylet/log_lib.py
Lines 301 to 319 in 725cc00
Tested (run the relevant ones):
bash format.shsky launch -y -c git --git-url https://github.com/skypilot-org/skypilot.git --infra kubernetes --cpus 2+ --memory 4+ tests/test_yamls/minimal.yaml/smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)