Skip to content

Conversation

@kevinmingtarja
Copy link
Collaborator

@kevinmingtarja kevinmingtarja commented Oct 28, 2025

We caught this in our release test failure (https://buildkite.com/skypilot-1/full-smoke-tests-run/builds/83/steps/canvas?jid=019a2740-3626-40f1-a132-067611370959). TL;DR: This only errors when workdir is set to https://github.com/skypilot-org/skypilot AND the API server is not running the latest master branch AND the master branch has a new dependency added. In other words, it is quite rare.

What happened was we had just merged a change to master that added a new dependency, orjson. But the 0.10.4 release branch was cut a few days back, so it didn't have this new dependency. This shouldn't be a problem normally, but in test_minimal_with_git_workdir, we git clone the master branch of the skypilot repo.

And we caught this test failing with:

  | 2025-10-27 12:59:57 PDT | sky.exceptions.CommandError: Command cd ~/sky_workdir && mkdir -p ~/sky_logs/1-min && touch ~/sky_logs/1-min/run.log && { echo 'import fu... failed with return code 1.
  | 2025-10-27 12:59:57 PDT | Failed to submit job 1.
  | 2025-10-27 12:59:57 PDT | Traceback (most recent call last):
  | 2025-10-27 12:59:57 PDT | File "<string>", line 1, in <module>
  | 2025-10-27 12:59:57 PDT | File "/home/sky/sky_workdir/sky/__init__.py", line 85, in <module>
  | 2025-10-27 12:59:57 PDT | from sky import backends
  | 2025-10-27 12:59:57 PDT | File "/home/sky/sky_workdir/sky/backends/__init__.py", line 4, in <module>
  ... 
  | 2025-10-27 12:59:57 PDT | File "/home/sky/sky_workdir/sky/server/requests/requests.py", line 23, in <module>
  | 2025-10-27 12:59:57 PDT | import orjson
  | 2025-10-27 12:59:57 PDT | ModuleNotFoundError: No module named 'orjson'
  | 2025-10-27 12:59:57 PDT | command terminated with exit code 1

If we look closely, it's trying to use the sky module from the workdir (/home/sky/sky_workdir), instead of the one in the ~/skypilot-runtime venv, which is why it saw the new dependency, and complained because it is not installed yet (which is expected, since the API server is not aware of this new package).

Here's a more minimal repro, so when we import sky from $HOME, it works fine, but when we do the same from ~/sky_workdir/, it fails with the same error:

# From /home/sky - works fine
(skypilot-runtime) (base) sky@git2-7a2eebbf-head:~$ pwd
/home/sky
(skypilot-runtime) (base) sky@git2-7a2eebbf-head:~$ python -c 'import sky; print("success")'
success
# Move into sky_workdir - breaks
(skypilot-runtime) (base) sky@git2-7a2eebbf-head:~$ cd sky_workdir/
(skypilot-runtime) (base) sky@git2-7a2eebbf-head:~/sky_workdir$ pwd
/home/sky/sky_workdir
(skypilot-runtime) (base) sky@git2-7a2eebbf-head:~/sky_workdir$ python -c 'import sky; print("success")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/sky/sky_workdir/sky/__init__.py", line 85, in <module>
    from sky import backends
  File "/home/sky/sky_workdir/sky/backends/__init__.py", line 4, in <module>
...
    from sky.server.requests import requests as requests_lib
  File "/home/sky/sky_workdir/sky/server/requests/requests.py", line 23, in <module>
    import orjson
ModuleNotFoundError: No module named 'orjson'

I believe the difference is due to how sys.path works in Python: https://docs.python.org/3.10/library/sys.html#sys.path.

As initialized upon program startup, the first item of this list, path[0], is the directory containing the script that was used to invoke the Python interpreter. If the script directory is not available (e.g. if the interpreter is invoked interactively or if the script is read from standard input), path[0] is the empty string, which directs Python to search modules in the current directory first.

But I'm not 100% sure because the path of the script is actually at ~/.sky/sky_app/sky_job_1. Based on this, path[0] should be ~/.sky/sky_app/, not ~/sky_workdir/. What would make sense is the latter statement, which is if it uses the cwd, which is indeed ~/sky_workdir/.

Actually, in job_submit_cmd, there is also job_lib.JobLibCodeGen.queue_job(job_id, job_submit_cmd), and our codegen stuff do invoke the python interpreter interactively (-c), see:

skypilot/sky/skylet/job_lib.py

Lines 1323 to 1326 in 725cc00

def _build(cls, code: List[str]) -> str:
code = cls._PREFIX + code
code = ';'.join(code)
return f'{constants.SKY_PYTHON_CMD} -u -c {shlex.quote(code)}'


But anyways, given the above, changing the job_submit_cmd to be run from $HOME fixes this issue. I think it should be safe to remove this cd, because it's not needed for anything (?). But I need to double check this.

This shouldn't break the workdir functionality in general, as we do still cd to the workdir in make_task_bash_script:

def make_task_bash_script(codegen: str,
env_vars: Optional[Dict[str, str]] = None) -> str:
# set -a is used for exporting all variables functions to the environment
# so that bash `user_script` can access `conda activate`. Detail: #436.
# Reference: https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html # pylint: disable=line-too-long
# DEACTIVATE_SKY_REMOTE_PYTHON_ENV: Deactivate the SkyPilot runtime env, as
# the ray cluster is started within the runtime env, which may cause the
# user program to run in that env as well.
# PYTHONUNBUFFERED is used to disable python output buffering.
script = [
textwrap.dedent(f"""\
#!/bin/bash
source ~/.bashrc
set -a
. $(conda info --base 2> /dev/null)/etc/profile.d/conda.sh > /dev/null 2>&1 || true
set +a
{constants.DEACTIVATE_SKY_REMOTE_PYTHON_ENV}
export PYTHONUNBUFFERED=1
cd {constants.SKY_REMOTE_WORKDIR}"""),

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Manually tested by going back to commit 0df5647 (one commit before the orjson dependency was added), restarting API server, and then doing sky launch -y -c git --git-url https://github.com/skypilot-org/skypilot.git --infra kubernetes --cpus 2+ --memory 4+ tests/test_yamls/minimal.yaml
    • Verified that with this PR, it's now fixed
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@kevinmingtarja
Copy link
Collaborator Author

kevinmingtarja commented Oct 28, 2025

/smoke-test
/smoke-test --kubernetes

Seeing some test failures, so this might have broken a few things. (Edit: It turns out it was just some flaky tests).

Invoking the interpreter with -I could be another option.

Run Python in isolated mode. This also implies -E, -P and -s options.

In isolated mode sys.path contains neither the script’s directory nor the user’s site-packages directory. All PYTHON* environment variables are ignored, too. Further restrictions may be imposed to prevent the user from injecting malicious code.

@kevinmingtarja
Copy link
Collaborator Author

kevinmingtarja commented Oct 28, 2025

/quicktest-core
/quicktest-core --kubernetes

One failure, fixed by #7765.

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test --kubernetes --resource-heavy

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test --resource-heavy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants