[Core] Remove cd SKY_REMOTE_WORKDIR step before submitting jobs #7760

kevinmingtarja · 2025-10-28T06:57:28Z

We caught this in our release test failure (https://buildkite.com/skypilot-1/full-smoke-tests-run/builds/83/steps/canvas?jid=019a2740-3626-40f1-a132-067611370959). TL;DR: This only errors when workdir is set to https://github.com/skypilot-org/skypilot AND the API server is not running the latest master branch AND the master branch has a new dependency added. In other words, it is quite rare.

What happened was we had just merged a change to master that added a new dependency, orjson. But the 0.10.4 release branch was cut a few days back, so it didn't have this new dependency. This shouldn't be a problem normally, but in test_minimal_with_git_workdir, we git clone the master branch of the skypilot repo.

And we caught this test failing with:

  | 2025-10-27 12:59:57 PDT | sky.exceptions.CommandError: Command cd ~/sky_workdir && mkdir -p ~/sky_logs/1-min && touch ~/sky_logs/1-min/run.log && { echo 'import fu... failed with return code 1.
  | 2025-10-27 12:59:57 PDT | Failed to submit job 1.
  | 2025-10-27 12:59:57 PDT | Traceback (most recent call last):
  | 2025-10-27 12:59:57 PDT | File "<string>", line 1, in <module>
  | 2025-10-27 12:59:57 PDT | File "/home/sky/sky_workdir/sky/__init__.py", line 85, in <module>
  | 2025-10-27 12:59:57 PDT | from sky import backends
  | 2025-10-27 12:59:57 PDT | File "/home/sky/sky_workdir/sky/backends/__init__.py", line 4, in <module>
  ... 
  | 2025-10-27 12:59:57 PDT | File "/home/sky/sky_workdir/sky/server/requests/requests.py", line 23, in <module>
  | 2025-10-27 12:59:57 PDT | import orjson
  | 2025-10-27 12:59:57 PDT | ModuleNotFoundError: No module named 'orjson'
  | 2025-10-27 12:59:57 PDT | command terminated with exit code 1

If we look closely, it's trying to use the sky module from the workdir (/home/sky/sky_workdir), instead of the one in the ~/skypilot-runtime venv, which is why it saw the new dependency, and complained because it is not installed yet (which is expected, since the API server is not aware of this new package).

Here's a more minimal repro, so when we import sky from $HOME, it works fine, but when we do the same from ~/sky_workdir/, it fails with the same error:

# From /home/sky - works fine
(skypilot-runtime) (base) sky@git2-7a2eebbf-head:~$ pwd
/home/sky
(skypilot-runtime) (base) sky@git2-7a2eebbf-head:~$ python -c 'import sky; print("success")'
success
# Move into sky_workdir - breaks
(skypilot-runtime) (base) sky@git2-7a2eebbf-head:~$ cd sky_workdir/
(skypilot-runtime) (base) sky@git2-7a2eebbf-head:~/sky_workdir$ pwd
/home/sky/sky_workdir
(skypilot-runtime) (base) sky@git2-7a2eebbf-head:~/sky_workdir$ python -c 'import sky; print("success")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/sky/sky_workdir/sky/__init__.py", line 85, in <module>
    from sky import backends
  File "/home/sky/sky_workdir/sky/backends/__init__.py", line 4, in <module>
...
    from sky.server.requests import requests as requests_lib
  File "/home/sky/sky_workdir/sky/server/requests/requests.py", line 23, in <module>
    import orjson
ModuleNotFoundError: No module named 'orjson'

I believe the difference is due to how sys.path works in Python: https://docs.python.org/3.10/library/sys.html#sys.path.

As initialized upon program startup, the first item of this list, path[0], is the directory containing the script that was used to invoke the Python interpreter. If the script directory is not available (e.g. if the interpreter is invoked interactively or if the script is read from standard input), path[0] is the empty string, which directs Python to search modules in the current directory first.

But I'm not 100% sure because the path of the script is actually at ~/.sky/sky_app/sky_job_1. Based on this, path[0] should be ~/.sky/sky_app/, not ~/sky_workdir/. What would make sense is the latter statement, which is if it uses the cwd, which is indeed ~/sky_workdir/.

Actually, in job_submit_cmd, there is also job_lib.JobLibCodeGen.queue_job(job_id, job_submit_cmd), and our codegen stuff do invoke the python interpreter interactively (-c), see:

skypilot/sky/skylet/job_lib.py

Lines 1323 to 1326 in 725cc00

    
           def _build(cls, code: List[str]) -> str: 
        
               code = cls._PREFIX + code 
        
               code = ';'.join(code) 
        
               return f'{constants.SKY_PYTHON_CMD} -u -c {shlex.quote(code)}'

But anyways, given the above, changing the job_submit_cmd to be run from $HOME fixes this issue. I think it should be safe to remove this cd, because it's not needed for anything (?). But I need to double check this.

This shouldn't break the workdir functionality in general, as we do still cd to the workdir in make_task_bash_script:

skypilot/sky/skylet/log_lib.py

Lines 301 to 319 in 725cc00

    
           def make_task_bash_script(codegen: str, 
        
                                     env_vars: Optional[Dict[str, str]] = None) -> str: 
        
               # set -a is used for exporting all variables functions to the environment 
        
               # so that bash `user_script` can access `conda activate`. Detail: #436. 
        
               # Reference: https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html # pylint: disable=line-too-long 
        
               # DEACTIVATE_SKY_REMOTE_PYTHON_ENV: Deactivate the SkyPilot runtime env, as 
        
               # the ray cluster is started within the runtime env, which may cause the 
        
               # user program to run in that env as well. 
        
               # PYTHONUNBUFFERED is used to disable python output buffering. 
        
               script = [ 
        
                   textwrap.dedent(f"""\ 
        
                       #!/bin/bash 
        
                       source ~/.bashrc 
        
                       set -a 
        
                       . $(conda info --base 2> /dev/null)/etc/profile.d/conda.sh > /dev/null 2>&1 || true 
        
                       set +a 
        
                       {constants.DEACTIVATE_SKY_REMOTE_PYTHON_ENV} 
        
                       export PYTHONUNBUFFERED=1 
        
                       cd {constants.SKY_REMOTE_WORKDIR}"""),

Tested (run the relevant ones):

Code formatting: install pre-commit (auto-check on commit) or bash format.sh
Any manual or new tests for this PR (please specify below)
- Manually tested by going back to commit 0df5647 (one commit before the orjson dependency was added), restarting API server, and then doing sky launch -y -c git --git-url https://github.com/skypilot-org/skypilot.git --infra kubernetes --cpus 2+ --memory 4+ tests/test_yamls/minimal.yaml
- Verified that with this PR, it's now fixed
All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

kevinmingtarja · 2025-10-28T06:57:57Z

/smoke-test
/smoke-test --kubernetes

~~Seeing some test failures, so this might have broken a few things.~~ (Edit: It turns out it was just some flaky tests).

Invoking the interpreter with -I could be another option.

Run Python in isolated mode. This also implies -E, -P and -s options.

In isolated mode sys.path contains neither the script’s directory nor the user’s site-packages directory. All PYTHON* environment variables are ignored, too. Further restrictions may be imposed to prevent the user from injecting malicious code.

kevinmingtarja · 2025-10-28T07:52:11Z

/quicktest-core
/quicktest-core --kubernetes

One failure, fixed by #7765.

kevinmingtarja · 2025-10-28T17:04:57Z

/smoke-test --kubernetes --resource-heavy

kevinmingtarja · 2025-10-28T17:05:42Z

/smoke-test --resource-heavy

[Core] Remove cd SKY_REMOTE_WORKDIR step before submitting jobs

0ec6bcb

kevinmingtarja requested review from DanielZhangQD, Michaelvll and SeungjinYang October 28, 2025 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] Remove cd SKY_REMOTE_WORKDIR step before submitting jobs #7760

[Core] Remove cd SKY_REMOTE_WORKDIR step before submitting jobs #7760

Uh oh!

kevinmingtarja commented Oct 28, 2025 •

edited

Loading

Uh oh!

kevinmingtarja commented Oct 28, 2025 •

edited

Loading

Uh oh!

kevinmingtarja commented Oct 28, 2025 •

edited

Loading

Uh oh!

kevinmingtarja commented Oct 28, 2025

Uh oh!

kevinmingtarja commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	def _build(cls, code: List[str]) -> str:
	code = cls._PREFIX + code
	code = ';'.join(code)
	return f'{constants.SKY_PYTHON_CMD} -u -c {shlex.quote(code)}'

	def make_task_bash_script(codegen: str,
	env_vars: Optional[Dict[str, str]] = None) -> str:
	# set -a is used for exporting all variables functions to the environment
	# so that bash `user_script` can access `conda activate`. Detail: #436.
	# Reference: https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html # pylint: disable=line-too-long
	# DEACTIVATE_SKY_REMOTE_PYTHON_ENV: Deactivate the SkyPilot runtime env, as
	# the ray cluster is started within the runtime env, which may cause the
	# user program to run in that env as well.
	# PYTHONUNBUFFERED is used to disable python output buffering.
	script = [
	textwrap.dedent(f"""\
	#!/bin/bash
	source ~/.bashrc
	set -a
	. $(conda info --base 2> /dev/null)/etc/profile.d/conda.sh > /dev/null 2>&1 \|\| true
	set +a
	{constants.DEACTIVATE_SKY_REMOTE_PYTHON_ENV}
	export PYTHONUNBUFFERED=1
	cd {constants.SKY_REMOTE_WORKDIR}"""),

[Core] Remove cd SKY_REMOTE_WORKDIR step before submitting jobs #7760

Are you sure you want to change the base?

[Core] Remove cd SKY_REMOTE_WORKDIR step before submitting jobs #7760

Uh oh!

Conversation

kevinmingtarja commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinmingtarja commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinmingtarja commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinmingtarja commented Oct 28, 2025

Uh oh!

kevinmingtarja commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevinmingtarja commented Oct 28, 2025 •

edited

Loading

kevinmingtarja commented Oct 28, 2025 •

edited

Loading

kevinmingtarja commented Oct 28, 2025 •

edited

Loading