Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UX] Support --tail parameter for sky logs #4241

Merged
merged 24 commits into from
Nov 9, 2024

Conversation

zpoint
Copy link
Collaborator

@zpoint zpoint commented Nov 1, 2024

Trying to add this feature: #4222

Config file:

(sky) ➜  skypilot git:(dev/zeping/support_tail_n) cat ~/Desktop/hello-sky/long_log.yaml 
# SkyPilot YAML configuration for a cluster with one node
workdir: .

resources:
  cloud: aws
  instance_type: t3.small

num_nodes: 1

setup: |
  echo "Running setup."

run: |
  i=1
  while [ $i -le 1000 ]; do
    echo "This is test log line number $i"
    i=$((i + 1))
  done

Launch command:

# Below are the output from this branching: sky/skylet/log_lib.py:436    if follow and status in [job_lib.JobStatus.SETTING_UP, job_lib.JobStatus.PENDING, job_lib.JobStatus.RUNNING, ]:
sky launch -c long_log_small ~/Desktop/hello-sky/long_log.yaml

...
(task, pid=1789) This is test log line number 985
(task, pid=1789) This is test log line number 986
(task, pid=1789) This is test log line number 987
(task, pid=1789) This is test log line number 988
(task, pid=1789) This is test log line number 989
(task, pid=1789) This is test log line number 990
(task, pid=1789) This is test log line number 991
(task, pid=1789) This is test log line number 992
(task, pid=1789) This is test log line number 993
(task, pid=1789) This is test log line number 994
(task, pid=1789) This is test log line number 995
(task, pid=1789) This is test log line number 996
(task, pid=1789) This is test log line number 997
(task, pid=1789) This is test log line number 998
(task, pid=1789) This is test log line number 999
(task, pid=1789) This is test log line number 1000
✓ Job finished (status: SUCCEEDED).

📋 Useful Commands
Job ID: 1
├── To cancel the job:          sky cancel long_log_small 1
├── To stream job logs:         sky logs long_log_small 1
└── To view job queue:          sky queue long_log_small

Cluster name: long_log_small
├── To log into the head VM:    ssh long_log_small
├── To submit a job:            sky exec long_log_small yaml_file
├── To stop the cluster:        sky stop long_log_small
└── To teardown the cluster:    sky down long_log_small

log command 1:

(sky) ➜  skypilot git:(dev/zeping/support_tail_n) ✗ sky logs long_log_small 1 --tail 10
Tailing logs of job 1 on cluster 'long_log_small'...
(task, pid=1789) This is test log line number 991
(task, pid=1789) This is test log line number 992
(task, pid=1789) This is test log line number 993
(task, pid=1789) This is test log line number 994
(task, pid=1789) This is test log line number 995
(task, pid=1789) This is test log line number 996
(task, pid=1789) This is test log line number 997
(task, pid=1789) This is test log line number 998
(task, pid=1789) This is test log line number 999
(task, pid=1789) This is test log line number 1000

log command 2:

(sky) ➜  skypilot git:(dev/zeping/support_tail_n) ✗ sky logs long_log_small 1 --tail 20
Tailing logs of job 1 on cluster 'long_log_small'...
(task, pid=1789) This is test log line number 981
(task, pid=1789) This is test log line number 982
(task, pid=1789) This is test log line number 983
(task, pid=1789) This is test log line number 984
(task, pid=1789) This is test log line number 985
(task, pid=1789) This is test log line number 986
(task, pid=1789) This is test log line number 987
(task, pid=1789) This is test log line number 988
(task, pid=1789) This is test log line number 989
(task, pid=1789) This is test log line number 990
(task, pid=1789) This is test log line number 991
(task, pid=1789) This is test log line number 992
(task, pid=1789) This is test log line number 993
(task, pid=1789) This is test log line number 994
(task, pid=1789) This is test log line number 995
(task, pid=1789) This is test log line number 996
(task, pid=1789) This is test log line number 997
(task, pid=1789) This is test log line number 998
(task, pid=1789) This is test log line number 999
(task, pid=1789) This is test log line number 1000

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@zpoint zpoint changed the title support -n parameter for sky logs [UX]support -n parameter for sky logs Nov 1, 2024
@concretevitamin
Copy link
Member

Nice. One UX thing to discuss is do we want -f to be compatible with existing tools.

So far, before this PR we didn't have -f, because it's implicit. Maybe it's ok.

@zpoint
Copy link
Collaborator Author

zpoint commented Nov 1, 2024

Simply adding one line would make -f valid.

@click.option(
    '-f',
    '--follow/--no-follow',
    is_flag=True,
    default=True,
    help=('Follow the logs of a job. '
          'If --no-follow is specified, print the log so far and exit. '
          '[default: --follow]'))

The difference between us and them is that their -f defaults to false, while our -f defaults to true.

sky/cli.py Outdated Show resolved Hide resolved
@zpoint zpoint requested a review from yika-luo November 2, 2024 03:32
Copy link
Collaborator

@yika-luo yika-luo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpoint!

@zpoint
Copy link
Collaborator Author

zpoint commented Nov 4, 2024

@Michaelvll please take a look when you have time to proceed with the merge.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpoint for adding this! Left a few comments. We need to fix them before merging for robustness : )

sky/skylet/log_lib.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Outdated Show resolved Hide resolved
sky/skylet/log_lib.py Outdated Show resolved Hide resolved
sky/skylet/log_lib.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Outdated Show resolved Hide resolved
Comment on lines 892 to 893
f'\nif getattr(constants, "SKYLET_LIB_VERSION", 1) > 1: tail_log_kwargs["tail"] = {tail}',
'\nlog_lib.tail_logs(**tail_log_kwargs)',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these \n's?

Copy link
Collaborator Author

@zpoint zpoint Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the if statement needs to start on a new line; the Python interpreter requires that.

If we don't start with a new line:

Python 3.11.0 (main, Sep 24 2024, 21:11:22) [Clang 19.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a=1;if a == 1: print(a);
  File "<stdin>", line 1
    a=1;if a == 1: print(a);
        ^^
SyntaxError: invalid syntax
>>> 

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! Just want to make sure these \n won't cause issues. If \n works, maybe we should just change those code generation into multiple line script instead of having them joined by ;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multiline script requires writing a new _build method and a new _PREFIX, which causes code duplication. I would prefer to keep it as it is and replace the '\n' with a macro.

sky/skylet/log_lib.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @zpoint! It seems to me that the current handling does not cover all the cases. See comments below

sky/skylet/job_lib.py Outdated Show resolved Hide resolved
Comment on lines 892 to 893
f'\nif getattr(constants, "SKYLET_LIB_VERSION", 1) > 1: tail_log_kwargs["tail"] = {tail}',
'\nlog_lib.tail_logs(**tail_log_kwargs)',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! Just want to make sure these \n won't cause issues. If \n works, maybe we should just change those code generation into multiple line script instead of having them joined by ;

sky/skylet/log_lib.py Outdated Show resolved Hide resolved
sky/skylet/log_lib.py Outdated Show resolved Hide resolved
sky/skylet/log_lib.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @zpoint! Left a concern with the memory consumption

sky/skylet/log_lib.py Outdated Show resolved Hide resolved
sky/skylet/log_lib.py Outdated Show resolved Hide resolved
sky/skylet/log_lib.py Outdated Show resolved Hide resolved
Comment on lines 479 to 492
if tail > 0:
current_file_lines = log_file.readlines()
lines = collections.deque(current_file_lines, maxlen=tail)
start_stream = _should_start_streaming(current_file_lines,
lines, start_stream_at)
for line in lines:
if start_stream_at in line:
start_stream = True
if start_stream:
print(line, end='')
# Flush the last n lines
print(end='', flush=True)
# Now, the cursor is at the end of the last lines
# if tail > 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this over complicated a bit. Since we are already processing the all the lines of the file, why can't we just do so by something like the following:

def _print_lines_after_start_stream_at(log_content, start_stream_at, tail) -> bool:
    for line_idx, line in log_content[:PEEK_HEAD_LINES_FOR_START_STREAM]:
        if start_stream_at in line:
            break
    else:
        return False
    tail_start_idx = len(log_content) - tail
    start_stream = tail_start_idx >= line_idx
    log_content = log_content[-tail:]
    for line in log_content:
        if start_stream_at in line:
            start_stream = True
        if start_stream:
            print(line, end='')
    print(end='', flusth=True)
Suggested change
if tail > 0:
current_file_lines = log_file.readlines()
lines = collections.deque(current_file_lines, maxlen=tail)
start_stream = _should_start_streaming(current_file_lines,
lines, start_stream_at)
for line in lines:
if start_stream_at in line:
start_stream = True
if start_stream:
print(line, end='')
# Flush the last n lines
print(end='', flush=True)
# Now, the cursor is at the end of the last lines
# if tail > 0
if tail > 0:
start_stream = _print_lines_after_start_stream_at(log_file.readlines(), start_stream_at, tail)
# Now, the cursor is at the end of the last lines
# if tail > 0

Copy link
Collaborator Author

@zpoint zpoint Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we can't load the whole file into memory, I changed the approach
Can't resolve this comment.

Comment on lines 503 to 508
# If tail > 0, we need to read the last n lines.
# We use double ended queue to rotate the last n lines.
current_file_lines = f.readlines()
lines = collections.deque(current_file_lines, maxlen=tail)
start_stream = _should_start_streaming(
current_file_lines, lines, start_stream_at)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the function above this will become

Suggested change
# If tail > 0, we need to read the last n lines.
# We use double ended queue to rotate the last n lines.
current_file_lines = f.readlines()
lines = collections.deque(current_file_lines, maxlen=tail)
start_stream = _should_start_streaming(
current_file_lines, lines, start_stream_at)
# If tail > 0, we need to read the last n lines.
# We use double ended queue to rotate the last n lines.
_print_lines_after_start_stream_at(log_file.readlines(), start_stream_at, tail)
return

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above reason, don't have the whole file in memory now

@@ -440,15 +476,39 @@ def tail_logs(job_id: Optional[int],
with open(log_path, 'r', newline='', encoding='utf-8') as log_file:
# Using `_follow` instead of `tail -f` to streaming the whole
# log and creating a new process for tail.
if tail > 0:
current_file_lines = log_file.readlines()
lines = collections.deque(current_file_lines, maxlen=tail)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One fruit of thought: I think the original purpose we use deque is to avoid the large memory consumption for a large log file, but here we are reading the whole file into the memory, so the memory consumption will still be high. Can we think of a way to avoid the large memory consumption by log files (we have users encountered the large memory consumption on controller)?

There are two ways:

  1. Can we avoid reading the whole file when tail is provided or not?
  2. If needed, we can also try to switch to tail -f / tail -f -n and find some way to handle the start_stream_at

Copy link
Collaborator Author

@zpoint zpoint Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we use deque and iterable readlines() to save memory.
I guess I should keep my over complicated implementation with cases 1, 2, and 3, with only a lookahead of the log file instead of consuming the whole file.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpoint for fixing all those comments! It looks good to me now.

sky/skylet/log_lib.py Outdated Show resolved Hide resolved
sky/skylet/log_lib.py Outdated Show resolved Hide resolved
sky/skylet/log_lib.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments before we merge it. :)

sky/skylet/log_lib.py Outdated Show resolved Hide resolved
sky/skylet/log_lib.py Outdated Show resolved Hide resolved
@Michaelvll Michaelvll changed the title [UX]support -n parameter for sky logs [UX] Support --tail parameter for sky logs Nov 9, 2024
@Michaelvll Michaelvll added this pull request to the merge queue Nov 9, 2024
Merged via the queue into skypilot-org:master with commit 42c79e1 Nov 9, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants