Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions nemo_skills/code_execution/sandbox.py
Original file line number Diff line number Diff line change
Expand Up @@ -271,8 +271,11 @@ def _check_ready(self, timeout: float = 5.0) -> bool:
except httpx.HTTPError:
return False

def wait_for_sandbox(self, timeout: int = 5):
while not self._check_ready(timeout=timeout):
def wait_for_sandbox(self, wait_timeout: int = 240, http_timeout: int = 5):
start_time = time.time()
while not self._check_ready(timeout=http_timeout):
if time.time() - start_time >= wait_timeout:
raise RuntimeError(f"Sandbox at {self.host}:{self.port} did not start within {wait_timeout} seconds")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do the same for llm wait as well? @i-vainn

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kipok are you saying like waiting for the llm server?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think we should have the same timeout logic for wait for llm server as we do for sandbox server. Currently llm will also block forever if something is wrong with the address

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how long do you think we should poll for? some models can take over 30 minutes to be ready--we could make the wait time configurable, but sometimes there is a decent amount of variance in the load times. Would we rather preemptively kill jobs that are ok, or fail misconfigured jobs with a job reaper (which will have a delay, but otherwise be more accurate signal).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can make it configurable. I guess by default we should set a fairly high timeout, like 2 hours. But just thinking that we should have a consistency in logic, if we support timeout for sandbox, we should also support it for llm. But this isn't very high priority

time.sleep(1)


Expand Down
2 changes: 1 addition & 1 deletion nemo_skills/evaluation/evaluator/code.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ def __init__(self, config: dict, num_parallel_requests: int = 12):
f"max_output_characters={self.eval_config.max_output_characters}"
)
self.sandbox = get_sandbox(self.eval_config.sandbox)
self.sandbox.wait_for_sandbox(50)
self.sandbox.wait_for_sandbox(wait_timeout=240)

async def eval_single(self, data: dict):
"""Evaluate single code during generation."""
Expand Down