Skip to content

Add timeout capability to TaskCollection#1244

Merged
pgrete merged 73 commits into
developfrom
lroberts36/add-taskregion-timer
Sep 10, 2025
Merged

Add timeout capability to TaskCollection#1244
pgrete merged 73 commits into
developfrom
lroberts36/add-taskregion-timer

Conversation

@lroberts36
Copy link
Copy Markdown
Collaborator

@lroberts36 lroberts36 commented Apr 16, 2025

PR Summary

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • Change is breaking (API, behavior, ...)
    • Change is additionally added to CHANGELOG.md in the breaking section
    • PR is marked as breaking
    • Short summary API changes at the top of the PR (plus optionally with an automated update/fix script)
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

@lroberts36
Copy link
Copy Markdown
Collaborator Author

@lroberts36 is this ready for review?

Yes, it is ready for review.

Copy link
Copy Markdown
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the default timeout is longer than any code will ever run, I think it's probably worth documenting this mechanism in the docs.

Also should the default timeout time be something shorter? Like a week or a day or something? Short enough that it might actually trigger, but long enough that nobody will hit it accidentally?

Approving now, though, as these are minor nitpicks

Comment thread src/mesh/mesh.cpp Outdated
Comment on lines +83 to +84
task_collection_timeout_in_seconds(pin->GetOrAddInteger(
"parthenon/mesh", "task_collection_timeout_in_seconds", 60 * 60 * 24 * 365)),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 year lol?

Copy link
Copy Markdown
Collaborator Author

@lroberts36 lroberts36 Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha, yeah, I wanted a timeout that no one would hit and I thought a year was funny.

We should probably have a discussion about what people think is a good default choice. I will make an issue.

Comment thread src/mesh/mesh.cpp
Comment thread src/driver/driver.hpp
@Yurlungur
Copy link
Copy Markdown
Collaborator

We should just pick a timeout so this can get merged and we can change it later.

Comment thread src/mesh/mesh.cpp Outdated
Comment thread src/tasks/tasks.hpp Outdated
Copy link
Copy Markdown
Collaborator

@pgrete pgrete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just fixed the param doc and two typos.
After quick clarification of the question, I think we can merge (assuming the tests pass now)

Comment thread src/tasks/thread_pool.hpp
Comment thread src/tasks/thread_pool.hpp Outdated
Comment thread tst/unit/test_tasklist.cpp Outdated
Comment on lines +56 to +59
region[0].AddTask(TaskID(0), []() { return TaskStatus::incomplete; });
const std::size_t timeout_in_seconds = 4;
tc.Execute(timeout_in_seconds);
REQUIRE(true);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this yesterday.
How does this unit test check the timeout capabilities?
I'd have imagined sth like a task waiting for more than the timeout and and then check for the timeout return.
Given subtle issues in the past we had with tasking/threading, I think that it's important to have a test for this new feature.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the task list timeout didn't work, tc.Execute() would run forever since the task always says it is incomplete and the task would be repeatedly called (similar to what would happen for a TryReceive call that never gets data from a neighbor). I changed this to get the status of the TaskCollecion and make sure the task collection failed.

Remember that when we compile with SerialPool, a task that runs for a long time without returning will not trigger the timeout. The only way to have that behavior I think is to spin off another thread (similar to what happens when we compile with ThreadPool or in the Athena++ code you linked to the other day).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, thanks for clarifying.

@lroberts36 lroberts36 enabled auto-merge September 4, 2025 18:28
@pgrete pgrete disabled auto-merge September 10, 2025 15:51
@pgrete pgrete enabled auto-merge (squash) September 10, 2025 15:51
@pgrete pgrete merged commit bf202e9 into develop Sep 10, 2025
37 checks passed
@pgrete pgrete deleted the lroberts36/add-taskregion-timer branch September 10, 2025 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants