Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue/8008 resource scheduler polishing #8116

Closed
wants to merge 28 commits into from

Conversation

sanderr
Copy link
Contributor

@sanderr sanderr commented Sep 20, 2024

Description

Polished the resource scheduler and addressed todos and review comments from #8001. I attached some comments to the diff with more details / motivation.

Please also have another look at the comments you made in #8001 to see if they have been resolved to your satisfaction.

closes #8008

Self Check:

Strike through any lines that are not applicable (~~line~~) then check the box

  • Attached issue to pull request
  • Changelog entry
  • Type annotations are present
  • Code is clear and sufficiently documented
  • No (preventable) type errors (check using make mypy or make mypy-diff)
  • Sufficient test cases (reproduces the bug/tests the requested feature)
  • Correct, in line with design
  • End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )
  • If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see test-fixes for more info)

@@ -36,11 +39,63 @@
LOGGER = logging.getLogger(__name__)


# FIXME[#8008] review code structure + functionality + add docstrings
# FIXME[#8008] add import entry point test case
class TaskManager(abc.ABC):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to have a restricted interface with the tasks, so they don't need to access private attributes. As I see it, this makes it easier to reason on scheduler state, because it's all in one place (here), and we know it won't be misused elsewhere. The two modules / classes are still relatively tightly coupled, but at least in a well defined manner.

@@ -60,30 +115,38 @@ def __init__(
self._work: work.ScheduledWork = work.ScheduledWork(
requires=self._state.requires.requires_view(),
provides=self._state.requires.provides_view(),
consumer_factory=self.start_for_agent,
new_agent_notify=self._start_for_agent,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion: I found the name consumer_factory a bit vague / confusing. I can revert this if you don't like it.

Comment on lines +128 to +129
# - lock to serialize updates to the scheduler's intent (version, attributes, ...), e.g. process a new version.
self._intent_lock: asyncio.Lock = asyncio.Lock()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a chat with @arnaudsjs about the name of this lock, and we concluded that "update lock" perhaps wasn't the most revealing name. We thought "desired state lock" might be a good fit. I ended up changing that to "intent lock", both for brevity and because that seems to be the more accepted terminology nowadays.


attribute_hash: AttributeHash
@dataclass(frozen=True)
class ResourceDetails:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the bigger changes I made, as proposed on Slack. I felt that the inheritance from executor.ResourceDetails was not appropriate here, because executor.ResourceDetails contains the full specification to deploy a resource, while this dataclass here is simply meant to express the current intent of a resource.

Most importantly imo, it should be version agnostic: the scheduler knows which version it reflects, and it makes sure that its model state reflects that version. If a new version is read from the database, unchanged resources should not be affected. The resource should become versioned only at the point where we actually commit to a version. Therefore I moved the construction of executor.ResourceDetails to the tasks module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an improvement, but I find it odd this doesn't carry a resource_id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to confess I'm always on the fence for these types of things. Do you duplicate the id in the object, adding an invariant that it matches the key in the mapping, or do you keep it contained to just the "data"? I'm not particularly attached to this, so I can add the id.

Comment on lines 106 to 110
resource_details: "state.ResourceDetails"
async with scheduler._scheduler_lock:
# fetch resource details atomically under lock
try:
resource_details = scheduler._state.resources[self.resource]
except KeyError:
# Stale resource, can simply be dropped.
# May occur in rare races between new_version and acquiring the lock we're under here. This race is safe
# because of this check, and an intrinsic part of the locking design because it's preferred over wider
# locking for performance reasons.
return
await self.execute_on_resource(scheduler, agent, resource_details)

@abc.abstractmethod
async def execute_on_resource(
self, scheduler: "scheduler.ResourceScheduler", agent: str, resource_details: "state.ResourceDetails"
) -> None:
pass
intent = await task_manager.get_resource_intent(self.resource, for_deploy=True)
if intent is None:
# Stale resource, can simply be dropped.
return
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The common part here became so trivial that I felt it made no sense anymore to group them together under a common subclass, so I dropped OnLatestState.

Comment on lines +311 to +312
:param stale_deploys: Set of resources for which a stale deploy is in progress, i.e. a deploy for an outdated resource
intent.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wouterdb this was previously running_deploys, on which you commented that it was unclear. I turned the approach around, I hope it's more clear now.

Comment on lines -183 to -184
# FIXME: SANDER: It seems we immediatly deploy if a new version arrives, we don't wait for an explicit deploy call?
# Is this by design?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was, with the intention to have the same behavior as the one we have now. But I just realized that it is actually a deviation since this is currently managed by a setting.

I have to give it more thought, but it may be difficult to change this, damn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have to dampen this somehow, but not now.

"""
Build a view on current resources. Might be filtered for a specific environment, used when a new version is released

:return: resource_mapping {id -> resource details}
"""
if version is None:
# TODO: create ticket: BUG: not necessarily released
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll delegate these to a new ticket but that will be for Monday.

Copy link
Contributor

@wouterdb wouterdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!


attribute_hash: AttributeHash
@dataclass(frozen=True)
class ResourceDetails:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an improvement, but I find it odd this doesn't carry a resource_id

Comment on lines +135 to +144
# Set of resources for which a concrete non-stale deploy is in progress, i.e. we've committed for a given intent and
# that intent still reflects the latest resource intent
# Apart from the obvious, this differs from the agent queues' in-progress deploys in the sense that those are simply
# tasks that have been picked up while this set contains only those tasks for which we've already committed. For each
# deploy task, there is a (typically) short window of time where it's considered in progress by the agent queue, but
# it has not yet started on the actual deploy, i.e. it will still see updates to the resource intent.
self._deploying: set[ResourceIdStr] = set()
# Set of resources for which a concrete stale deploy is in progress, i.e. we've committed for a given intent and
# that intent has gone stale since
self._deploying_stale: set[ResourceIdStr] = set()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to keep track of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We want to track the in-progress tasks, so that when we trigger a deploy, we know what's already running.
  2. We want to make sure we only consider tasks as in-progress for the purposes of 1, if they are running for the latest intent, or they haven't committed to an intent yet.

This set is to inform the ScheduledWork that while there are tasks running for these resources, they should not be taken into consideration because they are known to be stale.

I have to confess (I meant to add a comment but I forgot) that I'm not completely satisfied with how this ended up. But I also don't see an alternative that I prefer. The knowledge of which tasks are running is in the agent queues mostly, but the semantics of what the tasks mean and what a stale deploy is, live in the scheduler. I tried to stick with those responsibilities as much as I could, but there are some rough edges to it that I couldn't get rid of.

Comment on lines 116 to 121
await task_manager.report_resource_state(
resource=self.resource,
attribute_hash=resource_details.attribute_hash,
status=state.ResourceStatus.UP_TO_DATE if is_success else state.ResourceStatus.HAS_UPDATE,
deployment_result=state.DeploymentResult.DEPLOYED if is_success else state.DeploymentResult.FAILED,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be in a finaly block? i.e. where is the hard exception boundary for e.g. all the remoting?

If this is not called, due to an exception, the state of the scheduler is toast, if I understand correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I made the change, but I had to move some code around to keep it coherent (it felt a bit strange to handle some exceptions in do_deploy and some others outside of it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is normal, different types of failure are handled differently and at some level, no exceptions are expected, but they still have to be stopped to protect the main loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, but by lifting it up a level, the two cases become one and we're no longer dealing with vagueries purely for the sake of protecting the main loop, we now have the context to know what we're protecting against.

tests/agent_server/deploy/scheduler_test_util.py Outdated Show resolved Hide resolved
Comment on lines -183 to -184
# FIXME: SANDER: It seems we immediatly deploy if a new version arrives, we don't wait for an explicit deploy call?
# Is this by design?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have to dampen this somehow, but not now.

@@ -978,6 +978,7 @@ src/inmanta/agent/resourcepool.py:0: error: Argument 1 to "append" of "list" has
src/inmanta/agent/resourcepool.py:0: error: Missing type parameters for generic type "PoolMember" [type-arg]
src/inmanta/deploy/scheduler.py:0: error: Argument 1 to "parse_id" of "Id" has incompatible type "object"; expected "ResourceVersionIdStr | ResourceIdStr" [arg-type]
src/inmanta/deploy/scheduler.py:0: error: Argument "attribute_hash" to "ResourceDetails" has incompatible type "str | None"; expected "str" [arg-type]
src/inmanta/deploy/scheduler.py:0: error: "object" has no attribute "__iter__"; maybe "__dir__" or "__str__"? (not iterable) [attr-defined]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply a new instance of a known shortcoming in our typing (untyped resource attributes, in this case "requires").

@sanderr sanderr marked this pull request as ready for review September 23, 2024 12:04
Comment on lines +48 to +50
def __post_init__(self) -> None:
# use object.__setattr__ because this is a frozen dataclass, see dataclasses docs
object.__setattr__(self, "id", Id.parse_id(self.resource_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is how it's done! Oh the horror.

Comment on lines 116 to 121
await task_manager.report_resource_state(
resource=self.resource,
attribute_hash=resource_details.attribute_hash,
status=state.ResourceStatus.UP_TO_DATE if is_success else state.ResourceStatus.HAS_UPDATE,
deployment_result=state.DeploymentResult.DEPLOYED if is_success else state.DeploymentResult.FAILED,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is normal, different types of failure are handled differently and at some level, no exceptions are expected, but they still have to be stopped to protect the main loop.

@sanderr sanderr added the merge-tool-ready This ticket is ready to be merged in label Sep 24, 2024
@inmantaci
Copy link
Contributor

Processing this pull request

@inmantaci
Copy link
Contributor

Merged into branches master in 7356a83

inmantaci pushed a commit that referenced this pull request Sep 24, 2024
# Description

Polished the resource scheduler and addressed todos and review comments from #8001. I attached some comments to the diff with more details / motivation.

Please also have another look at the comments you made in #8001 to see if they have been resolved to your satisfaction.

closes #8008

# Self Check:

Strike through any lines that are not applicable (`~~line~~`) then check the box

- [x] Attached issue to pull request
- [x] Changelog entry
- [x] Type annotations are present
- [x] Code is clear and sufficiently documented
- [x] No (preventable) type errors (check using make mypy or make mypy-diff)
- [x] Sufficient test cases (reproduces the bug/tests the requested feature)
- [x] Correct, in line with design
- [x] ~~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~~
- [x] ~~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~~
@inmantaci inmantaci closed this Sep 24, 2024
@inmantaci inmantaci deleted the issue/8008-resource-scheduler-polishing branch September 24, 2024 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-tool-ready This ticket is ready to be merged in
Projects
None yet
Development

Successfully merging this pull request may close these issues.

resource scheduler: polish scheduling algorithm
4 participants