-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue/8008 resource scheduler polishing #8116
Conversation
…-scheduler-polishing
@@ -36,11 +39,63 @@ | |||
LOGGER = logging.getLogger(__name__) | |||
|
|||
|
|||
# FIXME[#8008] review code structure + functionality + add docstrings | |||
# FIXME[#8008] add import entry point test case | |||
class TaskManager(abc.ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this to have a restricted interface with the tasks, so they don't need to access private attributes. As I see it, this makes it easier to reason on scheduler state, because it's all in one place (here), and we know it won't be misused elsewhere. The two modules / classes are still relatively tightly coupled, but at least in a well defined manner.
@@ -60,30 +115,38 @@ def __init__( | |||
self._work: work.ScheduledWork = work.ScheduledWork( | |||
requires=self._state.requires.requires_view(), | |||
provides=self._state.requires.provides_view(), | |||
consumer_factory=self.start_for_agent, | |||
new_agent_notify=self._start_for_agent, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion: I found the name consumer_factory
a bit vague / confusing. I can revert this if you don't like it.
# - lock to serialize updates to the scheduler's intent (version, attributes, ...), e.g. process a new version. | ||
self._intent_lock: asyncio.Lock = asyncio.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a chat with @arnaudsjs about the name of this lock, and we concluded that "update lock" perhaps wasn't the most revealing name. We thought "desired state lock" might be a good fit. I ended up changing that to "intent lock", both for brevity and because that seems to be the more accepted terminology nowadays.
|
||
attribute_hash: AttributeHash | ||
@dataclass(frozen=True) | ||
class ResourceDetails: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the bigger changes I made, as proposed on Slack. I felt that the inheritance from executor.ResourceDetails
was not appropriate here, because executor.ResourceDetails
contains the full specification to deploy a resource, while this dataclass here is simply meant to express the current intent of a resource.
Most importantly imo, it should be version agnostic: the scheduler knows which version it reflects, and it makes sure that its model state reflects that version. If a new version is read from the database, unchanged resources should not be affected. The resource should become versioned only at the point where we actually commit to a version. Therefore I moved the construction of executor.ResourceDetails
to the tasks
module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an improvement, but I find it odd this doesn't carry a resource_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to confess I'm always on the fence for these types of things. Do you duplicate the id in the object, adding an invariant that it matches the key in the mapping, or do you keep it contained to just the "data"? I'm not particularly attached to this, so I can add the id.
src/inmanta/deploy/tasks.py
Outdated
resource_details: "state.ResourceDetails" | ||
async with scheduler._scheduler_lock: | ||
# fetch resource details atomically under lock | ||
try: | ||
resource_details = scheduler._state.resources[self.resource] | ||
except KeyError: | ||
# Stale resource, can simply be dropped. | ||
# May occur in rare races between new_version and acquiring the lock we're under here. This race is safe | ||
# because of this check, and an intrinsic part of the locking design because it's preferred over wider | ||
# locking for performance reasons. | ||
return | ||
await self.execute_on_resource(scheduler, agent, resource_details) | ||
|
||
@abc.abstractmethod | ||
async def execute_on_resource( | ||
self, scheduler: "scheduler.ResourceScheduler", agent: str, resource_details: "state.ResourceDetails" | ||
) -> None: | ||
pass | ||
intent = await task_manager.get_resource_intent(self.resource, for_deploy=True) | ||
if intent is None: | ||
# Stale resource, can simply be dropped. | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The common part here became so trivial that I felt it made no sense anymore to group them together under a common subclass, so I dropped OnLatestState
.
:param stale_deploys: Set of resources for which a stale deploy is in progress, i.e. a deploy for an outdated resource | ||
intent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wouterdb this was previously running_deploys
, on which you commented that it was unclear. I turned the approach around, I hope it's more clear now.
# FIXME: SANDER: It seems we immediatly deploy if a new version arrives, we don't wait for an explicit deploy call? | ||
# Is this by design? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was, with the intention to have the same behavior as the one we have now. But I just realized that it is actually a deviation since this is currently managed by a setting.
I have to give it more thought, but it may be difficult to change this, damn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will have to dampen this somehow, but not now.
src/inmanta/deploy/scheduler.py
Outdated
""" | ||
Build a view on current resources. Might be filtered for a specific environment, used when a new version is released | ||
|
||
:return: resource_mapping {id -> resource details} | ||
""" | ||
if version is None: | ||
# TODO: create ticket: BUG: not necessarily released |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll delegate these to a new ticket but that will be for Monday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
|
||
attribute_hash: AttributeHash | ||
@dataclass(frozen=True) | ||
class ResourceDetails: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an improvement, but I find it odd this doesn't carry a resource_id
# Set of resources for which a concrete non-stale deploy is in progress, i.e. we've committed for a given intent and | ||
# that intent still reflects the latest resource intent | ||
# Apart from the obvious, this differs from the agent queues' in-progress deploys in the sense that those are simply | ||
# tasks that have been picked up while this set contains only those tasks for which we've already committed. For each | ||
# deploy task, there is a (typically) short window of time where it's considered in progress by the agent queue, but | ||
# it has not yet started on the actual deploy, i.e. it will still see updates to the resource intent. | ||
self._deploying: set[ResourceIdStr] = set() | ||
# Set of resources for which a concrete stale deploy is in progress, i.e. we've committed for a given intent and | ||
# that intent has gone stale since | ||
self._deploying_stale: set[ResourceIdStr] = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to keep track of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We want to track the in-progress tasks, so that when we trigger a deploy, we know what's already running.
- We want to make sure we only consider tasks as in-progress for the purposes of 1, if they are running for the latest intent, or they haven't committed to an intent yet.
This set is to inform the ScheduledWork
that while there are tasks running for these resources, they should not be taken into consideration because they are known to be stale.
I have to confess (I meant to add a comment but I forgot) that I'm not completely satisfied with how this ended up. But I also don't see an alternative that I prefer. The knowledge of which tasks are running is in the agent queues mostly, but the semantics of what the tasks mean and what a stale deploy is, live in the scheduler. I tried to stick with those responsibilities as much as I could, but there are some rough edges to it that I couldn't get rid of.
src/inmanta/deploy/tasks.py
Outdated
await task_manager.report_resource_state( | ||
resource=self.resource, | ||
attribute_hash=resource_details.attribute_hash, | ||
status=state.ResourceStatus.UP_TO_DATE if is_success else state.ResourceStatus.HAS_UPDATE, | ||
deployment_result=state.DeploymentResult.DEPLOYED if is_success else state.DeploymentResult.FAILED, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should be in a finaly block? i.e. where is the hard exception boundary for e.g. all the remoting?
If this is not called, due to an exception, the state of the scheduler is toast, if I understand correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I made the change, but I had to move some code around to keep it coherent (it felt a bit strange to handle some exceptions in do_deploy
and some others outside of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is normal, different types of failure are handled differently and at some level, no exceptions are expected, but they still have to be stopped to protect the main loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, but by lifting it up a level, the two cases become one and we're no longer dealing with vagueries purely for the sake of protecting the main loop, we now have the context to know what we're protecting against.
# FIXME: SANDER: It seems we immediatly deploy if a new version arrives, we don't wait for an explicit deploy call? | ||
# Is this by design? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will have to dampen this somehow, but not now.
Co-authored-by: Wouter De Borger <[email protected]>
…inmanta/inmanta-core into issue/8008-resource-scheduler-polishing
…inmanta/inmanta-core into issue/8008-resource-scheduler-polishing
@@ -978,6 +978,7 @@ src/inmanta/agent/resourcepool.py:0: error: Argument 1 to "append" of "list" has | |||
src/inmanta/agent/resourcepool.py:0: error: Missing type parameters for generic type "PoolMember" [type-arg] | |||
src/inmanta/deploy/scheduler.py:0: error: Argument 1 to "parse_id" of "Id" has incompatible type "object"; expected "ResourceVersionIdStr | ResourceIdStr" [arg-type] | |||
src/inmanta/deploy/scheduler.py:0: error: Argument "attribute_hash" to "ResourceDetails" has incompatible type "str | None"; expected "str" [arg-type] | |||
src/inmanta/deploy/scheduler.py:0: error: "object" has no attribute "__iter__"; maybe "__dir__" or "__str__"? (not iterable) [attr-defined] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply a new instance of a known shortcoming in our typing (untyped resource attributes, in this case "requires").
def __post_init__(self) -> None: | ||
# use object.__setattr__ because this is a frozen dataclass, see dataclasses docs | ||
object.__setattr__(self, "id", Id.parse_id(self.resource_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is how it's done! Oh the horror.
src/inmanta/deploy/tasks.py
Outdated
await task_manager.report_resource_state( | ||
resource=self.resource, | ||
attribute_hash=resource_details.attribute_hash, | ||
status=state.ResourceStatus.UP_TO_DATE if is_success else state.ResourceStatus.HAS_UPDATE, | ||
deployment_result=state.DeploymentResult.DEPLOYED if is_success else state.DeploymentResult.FAILED, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is normal, different types of failure are handled differently and at some level, no exceptions are expected, but they still have to be stopped to protect the main loop.
Processing this pull request |
Merged into branches master in 7356a83 |
# Description Polished the resource scheduler and addressed todos and review comments from #8001. I attached some comments to the diff with more details / motivation. Please also have another look at the comments you made in #8001 to see if they have been resolved to your satisfaction. closes #8008 # Self Check: Strike through any lines that are not applicable (`~~line~~`) then check the box - [x] Attached issue to pull request - [x] Changelog entry - [x] Type annotations are present - [x] Code is clear and sufficiently documented - [x] No (preventable) type errors (check using make mypy or make mypy-diff) - [x] Sufficient test cases (reproduces the bug/tests the requested feature) - [x] Correct, in line with design - [x] ~~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~~ - [x] ~~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~~
Description
Polished the resource scheduler and addressed todos and review comments from #8001. I attached some comments to the diff with more details / motivation.
Please also have another look at the comments you made in #8001 to see if they have been resolved to your satisfaction.
closes #8008
Self Check:
Strike through any lines that are not applicable (
~~line~~
) then check the boxEnd user documentation is included or an issue is created for end-user documentation (add ref to issue here: )If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see test-fixes for more info)