Issue/8008 resource scheduler polishing #8116

sanderr · 2024-09-20T15:07:41Z

Description

Polished the resource scheduler and addressed todos and review comments from #8001. I attached some comments to the diff with more details / motivation.

Please also have another look at the comments you made in #8001 to see if they have been resolved to your satisfaction.

closes #8008

Self Check:

Strike through any lines that are not applicable (~~line~~) then check the box

Attached issue to pull request
Changelog entry
Type annotations are present
Code is clear and sufficiently documented
No (preventable) type errors (check using make mypy or make mypy-diff)
Sufficient test cases (reproduces the bug/tests the requested feature)
Correct, in line with design
~~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~~
~~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see test-fixes for more info)~~

…-scheduler-polishing

sanderr · 2024-09-20T15:13:50Z

src/inmanta/deploy/scheduler.py

@@ -36,11 +39,63 @@
 LOGGER = logging.getLogger(__name__)


-# FIXME[#8008] review code structure + functionality + add docstrings
-# FIXME[#8008] add import entry point test case
+class TaskManager(abc.ABC):


I added this to have a restricted interface with the tasks, so they don't need to access private attributes. As I see it, this makes it easier to reason on scheduler state, because it's all in one place (here), and we know it won't be misused elsewhere. The two modules / classes are still relatively tightly coupled, but at least in a well defined manner.

sanderr · 2024-09-20T15:14:36Z

src/inmanta/deploy/scheduler.py

@@ -60,30 +115,38 @@ def __init__(
        self._work: work.ScheduledWork = work.ScheduledWork(
            requires=self._state.requires.requires_view(),
            provides=self._state.requires.provides_view(),
-            consumer_factory=self.start_for_agent,
+            new_agent_notify=self._start_for_agent,


Just a suggestion: I found the name consumer_factory a bit vague / confusing. I can revert this if you don't like it.

sanderr · 2024-09-20T15:16:51Z

src/inmanta/deploy/scheduler.py

+        # - lock to serialize updates to the scheduler's intent (version, attributes, ...), e.g. process a new version.
+        self._intent_lock: asyncio.Lock = asyncio.Lock()


I had a chat with @arnaudsjs about the name of this lock, and we concluded that "update lock" perhaps wasn't the most revealing name. We thought "desired state lock" might be a good fit. I ended up changing that to "intent lock", both for brevity and because that seems to be the more accepted terminology nowadays.

sanderr · 2024-09-20T15:30:58Z

src/inmanta/deploy/state.py

-
-    attribute_hash: AttributeHash
+@dataclass(frozen=True)
+class ResourceDetails:


This is one of the bigger changes I made, as proposed on Slack. I felt that the inheritance from executor.ResourceDetails was not appropriate here, because executor.ResourceDetails contains the full specification to deploy a resource, while this dataclass here is simply meant to express the current intent of a resource.

Most importantly imo, it should be version agnostic: the scheduler knows which version it reflects, and it makes sure that its model state reflects that version. If a new version is read from the database, unchanged resources should not be affected. The resource should become versioned only at the point where we actually commit to a version. Therefore I moved the construction of executor.ResourceDetails to the tasks module.

I think this is an improvement, but I find it odd this doesn't carry a resource_id

I have to confess I'm always on the fence for these types of things. Do you duplicate the id in the object, adding an invariant that it matches the key in the mapping, or do you keep it contained to just the "data"? I'm not particularly attached to this, so I can add the id.

sanderr · 2024-09-20T15:32:24Z

src/inmanta/deploy/tasks.py

        resource_details: "state.ResourceDetails"
-        async with scheduler._scheduler_lock:
-            # fetch resource details atomically under lock
-            try:
-                resource_details = scheduler._state.resources[self.resource]
-            except KeyError:
-                # Stale resource, can simply be dropped.
-                # May occur in rare races between new_version and acquiring the lock we're under here. This race is safe
-                # because of this check, and an intrinsic part of the locking design because it's preferred over wider
-                # locking for performance reasons.
-                return
-        await self.execute_on_resource(scheduler, agent, resource_details)
-
-    @abc.abstractmethod
-    async def execute_on_resource(
-        self, scheduler: "scheduler.ResourceScheduler", agent: str, resource_details: "state.ResourceDetails"
-    ) -> None:
-        pass
+        intent = await task_manager.get_resource_intent(self.resource, for_deploy=True)
+        if intent is None:
+            # Stale resource, can simply be dropped.
+            return


The common part here became so trivial that I felt it made no sense anymore to group them together under a common subclass, so I dropped OnLatestState.

sanderr · 2024-09-20T15:34:47Z

src/inmanta/deploy/work.py

+        :param stale_deploys: Set of resources for which a stale deploy is in progress, i.e. a deploy for an outdated resource
+            intent.


@wouterdb this was previously running_deploys, on which you commented that it was unclear. I turned the approach around, I hope it's more clear now.

sanderr · 2024-09-20T16:05:10Z

tests/agent_server/deploy/test_scheduler_agent.py

-    # FIXME: SANDER: It seems we immediatly deploy if a new version arrives, we don't wait for an explicit deploy call?
-    # Is this by design?


It was, with the intention to have the same behavior as the one we have now. But I just realized that it is actually a deviation since this is currently managed by a setting.

I have to give it more thought, but it may be difficult to change this, damn.

We will have to dampen this somehow, but not now.

sanderr · 2024-09-20T16:45:03Z

src/inmanta/deploy/scheduler.py

        """
        Build a view on current resources. Might be filtered for a specific environment, used when a new version is released

        :return: resource_mapping {id -> resource details}
        """
        if version is None:
+            # TODO: create ticket: BUG: not necessarily released


I'll delegate these to a new ticket but that will be for Monday.

wouterdb

nice!

wouterdb · 2024-09-20T20:55:46Z

src/inmanta/deploy/state.py

-
-    attribute_hash: AttributeHash
+@dataclass(frozen=True)
+class ResourceDetails:


I think this is an improvement, but I find it odd this doesn't carry a resource_id

wouterdb · 2024-09-20T21:05:49Z

src/inmanta/deploy/scheduler.py

+        # Set of resources for which a concrete non-stale deploy is in progress, i.e. we've committed for a given intent and
+        # that intent still reflects the latest resource intent
+        # Apart from the obvious, this differs from the agent queues' in-progress deploys in the sense that those are simply
+        # tasks that have been picked up while this set contains only those tasks for which we've already committed. For each
+        # deploy task, there is a (typically) short window of time where it's considered in progress by the agent queue, but
+        # it has not yet started on the actual deploy, i.e. it will still see updates to the resource intent.
+        self._deploying: set[ResourceIdStr] = set()
+        # Set of resources for which a concrete stale deploy is in progress, i.e. we've committed for a given intent and
+        # that intent has gone stale since
+        self._deploying_stale: set[ResourceIdStr] = set()


Why do we need to keep track of this?

We want to track the in-progress tasks, so that when we trigger a deploy, we know what's already running.

We want to make sure we only consider tasks as in-progress for the purposes of 1, if they are running for the latest intent, or they haven't committed to an intent yet.

This set is to inform the ScheduledWork that while there are tasks running for these resources, they should not be taken into consideration because they are known to be stale.

I have to confess (I meant to add a comment but I forgot) that I'm not completely satisfied with how this ended up. But I also don't see an alternative that I prefer. The knowledge of which tasks are running is in the agent queues mostly, but the semantics of what the tasks mean and what a stale deploy is, live in the scheduler. I tried to stick with those responsibilities as much as I could, but there are some rough edges to it that I couldn't get rid of.

wouterdb · 2024-09-20T21:09:10Z

src/inmanta/deploy/tasks.py

+        await task_manager.report_resource_state(
+            resource=self.resource,
+            attribute_hash=resource_details.attribute_hash,
+            status=state.ResourceStatus.UP_TO_DATE if is_success else state.ResourceStatus.HAS_UPDATE,
+            deployment_result=state.DeploymentResult.DEPLOYED if is_success else state.DeploymentResult.FAILED,
+        )


I wonder if this should be in a finaly block? i.e. where is the hard exception boundary for e.g. all the remoting?

If this is not called, due to an exception, the state of the scheduler is toast, if I understand correctly?

You're right. I made the change, but I had to move some code around to keep it coherent (it felt a bit strange to handle some exceptions in do_deploy and some others outside of it.

It is normal, different types of failure are handled differently and at some level, no exceptions are expected, but they still have to be stopped to protect the main loop.

Exactly, but by lifting it up a level, the two cases become one and we're no longer dealing with vagueries purely for the sake of protecting the main loop, we now have the context to know what we're protecting against.

tests/agent_server/deploy/scheduler_test_util.py

wouterdb · 2024-09-20T21:15:05Z

tests/agent_server/deploy/test_scheduler_agent.py

-    # FIXME: SANDER: It seems we immediatly deploy if a new version arrives, we don't wait for an explicit deploy call?
-    # Is this by design?


We will have to dampen this somehow, but not now.

Co-authored-by: Wouter De Borger <[email protected]>

…inmanta/inmanta-core into issue/8008-resource-scheduler-polishing

sanderr · 2024-09-23T12:04:51Z

mypy-baseline.txt

@@ -978,6 +978,7 @@ src/inmanta/agent/resourcepool.py:0: error: Argument 1 to "append" of "list" has
 src/inmanta/agent/resourcepool.py:0: error: Missing type parameters for generic type "PoolMember"  [type-arg]
 src/inmanta/deploy/scheduler.py:0: error: Argument 1 to "parse_id" of "Id" has incompatible type "object"; expected "ResourceVersionIdStr | ResourceIdStr"  [arg-type]
 src/inmanta/deploy/scheduler.py:0: error: Argument "attribute_hash" to "ResourceDetails" has incompatible type "str | None"; expected "str"  [arg-type]
+src/inmanta/deploy/scheduler.py:0: error: "object" has no attribute "__iter__"; maybe "__dir__" or "__str__"? (not iterable)  [attr-defined]


Simply a new instance of a known shortcoming in our typing (untyped resource attributes, in this case "requires").

wouterdb · 2024-09-23T13:42:14Z

src/inmanta/deploy/state.py

+    def __post_init__(self) -> None:
+        # use object.__setattr__ because this is a frozen dataclass, see dataclasses docs
+        object.__setattr__(self, "id", Id.parse_id(self.resource_id))


so this is how it's done! Oh the horror.

wouterdb · 2024-09-23T13:45:41Z

src/inmanta/deploy/tasks.py

+        await task_manager.report_resource_state(
+            resource=self.resource,
+            attribute_hash=resource_details.attribute_hash,
+            status=state.ResourceStatus.UP_TO_DATE if is_success else state.ResourceStatus.HAS_UPDATE,
+            deployment_result=state.DeploymentResult.DEPLOYED if is_success else state.DeploymentResult.FAILED,
+        )


It is normal, different types of failure are handled differently and at some level, no exceptions are expected, but they still have to be stopped to protect the main loop.

inmantaci · 2024-09-24T08:44:06Z

Processing this pull request

inmantaci · 2024-09-24T08:44:10Z

Merged into branches master in 7356a83

# Description Polished the resource scheduler and addressed todos and review comments from #8001. I attached some comments to the diff with more details / motivation. Please also have another look at the comments you made in #8001 to see if they have been resolved to your satisfaction. closes #8008 # Self Check: Strike through any lines that are not applicable (`~~line~~`) then check the box - [x] Attached issue to pull request - [x] Changelog entry - [x] Type annotations are present - [x] Code is clear and sufficiently documented - [x] No (preventable) type errors (check using make mypy or make mypy-diff) - [x] Sufficient test cases (reproduces the bug/tests the requested feature) - [x] Correct, in line with design - [x] ~~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~~ - [x] ~~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~~

sanderr added 9 commits September 16, 2024 10:51

added notes

b3cf69b

WIP

4a70616

Merge remote-tracking branch 'origin/master' into issue/8008-resource…

908b6cd

…-scheduler-polishing

fixes

7d4eb02

cleaned up bidirectional mapping and added tests

afb41e9

misc cleanup

eb058f2

refinement

a110041

pep8

cbd348d

Merge branch 'master' into issue/8008-resource-scheduler-polishing

1047d1d

sanderr commented Sep 20, 2024

View reviewed changes

fixes

0ca2797

sanderr commented Sep 20, 2024

View reviewed changes

sanderr added 3 commits September 20, 2024 18:08

fix

3e8aacd

fixes

04c0232

mypy

fe98648

sanderr requested a review from wouterdb September 20, 2024 16:30

sanderr added 2 commits September 20, 2024 18:32

todo

e324fe3

fixed broken tests

4f3c1b8

sanderr commented Sep 20, 2024

View reviewed changes

wouterdb reviewed Sep 20, 2024

View reviewed changes

sanderr and others added 5 commits September 23, 2024 11:52

review comment

21df50e

Update tests/agent_server/deploy/scheduler_test_util.py

d41b001

Co-authored-by: Wouter De Borger <[email protected]>

Merge branch 'issue/8008-resource-scheduler-polishing' of github.com:…

7ccc6b6

…inmanta/inmanta-core into issue/8008-resource-scheduler-polishing

test fixes

0106a04

bugfix

3c6380d

sanderr added 6 commits September 23, 2024 13:42

review comments

5b7041d

review comments

3de5c1d

Merge branch 'issue/8008-resource-scheduler-polishing' of github.com:…

cce258a

…inmanta/inmanta-core into issue/8008-resource-scheduler-polishing

pep8

b3f2b51

todos

f9d9f04

mypy

eaa1adb

sanderr commented Sep 23, 2024

View reviewed changes

sanderr marked this pull request as ready for review September 23, 2024 12:04

sanderr requested review from wouterdb and arnaudsjs September 23, 2024 12:07

sanderr added 2 commits September 23, 2024 14:11

change entry

b84c809

Merge branch 'master' into issue/8008-resource-scheduler-polishing

8bb52c6

wouterdb approved these changes Sep 23, 2024

View reviewed changes

arnaudsjs approved these changes Sep 24, 2024

View reviewed changes

sanderr added the merge-tool-ready This ticket is ready to be merged in label Sep 24, 2024

inmantaci closed this Sep 24, 2024

inmantaci deleted the issue/8008-resource-scheduler-polishing branch September 24, 2024 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue/8008 resource scheduler polishing #8116

Issue/8008 resource scheduler polishing #8116

sanderr commented Sep 20, 2024 •

edited

Loading

sanderr Sep 20, 2024

sanderr Sep 20, 2024

sanderr Sep 20, 2024

sanderr Sep 20, 2024

wouterdb Sep 20, 2024

sanderr Sep 23, 2024

sanderr Sep 20, 2024

sanderr Sep 20, 2024

sanderr Sep 20, 2024

wouterdb Sep 20, 2024

sanderr Sep 20, 2024

wouterdb left a comment

wouterdb Sep 20, 2024

wouterdb Sep 20, 2024

sanderr Sep 23, 2024

wouterdb Sep 20, 2024

sanderr Sep 23, 2024

wouterdb Sep 23, 2024

sanderr Sep 23, 2024

wouterdb Sep 20, 2024

sanderr Sep 23, 2024

wouterdb Sep 23, 2024

wouterdb Sep 23, 2024

inmantaci commented Sep 24, 2024

inmantaci commented Sep 24, 2024

		# - lock to serialize updates to the scheduler's intent (version, attributes, ...), e.g. process a new version.
		self._intent_lock: asyncio.Lock = asyncio.Lock()

		:param stale_deploys: Set of resources for which a stale deploy is in progress, i.e. a deploy for an outdated resource
		intent.

		# FIXME: SANDER: It seems we immediatly deploy if a new version arrives, we don't wait for an explicit deploy call?
		# Is this by design?

Issue/8008 resource scheduler polishing #8116

Issue/8008 resource scheduler polishing #8116

Conversation

sanderr commented Sep 20, 2024 • edited Loading

Description

Self Check:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wouterdb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

inmantaci commented Sep 24, 2024

inmantaci commented Sep 24, 2024

sanderr commented Sep 20, 2024 •

edited

Loading