Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to straggler handling functionality #996

Draft
wants to merge 29 commits into
base: develop
Choose a base branch
from

Conversation

ParthMandaliya
Copy link
Collaborator

This PR to resolves issue #981

Summary of changes

  1. StragglerHandlingFunction: Added an interface method start_policy
  2. CutoffTimeBasedStragglerHandling: start_policy method implements a timer to wait for cutoff-time and then call provided call callback method.
  3. Aggregator:
    • sendlocaltaskresults: update collaborators_done to keep track of collaborators that have finished ALL tasks
    • _straggler_cutoff_time_elapsed: call back function that is called after cutoff time has elapsed and applies the straggler policy

Verification performed

Tested various scenarios for time based cutoff straggler handling policy
All Director based workflow and Aggregator based workflow test-cases

Signed-off-by: Parth Mandaliya <[email protected]>
…lingFunction abstract class.

Renamed start_timer and __timer_expired functions to start_straggler_cutoff_timer and _straggler_cutoff_time_elapsed respectively.
Added docstring to both functions mentioned above.

Signed-off-by: Parth Mandaliya <[email protected]>
If one or more collaborator(s) does not even 1 task results in time, all tasks results sent by that collaborator is excluded from aggregation.

Signed-off-by: Parth Mandaliya <[email protected]>
removing start_straggler_cutoff_timer function from parent class StragglerHandlingFunction

Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
…lts for all tasks

Changed straggler handling logs
Added docstring for functions in all straggler handling policies

Signed-off-by: Parth Mandaliya <[email protected]>
2. CutoffTimeBasedStragglerHandling: start_policy method implements a timer to wait
for cutoff-time and then call provided call callback method
3. Aggregator:
    - sendlocaltaskresults: update collaborators_done to keep track of collaborators
	that have finished ALL tasks
    - _straggler_cutoff_time_elapsed: call back function that is called after cutoff
	time has elapsed and applies the straggler policy

Signed-off-by: Parth Mandaliya <[email protected]>
Added logger argument to get_straggler_handling_policy in plan.py

Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
@ParthMandaliya ParthMandaliya marked this pull request as ready for review July 4, 2024 12:33
@ParthMandaliya ParthMandaliya marked this pull request as draft July 5, 2024 12:54
minimum_reporting cannot be set 0 in any straggler policy

Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
@ParthMandaliya ParthMandaliya linked an issue Jul 9, 2024 that may be closed by this pull request
@ParthMandaliya ParthMandaliya self-assigned this Jul 9, 2024
Copy link

@teoparvanov teoparvanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code in this PR would work, but it introduces some tech debt - esp. with regards to some encapsulation "leaks" of the straggler policy objects. I've suggested some improvements in this first review pass, but it may be necessary to further revise the straggler policy interface.

Comment on lines +360 to +361
# Stop timer from restarting until next round
self.straggler_handling_policy.is_timer_expired = True
Copy link

@teoparvanov teoparvanov Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cleaner if the policy object itself sets is_timer_expired to True when appropriate. It doesn't seem safe to leave this responsibility with the callback function, as the implementor may not be aware of it.

@@ -5,15 +5,49 @@

from abc import ABC
from abc import abstractmethod
from typing import Callable


class StragglerHandlingFunction(ABC):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer a pure functional interface - I'd suggest to rename it to StragglerHandlingPolicy

Comment on lines +655 to +660
if (
not hasattr(self.straggler_handling_policy, "round_start_time") and
self.straggler_handling_policy.straggler_cutoff_check(
len(self.collaborators_done), self.authorized_cols
)
):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of the actual policy implementation, shouldn't we always add the lagging collaborators to the stragglers list (as soon as the straggler_cutoff_check triggers)?

raise NotImplementedError

@abstractmethod
def straggler_cutoff_check(self, **kwargs) -> bool:
Copy link

@teoparvanov teoparvanov Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now it's not clear what kind of inputs this method expects. Why not define it a bit more explicitly, as in:

def straggler_cutoff_check(self, num_collaborators_done, all_collaborators, **kwargs) -> bool:

or even

def straggler_cutoff_check(self, num_collaborators_done, num_all_collaborators, **kwargs) -> bool:

(for those policies, we most likely only need the list lengths, rather than the actual collaborator names)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential issues observed in straggler handling policy
2 participants