Skip to content

Conversation

@tobous
Copy link
Member

@tobous tobous commented Oct 2, 2019

[CORE] Improve exception messages for jupiter preconditions

[FIX][CORE] #223 Drop vector time when resource is removed

Adjusts ConcurrentDocumentClient and ConcurrentDocumentServer to drop
the vector time for a resource when it is deleted (or moved, thereby
"deleting" the resource with the previous path).

Adds DeletedResourceFilter to encapsulate the logic of detecting and
correctly handling resource deletions.
Furthermore, it provides a way of filtering out subsequent activities
for deleted resource. This is needed as there can be straggling
activities for deleted resources that were created by remote
participants before they executed the deletion activity for the
resource. Such activities are now filter out until the corresponding
resource is recreated. This ensures that activities for the new
resource will still be processed correctly.
Checksum activities confirming the file deletion are explicitly excluded
from this filer (i.e. they will not be dropped for deleted resources) as
they are needed to check the consistency of all clients.

Fixes #223. Also fixes #710.

[NOP][I] Remove comments to reset vector time

[INTERNAL][CORE] Add deletion acknowledgments

Adds DeletionAcknowledgmentActivity, which can be used to acknowledge
that a file was successfully deleted locally as a reaction to a received
activity. This acknowledgment allows us to drop the activity filter for
the deleted resource once all other session participants have
acknowledged that they received the deletion.

Adds the DeletionAcknowledgmentDispatcher to automatically dispatch an
acknowledgment when an activity is received that contains a file
deletion.

[INTERNAL][CORE] Use deletion acknowledgments to drop activity filters

Uses received deletion acknowledgment activities to determine when all
other participants have received the deletion, allowing the local
instance to drop the activity filter for the deleted resource.

[REFACTOR][CORE] Adjust wording in activity filter

Adjusts the wording in the activity filter to reflect that it is
specifically dealing with files and not resources in general.

[FIX][CORE] Exclude editor closed activities from filter

Adjusts ResourceActivityFilter.isFiltered(IActivity) to also never
filter out editor activities of the type "CLOSED". This is necessary as
such activities are used to clean up the user editor state.

Adjusts the wording of the javadoc for isFiltered(IActivity).

Adjusts code in ResourceActivityFilter to always uses the full qualifier
when using a variable of the type 'FileActivity.Type'. This was done to
make them easier to differentiate from the newly introduced usage of
variables of the type 'EditorActivity.Type'.

[DOC][CORE] Add documentation on how to handle deleted child resources

Adds information on how to handle deleted child resources to
FolderDeletedActivity.

Adds notes that activities might be filtered out to the corresponding
methods in ConcurrentDocumentServer and ConcurrentDocumentClient.

@tobous tobous added Type: Bug Issue that describes an unintended behavior of Saros Aspect: Consistency Issue with Saros' consistency handling Area: Core Issue affecting the Saros Core Prio: Critical labels Oct 2, 2019
@tobous tobous added this to the Saros/I Release 0.3.0 milestone Oct 2, 2019
@tobous tobous self-assigned this Oct 2, 2019
@tobous
Copy link
Member Author

tobous commented Oct 2, 2019

I am still not completely happy with this as the set of deleted resources is never cleared (unless the resource is recreated), meaning it continuously grows. Furthermore, we store references that might not be needed anymore as all participants might already have executed the deletion/move activity.

I only see two options for this:

  • Hold a timestamp for each entry and prune it periodically.
  • Introduce an ACK for resource deletions/moves so that we know when every participant has successfully processed it.

Pruning seems easier, but ACKs would be a nicer solution as it guarantees that all activities are handled correctly (which could be an issue if we prune to early). So I think I am currently leaning towards adding an ACK.

@Drakulix @m273d15 @srossbach Any opinions?

@srossbach
Copy link
Contributor

I updated my comment yesterday regarding the issue. You cannot simply delete something. If the vector time mismatch the Watchdog is unable to detect inconsistencies. As an example. When you perform a recovery (which includes a reset on the client as well on the server side) the whole session is locked/blocked, i.e nobody can edit files and modify something in the editor.

As you are primary testing under IntelliJ and your plugin only supports 2 people you MUST NOT assume that if something works for n = 2 users that this will work for n > 2 users.

@srossbach
Copy link
Contributor

So after taking a quick look you changed the logic to:

When I see a create / delete activity the specific document gets a state which decide if an activity is applied. If you are not in that state do the old logic and create documents on the fly ?

@tobous
Copy link
Member Author

tobous commented Oct 2, 2019

When I see a create / delete activity the specific document gets a state which decide if an activity is applied. If you are not in that state do the old logic and create documents on the fly ?

Yes. When a file is deleted, before the vector time is reset, it is added to the set of deleted resources. This happens for the server and the client whenever they receive a deletion/move activity. Whenever the server or client receives an IResourceActivitiy for a resource contained in the set, it is simply dropped. This was added to avoid the case that we receive an activity for the deleted file after its deletion, thereby recreating its jupiter document.

Resources are removed from the set when they are recreated, either through a file creation or a file move. This ensures that activities will be processed again for the newly created resource.

And yes, I did not change the old logic, so jupiter documents are still created on the fly. But due to the filter, it should not matter.

@saros-project saros-project deleted a comment Oct 2, 2019
@Drakulix
Copy link
Contributor

Drakulix commented Oct 4, 2019

I am still not completely happy with this as the set of deleted resources is never cleared (unless the resource is recreated), meaning it continuously grows. Furthermore, we store references that might not be needed anymore as all participants might already have executed the deletion/move activity.

I only see two options for this:

* Hold a timestamp for each entry and prune it periodically.

* Introduce an ACK for resource deletions/moves so that we know when every participant has successfully processed it.

Pruning seems easier, but ACKs would be a nicer solution as it guarantees that all activities are handled correctly (which could be an issue if we prune to early). So I think I am currently leaning towards adding an ACK.

@Drakulix @m273d15 @srossbach Any opinions?

While I like the overall approach, this remaining problem needs to be solved. Otherwise, we are introducing a new memory leak to fix an issue, that was mostly caused by a fix to an even worse leak.

(But reading through the patch, this is something any activity could actually trigger, it is just much more likely with the HeartbeatDispatcher, right?)

I like the ACK-solution, as it is cleaner, but I think the prune is more robost. With the ACK we could keep entries in cases, where we miss the ACK (e.g. an unexpected disconnect). One can absolutely argue, that this is negligible. I guess I am fine with either solution.

@tobous
Copy link
Member Author

tobous commented Oct 4, 2019

(But reading through the patch, this is something any activity could actually trigger, it is just much more likely with the HeartbeatDispatcher, right?)

@Drakulix No, this issue is independent from the hearbeat stuff. It just has the same solution.
The underlying issues are that

  • the jupiter document/vector time is not dropped for deleted resources on the server. (The logic was present but faulty. See old l.78.)
  • the jupiter time is only dropped when receiving a deletion activity, not when sending it. I.e. the state of the participant that deleted the resource is not updated.
  • straggling activities sent before the deletion was applied locally were not ignored.

This issue is that this causes a desync when subsequently recreating a deleted resource as the vector times no longer match.

The issue for #710 is that the participant that deleted the resource still sends heartbeats for the deleted resource as the local jupiter document was never reset. And they also receive heartbeats from other clients that they can't process.

Both issues are solved by this patch as it adjusts the behavior to also drop the vector time on outgoing deletion activities and adds a filter for straggling activities.

@tobous
Copy link
Member Author

tobous commented Oct 4, 2019

Updated the change by introducing DeletedResourceFilter to encapsulate the filter logic. This was done to avoid code duplication and to provide a central place to put later additions of the logic (like pruning or ACK processing).

@saros-project saros-project deleted a comment Oct 4, 2019
@saros-project saros-project deleted a comment Oct 5, 2019
@tobous
Copy link
Member Author

tobous commented Oct 7, 2019

I am not sure whether adding ACKs is really the best solution. I implemented it and it seems to work (but I only did very limited testing with 2 participants; but that is not the point I am trying to make), but it adds a lot of logic (about 220 lines of code before cleanup and adding javadoc), a new activity type, and a lot of network traffic (as each user has to send an ACK for every other participant besides the creator of the deletion activity). Furthermore, as this adds to the concurrent nature of the problem, it is annoying to debug and to ensure that it correctly works for multiple clients.

The prune solution on the other hand needs about 70 LOC and it ensures that the list of deleted resources is always cleared.

I still think that using ACKs is a more elegant solution, but I am still pondering on whether it is worth the hassle (testing troubles and increased code complexity, which makes it harder to maintain).

@tobous
Copy link
Member Author

tobous commented Oct 7, 2019

The current approach only looks at file activities. This is problematic when a folder is deleted as the vector time of the contained files will not be reset. However, this issue can not be solved with the current logic as the resources will already have been deleted on the creator's side, meaning we can't access the child resources.

The only solution I can think of is to require the creator of the activity to always completely tear down deleted folders (i.e. create deletion activities for all contained resources that are dispatched before the actual folder is deleted).

This is currently not handled in such a way in Saros/I, but there is already a TODO for it (as this is also needed for other cleanup stuff) and its a relatively easy change.

In Saros/E, the logic in ProjectDeltaVisitor seem to already handle the child resources in such a way (as children of folders are visited and the activities are sorted to ensure that files are deleted before folders). But I am not completely sure on this. @srossbach Any input?

Furthermore, if this is now a requirement for folder deletions, this should be put in writing somewhere. Probably the javadoc of the folder deletion activity.

@srossbach
Copy link
Contributor

In Saros/E, the logic in ProjectDeltaVisitor seem to already handle the child resources in such a way (as children of folders are visited and the activities are sorted to ensure that files are deleted before folders). But I am not completely sure on this. @srossbach Any input?

We have no guideline on how to send file deletions when they are triggered from another delete operation (e.g delete a folder). Unless this is specified somewhere I would not make any assumption which data in which order you get. E.g you could just send a folder delete activity, but I guess this would blow up the partial sharing as it also have to keep track of the current resources.

@Drakulix
Copy link
Contributor

Drakulix commented Oct 8, 2019

I am not sure whether adding ACKs is really the best solution. I implemented it and it seems to work (but I only did very limited testing with 2 participants; but that is not the point I am trying to make), but it adds a lot of logic (about 220 lines of code before cleanup and adding javadoc), a new activity type, and a lot of network traffic (as each user has to send an ACK for every other participant besides the creator of the deletion activity). Furthermore, as this adds to the concurrent nature of the problem, it is annoying to debug and to ensure that it correctly works for multiple clients.

The prune solution on the other hand needs about 70 LOC and it ensures that the list of deleted resources is always cleared.

That is what I was fearing. I would prefer the prune solution in that case.

@tobous
Copy link
Member Author

tobous commented Oct 8, 2019

We have no guideline on how to send file deletions when they are triggered from another delete operation (e.g delete a folder). Unless this is specified somewhere I would not make any assumption which data in which order you get. E.g you could just send a folder delete activity, but I guess this would blow up the partial sharing as it also have to keep track of the current resources.

@srossbach I know that we currently don't have any guidelines on this. My question was rather whether my understanding of the Eclipse handling (i.e. that child resources are explicitly handled) is correct. From looking at the code, it looks that way to me, but I am not familiar with the details of the Eclipse IResourceDeltaVisior that is used in the ProjectDeltaVisior. But, from the javadoc, it seems that way as visit(...) always returns true for folders and the activities should be sorted by the sort(...) method.

As I see no other way of handling this case, I would just make this explicit handling of child resources a new guideline.

@tobous
Copy link
Member Author

tobous commented Oct 8, 2019

Hmmmm... a big problem with the prune/cleanup solution is that we don't have a timeout for clients. In other words, it is completely acceptable for a client to be stuck for a while and the process everything that happened in the meantime without causing a desync.

An example of this happening in a normal session is the developer executing some blocking actions (e.g. opening the settings menu in IntelliJ) that block the IDE execution thread. This blocks the execution of received activities (like file deletions).

This makes it basically impossible to set a cleanup window (i.e. after what time to throw away the deletion filter for a particular resource) that is guaranteed to work.

@tobous
Copy link
Member Author

tobous commented Oct 28, 2019

Rebased onto current master without any interaction.

@tobous
Copy link
Member Author

tobous commented Nov 26, 2019

Rebased onto current master without any interaction.

@tobous
Copy link
Member Author

tobous commented Nov 28, 2019

Rebased onto current master without any interaction. Also added the commit [FIX][CORE] Exclude editor closed activities from filter to fix the handling of editor closed activities.

@tobous
Copy link
Member Author

tobous commented Dec 9, 2019

Man, this PR grew a lot from its initial size (mostly because of the addition of the deletion ACK). Now that it has been tested multiple times in a multi-user session and the found issues were resolved, I think it is time to again start reviewing this PR.

@Drakulix @m273d15 @srossbach @stefaus Can I interest anyone of you to again review this PR?

As a note for the review:
I think the most crucial point where this logic can actually break something is when activities that are needed for the teardown of the Saros state for the deleted resource are filtered out. This was, for example, the cause for #780 (which has been fixed with 8ed15b5).

The filter logic is located in `ResourceActivityFilter#isFiltered(IActivity). I went through all existing activities and checked whether they were important for the teardown, but maybe I missed something. So if you see any other possible issues with needed activities being filtered that I might have missed (i.e. that are not handled correctly by the logic), please say so.

@saros-infrastructure
Copy link

Codacy Here is an overview of what got changed by this pull request:

Issues
======
+ Solved 1
           

Complexity increasing per file
==============================
- core/src/saros/concurrent/management/ConcurrentDocumentClient.java  1
- core/src/saros/activities/DeletionAcknowledgmentActivity.java  1
- core/src/saros/session/internal/DeletionAcknowledgmentDispatcher.java  3
- core/src/saros/concurrent/management/ResourceActivityFilter.java  7
- core/src/saros/concurrent/management/ConcurrentDocumentServer.java  1
         

See the complete overview on Codacy

@tobous
Copy link
Member Author

tobous commented Jan 23, 2020

It seems like there is no interest in reviewing (or re-reviewing) this PR. (I want to make very clear that this is not meant as any form of criticism, just an observation! I know this PR isn't easy to review and a lot of work went into the review process already.)

But, as I would still like to move the process along, after conferring with @m273d15, I have decided to proceed with merging this PR without a complete review unless anybody voices objections.

I have now tested these changes many times and have removed any outstanding issues. I have also re-run the STF to confirm that all tests (besides the known faulty ones) pass. I feel like I have done my due diligence on that front.

If you don't want that to happen, you have at least till Monday evening (the 28th of January 2020) to either voice some objections/concerns or at least declare your intentions of reviewing this PR (in a somewhat timely manner 😉). Otherwise, I will merge this some time next week.

Adjusts ConcurrentDocumentClient and ConcurrentDocumentServer to drop
the vector time for a resource when it is deleted (or moved, thereby
"deleting" the resource with the previous path).

Adds DeletedResourceFilter to encapsulate the logic of detecting and
correctly handling resource deletions.
Furthermore, it provides a way of filtering out subsequent activities
for deleted resource.  This is needed as there can be straggling
activities for deleted resources that were created by remote
participants before they executed the deletion activity for the
resource.  Such activities are now filter out until the corresponding
resource is recreated.  This ensures that activities for the new
resource will still be processed correctly.
Checksum activities confirming the file deletion are explicitly excluded
from this filer (i.e. they will not be dropped for deleted resources) as
they are needed to check the consistency of all clients.

Fixes #223. Also fixes #710.
Adds DeletionAcknowledgmentActivity, which can be used to acknowledge
that a file was successfully deleted locally as a reaction to a received
activity. This acknowledgment allows us to drop the activity filter for
the deleted resource once all other session participants have
acknowledged that they received the deletion.

Adds the DeletionAcknowledgmentDispatcher to automatically dispatch an
acknowledgment when an activity is received that contains a file
deletion.
Uses received deletion acknowledgment activities to determine when all
other participants have received the deletion, allowing the local
instance to drop the activity filter for the deleted resource.
Adjusts the wording in the activity filter to reflect that it is
specifically dealing with files and not resources in general.
Adjusts ResourceActivityFilter.isFiltered(IActivity) to also never
filter out editor activities of the type "CLOSED". This is necessary as
such activities are used to clean up the user editor state.

Adjusts the wording of the javadoc for isFiltered(IActivity).

Adjusts code in ResourceActivityFilter to always uses the full qualifier
when using a variable of the type 'FileActivity.Type'. This was done to
make them easier to differentiate from the newly introduced usage of
variables of the type 'EditorActivity.Type'.
Adds information on how to handle deleted child resources to
FolderDeletedActivity.

Adds notes that activities might be filtered out to the corresponding
methods in ConcurrentDocumentServer and ConcurrentDocumentClient.
@tobous
Copy link
Member Author

tobous commented Jan 28, 2020

Rebased onto current master without any interaction in preparation of merge.

@tobous tobous merged commit f55a8f3 into master Jan 28, 2020
@tobous tobous deleted the pr/core/fix_223 branch January 28, 2020 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Core Issue affecting the Saros Core Aspect: Consistency Issue with Saros' consistency handling Prio: Critical Type: Bug Issue that describes an unintended behavior of Saros

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deleting a shared file cause vector time mismatch errors Vector time does not get reset correctly when deleting a file

5 participants