Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Oct 19, 2021

What is the purpose of the pull request

  • With recent refactoring for synchronous metadata patch, there was 2 code paths where double locking is happening. This patch fixes this issue.
  • While adding tests, also found a bug related to multi-writer and upgrade and fixed the same as well.

Brief change log

  • There are two code paths where deadlock happens.
    a. When a table is being upgraded, and if multi-writer is enabled, we rollback all partial commits. This upgrade step itself happens within a lock, but the rollback also tries to acquire lock when committing the rollback
    b. When auto commit is enabled with inline cleaning, auto clean is triggered as part of post commit which happens within a lock. But cleaning at the end when trying to commit again tries to acquire lock.
    Fix: Added optionality to both rollback and clean to skip locking. These two code paths will set true for skipLocking. Also, fixed to add guards to take locks only if metadata is enabled. Prior to this patch, we take locks and then check if metadata writer is available. Some minor optimization.
  • While testing, came across a bug with upgrade and rollback relating to multi-writer. When a table needs an upgrade, and if multi-writer is enabled, we rollback all partial commits. This also rollsback the current transaction in flight. Added a filtering to ensure we don't rollback the current transaction that just got started and is triggering the upgrade.

Verify this pull request

This change added tests and can be verified as follows:

  • Added tests to TestHoodieBackedMetadata
    a. testMultiWriterForDoubleLocking
    b. testRollbackDuringUpgradeForDoubleLocking

Points to discuss:

  • I checked code paths for inline compaction and clustering and it happens just outside of transactional lock in AbstractHoodieWriteClient. But cleaning happens within transactional lock meant for inflight commit. Is that intentional or should we also move clean outside of the lock.
  • Two different code paths for auto commit enabled and disabled is slowly becoming a pain point. Should we look into unifying the same.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@hudi-bot
Copy link
Collaborator

hudi-bot commented Oct 19, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run travis re-run the last Travis build
  • @hudi-bot run azure re-run the last Azure build

@davehagman
Copy link
Contributor

LGTM!

@vinothchandar vinothchandar self-assigned this Oct 19, 2021
@vinothchandar
Copy link
Member

cc @prashantwason

@nsivabalan nsivabalan added the priority:blocker Production down; release blocker label Oct 22, 2021
@manojpec
Copy link
Contributor

manojpec commented Oct 23, 2021

Overall Approach/Design Comment:

Exposing locking semantics to callers is a bit dangerous. Caller usually don't know the internals of abstract class interfaces and cannot decide when to use what lock/unlock versions.

The core problem here is nested transactions (and there by lock reentrancy). Few higher level actions have sub actions and taking the same exclusive lock across actions can lead to deadlock. Instead of viewing this as implementation specific (lock) issues, can we see this as transaction manager issue? That is a higher level transaction started doesn't want to start sub transactions, right? If thats the case, can we instantiate the Rollback/Clean sub actions with no-op transaction manager? Because the transaction manager in these actions have the only task of locking/unlocking and so the presence of Transaction manager can inform indirectly inform them to relax transaction and there by the locking. Please let me know if I got the core problem wrong?

@nsivabalan
Copy link
Contributor Author

@manojpec : thanks for your inputs. I do like the idea of TransactionManager handling the locking depending on whether the lock acquisition is requested by same owner or diff. But I see some impl hurdles in that. Let me see how I can go about that.

@nsivabalan nsivabalan force-pushed the doubleLockingFix branch 2 times, most recently from 7b02f0f to 0faf3e0 Compare October 27, 2021 15:41
@nsivabalan
Copy link
Contributor Author

@manojpec : addressed all comments.

@nsivabalan
Copy link
Contributor Author

As per our offline discussion, abstracting out transaction manager for this purpose is not straightfoward and needs more thinking. Have filed a ticket here.

@nsivabalan
Copy link
Contributor Author

@hudi-bot azure run

Copy link
Contributor

@manojpec manojpec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking up the larger refactoring of transaction manager in a different ticket is ok with me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants