Skip to content

tm revamp: remove most topo.ChangeType#6139

Merged
sougou merged 4 commits intovitessio:masterfrom
planetscale:ss-tm-owned-record
May 28, 2020
Merged

tm revamp: remove most topo.ChangeType#6139
sougou merged 4 commits intovitessio:masterfrom
planetscale:ss-tm-owned-record

Conversation

@sougou
Copy link
Copy Markdown
Contributor

@sougou sougou commented May 1, 2020

All those calls have been replaced with agent.ChangeType.
agent.ChangeType is the only one that should call topo.ChangeType.
One exception is agent.finalizeTabletExternallyReparented. It
actually changes the tablet's internal state before updating
topo, which seems to break the main rule. So, I've left it alone
for now.

Signed-off-by: Sugu Sougoumarane ssougou@gmail.com

@sougou sougou requested review from deepthi and enisoc May 1, 2020 04:53
@deepthi
Copy link
Copy Markdown
Collaborator

deepthi commented May 1, 2020

We deprecated the TER vttablet RPC in 4.0 but forgot to remove it in 5.0/6.0. We can delete that and then finalize can also be deleted.

Copy link
Copy Markdown
Collaborator

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Copy link
Copy Markdown
Collaborator

@deepthi deepthi May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are calling an RPC within an RPC, so it's blocking on the actionMutex. Specifically we called agent.lock(ctx) just before this call to ChangeType which calls agent.lock(ctx) as the first thing. I don't think it is necessary to call it at the beginning of this code block any more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we were taking the lock in this block so that, in the case of drain-for-backup, we exclude all other RPC actions while the backup is happening. If that's true, we still need to hold the lock throughout the rest of this function, not just while changing type. The pattern we typically use in Vitess for this is to pull the meat of ChangeType() out into changeTypeLocked() and call the latter whenever we already hold the lock.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I forgot that defer only gets executed at the end of the function, not at the end of the block.

Copy link
Copy Markdown
Collaborator

@deepthi deepthi May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next call to refreshTablet is unnecessary, because ChangeType already calls it. Ditto for the call to agent.runHealthCheckLocked

@sougou sougou force-pushed the ss-tm-owned-record branch from 80e8ef5 to 51a9b1f Compare May 1, 2020 20:17
Copy link
Copy Markdown
Member

@enisoc enisoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally concerned about almost all of the removals of agent.lock() calls. This seems to be defeating an important protection against multiple actions/RPCs running concurrently.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we were taking the lock in this block so that, in the case of drain-for-backup, we exclude all other RPC actions while the backup is happening. If that's true, we still need to hold the lock throughout the rest of this function, not just while changing type. The pattern we typically use in Vitess for this is to pull the meat of ChangeType() out into changeTypeLocked() and call the latter whenever we already hold the lock.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unsafe. There's a lot happening now without holding the lock. The current goal of the lock is to serialize entire RPCs, not just certain changes. I don't think we can relax that without thinking through a full redesign of the locking scheme.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Contributor Author

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've restored all the action locks. I'll revisit them on a case by case basis.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were these changes needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we're now using agent.ChangeType instead of directly updating the topo. So, the ActionAgent needs to be initialized for it to process ChangeType properly.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could revert all changes to rpc_external_reparent.go and this test and delete the functions in that file + tests either in this PR or in a subsequent one. It doesn't make sense to make changes to it when we should be deleting it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine. I'll ping you separately for the cleanup.

@sougou
Copy link
Copy Markdown
Contributor Author

sougou commented May 1, 2020

local_example has been failing, but it passes on my machine. I'll keep retrying.

@sougou sougou force-pushed the ss-tm-owned-record branch 4 times, most recently from ae42484 to 3abdcf4 Compare May 2, 2020 04:06
Copy link
Copy Markdown
Member

@enisoc enisoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

sougou added 3 commits May 26, 2020 20:19
All those calls have been replaced with agent.ChangeType.
agent.ChangeType is the only one that should call topo.ChangeType.
One exception is agent.finalizeTabletExternallyReparented. It
actually changes the tablet's internal state before updating
topo, which seems to break the main rule. So, I've left it alone
for now.

Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
In many of those flows action lock wasn't needed.
We'll eventually deprecate actionLock altogether.

Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
@sougou sougou force-pushed the ss-tm-owned-record branch 2 times, most recently from 7cb5b97 to d89f7e4 Compare May 27, 2020 03:23
changeTypeLocked runs health, but InitMaster should not.

Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
@sougou sougou merged commit 65d3b3c into vitessio:master May 28, 2020
@sougou sougou deleted the ss-tm-owned-record branch May 28, 2020 04:42
@deepthi deepthi added this to the v7.0 milestone Jul 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants