Skip to content

Emergency Reparent: Take Union of Executed and Retrieved for Consideration#6190

Closed
PrismaPhonic wants to merge 6 commits intovitessio:masterfrom
planetscale:feature/emergency-reparent-relay-log-union
Closed

Emergency Reparent: Take Union of Executed and Retrieved for Consideration#6190
PrismaPhonic wants to merge 6 commits intovitessio:masterfrom
planetscale:feature/emergency-reparent-relay-log-union

Conversation

@PrismaPhonic
Copy link
Copy Markdown
Contributor

@PrismaPhonic PrismaPhonic commented May 16, 2020

Problem:

We need a way to take the union of both the executed and retrieved GTIDSets, so that we take the relay log into consideration when determining who is the most caught up replica for an EmergencyReparent.

Solution

This PR adds a new RelayLogPosition field to the Status struct, and retrieves the GTIDSet appropriately for all flavors. Unfortunately we discovered that it's not possible to retrieve the GTIDSet for the relay log with MariaDB, so a comment has been added explaining that in the case of MariaDB, the RelayLogPosition will be empty.

This PR then uses the new Union method on the GTIDSet interface to take a union of the retrieved and executed GTIDSets, and exposes as a public method on Wrangler how we come to that determination. This is necessary because we currently require a user to supply the most caught up replica, and thus they must have a way to come to the same determination that we do.

@PrismaPhonic PrismaPhonic requested review from deepthi and enisoc May 16, 2020 00:17
@PrismaPhonic PrismaPhonic force-pushed the feature/emergency-reparent-relay-log-union branch 2 times, most recently from 7de8b56 to ec9f0a3 Compare May 16, 2020 03:17
Copy link
Copy Markdown
Member

@enisoc enisoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have started to write in the logic to determine the most caught up replica that we could use for such an optional feature.

Can we add the new feature in a separate PR? The new feature will probably need a bit of design iteration (e.g. how do we choose among multiple options), whereas we already know what needs to be done to fix the bug in the existing feature (consider relay log pos whenever available when validating positions).

@PrismaPhonic PrismaPhonic requested a review from enisoc May 19, 2020 01:53
…apping this for all flavors, and updated proto so this gets mapped to the proto type. Also Added shared logic that checks if the supplied master tablet alias is the farthest ahead, by first merging executed and retrieved GTIDSets before comparing. Also did work on what will become an optional feature to have us find the most caught up replica for you.

Signed-off-by: Peter Farr <Peter@PrismaPhonic.com>
…vor, because the realy log file position is not comparable to what we need to compare again - the master binlog file position. So we should think of this as not available instead. 2. RelayLogPosition already has a valid empty value, and therefore should not be a pointer type. 3. Updated the proto definition so it generates the id correctly for RelayLogPosition. 4. Refactored so that RelayLogPosition is already a union.

Signed-off-by: Peter Farr <Peter@PrismaPhonic.com>
…and exported ChooseNewMaster for use externally. Refactored ChooseNewMaster to use new decodePosition helper which takes a status object, and preferably returns the RelayLogPosition, if one exists.

Signed-off-by: Peter Farr <Peter@PrismaPhonic.com>
@PrismaPhonic PrismaPhonic force-pushed the feature/emergency-reparent-relay-log-union branch from c92ee8b to 11b70a2 Compare May 19, 2020 02:01
@PrismaPhonic PrismaPhonic marked this pull request as ready for review May 19, 2020 02:40
@PrismaPhonic PrismaPhonic requested a review from sougou as a code owner May 19, 2020 02:40
@sougou
Copy link
Copy Markdown
Contributor

sougou commented May 19, 2020

This was a suggestion from @jfg956: Master_Log_File and Read_Master_Log_Pos would allow you to know which slave has more binlogs from the master.

Although not exact, it may still let us find a better candidate. But I'm thinking the comparison may become non-trivial.

… a list of all tablet aliases that are fully caught up, so a caller can choose what to do with that information.

Signed-off-by: Peter Farr <Peter@PrismaPhonic.com>
Signed-off-by: Peter Farr <Peter@PrismaPhonic.com>
@PrismaPhonic PrismaPhonic requested a review from deepthi May 20, 2020 01:19
Signed-off-by: Peter Farr <Peter@PrismaPhonic.com>
}

// FindMasterCandidates will look at all of the tablets in the supplied tabletMap, and return a list of tabletAliases that
// are all caught up. This means that all of the returned TabletAliases will have the same replication position.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this right, I don't see a guarantee (or a check) that the returned replicas are actually caught up. So we should fix the comments: "all caught up" -> "the most up-to-date" OR "the most caught up".

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we ignore tablets we can't reach, we should clarify that it's additionally "the most up-to-date that we were able to reach". It's therefore important to still call CheckTabletIsFarthestAhead.


if len(aggregator.tabletReplInfos) < 2 {
// Either 1 result, or 0. Regardless, we can return the list at this point as we don't need to
// filter out tablets that aren't fully caught up.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment. It seems that "fully caught up" should be changed to "the most caught up"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, good point, will update.

// chooseNewMaster finds a tablet that is going to become master after reparent. The criteria
// for the new master-elect are (preferably) to be in the same cell as the current master, and
// to be different from avoidMasterTabletAlias. The tablet with the largest replication
// for the new master-elect are to be in the same cell as the current master, and
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enisoc do you know why we limit the search to master's current cell?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense that we'd want to prefer the same cell when possible, as the comment originally said, but I don't know if it was intentional that this code has actually been requiring the same cell. It does make sense from a safety perspective, but I was also surprised by this since the vtctl command docs don't mention it.

In any case, I advised @PrismaPhonic that we should probably preserve the behavior rather than fixing it to match the comment, since users may have come to rely on this and unexpectedly reparenting to a remote cell could cause an outage if the app isn't prepared for it. What do you think?

On a related note, PRS used to only contact tablets in the same cell to get their positions, but now we will contact tablets in all cells and throw away most of them. Do you think it's worth fixing this to avoid the unnecessary calls?

@enisoc
Copy link
Copy Markdown
Member

enisoc commented May 20, 2020

@deepthi wrote:

For this reason you also have a 2-step process to delete fields from a proto. You first deprecate them (using reserved) and delete them in the next release.

My understanding was that it's best to keep previously-used field numbers reserved forever. Have we been removing reservations from our protos?

@deepthi
Copy link
Copy Markdown
Collaborator

deepthi commented May 20, 2020

My understanding was that it's best to keep previously-used field numbers reserved forever. Have we been removing reservations from our protos?

I don't think we have been removing them at all, so we should be fine. I'll keep this in mind for the future.

// were to finish executing everything that's currently in its relay log.
// However, some MySQL flavors don't expose this information,
// in which case RelayLogPosition.IsZero() will be true.
RelayLogPosition Position
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out we do have the info we need to fill this in for the filepos flavor: #6190 (comment)

Leaving this here since there's not a better place to stick a line comment about it.

// maxReplPosSearch is a struct helping to search for a tablet with the largest replication
// position querying status from all tablets in parallel.
type maxReplPosSearch struct {
// tabletReplInfoAggregator is a struct that can be used to get all tablets replication positions tied to their aliases concurrently.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention that this looks at the relay log position whenever possible, since this is different from what people usually mean when they refer to the replication position of an instance. Usually, it refers to which snapshot of data you'd access if you sent a query to that instance, meaning the executed GTID set. We pretty much only care about the relay log position in the context of checking for semi-sync ACKed transactions that would otherwise be lost in the case of an unplanned failover.

In fact, I'm just now realizing that we should not look at relay log pos when this is called from PlannedReparentShard. In PRS, the next thing we're going to do after choosing a candidate is wait for its executed position to catch up to the demoted (frozen) master. Since the demoted master is available in this case, we don't need to look for semi-sync ACKed transactions that were otherwise lost -- we know the master must have them and we know the master's position.

The only reason we look at candidate replication positions at all in PRS is to minimize the time we spend waiting for the chosen candidate to catch up. We therefore want to choose the one that has executed the farthest, not the one that has merely retrieved the farthest; execution is what's going to take time. So I think we should make it optional to look at relay log pos and keep that off when calling this from PRS.

@deepthi Can you check my logic on this?


// decodePosition is a helper that will decode the RelayLogPosition, if it's non-empty, and otherwise fall back
// to the executed position.
func decodePosition(status *replicationdatapb.Status) (mysql.Position, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should name this something to indicate that it's special. We don't normally want to consider the relay log position - only when trying to use semi-sync to save transactions that otherwise might be lost due to an unplanned failover.

Suggested change
func decodePosition(status *replicationdatapb.Status) (mysql.Position, error) {
func bestEffortRelayLogPosition(status *replicationdatapb.Status) (mysql.Position, error) {

if maxPosSearch.maxPosTablet == nil || !maxPosSearch.maxPos.AtLeast(replPos) {
maxPosSearch.maxPos = replPos
maxPosSearch.maxPosTablet = tablet
t.statusLock.Lock()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a TODO: Refactor this into a send on a channel. Multiple goroutines needing to append results to a list is a textbook use case for a channel.

tabletAliases = append(tabletAliases, t.tabletReplInfos[i].alias)
}
maxPosSearch.maxPosLock.Unlock()
t.statusLock.Unlock()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could defer this right after the Lock to make it safer in case additional return paths are ever added.

// chooseNewMaster finds a tablet that is going to become master after reparent. The criteria
// for the new master-elect are (preferably) to be in the same cell as the current master, and
// to be different from avoidMasterTabletAlias. The tablet with the largest replication
// for the new master-elect are to be in the same cell as the current master, and
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense that we'd want to prefer the same cell when possible, as the comment originally said, but I don't know if it was intentional that this code has actually been requiring the same cell. It does make sense from a safety perspective, but I was also surprised by this since the vtctl command docs don't mention it.

In any case, I advised @PrismaPhonic that we should probably preserve the behavior rather than fixing it to match the comment, since users may have come to rely on this and unexpectedly reparenting to a remote cell could cause an outage if the app isn't prepared for it. What do you think?

On a related note, PRS used to only contact tablets in the same cell to get their positions, but now we will contact tablets in all cells and throw away most of them. Do you think it's worth fixing this to avoid the unnecessary calls?

}

// FindMasterCandidates will look at all of the tablets in the supplied tabletMap, and return a list of tabletAliases that
// are all caught up. This means that all of the returned TabletAliases will have the same replication position.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we ignore tablets we can't reach, we should clarify that it's additionally "the most up-to-date that we were able to reach". It's therefore important to still call CheckTabletIsFarthestAhead.

if !masterElectPos.AtLeast(pos) {
return fmt.Errorf("tablet %v is more advanced than master elect tablet %v: %v > %v", alias, masterElectTabletAliasStr, status.Position, masterElectStatus.Position)
}
// Bail out if the master tablet candidate is not the farthest ahead.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize that we ignore errors from StopReplicationAndGetStatus() above. Doesn't that mean ERS might cause you to lose transactions? I guess we need a force mode like that, but I would have expected an intermediate option as well that says, reparent only if it's safe.

@deepthi @sougou Did you know about this?

I'm also worried that it doesn't look like we have any step to wait for replicas to finish playing their relay logs. It seems like we just stop them and leave them stopped if our checks fail? If that's the case, then we might never meet the new criterion of ensuring the new master candidate has executed at least as far as anybody's relay log.


// CheckTabletIsFarthestAhead will take a tablet alias string, along with a statusMap of tablet alias strings to tablet Status objects,
// and return an error if the tablet is not the farthest ahead, or otherwise return nil if it is the farthest ahead.
func (wr *Wrangler) CheckTabletIsFarthestAhead(tabletAliasStr string, statusMap map[string]*replicationdatapb.Status) error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we actually need to ensure that the candidate master's executed position is at least as advanced as every replica's relay log pos. Promoting a replica to master purges its relay log, so only the transactions that have been executed should count.

@enisoc enisoc mentioned this pull request May 20, 2020
13 tasks
@enisoc enisoc marked this pull request as draft July 13, 2020 22:43
@deepthi
Copy link
Copy Markdown
Collaborator

deepthi commented Aug 24, 2020

replaced by #6449

@deepthi deepthi closed this Aug 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants