Skip to content

Tablet left in up state when being killed#3554

Closed
rafael wants to merge 1 commit intovitessio:masterfrom
tinyspeck:tablet-up-down-issue
Closed

Tablet left in up state when being killed#3554
rafael wants to merge 1 commit intovitessio:masterfrom
tinyspeck:tablet-up-down-issue

Conversation

@rafael
Copy link
Copy Markdown
Member

@rafael rafael commented Jan 13, 2018

Description

Couple days ago, we noticed that during a hard crash of vttablet, vtgate does not removes the tablet from the healthy list. I was digging a little bit into this issue and found that it was due to the following:

  1. Health checker changes the status of the tablet to Serving false, but Up remains true.
  2. The logic in tablet_stats_cache doesn't remove it from the healthy list in this situation.

I'm not familiarized with this part of the codebase, but at first glance we have two options:

  1. Set Up to false.
  2. Change the tablet_stats_cache to remove it from the healthy state when Up but not Serving

I did a quick change using first approach that seems to solve the problem. But debating myself if number two is a better way. Looking forward to getting feedback on this :)

One more question that I had while thinking on this problem: Should this kind of error enter in the buffering requests state ? Right now it will not do that.

Test

The bug is reproducible by running local/examples and killing a tablet. Without this change, you can see that the gates keeps retrying to connect to the tablet without removing it from the healthy hosts.

@alainjobart
Copy link
Copy Markdown
Contributor

So in your 2 proposed fixes, 2. is the right one. :)

I understand the name is misleading, 'Up' feels like the tablet is up, when in fact it's about the tablet being discovered, independently of it being serving and healthy. 'Up=true' in the callback means the tablet is added, 'Up=false' means the tablet is removed from the topology.

For this bug though, I don't see how this is possible, but maybe I'm not reading the code properly:
https://github.com/youtube/vitess/blob/master/go/vt/discovery/replicationlag.go#L74
in that part, we remove a tablet which reports !Serving. So I don't see how it could still be shown as healthy?

As a side note, we also added a new feature fairly recently, when a tablet, upon shutdown, will clear out its Hostname in the topology. That really means the tablet is not there and won't be there. We could use that to make the tablet 'Up=false' if we wanted to. But I'm pretty sure we don't right now. And anyway, it's independent from the reported issue with vtgate showing the wrong status for a tablet.

@sougou
Copy link
Copy Markdown
Contributor

sougou commented Jan 18, 2018

@alainjobart is reworking this entire thing with brand new data structures. So, let's close this and wait for his new shiny :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants