Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
224 changes: 224 additions & 0 deletions doc/TabletRouting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
This document describes how Vitess routes queries to healthy tablets. It is
meant to be up to date with the current status of the code, and describe
possible extensions we're working on.

# Current architecture and concepts

## Vtgate and SrvKeyspace

Vtgate receives queries from the client. Depending on the API, it has:

* The shard to execute the query on.
* The Keyspace Id to use.
* Enough routing information with the VSchema to figure out the Keyspace Id.

At this point, vtgate has a cache of the SrvKeyspace object for each keyspace,
that contains the shard map. Each SrvKeyspace is specific to a cell. Vtgate
retrieves the SrvKeyspace for the current cell, and uses it to find out the
shard to use.

The SrvKeyspace object contains the following information about a Keyspace, for
a given cell:

* The served_from field: for a given tablet type (master, replica, read-only),
another keyspace is used to serve the data. This is used for vertical
resharding.
* The partitions field: for a given tablet type (master, replica, read-only),
the list of shards to use. This can change when we perform horizontal
resharding.

Both these fields are cell and tablet type specific, as when we reshard
(horizontally or vertically), we have the option to migrate serving of any type
in any cell forward or backward (with some constraints).

Note that the VSchema query engine uses this information in slightly different
ways that the older API, as it needs to know if two queries are on the same
shard, for instance.

## Tablet health

Each tablet is responsible for figuring out its own status and health. When a
tablet is considered for serving, a StreamHealth RPC is sent to the tablet. The
tablet in turn returns:

* Its keyspace, shard and tablet type.
* If it is serving or not (typically, if the query service is running).
* Realtime stats about the tablet, including the replication lag.
* the last time it received a 'TabletExternallyReparented' event (used to break
ties during reparents).
* The tablet alias of this tablet (so the source of the StreamHealth RPC can
check it is the right tablet, and not another table that restarted on the same
host / port, as can happen in container environments).

That information is updated on a regular basis (every health check interval, and
when something changes), and streamed back.

## Discovery module

The go/vt/discovery module provides libraries to keep track of the health of a
group of tablets.

As an input, it needs the list of tablets to watch. The TopologyWatcher object
is responsible for finding the tablets to watch. It can watch:

* All the tablets in a cell. Uses polling of the `tablets/` directory of the
topo service in that cell to figure out a list.
* The tablets for a given cell / keyspace / shard. It polls the ShardReplication
object.

Tablets to watch are added / removed from the main list kept by the
TabletStatsCache object. It just contains a giant map indexed by cell, keyspace,
shard and tablet type of all the tablets it is in contact with.

When a tablet is in the list of tablets to watch, this module maintains a
StreamHealth streaming connection to the tablet, as described in the previous
section.

This module also provides helper methods to find a tablet to route traffic
to. Since it knows the health status of all tablets for a given keyspace / shard
/ tablet type, and their replication lag, it can provide a good list of tablets.

## Vtgate Gateway interface

An important interface inside vtgate is the Gateway interface. It can send
queries to a tablet by keyspace, shard, and tablet type.

As mentioned previously, the higher levels inside vtgate can resolve queries to
a keyspace, shard and tablet type. The queries are then passed to the Gateway
implementation inside vtgate, to route them to the right tablet.

There are two implementations of the Gateway interface:

* discoveryGateway: It combines a set of TopologyWatchers (described in the
discovery section, one per cell) as a source of tablets, a HealthCheck module
to watch their health, and a TabletStatsCache to collect all the health
information. Based on this data, it can find the best tablet to use.
* l2VTGateGateway: It keeps a map of l2vtgate processes to send queries to. See
next section for more details.

## l2vtgate

As we started increasing the number of tablets in a cell, it became clear that a
bottleneck of the system was going to be how many tablets a single vtgate is
connecting to. Since vtgate maintains a streaming health check connection per
tablet, the number of these connections can grow to large numbers. It is common
for vtgate to watch tablets in other cells, to be able to find the master
tablet.

So l2vtgate came to exist, based on very similar concepts and interfaces:

* l2vtgate is an extra hop between a vtgate pool and tablets.
* A l2vtgate pool connects to a subset of tablets, therefore it can have a
reasonable number of streaming health connections. Externally, it exposes the
QueryService RPC interface (that has the Target for the query, keyspace /
shard / tablet type). Internally, it uses a discoveryGateway, as usual.
* vtgate connects to l2vtgate pools (using the l2VTGateGateway instead of the
discoveryGateway). It has a map of which keyspace / shard / tablet type needs
to go to wich l2vtgate pool. At this point, vtgate doesn't maintain any health
information about the tablets, it lets l2vtgate handle it.

Note l2vtgate is not an ideal solution as it is now. For instance, if there are
two cells, and the master for a shard can be in either, l2vtgate still has to
watch the tablets in both cells, to know where the master is. Ideally, we'd want
l2vtgate to be collocated with the tablets in a given cell, and not go
cross-cell.

# Extensions, work in progress

## Regions, cross-cell targeting

At first, the discoveryGateway was routing queries the following way:

* For master queries, just find the master tablet, whichever cell it's in, and
send the query to it. This was meant to handle cross-cell master fail-over.
* For non-master queries, just route to the local cell vtgate is in.

This turned out to be a bit limiting, as if no local tablet was available in the
cell, the queries would just fail. We added the concept of Region, and if
tablets in the same Region are available to serve, they can be used instead.
This fix however is not entirely correct, as it uses the SrvKeyspace of the
current cell, instead of the SrvKeyspace of the other cell. Both ServedFrom and
the shard partition may be different in a different cell.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused why this particular issue is only a problem with the new cross-cell region routing.

Specifically doesn't the same issue occur when routing to a master in another cell since the ServedFrom or the shard partitioning for a master in cell2 might not be propagated to the SrvKeyspace in cell1?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The master case is actually a bit different, since there is only one. We always migrate all cells at once when migrating the master, as it doesn't make sense to have one master one way for one cell, and another way for another cell. The old / new masters are only configured one way, and blacklist all queries for what is not there, there is only one choice.

For replica and rdonly though, I kinda agree it doesn't matter much to use the wrong shard map when going cross-cell. Unless:

  • there is no tablet of one kind in the other cell. Like you have both old and new tablets in the current cell, and use the new tablets, but the other cell only has old tablets.
  • the other case we ran into was no tablets for a keyspace in the local cell. There was no SrvKeyspace at all. So the routing failed much earlier.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- so the problem comes up in the case where we migrate replicas differently in different cells.

Also the case of no SrvKeyspace in the current cell if there are no tablets for a keyspace is definitely an issue and is worth calling out explicitly, thanks for reminding me. From my standpoint though, that seems like something we should be able to handle externally via vtctl since it is only an issue when setting up the cell for the first time or adding a new keyspace, right?


## Percolating the cell back up

We think the correct fix is to percolate the cell back up at the vtgate routing
layer. That way when the higher level is asking for the SrvKeyspace object, it
can ask for the right one, in the right cell. For instance, we can read the
current SrvKeyspace for the current cell, see if it has healthy tablets for the
shard(s) we're interested in, and if not find another cell in the same region
that has healthy tablets.

In the l2vtgate case though, this is not as easy, as the vtgate process has no
idea what tablets are healthy. We propose to solve this by adding a healthcheck
between vtgate and l2vtgate:

* vtgate would maintain a healthcheck connection to the l2vtgate pool (a single
connection should be good enough, but we may be more aggressive to collect
more up to date information, so we don't depend on a single l2vtgate).
* l2vtgate will report on a regular basis which keyspace / shard / tablet type
it can serve. It would also report the minimum and maximum replication lag
(and whatever other pertinent information vtgate needs).
* l2vtgateGateway can then provide the health information back up inside vtgate.
* This would also lift the limitation previously noted about l2vtgate going
cross-cell. When a master is moved from one cell to another, the l2vtgate in
the old cell would advertize back that they lost the ability to serve the
master tablet type, and the l2vtgate in the new cell would start advertizing
it. Vtgates would then know where to look.

This would also be a good time to merge the vtgate code that uses the VSchema
with the code that doesn't for SrvKeyspace access.

## Hybrid Gateway

It would be nice to re-organize the code a bit inside vtgate to allow for an
hybrid gateway, and get rid of l2vtgate alltogether:

* vtgate would use the discoveryGateway to watch the tablets in the current cell
(and optionally to any other cell we still want to consider local).
* vtgate would use l2vtgateGateway to watch the tablets in a different cell.
* vtgate would expose the RPC APIs currently exposed by the l2vtgate process.

So vtgate would watch the tablets in the local cell only, but also know what
healthy tablets are in the other cells, and be able to send query to them
through their vtgate. The extra hop to the other cell vtgate should be a small
latency price to pay, compared to going cross-cell already.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the cleanliness of this design, and in some environments this will likely be true, but for deployments like Slack is planning, I don't believe this is necessarily the case.

Specifically, AWS round trip times as measured from one of our vtgates is as follows:

from vtgate in us-east-1a:
tablet in us-east-1a: rtt min/avg/max/mdev = 0.100/0.117/0.133/0.012 ms
tablet in us-east-1b: rtt min/avg/max/mdev = 0.662/0.668/0.677/0.032 ms
tablet in us-east-1d: rtt min/avg/max/mdev = 0.401/0.408/0.419/0.007 ms
tablet in us-east-1e: rtt min/avg/max/mdev = 0.687/0.698/0.714/0.021 ms

So there is a clear penalty to cross AZs (both added latency and cost) but it is on the order of 4-6x, not orders of magnitude.

Still our plan is to deploy a vitess cell per AZ for both the latency benefits and fault tolerance, so I'd want to make sure this design still allows for the ability for one vtgate to cross cells.

More concretely, this design consideration is a consistency vs performance tradeoff -- Slack (and others) might be willing to tolerate a small amount of time where queries could be misrouted during a shard split while the various SrvKeyspace records are being updated, if it means we don't have to pay an extra hop for every cross-cell query.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and I think the proposed design would make this clear: you can still go cross-cell, but you need your vtgates to health-check the tablets cross-cell. That can be a large number of tablets. Otherwise just point at a vtgate pool somewhere else, and it will add the extra hop.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks for clarifying.


So queries would go one of two routes:

* client(cell1) -> vtgate(cell1) -> tablet(cell1)
* client(cell1) -> vtgate(cell1) -> vtgate(cell2) -> tablet(cell2)

If the number of tablets in a given cell is still too high for the local vtgate
pool, two or more pools can still be created, each of them knowing about a
subset of the tablets. And they would just forward queries to each others when
addressing the other tablet set.

## Config-based routing

Another possible extension would be to group all routing options for vtgate in a
configuration (and possibly distribute that configuration in the topology
service). The following parameters now exist in different places:

* vttablet has a replication delay threshold after which it reports
unhealthy, `-unhealthy_threshold`. Unrelated to vtgate's
`discovery_high_replication_lag_minimum_serving` parameter.
* vttablet has a `-degraded_threshold` parameter after which it shows as
unhealthy in its status page. No impact on serving. Independent from vtgate's
`-discovery_low_replication_lag` parameter, although if they match the user
experience is better.
* vtgate also has a `-min_number_serving_vttablets` that is used to not just
return one tablet and overwhelm it.

We also now have the capability to route to different Cells in the same
Region. Configuring when to use a different cell in corner cases is hard. If
there is only one tablet with somewhat high replication lag in the current cell,
is it better than up-to-date tablets in other cells?

A possible solution for this would be a configuration file, that lists what to
do, in order of preference, something like:

* use local tablets if more than two with replication lag smaller than 30s.
* use remote tablets if all local tablets are above 30s lag.
* use local tablets if lag is lower than 2h, minimum 2.
* ...