Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help Wanted] cdc: why my replication_gap so large? #402

Open
Smityz opened this issue Mar 28, 2024 · 9 comments
Open

[Help Wanted] cdc: why my replication_gap so large? #402

Smityz opened this issue Mar 28, 2024 · 9 comments

Comments

@Smityz
Copy link

Smityz commented Mar 28, 2024

image

./cdc cli  changefeed statistics
{
  "ops": 0,
  "count": 0,
  "sink_gap": "650ms",
  "replication_gap": "7250ms"
}
{
  "ops": 0,
  "count": 0,
  "sink_gap": "450ms",
  "replication_gap": "7334ms"
}
{
  "ops": 0,
  "count": 0,
  "sink_gap": "1000ms",
  "replication_gap": "7700ms"
}

I used this command to monitor my cdc job, and found the latency is large than I expected.
replication_gap is big but sink_gap is small, could this be due to slow data pulling? How can I increase its speed?

@Smityz
Copy link
Author

Smityz commented Apr 19, 2024

I noticed that there are many hard-coded parameters. After modifying some of them, I found that the latency decreased from 8 seconds to 4 seconds. Would it be possible to make these parameters configurable?
@pingyu

@pingyu
Copy link
Collaborator

pingyu commented Apr 19, 2024

I think yes. Which parameters, and why they affect the latency in your case ?

@Smityz
Copy link
Author

Smityz commented Apr 19, 2024

https://github.com/tikv/tikv/blob/v6.5.8/components/causal_ts/src/tso.rs#L70

/// Maximum available interval of TSO cache.
/// It means the duration that TSO we cache would be available despite failure
/// of PD. The longer of the value can provide better "High-Availability"
/// against PD failure, but more overhead of `TsoBatchList` & pressure to TSO
/// service.
pub(crate) const DEFAULT_TSO_BATCH_ALLOC_AHEAD_BUFFER_MS: u64 = 500;

This parameter in TiKV can cause a timestamp discrepancy with physical time, resulting in incorrect calculations, but seems no real negative effect.

https://github.com/tikv/migration/blob/main/cdc/pkg/config/kvclient.go#L27

// the safe interval to move backward resolved ts
ResolvedTsSafeInterval time.Duration `toml:"resolved-ts-safe-interval" json:"resolved-ts-safe-interval"`

This parameter have a real negative effect, but it seems related to a server side bug.

I also tried other parameters, like
sink default concurrency
and
flush ticker
in
https://github.com/tikv/migration/blob/main/cdc/cdc/sink/tikv.go, but no use

@pingyu
Copy link
Collaborator

pingyu commented Apr 19, 2024

This parameter in TiKV can cause a timestamp discrepancy with physical time, resulting in incorrect calculations, but seems no real negative effect.

This parameter is configurable, See https://docs.pingcap.com/tidb/v6.5/tikv-configuration-file#alloc-ahead-buffer-new-in-v640.

It's indeed an issue, the cached TSO should not affect the calculation of replication gap. But currently I don't find a good way to fix it. (Maybe carry the information about the cache in resolved ts message ?)

This parameter have a real negative effect, but it seems related to a server side bug.

Emmm... I think the issue has been fixed by tikv/tikv#13520. The parameter is a work around (#196) but we forget to remove it.

I will remove it and do some tests.

@Smityz
Copy link
Author

Smityz commented Apr 19, 2024

Great! Thank you master @pingyu
But I think hard code parameters in https://github.com/tikv/migration/blob/main/cdc/cdc/sink/tikv.go is still an issue. Perhaps we could consolidate these parameters into a single file as constant variables

@zeminzhou
Copy link
Contributor

You are right, the ResolvedTsSafeInterval is a way to workaround a bug, the perfect fix depends on this PR. If you can be sure that the tikv version you are using is greater than or equal to 6.5, you can set ResolvedTsSafeInterval to 0.

@pingyu
Copy link
Collaborator

pingyu commented Apr 19, 2024

Great! Thank you master @pingyu But I think hard code parameters in https://github.com/tikv/migration/blob/main/cdc/cdc/sink/tikv.go is still an issue. Perhaps we could consolidate these parameters into a single file as constant variables

En ? You mean these parameters should be constant or configurable ?

@Smityz
Copy link
Author

Smityz commented Apr 19, 2024

Great! Thank you master @pingyu But I think hard code parameters in https://github.com/tikv/migration/blob/main/cdc/cdc/sink/tikv.go is still an issue. Perhaps we could consolidate these parameters into a single file as constant variables

En ? You mean there parameters should be constant or configurable ?

There are many hard-coded numbers in the code now, which is not very friendly for some optimization operations because developers often do not realize the impact these numbers may have. If we can give these numbers some names, it will make it easier for developers to optimize them instead of reading through all the code. When we find parameters that really have a huge impact on performance, we can consider making them configurable. Before that, we can consider giving them constant names and corresponding comments to facilitate retrieval.
For example:
https://github.com/tikv/migration/blob/main/cdc/cdc/sink/tikv.go#L47
https://github.com/tikv/migration/blob/main/cdc/cdc/kv/client.go#L1164

@pingyu
Copy link
Collaborator

pingyu commented Apr 19, 2024

Great! Thank you master @pingyu But I think hard code parameters in https://github.com/tikv/migration/blob/main/cdc/cdc/sink/tikv.go is still an issue. Perhaps we could consolidate these parameters into a single file as constant variables

En ? You mean there parameters should be constant or configurable ?

There are many hard-coded numbers in the code now, which is not very friendly for some optimization operations because developers often do not realize the impact these numbers may have. If we can give these numbers some names, it will make it easier for developers to optimize them instead of reading through all the code. When we find parameters that really have a huge impact on performance, we can consider making them configurable. Before that, we can consider giving them constant names and corresponding comments to facilitate retrieval. For example: https://github.com/tikv/migration/blob/main/cdc/cdc/sink/tikv.go#L47 https://github.com/tikv/migration/blob/main/cdc/cdc/kv/client.go#L1164

Get it.

BTW, the concurrency is configurable by concurrency in sink URL of changefeed. See

if s, ok := opts["concurrency"]; ok {
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants