Skip to content

Conversation

@yuchen-db
Copy link
Collaborator

@yuchen-db yuchen-db commented Oct 11, 2024

The endpoint expose number of TSDBs, rollout operator will patch the sts if n_tsdb=0
related PRs:
https://github.com/databricks/universe/pull/753653
databricks/rollout-operator#7

@yuchen-db yuchen-db changed the title update downscale Oct 12, 2024
@yuchen-db yuchen-db changed the title downscale db downscale Oct 14, 2024
@yuchen-db yuchen-db changed the title db downscale Add receiver downscale endpoint Oct 14, 2024
@yuchen-db yuchen-db requested review from a team, christopherzli, hczhu-db, jnyi and yulong-db and removed request for a team October 14, 2024 23:05
toolkit_web "github.com/prometheus/exporter-toolkit/web"
"golang.org/x/net/http2"
"golang.org/x/net/http2/h2c"
"net/http"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored

httpserver.WithGracePeriod(time.Duration(*conf.httpGracePeriod)),
httpserver.WithTLSConfig(*conf.httpTLSConfig),
)
var lastDownscalePrepareTimestamp *int64 = nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this implementation assumes the pod won't restart, to preserve the state, we might need to persist it on local PVC too, any thoughts?

Copy link
Collaborator Author

@yuchen-db yuchen-db Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but it's safe to lose this state. rollout operator will only do downscale if the returned timestamp is more than 30 seconds ago. If we lose this state, it returns the current timestamp and downscale will not accidentally happen. In the worst case, a scheduled downscale will be delayed for 30 seconds.

}

func (t *MultiTSDB) GetTenants() map[string]*tenant {
return t.tenants
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This map is protected by mtx. A call site of GetTenants() can hold a pointer to the map and do reads/writes while the map is being updated by other goroutines. That'd be a data race.
It's safer to return a deep copy of the map if copying is not too expensive.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored code

func RegisterDownscale[K comparable, V any](s *Server, m map[K]V, mtx *sync.RWMutex, t *int64) {
s.mux.Handle("/-/downscale", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
mtx.RLock()
n := len(m)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't pass in a map, if only the size of map is needed. Pass in the size of the map.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the size of the map changes dynamically during handler runtime and Golang passes map by reference

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored code

@yuchen-db yuchen-db requested review from hczhu-db and jnyi October 15, 2024 18:21
@yuchen-db yuchen-db force-pushed the yuchen-db/scaledown-with-operator branch from cf878f5 to ea0d891 Compare October 22, 2024 05:51
@hczhu-db hczhu-db requested a review from a team October 24, 2024 22:53
func (t *MultiTSDB) GetTenantsLen() int {
t.mtx.RLock()
defer t.mtx.RUnlock()
return len(t.tenants)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we log how many tenants left and what are they?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add unit test?

"context"
"fmt"
"net"
"net/http"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent

Copy link
Collaborator

@jnyi jnyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, would be nice if you can add unit tests, especially after prune.

@yuchen-db yuchen-db merged commit 6cf9daa into db_main Nov 14, 2024
13 of 14 checks passed
@yuchen-db yuchen-db deleted the yuchen-db/scaledown-with-operator branch November 14, 2024 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants