Skip to content

Add a sema to bound the number of inflight vexec queries when getting workflows#8353

Closed
ajm188 wants to merge 1 commit intovitessio:mainfrom
tinyspeck:am_getworkflows_sema
Closed

Add a sema to bound the number of inflight vexec queries when getting workflows#8353
ajm188 wants to merge 1 commit intovitessio:mainfrom
tinyspeck:am_getworkflows_sema

Conversation

@ajm188
Copy link
Contributor

@ajm188 ajm188 commented Jun 18, 2021

Signed-off-by: Andrew Mason amason@slack-corp.com

Description

What it says in the title. Current defaults are somewhat arbitrary 1000 concurrent queries with a 50ms timeout to get a conn.

Related Issue(s)

Checklist

  • Tests were added or are not required -- n/a
  • Documentation was added or is not required -- n/a

Deployment Notes

… workflows

Signed-off-by: Andrew Mason <amason@slack-corp.com>
@ajm188 ajm188 added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Cluster management labels Jun 18, 2021
@ajm188 ajm188 requested review from doeg, rafael and setassociative June 18, 2021 18:20
@ajm188 ajm188 requested a review from deepthi as a code owner June 18, 2021 18:20
Copy link

@setassociative setassociative left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd have a little more information around the concurrent query utilization for normal operations workflow before setting a limit. Basically - if this is a problem we should tune to control the problem; if it's not then why introduce it?

That said I believe this is safe. If there is a larger thing we're chasing let's try to get it into an issue.

No major objections to merging this as is though -- it seems like a good pattern for endpoints that can stand throttling. (🗒️ Implicitly: GetWorkflows is not in the hot path anywhere afaik, does that vibe with your understanding)

Comment on lines 315 to +319
vx := vexec.NewVExec(req.Keyspace, "", s.ts, s.tmc)

if !s.vexecPool.AcquireContext(ctx) {
return nil, ErrVExecConnTimeout
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tioli:

func NewVExecThrottled(..., pool *sync2.Semaphore) *VExec

and then moving other vexec endpoints over to it is a tiiiiiiny bit less effort; also you don't have to remember to manage locking around each query

Comment on lines +51 to +52
vExecPoolSize = flag.Uint("workflow_server_vexec_pool_size", 1000, "maximum number of concurrent vexec queries to allow")
vExecPoolTimeout = flag.Duration("workflow_server_vexec_pool_default_timeout", time.Millisecond*50, "default timeout to wait acquiring a connection from the vexec pool. zero implies no timeout")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should update release notes with this and if we expect to roll out limits per call (imo) we should have a better approach than a flag for concurrency-per-call

@setassociative
Copy link

setassociative commented Jun 18, 2021

Edit: It was transient

From unit tests the e2e is worth looking into;

--- FAIL: TestCellAliasVreplicationWorkflow (360.09s)
--- FAIL: TestCellAliasVreplicationWorkflow/shardCustomer (300.99s)

The 300s reads like a timeout but that's adjacent to "we exhausted our pool and dead locked somehow." I don't see how that'd happen given code but 🤷. I kicked off a rerun in case it was transient.

@ajm188
Copy link
Contributor Author

ajm188 commented Jun 23, 2021

#8368 is looking like a more promising (and holistic) fix. Going to close this

@ajm188 ajm188 closed this Jun 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Cluster management Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants