-
Notifications
You must be signed in to change notification settings - Fork 208
Add ops the ability to clean orphan commands #1434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ts status TIMEOUT
|
The following files are not gofmt-ed. By commenting pkg/app/ops/orphancommandcleaner/orphancommandcleaner.go--- pkg/app/ops/orphancommandcleaner/orphancommandcleaner.go.orig
+++ pkg/app/ops/orphancommandcleaner/orphancommandcleaner.go
@@ -4,9 +4,10 @@
"context"
"time"
+ "go.uber.org/zap"
+
"github.com/pipe-cd/pipe/pkg/datastore"
"github.com/pipe-cd/pipe/pkg/model"
- "go.uber.org/zap"
)
var (
|
|
/golinter fmt |
…pipe into orphan-commands-timeout
| commandTimeOut = 7 * 24 * time.Hour | ||
| interval = 24 * time.Hour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What values should be set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commandTimeOut = 24 * time.Hour
interval = 6 * time.Hour
|
Is there a better place to create |
cmd/pipecd/ops.go
Outdated
|
|
||
| // Starting orphan commands cleaner | ||
| cleaner := orphancommandcleaner.NewOrphanCommandCleaner(ds, t.Logger) | ||
| cleaner.Run(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be wrapped by errgroup to terminate ops when cleaner exit because of an error.
group.Go(func() error {
return cleaner.Run(ctx)
})There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, there was no need to stop ops because errors that occur in cleaner is probably caused by a network error when communicating with datastore.
Of course, it is necessary to output the error log to show that an error has occurred. However, I don't think it is necessary to stop ops because updateOrphanCommandsStatus will include the update that should have been done with the previous process if the previous process failed.
Or how about stopping ops if it fails several times in a row?
What do you think about it? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice comment. I got your points.
Agree with you that the cleaner may have some errors while executing its jobs.
But that is the problem of cleaner, not this run function.
This run should not care about what each component is doing, it just cares that they are fine or not. If any of them had a problem, ops should be stopped to notify that there was a problem from one of its components.
In other words, run should see all of its components (cleaner, http-server, insightcollector...) in the same way.
About the network problem or something that can cause the error while cleaner executing the job,
cleaner can ignore them, just log the error message, and does not return.
By that way, the ops will not be stopped by those errors and we can keep run to be nice too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the network problem or something that can cause the error while cleaner executing the job,
cleaner can ignore them, just log the error message, and does not return.
By that way, the ops will not be stopped by those errors and we can keep run to be nice too.
OK, I see 👍
| ) *OrphanCommandCleaner { | ||
| return &OrphanCommandCleaner{ | ||
| commandstore: datastore.NewCommandStore(ds), | ||
| logger: logger.Named("orphan-command-finder"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
orphan-command-cleaner
| if err := o.updateOrphanCommandsStatus(ctx); err != nil { | ||
| o.logger.Error("failed to update orphan commands", zap.Error(err)) | ||
| } | ||
| time.Sleep(interval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use time.Ticker instead of sleep.
| } | ||
| } | ||
|
|
||
| func (o *OrphanCommandCleaner) Run(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: func(c * because it is a cleaner.
| } | ||
|
|
||
| func (o *OrphanCommandCleaner) updateOrphanCommandsStatus(ctx context.Context) error { | ||
| timeoutedTime := time.Now().Add(-commandTimeOut).Unix() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: timedout or just timeout
| } | ||
|
|
||
| for _, c := range commands { | ||
| o.commandstore.UpdateCommand(ctx, c.Id, func(c *model.Command) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a log to show that the command was successfully updated or not?
I think the current place is fine. |
|
@sanposhiho Hi, Thank you for your great effort. |
|
|
||
| t := time.NewTicker(interval) | ||
| for { | ||
| select { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the context.Done to exit when ops was terminated.
for {
select {
case <- ctx.Done():
return nil
case <- t.C:
...
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, thanks
| cmd.Status = model.CommandStatus_COMMAND_TIMEOUT | ||
| return nil | ||
| }); err != nil { | ||
| c.logger.Error("failed to update orphan commands", zap.Error(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: commands - > command
And can you add the command ID and Type to the log message?
| c.logger.Error("failed to update orphan commands", zap.Error(err)) | ||
| consecutiveFailuresCount++ | ||
| if consecutiveFailuresCount == maxConsecutiveFailuresCount { | ||
| return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I explained here, we can ignore this error.
Don't have to return. And we don't need the error log too because it is already logged inside the updateOrphanCommandsStatus function.
Btw, It would be nice if you can add a log to show the elapsed time of this execution.
|
@nghialv |
|
@sanposhiho Thank you very much for your great work! ❤️ |
What this PR does / why we need it:
Create orphan commands cleaner. It periodically find orphan commands and mark them with TIMEOUT status
Which issue(s) this PR fixes:
Fixes #1397
Does this PR introduce a user-facing change?: