Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write pidfile only if it doesn't exist #2358

Merged
merged 2 commits into from
Feb 3, 2017

Conversation

nfirvine
Copy link
Contributor

@nfirvine nfirvine commented Feb 2, 2017

And remove pidfile at end of run

I discovered a scenario where if run a second instance telegraf, it will happily write its pid to the pidfile, even if the pidfile contains a pid of a running process, and even if the second instance crashes immediately (due to a port in use for example).

This fixes it by requiring that the pidfile not exist beforehand, and deletes its pidfile when it closes regularly.

@@ -271,7 +271,7 @@ func reloadLoop(stop chan struct{}, s service.Service) {
log.Printf("I! Tags enabled: %s", c.ListTags())

if *fPidfile != "" {
f, err := os.Create(*fPidfile)
f, err := os.OpenFile(*fPidfile, os.O_EXCL|os.O_CREATE|os.O_WRONLY, 0666)
Copy link
Contributor

@phemmer phemmer Feb 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mode 0666 means that anyone on the box can write to the file. Thus if you have a service manager that is starting telegraf, and someone changes the contents of the pid file to some other system service, they can trick the service manager into signaling (and killing) the wrong service.

O_CREATE|O_EXCL also has the issue that if the pid file is ever abandoned for any reason (such as telegraf crash), telegraf will refuse to start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0666 is what the previous one did too; that's the behaviour of Create. But I agree; 644 is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O_CREATE|O_EXCL also has the issue that if the pid file is ever abandoned for any reason (such as telegraf crash), telegraf will refuse to start.

The alternative is checking if the pid in the pidfile refers to a live process, and if not, reclaiming the file for ourselves. Seems like a lot of pid-fiddling, but I'm happy to go that way.

Copy link
Contributor

@phemmer phemmer Feb 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative is checking if the pid in the pidfile refers to a live process, and if not, reclaiming the file for ourselves.

The problem is that with things like PID reuse, that can't be relied upon. It might not be a telegraf process sitting on that PID. And checking that the pid exists is a little difficult to do in a good cross-platform way. The best I know of is to send a signal 0 to the PID, and check response code. But even that can be misleading as you can't signal processes that you don't have perms on. So it gets messy really fast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it gets messy really fast.

Agreed. :) I thought about this when I started, and out of lazinesscomplexity avoidance, I went the easier, "operator intervention required" route.

I'm sure there are other apps that won't boot if there's lingering lock after a crash (apt), but I don't know if that's a good pattern to follow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the pidfile should be used as a process lock. There are quite a few references on how to handle pidfiles, none seem very canonical, this might be the best one to go with (uses O_RDWR | O_CREAT): http://www.man7.org/tlpi/code/online/dist/filelock/create_pid_file.c.html

if *fPidfile != "" {
err := os.Remove(*fPidfile)
if err != nil {
log.Fatalf("E! Unable to remove pidfile: %s", err)
Copy link
Contributor

@phemmer phemmer Feb 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really need to be fatal? We should already be shutting down, so I don't think we don't need to exit early and abort the shutdown process.
And what if the issue is that the file was already removed? We were just going to remove it anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it doesn't. Warning at best.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the pidfile removal in a defer function just after it is created

@nfirvine
Copy link
Contributor Author

nfirvine commented Feb 2, 2017

I'll give this a little more thought in the morning, going to leave it for now.

For context, I'm running telegraf under monit, using the pidfile to track. We're seeing the pidfile change mysteriously, then monit goes into a loop spawning new telegrafs, which crash because the real one already has the statsd socket open. I'm not sure yet what causes the initial change. Yes, this is stupid, and I should get a better supervisor.

I'm willing to admit I need to do more looking into the monit side. But I don't think overwriting the pidfile blindly is the best idea.

@phemmer
Copy link
Contributor

phemmer commented Feb 2, 2017

Shouldn't it just spawn 1 telegraf? The new process will write a new pid file, which monit will see and go "ok telegraf is now running since the pid file contains a valid pid".
With the change in this PR, if telegraf fails to start because the pid file exists, won't monit just start trying to spawn it over and over?

I've never used monit, so I'm just asking mostly out of curiosity.

@nfirvine
Copy link
Contributor Author

nfirvine commented Feb 2, 2017

Shouldn't it just spawn 1 telegraf? The new process will write a new pid file, which monit will see and go "ok telegraf is now running since the pid file contains a valid pid".

One would hope so, but that's not happening here. I need to look into it more, but considering I can't repro that problem, it might be a while.

If telegraf just fails to start because the pid file exists, won't monit just start trying to spawn it over and over?

Yes, but I was intending on implementing the stale pidfile checking in a launcher script. Even something really stupid like killall telegraf ; rm pidfile ; telegraf ... would work. And I wanted to keep that platform-specific stuff out the telegraf codebase.

if *fPidfile != "" {
err := os.Remove(*fPidfile)
if err != nil {
log.Fatalf("E! Unable to remove pidfile: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the pidfile removal in a defer function just after it is created

@@ -271,7 +271,7 @@ func reloadLoop(stop chan struct{}, s service.Service) {
log.Printf("I! Tags enabled: %s", c.ListTags())

if *fPidfile != "" {
f, err := os.Create(*fPidfile)
f, err := os.OpenFile(*fPidfile, os.O_EXCL|os.O_CREATE|os.O_WRONLY, 0666)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the pidfile should be used as a process lock. There are quite a few references on how to handle pidfiles, none seem very canonical, this might be the best one to go with (uses O_RDWR | O_CREAT): http://www.man7.org/tlpi/code/online/dist/filelock/create_pid_file.c.html

@nfirvine
Copy link
Contributor Author

nfirvine commented Feb 2, 2017

I don't think that the pidfile should be used as a process lock. There are quite a few references on how to handle pidfiles, none seem very canonical, this might be the best one to go with (uses O_RDWR | O_CREAT): http://www.man7.org/tlpi/code/online/dist/filelock/create_pid_file.c.html

Agree. As such, I'd say the concept behind the PR is flawed. Thanks for all your comments

@nfirvine nfirvine closed this Feb 2, 2017
@sparrc
Copy link
Contributor

sparrc commented Feb 2, 2017

I'm going to reopen because I think there are two things from this that we need to fix about the current behavior:

  1. we should have a defer function to make sure the pidfile almost always gets deleted.
  2. we should create the pidfile with 0644 permissions.

@sparrc sparrc reopened this Feb 2, 2017
@sparrc sparrc added this to the 1.4.0 milestone Feb 2, 2017
Also, ensure pidfile perms are 644
@nfirvine
Copy link
Contributor Author

nfirvine commented Feb 2, 2017

I'm going to reopen because I think there are two things from this that we need to fix about the current behavior...

I've updated the PR to do this

@sparrc sparrc modified the milestones: 1.3.0, 1.4.0 Feb 2, 2017
@sparrc
Copy link
Contributor

sparrc commented Feb 3, 2017

thanks @nfirvine

@sparrc sparrc merged commit dfb4038 into influxdata:master Feb 3, 2017
@nfirvine nfirvine deleted the pidfile-improvements branch February 3, 2017 17:34
maxunt pushed a commit that referenced this pull request Jun 26, 2018
Also, ensure pidfile perms are 644
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants