Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't properly handle zombie processes on UNIX #428

Closed
giampaolo opened this issue May 23, 2014 · 22 comments
Closed

Can't properly handle zombie processes on UNIX #428

giampaolo opened this issue May 23, 2014 · 22 comments

Comments

@giampaolo
Copy link
Owner

From [email protected] on September 16, 2013 21:48:41

How to reproduce:

1. start Photoshop CS6 on a Mountain Lion OSX
2. import psutil; [x.as_dict() for x in psutil.process_iter()] # (in .py file, ipython) 



What is the expected output?  
A long list of processes and related information 



What do you see instead?  


$ python test.py
Traceback (most recent call last):
  File "test.py", line 3, in <module>
    [x.as_dict() for x in psutil.process_iter() if x.is_running()]
  File 
"/Users/rico/.virtualenvs/temp/lib/python2.7/site-packages/psutil/__init__.py", 
line 225, in as_dict
    ret = attr()
  File 
"/Users/rico/.virtualenvs/temp/lib/python2.7/site-packages/psutil/__init__.py", 
line 414, in get_nice
    return self._platform_impl.get_process_nice()
  File 
"/Users/rico/.virtualenvs/temp/lib/python2.7/site-packages/psutil/_psosx.py", 
line 151, in wrapper
    raise NoSuchProcess(self.pid, self._process_name)
psutil._error.NoSuchProcess: process no longer exists (pid=46244)

or within iPython notebook:
[x.as_dict() for x in psutil.process_iter() if x.is_running()]
---------------------------------------------------------------------------
NoSuchProcess                             Traceback (most recent call last)
<ipython-input-108-a71c6dffe397> in <module>()
----> 1 [x.as_dict() for x in psutil.process_iter() if x.is_running()]

/Users/rico/.virtualenvs/temp/lib/python2.7/site-packages/psutil/__init__.pyc 
in as_dict(self, attrs, ad_value)
    223                         ret = attr(interval=0)
    224                     else:
--> 225                         ret = attr()
    226                 else:
    227                     ret = attr

/Users/rico/.virtualenvs/temp/lib/python2.7/site-packages/psutil/__init__.pyc 
in get_nice(self)
    412     def get_nice(self):
    413         """Get process niceness (priority)."""
--> 414         return self._platform_impl.get_process_nice()
    415 
    416     @_assert_pid_not_reused

/Users/rico/.virtualenvs/temp/lib/python2.7/site-packages/psutil/_psosx.pyc in 
wrapper(self, *args, **kwargs)
    149             err = sys.exc_info()[1]
    150             if err.errno == errno.ESRCH:
--> 151                 raise NoSuchProcess(self.pid, self._process_name)
    152             if err.errno in (errno.EPERM, errno.EACCES):
    153                 raise AccessDenied(self.pid, self._process_name)

NoSuchProcess: process no longer exists (pid=46243, name='adobe_licutil') 

When I close Photoshop, the error will not show up. When starting it again the 
error reappears.
An additional is_running() check within the list comprehension does not change 
a thing and running the code several times will not change the reported pid.

Original issue: http://code.google.com/p/psutil/issues/detail?id=428

@giampaolo
Copy link
Owner Author

From g.rodola on September 16, 2013 13:37:52

Are you sure this doesn't happen simply because Photoshop process terminates 
(maybe quickly)?

Does this happen with get_process_nice() only?

Are you able to isolate a test case similar to this and post the result?

try:
    p.get_nice()
except psutil.NoSuchProcess:
    print(p.is_running())


My best guess is that the process is *actually* terminated (maybe it's a 
Photoshop worker subprocess which terminates very quickly) and using 
is_running() within the list comprehension doesn't help because it's subject to 
a race condition.

@giampaolo
Copy link
Owner Author

From [email protected] on September 16, 2013 13:52:18

I just tried the following script:

    import psutil

    for process in psutil.process_iter():
        print "\n\n----------------------------"
        print "process: {}".format(process)
        print process.as_dict()

Which gave me the following output (after emitting a lot of other processes of course):

    ----------------------------
    process: psutil.Process(pid=46776, name='adobe_licutil')
    Traceback (most recent call last):
      File "test.py", line 6, in <module>
        print process.as_dict()
      File 
"/Users/rico/.virtualenvs/temp/lib/python2.7/site-packages/psutil/__init__.py", 
line 225, in as_dict
        ret = attr()
      File 
"/Users/rico/.virtualenvs/temp/lib/python2.7/site-packages/psutil/__init__.py", 
line 414, in get_nice
        return self._platform_impl.get_process_nice()
      File 
"/Users/rico/.virtualenvs/temp/lib/python2.7/site-packages/psutil/_psosx.py", 
line 151, in wrapper
        raise NoSuchProcess(self.pid, self._process_name)
    psutil._error.NoSuchProcess: process no longer exists (pid=46776, 
name='adobe_licutil')


But the strange thing is... when I run ps:

    $ ps aux | grep adobe
    rico           46804   0.0  0.0  2432768    596 s007  R+   10:47PM   
0:00.00 grep adobe
    rico           46776   0.0  0.0        0      0   ??  Z    10:42PM   
0:00.00 (adobe_licutil)


The process seems to be there.


When I integrate your code that would read like this I guess:

    $ cat test.py
    import psutil

    for process in psutil.process_iter():
        print "\n\n----------------------------"
        print "process: {}".format(process)
        try:
            process.get_nice()
        except psutil.NoSuchProcess:
            print "got NoSuchProcess Exception"
            print process.is_running()


The output is:
----------------------------

    process: psutil.Process(pid=46776, name='adobe_licutil')
    got NoSuchProcess Exception
    True

@giampaolo
Copy link
Owner Author

From g.rodola on September 16, 2013 14:05:07

Mmm... this is weird.
Can you please try this C program and paste the output?


    #include <stdio.h>
    #include <errno.h>
    #include <sys/time.h>
    #include <sys/resource.h>

    int main()
    {
        int ret;
        ret = getpriority(PRIO_PROCESS, getpid());
        printf("ret %i\n", ret);
        printf("errno %i\n", errno);

        ret = getpriority(PRIO_PROCESS, 46776);  // adobe_licutil PID
        printf("ret %i\n", ret);
        printf("errno %i\n", errno);
    }



In case you don know how to do that: save that into a file named "a.c" then run 
"gcc a.c && ./a.out" in your shell.

@giampaolo
Copy link
Owner Author

From [email protected] on September 16, 2013 14:10:57

of course!

First I put the program in place and compiled it as requested:

$ cat > a.c <<EOF
> #include <stdio.h>
> #include <errno.h>
> #include <sys/time.h>
> #include <sys/resource.h>
>
> int main()
> {
>    int ret;
>    ret = getpriority(PRIO_PROCESS, getpid());
>    printf("ret %i\n", ret);
>    printf("errno %i\n", errno);
>
>    ret = getpriority(PRIO_PROCESS, 46776);  // adobe_licutil PID
>    printf("ret %i\n", ret);
>    printf("errno %i\n", errno);
> }
> EOF
$ gcc a.c
$ ./a.out
ret 0
errno 0
ret -1
errno 3


Verification that process is still there:
$ ps aux | grep adobe
rico           46893   0.0  0.0  2432768    596 s006  S+   11:08PM   0:00.00 grep adobe
rico           46776   0.0  0.0        0      0   ??  Z    10:42PM   0:00.00 
(adobe_licutil)

@giampaolo
Copy link
Owner Author

From [email protected] on September 16, 2013 14:15:44

I have CS6 and Mountain Lion, so let me see if I can reproduce/debug further also.

@giampaolo
Copy link
Owner Author

From [email protected] on September 16, 2013 14:24:03

Interesting, with or without CS6 open, I can reproduce an error but in my case 
it's for Google Chrome first: 

----------------------------
process: psutil.Process(pid=24798, name='Google Chrome He')
Traceback (most recent call last):
  File "test.py", line 6, in <module>
    print process.as_dict()
  File "build/bdist.macosx-10.8-intel/egg/psutil/__init__.py", line 225, in as_dict
  File "build/bdist.macosx-10.8-intel/egg/psutil/__init__.py", line 414, in get_nice
  File "build/bdist.macosx-10.8-intel/egg/psutil/_psosx.py", line 151, in wrapper
psutil._error.NoSuchProcess: process no longer exists (pid=24798, name='Google 
Chrome He')


user@host cs6]$ ps aux | grep 24798 | grep -v grep 
jloden         24798   0.0  0.0        0      0   ??  Z     5Sep13   0:00.00 
(Google Chrome He)


Note that in both my case and rico's example above, the missing process is 
marked as "Z" a.k.a Zombie process. So That's why the NoSuchProcess exception. 
I'm guessing the adobe_licutil is for checking the license before launching the 
main UI so it probably execs or otherwise forks into a new process and leaves 
it behind as a zombie. 

I'm not sure how to handle this "correctly" in psutil. We could possibly remove 
it from process_iter if it's a zombie process, or maybe there's another more 
elegant way to treat these. Thoughts?

@giampaolo
Copy link
Owner Author

From g.rodola on September 16, 2013 14:46:09

Ah I see, a zombie process (didn't notice that).
Hmmm... what other info can you extract from that process?
It would be interesting to know if there are other methods raising NSP in the 
same manner, then we can decide what to do, although I think letting NSP bubble 
up might be legitimate.
...And it seems clear we don't have a test case for this (I'm currently looking 
into how to create a zombie process on purpose =)).

@giampaolo
Copy link
Owner Author

From g.rodola on September 16, 2013 15:55:57

Replying to my own question: get_open_files() and get_num_fds() on OSX behave 
the same way.
Tried the same test on Linux and the issue simply does not exist there (a 
zombie processes is querable just fine).
I think there are 2 possibilities on the table:

#1 return bogus values
#2 let NSP exception propagate

Although I'm not particularly thrilled with the idea, I think #1 is the best 
way to go because "practicality beats purity" and we would also be consistent 
with other platforms (AFAIK OSX is the only one behaving like this, but I'll 
make sure of this and report back).

I think removing the zombie processes from process_iter() is a bad idea because 
looking for them only (e.g. in order to kill them) is a perfectly reasonable use case.

@giampaolo
Copy link
Owner Author

From [email protected] on September 17, 2013 08:18:35

Is there something else I could provide to help you solve this issue?

@giampaolo
Copy link
Owner Author

From g.rodola on September 17, 2013 08:20:47

No thanks. I already figured it out.
Will provide a fix and a test later today or tomorrow, then I think we're ready 
to release a new version.

@giampaolo
Copy link
Owner Author

From [email protected] on September 17, 2013 08:29:06

I agree, the NSP exception isn't particularly elegant. Bogus values aren't 
appealing but it probably makes more sense to have null/empty values rather 
than unexpected exceptions.

Ideally we'd have some mechanism to easily identify the process as a zombie 
also, if you're iterating processes. Maybe a separate property (process state), 
or a bogus value that only appears for zombie procs?

@giampaolo
Copy link
Owner Author

From g.rodola on September 17, 2013 08:31:46

We already have Process.status property.

@giampaolo
Copy link
Owner Author

From [email protected] on September 17, 2013 08:38:25

Oops, good point, I forgot about that one. I just checked, and it is properly 
reporting STATUS_ZOMBIE for the process, at least in my example so we're all 
set there. 

Another thought - maybe we could add a filter option to process_iter() so that 
you could iterate only processes matching certain status(es). That way someone 
could easily iterate zombie processes to kill them as in your example, or 
ignore certain statuses like zombie processes if they're only interested in 
other stats.

@giampaolo
Copy link
Owner Author

From g.rodola on September 17, 2013 08:47:05

It seems an unnecessary API complication to me (why not add other filter 
arguments then?).  Plus you'll only save one line (if p.status == 
STATUS_ZOMBIE) which is even better if left explicit.

@giampaolo
Copy link
Owner Author

From [email protected] on September 17, 2013 09:06:23

Fair enough. You could use a list comprehension to build a list as well so 
there are other options. 

I was just thinking it'd might be a nice feature to have a filter option on the 
iter function (as you noted, there are other filter options besides status that 
could make sense). It seems like a lot of use cases I see on the mailing list 
and sample code snippets are iterating through processes searching for items so 
I was thinking it might be useful in a general case.

@giampaolo
Copy link
Owner Author

From g.rodola on September 17, 2013 16:03:08

Yeah. Well, generally speaking I think it's better if we remain as simple and 
minimalist as possible as long as "something" is already easily implementable 
in user's code, as this is the case.

@giampaolo
Copy link
Owner Author

From g.rodola on September 17, 2013 23:34:38

Update: it seems on FreeBSD we cannot instantiate a new Process instance for a 
zombie process (NSP gets raised in __init__ because we try to get process 
creation time).
This should also be fixed because of the use case I was mentioning before 
(looking for all zombie processes in order to kill them).

@giampaolo
Copy link
Owner Author

From g.rodola on September 18, 2013 14:03:20

This appears to be more complicated and profound than I initially thought.

It seems FreeBSD deletes all process information after it's gone zombie as 
*all* Process methods (ppid, name, nice, cmdline, etc.) raise NSP, so it 
appears that faking return values is not a great idea after all, at least on FreeBSD.
Even Process.status will raise NSP instead of returning STATUS_ZOMBIE, which is 
strange since "ps aux" manages to show the process status somehow.

So far the only platform where a zombie process is indistinguishable from 
regular ones is Linux (Windows does not have them).
That implies life will be easier for whoever wants to filter them:

zombies = [p for p in psutil.process_iter() if p.status == psutil.STATUS_ZOMBIE]

On BSD, where this is not possible, one would have to do something (nasty) like this:

def get_zombies():
    for pid in psutil.get_pid_list():
        try:
            p = psutil.Process(pid)      
            if p.status == psutil.STATUS_ZOMBIE:  # for platforms != BSD
                yield pid 
        except psutil.NoSuchProcess:
            if psutil.pid_exists(pid):  # <-- race condition
                yield pid

In the meantime I investigated further and it appears a zombie process cannot 
be killed ('cause it's already dead !-)).
The only way to get rid of it would be making its parent call wait() against it.
Will think this through further tomorrow.

@giampaolo
Copy link
Owner Author

From g.rodola on September 18, 2013 14:23:19

Further update: I took a look at ps source code for FreeBSD and it seems it 
manages to get process status (and also ppid) by using kvm_getprocs() whereas 
we use sysctl(): 
https://code.google.com/p/psutil/source/browse/psutil/_psutil_bsd.c?spec=svn51a50962614e02f1426da55012f01ca8e1fd53ed&r=83165d10041d7306798dcc400df5d64a57fb58f0#63
 Assuming sysctl() is faster we might use that one first and then fall back on 
using kvm_getprocs() at least for retrieving process status, ppid and 
creation_time (in order to ensure process univocity over time).

@giampaolo
Copy link
Owner Author

From g.rodola on March 09, 2014 15:35:14

Bumping up priority.

Summary: Can't properly handle zombie processes on UNIX (was: NoSuchProcess exception while running Photoshop)
Labels: -Priority-Medium Priority-High OpSys-UNIX

@giampaolo
Copy link
Owner Author

This still deserves some careful thinking but one idea might be to introduce a new ZombieProcess exception to raise instead of NoSuchProcess.
Also, note that this seem to affect OSX and FreeBSD but not Linux where a zombie process is still querable (despite the returned data is probably "faked" by the kernel).
Not sure what happens on Solaris, while Windows of course has no zombie processes at all.

@giampaolo
Copy link
Owner Author

For future reference - the test case we will likely want to have should look like this (psuedo code):

pid = create_zombie()
# a zombie process should always be instantiable
p = psutil.Process(pid)
# ...and always be querable
assert p.status() == psutil.STATUS_ZOMBIE
# ...and process_iter() must return it
assert pid in [x.pid for x in psutil.process_iter()]
# ...and of course also pids()
assert pid in psutil.pids()

giampaolo added a commit that referenced this issue Oct 31, 2014
giampaolo added a commit that referenced this issue Feb 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant