Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition when multiple processes try to compile a module at once #67

Closed
joshlk opened this issue Apr 20, 2022 · 5 comments · Fixed by #71
Closed

Race condition when multiple processes try to compile a module at once #67

joshlk opened this issue Apr 20, 2022 · 5 comments · Fixed by #71

Comments

@joshlk
Copy link
Collaborator

joshlk commented Apr 20, 2022

Hi,

Great package by the way!

I've encountered an issue when multiple processes are spawned that all race to compile the same module. This can also occur when multiple processes are spawned on different hosts and share the same network filesystem. Such a situation is common when distributing work between multiple processes or hosts for AI or data analytics.

Here is a demonstration (in the shell):

echo '// cppimport
#include <pybind11/pybind11.h>

namespace py = pybind11;

int square(int x) {
    return x * x;
}

PYBIND11_MODULE(somecode, m) {
    m.def("square", &square);
}
/*
<%
setup_pybind11(cfg)
%>
*/' > somecode.cpp

echo 'import cppimport.import_hook
import somecode
somecode.square(9)' > test.py

rm somecode.cpython-*

for i in {1..100}; do python3 test.py & done

On my system around 4 out of 100 processes exit in an error. The shell output includes:

error: could not delete '/localdata/joshl/sandbox/somecode.cpython-36m-x86_64-linux-gnu.so': No such file or directory
...
Exit 1                  python3 test.py
...
Bus error               (core dumped) python3 test.py

These errors don't appear when the binary already exists.


To mitigate this issue in our applications we have used a file lock so that only one process attempts to compile the module at one time. A process first checks if the binary file exists, otherwise attempts to obtain the file lock. If it can't obtain the lock it waits until either the binary exists, can obtain the file lock or times out. Here is an example how it can be done (app code):

from cppimport.checksum import is_checksum_valid

binary_path = module_data['ext_path']
lock_path = binary_path + '.lock'

t = time()

while not (os.path.exists(binary_path) and is_checksum_valid(module_data)) and time() - t < timeout:
    try:
        with FileLock(lock_path, timeout=1):
            if os.path.exists(binary_path) and is_checksum_valid(module_data_new_path):
                break
            # BUILD BINARY
            template_and_build(filepath, module_data)
    except Timeout:
        logging.debug(f'{os.getpid()}: Could not obtain lock')
        sleep(1)

if not (os.path.exists(binary_path) and is_checksum_valid(module_data_new_path)):
    raise Exception(
        f'Could not compile binary as lock already taken and timed out. Lock file will be deleted: {lock_path}')

if os.path.exists(lock_path):
    with suppress(OSError):
        os.remove(lock_path)

It would be great if we could upstream the above to cppimport to prevent the race condition errors. If you are happy with this solution I could contribute the above to the appropriate place in cppimport.

@tbenthompson
Copy link
Owner

Thanks for the interesting issue!!

Overall, I really like this and would like to implement something like this since it seems clearly useful.

The only thing that makes me hesitate is that it seems like a problem that is inherent to build systems - if you run the same build with the same build system concurrently, you are bound to run into a problem. So, why should cppimport fix a problem that essentially comes from the underlying tools? My response to this question would be: cppimport is attempting to make c++ files behave a little bit more like Python modules, while accepting that the correspondence is very leaky. So, any step to make that correspondence a little less leaky is a good thing.

Would you be interested in putting together a pull request for this? I'm generally very light touch on code reviews since I want to encourage contributions. So, it won't be an arduous process! =)

@tbenthompson
Copy link
Owner

Hi @joshlk, just wanted to check in if this is something you'd be interested in making a PR for! Are you still using your file lock solution? Any issues with that approach crop up over the last couple months?

@joshlk
Copy link
Collaborator Author

joshlk commented Jun 9, 2022

Yes it is. Im just trying to get approval from my work but it should be sorted soon

@tbenthompson
Copy link
Owner

Awesome!

@joshlk
Copy link
Collaborator Author

joshlk commented Jul 4, 2022

Sorry it's taken so long - I had to jump through a bunch of internal hoops. Here is a PR: #71

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants