-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: include: assume that shared file systems exist on clusters / remove include_from_node1 #22588
Conversation
This seems like a step back in usability as it increases the assumptions on the environment we're executing in. While I agree that this complexity does not need to be in base, I think we should have a package that implements it. It's very useful. |
The use case I had yesterday was adding Anubis workers to a master process on the desktop. Is it correct that, with this change, it wouldn't be possible to add such remote workers? |
Did that actually manage to work? We have quite a few open bugs claiming it doesn't (Except in the very specific configuration where all nodes are exactly identical. In which case, the added complexity – which is being removed here – chanced to be idempotent and simply unnecessary and unreliable overhead). With this change, that workflow should be much more reliable and general.
It assumes the kernel of the remote system supports some type of network file system or can otherwise copy files ahead of time from somewhere on the network. This is quite significantly less than what we assume now.
Someone will go do it if they find it to be useful. I suspect they will not. The hooks added for Pkg3 should already handle this case. |
No, it assumes you know how to set one up, which is a significantly larger problem than the technical problem here. My ideal user interface for the parallel stuff would be that you point julia at an ssh server and it starts a worker there without any assumptions on the remote file system. That doesn't work right now, but I think we should strive towards that. |
No, this PR assumes that if you have figured out how to get the Currently, we unavoidable also assume you have have set up a correct nfs (because many packages also have file system dependencies), but because we emulate exactly one operation in Julia (include), it adds all sorts of other limitations and complications on how that nfs must be configured and what kinds of machines it can be run on. |
Things mostly worked fine though without a shared file system. Changes to the packages on the head node were indeed picked up by the workers (true point about the file system dependencies though). I don't disagree that this has significant complexity that's probably unsuitable for base, I'm just saying the functionality was useful. |
I don't think we should completely remove this without a working replacement. |
What we have now emphatically does not work in many cases. This replaces it with something simpler that does work in all cases. (rather than attempting to address the bugs that make the current implementation basically unusable) |
requiring a shared file system breaks many, possibly most, of the cases that work right now |
You're going to have be more specific, since given #22252, I'm unaware of any cases that work right now (except for cases which just manage to usually happen to work, but would work much better and more reliably after this PR). |
(edited top post to clarify that this PR is less dependent on the presence of a shared file system than the current situation) |
c2fda80
to
4187526
Compare
It works if both master and workers are launched with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do this. I've been spending some time lately on trying to run code Julia on remote nodes that don't share the file system with the master node.
It is not really practical to avoid binary dependencies so right now we don't really support running code on remote machines.
@vtjnash Needs a rebase.
remove buggy support for emulating shared file-systems from Julia: the kernel is much better at this, and can do it transparently
4187526
to
a592f3f
Compare
a592f3f
to
f8b84f2
Compare
This badly needs news updates. Docs too, most likely. |
🎉 Just tried this out and now I can actually do useful stuff on the workers. @vtjnash thanks for the fix. Would you mind writing the NEWS entry and the docs update? Most likely, I wouldn't be able to explain this correctly. |
…ove include_from_node1 (JuliaLang#22588) * include: assume that shared file systems exist for clusters remove buggy support for emulating shared file-systems from Julia: the kernel is much better at this, and can do it transparently * only broadcast using/import to nodes which need it fix JuliaLang#12381 fix JuliaLang#13999
Remove buggy support for emulating mounting shared file-systems from Julia: the kernel is much better at this, since it can do it transparently for all files and all purposes.
This removes all of the support for node-aware
include
, and instead assumes the local kernel is capable of mounting remote drives.In the case where the remote nodes have the same filesystem layout, it's still feasible to share
.ji
files: simply prepend a local path toLOAD_CACHE_PATH
on all except the first node (ref #13684 (comment)), such that each node has a separate working directory (read/write), but can also see the cache directory from the master (readonly).resolves #22252
resolves #11093
resolves #13939
closes #12381
fixes #12381 and #13999
refs and replaces #19073
@StefanKarpinski as we talked about at JuliaCon
edit: I forgot to mention that this doesn't require the ability to mount remote drives. I only mentioned that above as an example of how to most nearly simulate the limited capabilities of the existing
include
framework. With this PR, new capabilities will also become feasible, including completely independent systems or using any arbitrary method of copying files around (tar, sftp, rsync, etc.).