-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fs.put() behaves inconsistently depending on whether the rpath already exists #1062
Comments
The behaviour you describe as suggested is not that provided by typical CLI shells, like posix |
I guess it's a separate question as to what should be done when requesting a recursive copy on a single file... |
Thanks for the response! I did try with the trailing I think even if the goal is be consistent with
But from my tests it didn't look like this worked with I don't think a solution needs to mirror |
Perhaps you wanted |
In any case, you are quite right that we should at the very least be really explicit in the docs, so no one is surprised. But in general, we need to write out a set of tests of expected behaviour (single file, recursive on/off, remote dir exists or not) and tested against multiple remote filesystems with differing ideas about directories. |
There are a number of other issues related to this: #820, #882, s3fs fsspec/s3fs#659. Here are the results of my investigation. This code compares what happens for command-line import fsspec
import os
import shutil
import subprocess
def setup():
source = '/tmp/source'
target = '/tmp/target'
# Clear test directories.
shutil.rmtree(source, ignore_errors=True)
shutil.rmtree(target, ignore_errors=True)
# Create test directory and file.
fs.mkdir(source)
with open(os.path.join(source, 'a'), 'w') as f:
f.write("Here is some text")
# Target directory does not yet exist.
assert not fs.isdir(target)
return source, target
def run(msg, callback):
source, target = setup()
print(msg)
for loop in range(2):
callback(source, target)
print(f" loop {loop}: {fs.find(target, recursive=True, withdirs=True)}")
fs = fsspec.filesystem('file')
run("cp", lambda source, target: subprocess.run(['cp', '-r', source, target]))
run("cp, slash", lambda source, target: subprocess.run(['cp', '-r', source+'/', target]))
run("rsync", lambda source, target: subprocess.run(['rsync', '-r', source, target]))
run("rsync, slash", lambda source, target: subprocess.run(['rsync', '-r', source+'/', target]))
run("scp", lambda source, target: subprocess.run(['scp', '-r', source, target]))
run("scp, slash", lambda source, target: subprocess.run(['scp', '-r', source+'/', target]))
run("fs.cp", lambda source, target: fs.cp(source, target, recursive=True))
run("fs.cp, slash", lambda source, target: fs.cp(source+'/', target, recursive=True))
run("fs.put", lambda source, target: fs.put(source, target, recursive=True))
run("fs.put, slash", lambda source, target: fs.put(source+'/', target, recursive=True)) Output is (using
Ignoring
I propose that the way forward here is to restate the assumption that The wider issue of users wanting some sort of sync functionality can be provided by the |
Thanks for looking into it. I agree with everything here. Local |
For completeness here is the demo code with import fsspec
import os
import subprocess
def setup(remote_source, remote_target):
# Source
if remote_source:
source = 'remote_source'
afs = mem
else:
source = '/tmp/source'
afs = fs
# Clear directory.
if afs.exists(source):
afs.rm(source, recursive=True)
assert not afs.isdir(source)
afs.mkdir(source)
with afs.open(os.path.join(source, 'a'), 'w') as f:
f.write("Here is some text")
# Target
if remote_target:
target = 'remote_target'
afs = mem
else:
target = '/tmp/target'
afs = fs
# Clear directory.
if afs.exists(target):
afs.rm(target, recursive=True)
# Confirm target directory does not yet exist.
assert not afs.isdir(target)
return source, target
def run(msg, callback, remote_source=False, remote_target=False):
source, target = setup(remote_source=remote_source, remote_target=remote_target)
afs = mem if remote_target else fs
print(msg)
for loop in range(2):
callback(source, target)
print(f" loop {loop}: {afs.find(target, recursive=True, withdirs=True)}")
fs = fsspec.filesystem('file')
mem = fsspec.filesystem('memory')
run("cp", lambda source, target: subprocess.run(['cp', '-r', source, target]))
run("cp, slash", lambda source, target: subprocess.run(['cp', '-r', source+'/', target]))
run("rsync", lambda source, target: subprocess.run(['rsync', '-r', source, target]))
run("rsync, slash", lambda source, target: subprocess.run(['rsync', '-r', source+'/', target]))
run("scp", lambda source, target: subprocess.run(['scp', '-r', source, target]))
run("scp, slash", lambda source, target: subprocess.run(['scp', '-r', source+'/', target]))
run("fs.cp", lambda source, target: fs.cp(source, target, recursive=True))
run("fs.cp, slash", lambda source, target: fs.cp(source+'/', target, recursive=True))
run("fs.get", lambda source, target: fs.get(source, target, recursive=True))
run("fs.get, slash", lambda source, target: fs.get(source+'/', target, recursive=True))
run("fs.put", lambda source, target: fs.put(source, target, recursive=True))
run("fs.put, slash", lambda source, target: fs.put(source+'/', target, recursive=True))
run("mem.cp", lambda source, target: mem.cp(source, target, recursive=True), remote_source=True, remote_target=True)
run("mem.cp, slash", lambda source, target: mem.cp(source+'/', target, recursive=True), remote_source=True, remote_target=True)
run("mem.get", lambda source, target: mem.get(source, target, recursive=True), remote_source=True)
run("mem.get, slash", lambda source, target: mem.get(source+'/', target, recursive=True), remote_source=True)
run("mem.put", lambda source, target: mem.put(source, target, recursive=True), remote_target=True)
run("mem.put, slash", lambda source, target: mem.put(source+'/', target, recursive=True), remote_target=True) output:
|
Thanks @ianthomas23 . More of the same, then! Mostly reasonable behaviour, meaning that we agree with cp/scp at least some of the time :) |
@tgaddair Returning to this, now that we have merged #1148 we have consistent behaviour for Using the latest commit ( import os
import tempfile
from pathlib import Path
import subprocess
import fsspec
fs = fsspec.filesystem("file")
rpath = "/tmp/test"
if fs.exists(rpath):
fs.rm(rpath, recursive=True)
with tempfile.TemporaryDirectory() as tmpdir:
Path(os.path.join(tmpdir, "foo.txt")).touch()
source = tmpdir
#source = tmpdir + "/"
for loop in range(2):
fs.put(source, rpath, recursive=True)
#subprocess.run(['cp', '-r', source, rpath])
print(f"loop {loop}:", fs.find(rpath, recursive=True)) the output is
as you originally observed. Replacing
If you comment out the I am going to close this issue as I don't think any other action is necessary. Feel free to reopen it if there is more that you would like to discuss. |
When calling
fs.put()
to upload a local directory to a remote filesystem, if the remote directory does not exist then only the local directory contents will be uploaded. However, if the remote directory does exist already, then the entire directory will be uploaded to the rpath. This makes certain sync patterns (like periodically callingput()
to sync local files to the remote) difficult to implement.Repro:
I've verified this behavior also happens when using s3fs. I assume it is true for all filesystem backends.
The expected behavior is that the file in the above should be written as
/tmp/test/foo.txt
for both calls. This is, for example, what happens when we wrap the fsspec filesystem with pyarrow:Using
fsspec==2022.8.2
, but also happened in2022.7.1
.The text was updated successfully, but these errors were encountered: