Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rmtree not working when there is directory inside (google cloud storage) #184

Closed
gadotroee opened this issue Dec 28, 2021 · 10 comments · Fixed by #185
Closed

rmtree not working when there is directory inside (google cloud storage) #184

gadotroee opened this issue Dec 28, 2021 · 10 comments · Fixed by #185

Comments

@gadotroee
Copy link

This package is really helpful to work with different cloud providers but I'm facing an issue when trying to remove directory in gcs.

I tried it even in some test script using this line (after handle the authentication, because I do get True for methods like .exists() or is_dir().

CloudPath("gs://<bucket>/<path_to_folder(prefix)>").rmtree()

and getting exception from the _remove function. ('NoneType' object has no attribute 'delete').

Am I missing something in my usage or maybe it is knows bug?
Thanks

@pjbull
Copy link
Member

pjbull commented Dec 28, 2021

@gadotroee Can you add the full stack trace and what the file structure on GCS looks like when you see the error?

We do have a test that hits rmtree that passed on the latest test run 3 days ago for GCS:

def test_file_discovery(rig):
p = rig.create_cloud_path("dir_0/file0_0.txt")
assert p.exists()
p2 = rig.create_cloud_path("dir_0/not_a_file")
assert not p2.exists()
p2.touch()
assert p2.exists()
p2.unlink()
p3 = rig.create_cloud_path("dir_0/")
assert p3.exists()
assert len(list(p3.iterdir())) == 3
assert len(list(p3.glob("**/*"))) == 3
with pytest.raises(CloudPathIsADirectoryError):
p3.unlink()
with pytest.raises(DirectoryNotEmptyError):
p3.rmdir()
p3.rmtree()
assert not p3.exists()
p4 = rig.create_cloud_path("")
assert p4.exists()
assert len(list(p4.iterdir())) == 1 # only bucket/dir_1/ should still exist
assert len(list(p4.glob("**/*"))) == 4
assert list(p4.glob("**/*")) == list(p4.rglob("*"))

It would be good to know if this test case is missing a scenario we need to capture.

Even better would be a minimally reproducible example using only cloudpathlib to create files and then reproduce the error.

@gadotroee
Copy link
Author

Hi. @pjbull - Thanks for the quick reply.
I'm happy to try and help improve this great library.

I tried simple script and it did work so I understand the issue is in my folder and the issue is that this folder has directories in it.

so here is the full stack trace and example script (using cloudpathlib to created files and then rmtree to delete it):

ipython test-cloud-path-lib.py
written file
written file
written file
written file
written file
written file
written file
written file
written file
written file
written file
written file
written file
written file
written file
written file
written file
written file
written file
written file
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/Desktop/test-cloud-path-lib.py in <module>
     15
     16
---> 17 CloudPath(os.path.join(base_dir)).rmtree()

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/cloudpathlib/cloudpath.py in rmtree(self)
    612                 f"Path {self} is a file; call unlink instead of rmtree."
    613             )
--> 614         self.client._remove(self)
    615
    616     def upload_from(

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/cloudpathlib/gs/gsclient.py in _remove(self, cloud_path)
    189             bucket = self.client.bucket(cloud_path.bucket)
    190             for blob in blobs:
--> 191                 bucket.get_blob(blob).delete()
    192         elif self._is_file_or_dir(cloud_path) == "file":
    193             bucket = self.client.bucket(cloud_path.bucket)

AttributeError: 'NoneType' object has no attribute 'delete'
import os 
from cloudpathlib import CloudPath

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path-to-json-credentials-file"
gcs_bucket_name = <bucket-name>
dir_name = 'example-dir'

base_dir = os.path.join("gs://", gcs_bucket_name, dir_name)
CloudPath(base_dir).mkdir()

for sub_folder in ["test1", "test2"]:
    for i in range(2):
        CloudPath(os.path.join(base_dir, sub_folder, f"{i}.txt")).write_bytes(str(i).encode('utf-8'))
        print("written file")

CloudPath(os.path.join(base_dir)).rmtree()

I hope this will help..

@pjbull
Copy link
Member

pjbull commented Dec 28, 2021

Thanks @gadotroee, I'll see if we can reproduce this.


A couple little things on the script that you sent:

  • You should explicitly use / as a separator, and not use os.path.join since it will not be / on Windows machines, which would cause your scripts to fail.
  • You can simplify the script by using CloudPath objects and the / operator rather than strings and joining as below:
import os
from cloudpathlib import CloudPath

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path-to-json-credentials-file"
gcs_bucket_name = <bucket-name>
dir_name = 'example-dir'

base_dir = CloudPath(f"gs://{gcs_bucket_name}/{dir_name}")
base_dir.mkdir()

for sub_folder in ["test1", "test2"]:
    for i in range(2):
        (base_dir / sub_folder / f"{i}.txt").write_text(str(i))
        print("written file")

base_dir.rmtree()

@gadotroee gadotroee changed the title rmtree not working in google cloud storage rmtree not working when there is directory inside (google cloud storage) Dec 28, 2021
@gadotroee
Copy link
Author

I tested again with this script and it is very nice improvement, but as I think the reason is that I'm trying to delete folder that has folders inside.

If I'm running the script (as is) it failed, but when I'm doing this the following little change
(base_dir / "test1").rmtree() - everything works and the folder is really deleted.

Hope this will help to understand the reason and maybe add this as a test scenario.
If there is anything else I can do to help - let me know

@pjbull
Copy link
Member

pjbull commented Dec 29, 2021

Repro'd this on both Azure and GCS. Problem is that _list_dir implementations return both directories and files, but our _remove function for both of these clients assumes only files.

@pjbull
Copy link
Member

pjbull commented Dec 29, 2021

@gadotroee this should be fixed in release 0.6.3, which just got published. You should be able to upgrade with pip install -U "cloudpathlib[gs]"

@gadotroee
Copy link
Author

gadotroee commented Dec 29, 2021

Hi @pjbull , I tried to use the new version but I think something is broken there..
when trying to import CloudPath I'm getting an error..

<ipython-input-1-c3dd5fdc0792> in <module>
----> 1 from cloudpathlib import CloudPath

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/cloudpathlib/__init__.py in <module>
      2
      3 from .anypath import AnyPath
----> 4 from .azure.azblobclient import AzureBlobClient
      5 from .azure.azblobpath import AzureBlobPath
      6 from .cloudpath import CloudPath, implementation_registry

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/cloudpathlib/azure/__init__.py in <module>
----> 1 from .azblobclient import AzureBlobClient
      2 from .azblobpath import AzureBlobPath
      3
      4 __all__ = [
      5     "AzureBlobClient",

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/cloudpathlib/azure/azblobclient.py in <module>
     19
     20 @register_client_class("azure")
---> 21 class AzureBlobClient(Client):
     22     """Client class for Azure Blob Storage which handles authentication with Azure for
     23     [`AzureBlobPath`](../azblobpath/) instances. See documentation for the

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/cloudpathlib/azure/azblobclient.py in AzureBlobClient()
     89         super().__init__(local_cache_dir=local_cache_dir)
     90
---> 91     def _get_metadata(self, cloud_path: AzureBlobPath) -> Union[BlobProperties, Dict[str, Any]]:
     92         blob = self.service_client.get_blob_client(
     93             container=cloud_path.container, blob=cloud_path.blob

NameError: name 'BlobProperties' is not defined

Is it issue in my environment or problem with version "0.6.3"?

@gadotroee
Copy link
Author

when I re-installed (with [all]) it is working (so the reason is missing package if azure..
but I did try in a "clean" environment (new python 3.7 docker container and it didn't work but then worked so I'm not sure if this was something wrong I did before or not)
anyway - installing just with [gs] extra raising this exception so probably there is a fix that need to be made here.

@pjbull
Copy link
Member

pjbull commented Dec 29, 2021

Yep, thanks @gadotroee. Looks like a small issue introduced by #177.

I've made the necessary fix and will update the releases once the test suite passes.

@pjbull
Copy link
Member

pjbull commented Dec 29, 2021

@gadotroee ok, that issue should be fixed now in release 0.6.4. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants