Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot delete empty files and folders containing them #2466

Closed
sidec15 opened this issue Dec 19, 2019 · 10 comments
Closed

Cannot delete empty files and folders containing them #2466

sidec15 opened this issue Dec 19, 2019 · 10 comments
Assignees
Labels
🪲 bug Issue is not intended behavior ⚙️ azcopy Related to AzCopy integration
Milestone

Comments

@sidec15
Copy link

sidec15 commented Dec 19, 2019

Storage Explorer Version: 1.11.2
Build Number: 20191217.4
Platform/OS: Ubuntu - Windows 10
Architecture: x64
Regression From: Yes it worked in a previous version but I don't remeber which one. I think it worked before the introduction of azCopy.

Bug Description

When I created folders with Spark API, an empty file is created for each folder and subfolders of the heirarchy. If I wnat to delete these files using Azure Storage Explorer an error occurs with the message: "failed to perform remove command due to error: nothing found to remove", and the file is not removed. The same error occurs also if I try to delete a folder containing an empty file.

Steps to Reproduce

  1. Persist a Spark Dataframe in Azure Blob Storage, for example: df.write.csv("path/to/output")
  2. We will obtain a folder hirarchy like the following:
    path (<- empty file)
    path (<- directory)
    _|- to (<- empty file)
    _|- to (<- directory)
    ___|- output (<- empty file)
    ___|- output (<- directory)
    ______|- part-0000.csv (<- content of one dataframe partition)
    ______|- .... (<- content of other dataframe partitions)

Expected Experience

I want to delete these empty files and also folders containing them using Azure Storage Explorer like I did in the previous versions.

Actual Experience

Empty files and folders containing them cannot be deleted

Additional Context

From the Azure protal it is (fortunately) possible to delete these empty files but not folders including them.

@JasonYeMSFT
Copy link
Contributor

Do you see any folders with consecutive / in the resource path you are trying to delete?

If the two levels (below) look exactly the same to you, it is very likely this is the case, since it is the behavior of our navigation bug on this kinds of resource paths.

path (<- empty file)
path (<- directory)
_|- to (<- empty file)
_|- to (<- directory)

Neither AzCopy nor us support resources with "" names. Although it seems working before, the old code actually has undefined behavior with some operations on those kinds of resources. Even though the name "" is allowed, it is not recommended.

If possible, you can modify your the resource path that your Spark program writes to to prevent this. See Azure Storage Blob naming convention for more information about what Blob naming conventions are.

@sidec15
Copy link
Author

sidec15 commented Dec 20, 2019

Thank you for the respense but sincerily I didn't get the point. I checked the blob namig convention from the link you sugget and one blob name example is "/a/b.txt" and is written the following: "You can take advantage of the delimiter character when enumerating blobs". So basically is specified that is possible to simulate the hierarchical structure using the "/" character.
Anyhow we need to create output files using this kind of pattern.
I tried to make some test using the spark-shell. I run the following three lines of code:

var df = spark.range(1,2000)
var output = "wasbs://[email protected]/my/test/out/path"
df.write.mode("overwrite").csv(output)

This is the simulated hierarcal structure that is created:

my (<- empty file)
my (<- dir)
_|- test (<- empty file)
_|- test (<- dir)
___|- out (<. empty file)
___|- out (<. dir)
_____|- path (<- empty file)
_____|- path (<- dir)
________|- _SUCCESS (<. default file created by spark after a successful write operation)
________|- part-00000-37923afa-7ac0-4e61-a68e-e6385e0aa466-c000.csv (<- output of one partition)
________|- part-00001-37923afa-7ac0-4e61-a68e-e6385e0aa466-c000.csv (<- output of one partition)

I tried to delete the root dir "my" with the Azure Storage Explorer and basically it deletes only the files: _SUCCESS, part-00000-37923afa-7ac0-4e61-a68e-e6385e0aa466-c000.csv, part-00001-37923afa-7ac0-4e61-a68e-e6385e0aa466-c000.csv; but the empty files and (so all the directory containing them) remained there.

I tried to use azure-cli with the following command:
az storage blob delete-batch --source test --account-name mystorageaccount --pattern my*
and it everything was deleted.
I found this workaround to solve my problem but sincerily I would prefer to accomplish this operation with Azure Storage Explorer directly.

@JasonYeMSFT
Copy link
Contributor

Sorry for being unclear. For example (see the screenshot), suppose in my blob container emptytest1, there are two blobs with the following full names

  • my/test/something
  • my/

image

Due to some improper path handling, we incorrectly show a file named my under the folder my, which actually has an empty string as its name. Due to the same reason, some of our operations such as AzCopy delete cannot find it. The reason why the azure-cli works is that it deletes every blob whose name is prefixed by "my" which includes all those special blobs. If you are unsure about your blobs' full resource name, you can go to Azure Portal to check if they have such blob names.

@sidec15
Copy link
Author

sidec15 commented Jan 3, 2020

The point here is that we must create output files with Spark with a hierarchical structure. As I explained with the previous code example, we are simply using the Spark api to create output files passing an url with a hierarchical structure (for example "wasbs://[email protected]/my/test/out/path"), nothing more.

@MRayermannMSFT MRayermannMSFT added the 🪲 bug Issue is not intended behavior label Jan 6, 2020
@MRayermannMSFT MRayermannMSFT added this to the 1.14.0 milestone Jan 13, 2020
@JasonYeMSFT
Copy link
Contributor

We have recently discovered a problem with deleting blobs with metadata hdi_isFolder=True. This can occur when you are using some automatic tools to upload data to your blob container.

@sdecri Could you please verify if your data with the problem has the metadata set on them?

See #2665 for detailed explanation.

@sidec15
Copy link
Author

sidec15 commented Feb 28, 2020

I can confirm that the blob representing the directory (but also the empty file described above) has the metadata hdi_isFolter=true.

@swap2919
Copy link

swap2919 commented Mar 9, 2020

I am uploading blobs to storage from databricks notebook and observing the same behaviour. For every directory in output path, and empty blob ("Directory markers" with metadata hdi_isfolder="true") is created.
I have tried using all possible dbfs methods, but all in vain.

@MRayermannMSFT
Copy link
Member

For the record, anyone blocked by this can try:

  • calling the Blob APIs yourself to delete the blobs
  • using whatever you originally used to create the blobs to delete them
  • removing the hdi_isfolder=true metadata from the blobs (right click -> properties) and then deleting (for this solution, we're not 100% sure what the effects of deleting these blobs will be on the tool that was used to create the blobs, we strongly recommend that you make sure doing this is safe to do in the context of your tool)

@MRayermannMSFT
Copy link
Member

This has been fixed in AzCopy 10.5. We'll be updating the integrated version to that in release 1.14.1. The integration update has been merged.

@MRayermannMSFT
Copy link
Member

Going to repoen this and put in 1.15. AzCopy's fix appears to only be working if that file is the only thing you're deleting. Will add that to our known issues, go ahead and ship, and we'll pick up AzCopy's fix to that in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪲 bug Issue is not intended behavior ⚙️ azcopy Related to AzCopy integration
Projects
None yet
Development

No branches or pull requests

4 participants