Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix streaming for servers not supporting HTTP range requests #3689

Merged
merged 3 commits into from
Feb 10, 2022

Conversation

albertvillanova
Copy link
Member

@albertvillanova albertvillanova commented Feb 8, 2022

Some servers do not support HTTP range requests, whereas this is required to stream some file formats (like ZIP).

This PR implements a workaround for those cases, by download the files locally in a temporary directory (cleaned up by the OS once the process is finished).

This PR raises custom error explaining that streaming is not possible because data host server does not support HTTP range requests.

Fix #3677.

@severo
Copy link
Collaborator

severo commented Feb 8, 2022

Does it mean that huge files might end up being downloaded? It would go against the purpose of streaming, I think. At least, this fallback should be an option that could be disabled

@albertvillanova
Copy link
Member Author

Yes, it is against the purpose of streaming, but streaming is not possible if the server does not allow HTTP range requests.

We have two options: either we download the file or we throw an error.

@severo
Copy link
Collaborator

severo commented Feb 8, 2022

I think we simply cannot fallback to downloading the file if streaming fails without the user being aware of it. Some options:

  • make the fallback optional (using an env var? or a function param)
  • use the fallback only if the dataset size is under some threshold (provided we have the data in the DatasetInfo) -> it's the option I use in datasets-preview-backend (here and here)
  • throw an exception and let the user decide what to do

@lhoestq
Copy link
Member

lhoestq commented Feb 8, 2022

IMO in general we should throw an exception and ask the user to not use streaming mode in that case.

Your second point is also interesting but I feel like it could be confusing for users sometimes: it doesn't feel natural that the streaming-ability should depend on the size of the file.

@severo
Copy link
Collaborator

severo commented Feb 8, 2022

Sure, I think we should just throw an exception

@albertvillanova
Copy link
Member Author

Current behavior is already throwing an Exception:

ValueError: Cannot seek streaming HTTP file

We could customize the exception class and/or the exception message.

@severo
Copy link
Collaborator

severo commented Feb 9, 2022

I'm not sure we really need to change anything. I opened the issue #3677 because discovery was streamable and is not anymore (according to my test suite in https://github.com/huggingface/datasets-preview-backend): I was not sure if it was due to some regression in the library, or to some change in the dataset itself.

@albertvillanova
Copy link
Member Author

I'm wondering why it worked before and it is no longer working...

@lhoestq
Copy link
Member

lhoestq commented Feb 9, 2022

We could customize the exception class and/or the exception message.

Yup a message that says that the host doesn't support streaming because it doesn't support HTTP Range requests would be useful !

@albertvillanova
Copy link
Member Author

DONE, @lhoestq.

Copy link
Collaborator

@severo severo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thanks !

@albertvillanova albertvillanova merged commit fc67b30 into master Feb 10, 2022
@albertvillanova albertvillanova deleted the fix-3677 branch February 10, 2022 16:51
albertvillanova added a commit that referenced this pull request Mar 7, 2022
* Download file locally when range request not supported

* Revert "Download file locally when range request not supported"

This reverts commit 9f90bcb.

* Raise custom error when range request not supported
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Discovery cannot be streamed anymore
3 participants