Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix resumable_download for fully downloaded files #1060

Merged

Conversation

flyingleafe
Copy link
Contributor

When resumable_download is called and the file is fully downloaded, the requested range starts after the file ends, which yields HTTP error 416. It seems logical that in this case the exception should not be raised, but rather Lhotse should report that the file is already downloaded.

@rouseabout
Copy link
Contributor

The file should only be considered fully downloaded WHEN the 416 error code is returned AND the file size returned in the Content-Range header matches that on local disk.

Not all servers include the Content-Range header with the 416 response, meaning that the server does not support resuming downloads.

Here is an older patch from my tree. Feel free to adapt.

diff --git a/lhotse/utils.py b/lhotse/utils.py
index 6cdb0f15..60d774af 100644
--- a/lhotse/utils.py
+++ b/lhotse/utils.py
@@ -471,7 +471,8 @@ def resumable_download(
         f.seek(file_size)
 
         # Open the URL and read the contents in chunks
-        with urllib.request.urlopen(req) as response:
+        try:
+            response = urllib.request.urlopen(req)
             chunk_size = 1024
             total_size = int(response.headers.get("content-length", 0)) + file_size
             with tqdm(
@@ -487,6 +488,10 @@ def resumable_download(
                         break
                     f.write(chunk)
                     pbar.update(len(chunk))
+        except urllib.error.HTTPError as err:
+            if err.code != 416 or err.headers.get("Content-Range", "") != f"bytes */{file_size}":
+                raise err
 
 
 def safe_extract(

@flyingleafe
Copy link
Contributor Author

@rouseabout Fair enough. If the server does not support resumable downloads though, we should not simply fail but rather try to download from scratch with a request without range headers then. Updated PR accordingly.

pzelasko
pzelasko previously approved these changes May 22, 2023
Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@pzelasko
Copy link
Collaborator

Please just fix the CI checks

@pzelasko pzelasko enabled auto-merge (squash) May 22, 2023 16:22
@pzelasko pzelasko disabled auto-merge May 22, 2023 16:23
@flyingleafe
Copy link
Contributor Author

@pzelasko only black was failing, should be ok now.

@pzelasko pzelasko enabled auto-merge (squash) May 24, 2023 11:15
@pzelasko pzelasko added this to the v1.15 milestone May 24, 2023
@pzelasko pzelasko merged commit a3e7883 into lhotse-speech:master May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants