Skip to content
This repository has been archived by the owner on Sep 3, 2022. It is now read-only.

Add error handling to object.read_stream for reading corrupted text files from GCS #713

Open
EmersonYe opened this issue Jan 15, 2019 · 0 comments

Comments

@EmersonYe
Copy link

EmersonYe commented Jan 15, 2019

def read_stream(self, start_offset=0, byte_count=None):
"""Reads the content of this object as text.
Args:
start_offset: the start offset of bytes to read.
byte_count: the number of bytes to read. If None, it reads to the end.
Returns:
The text content within the object.
Raises:
Exception if there was an error requesting the object's content.
"""
try:
return self._api.object_download(self._bucket, self._key,
start_offset=start_offset, byte_count=byte_count)
except Exception as e:
raise e

If a text file in GCS has any non-ascii characters, attempting to read_stream the file results in the following error message:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8e in position 54628: ordinal not in range(128)

I suggest adding an errors argument, like in Python 3's built-in open function. The option to ignore encoding errors or replace malformed data would make reading text files from GCS in DataLab much easier. The workaround I resorted to was to download the text locally and clean it before re-uploading to GCS

Additionally, the download and read_lines functions are not documented on the included storage documentation (http://localhost:8081/notebooks/datalab/docs/tutorials/Storage/Storage%20APIs.ipynb)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant