Add error handling to object.read_stream for reading corrupted text files from GCS #713

EmersonYe · 2019-01-15T01:12:48Z

pydatalab/google/datalab/storage/_object.py

Lines 190 to 205 in d903190

    
             def read_stream(self, start_offset=0, byte_count=None): 
        
               """Reads the content of this object as text. 
        
               Args: 
        
                 start_offset: the start offset of bytes to read. 
        
                 byte_count: the number of bytes to read. If None, it reads to the end. 
        
               Returns: 
        
                 The text content within the object. 
        
               Raises: 
        
                 Exception if there was an error requesting the object's content. 
        
               """ 
        
               try: 
        
                 return self._api.object_download(self._bucket, self._key, 
        
                                                  start_offset=start_offset, byte_count=byte_count) 
        
               except Exception as e: 
        
                 raise e

If a text file in GCS has any non-ascii characters, attempting to read_stream the file results in the following error message:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8e in position 54628: ordinal not in range(128)

I suggest adding an errors argument, like in Python 3's built-in open function. The option to ignore encoding errors or replace malformed data would make reading text files from GCS in DataLab much easier. The workaround I resorted to was to download the text locally and clean it before re-uploading to GCS

Additionally, the download and read_lines functions are not documented on the included storage documentation (http://localhost:8081/notebooks/datalab/docs/tutorials/Storage/Storage%20APIs.ipynb)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add error handling to object.read_stream for reading corrupted text files from GCS #713

Add error handling to object.read_stream for reading corrupted text files from GCS #713

EmersonYe commented Jan 15, 2019 •

edited

Loading

Add error handling to object.read_stream for reading corrupted text files from GCS #713

Add error handling to object.read_stream for reading corrupted text files from GCS #713

Comments

EmersonYe commented Jan 15, 2019 • edited Loading

EmersonYe commented Jan 15, 2019 •

edited

Loading