Record readers are relatively low-level constructs in the MLIO API. They allow reading raw binary records from an InputStream
instance.
As of today the only publicly available record reader type is ParquetRecordReader
which reads Parquet files as memory blobs that can be passed to Apache Arrow.
Represents the abstract base class for all record reader types.
Reads the next Record
from the underlying InputStream
. If the end of the stream is reached, returns None
.
read_record()
Peeks the next Record
from the underlying InputStream
without consuming it. Calling read_record
afterwards will return the same record.
peek_record()
All record readers are iterable and can be used in contexts such as for
loops, list comprehensions, and generator expressions.
Reads Parquet files from an underlying InputStream
and returns them as binary blobs via Record
instances. Inherits from RecordReader.
ParquetRecordReader(strm : InputStream)
strm
: The input stream from which to read the Parquet files.
This class is meant to be used with input streams that can potentially contain more than one Parquet file. For example a SageMakerPipe
data store pointing to an S3 location with more than one Parquet file should use ParquetRecordReader
to extract them from the input stream.
Conventional data stores such as File
s don't need to use ParquetRecordReader
. A data store containing only a single Parquet file can be directly converted into an Arrow file via mlio.integ.arrow.as_arrow_file()
function.
Represents a binary blob containing the raw bytes of a data instance. It supports the Python Buffer protocol.
Gets the kind of the record; indicating whether it is a complete or a partial record.
Specifies the kind of a record.
Value | Description |
---|---|
COMPLETE |
The record contains a complete data instance. |
BEGIN |
The record contains the beginning of a data instance. |
MIDDLE |
The record contains the middle of a data instance. |
END |
The record contains the end of a data instance. |
Type | Description |
---|---|
RecordError |
Thrown when the record cannot be read. Inherits from MLIOError . |
CorruptRecordReader |
Thrown when the record is corrupt. Inherits from RecordError . |
CorruptRecordHeader |
Thrown when the record has a corrupt header. Inherits from CorruptRecordReader . |
CorruptRecordFooter |
Thrown when the record has a corrupt footer. Inherits from CorruptRecordReader . |
RecordTooLargeError |
Thrown when the record is larger than a threshold. Inherits from RecordError . |