How does meza compare to pandas | What readers are available
Philosophically, meza
is designed around functional programming and
iterators of dictionaries whereas pandas
is designed around the
DataFrame object. Also, meza
is better suited for ETL, or processing
evented / streaming data; whereas pandas
seems optimized for performing
matrix transformations and linear algebra.
One advantage meza
iterators has is that you can process extremely large
files without reading them into memory.
>>> import itertools as it
>>> from meza import process as pr
>>> records = it.repeat({'int': '1', 'time': '2:30', 'text': 'hi'})
>>> next(pr.cut(records, ['text']))
{'text': 'hi'}
Here I used it.repeat
to simulate the output of reading a large file via
meza.io.read...
. Most of the meza
functions operate iteratively, which
means you can efficiently process files that can't fit into memory. This also
illustrates meza
's functional API. Since there are no objects, you don't
need a special records
constructor. Any iterable of dicts will do just fine.
meza
supports PyPy out of the box.
The records
data structure is compatible with other libraries such as
sqlachemy
's bulk insert:
>>> from meza import fntools as ft
>>> from .models import Table
# Table is a sqlalchemy.Model class
# db is a sqlalchemy database instance
>>> for data in ft.chunk(records, chunk_size):
... db.engine.execute(Table.__table__.insert(), data)
And since records
is just an iterable, you have the power of the entire
itertools
module at your disposal.
meza
supports reading and writing GeoJSON out of the box.
>>> from meza import io, convert as cv
# read a geojson file
>>> records = io.read_geojson('file.geojson')
##
# perform your data analysis / manipulation... then
##
# convert records to a geojson file-like object
>>> geojson = cv.records2geojson(records)
The tradeoff is that you lose the speed of pandas
vectorized operations.
I imagine any heavy duty number crunching will be much faster in pandas
than meza
. However, this can be partially offset by running meza
under
PyPy.
So I would use pandas
when you want speed or are working with
matrices; and meza
when you are processing streams or events,
want low memory usage, geojson support, PyPy compatibility, or the
convenience of working with dictionaries, (or if you just don't need the
raw speed of arrays).
I'd also like to point out one area you may like to explore if you want to
squeeze out more speed: meza.convert.records2array
and
meza.convert.array2records
. These functions can convert records
to and
from a list of native array.array
's. So any type of optimization techniques
you may like to explore should start there.
meza's available readers are outlined below:
File type | Recognized extension(s) | Default reader |
---|---|---|
Comma separated file | csv | read_csv |
dBASE/FoxBASE | dbf | read_dbf |
Fixed width file | fixed | read_fixed_fmt |
GeoJSON | geojson, geojson.json | read_geojson |
HTML table | html | read_html |
JSON | json | read_json |
Microsoft Access | mdb | read_mdb |
SQLite | sqlite | read_sqlite |
Tab separated file | tsv | read_tsv |
Microsoft Excel | xls, xlsx | read_xls |
YAML | yml, yaml | read_yaml |
Alternatively, meza provides a universal reader which will select the appropriate reader based on the file extension as specified in the above table.
>>> from io import open
>>> from meza import io
>>> records1 = io.read('path/to/file.csv')
>>> records2 = io.read('path/to/file.xls')
>>> with open('path/to/file.json', encoding='utf-8') as f:
... records3 = io.read(f, ext='json')
Most readers take as their first argument, either a file path or file like object.
The notable exception is read_mdb
which only accepts a file path.
File like objects should be opened using Python's stdlib io.open
. If the file
is opened in binary mode io.open('/path/to/file')
, be sure to pass the proper
encoding if it is anything other than utf-8
, e.g.,
>>> from io import open
>>> from meza import io
>>> with open('path/to/file.xlsx') as f:
... records = io.read_xls(f, encoding='latin-1')
While each reader has kwargs specific to itself, the following table outlines the most common ones.