-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
DOC: Add scaling to large datasets section #28577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
7e7d786
506edd1
35a4dde
efb3260
3201f42
eae9593
68ff6ee
a7fb97f
a4baa41
78d22e6
f7bc6dc
5294cdb
c57f33a
7f32b83
55be2bb
a76453f
98c06fa
9578248
78eb2f1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| data/ | ||
| timeseries.csv | ||
| timeseries.parquet | ||
| timeseries_wide.parquet | ||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,362 @@ | ||||||
| .. _scale: | ||||||
|
|
||||||
| ************************* | ||||||
| Scaling to large datasets | ||||||
| ************************* | ||||||
|
|
||||||
| Pandas provide data structures for in-memory analytics. This makes using pandas | ||||||
| to analyze larger than memory datasets somewhat tricky. | ||||||
|
||||||
|
|
||||||
| This document provides a few recommendations for scaling to larger datasets. | ||||||
| It's a complement to :ref:`enhancingperf`, which focuses on speeding up analysis | ||||||
| for datasets that fit in memory. | ||||||
|
|
||||||
| But first, it's worth considering *not using pandas*. Pandas isn't the right | ||||||
| tool for all situations. If you're working with very large datasets and a tool | ||||||
| like PostgreSQL fits your needs, then you should probably be using that. | ||||||
| Assuming you want or need the expressivity and power of pandas, let's carry on. | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| import pandas as pd | ||||||
| import numpy as np | ||||||
| from pandas.util.testing import make_timeseries | ||||||
|
||||||
|
|
||||||
|
|
||||||
| Use more efficient file formats | ||||||
| ------------------------------- | ||||||
|
|
||||||
| Depending on your workload, data loading may be a bottleneck. In these case you | ||||||
jschendel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| might consider switching from a slow format like CSV to a faster format like | ||||||
| Parquet. Loading from a file format like Parquet will also require less memory | ||||||
| usage, letting you load larger datasets into pandas before running out of | ||||||
| memory. | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| # Make a random in-memory dataset | ||||||
| ts = make_timeseries(freq="30S", seed=0) | ||||||
| ts | ||||||
|
|
||||||
|
|
||||||
| We'll now write and read the file using CSV and parquet. | ||||||
|
|
||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| %time ts.to_csv("timeseries.csv") | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| col = "timestamp" | ||||||
| %time pd.read_csv("timeseries.csv", index_col=col, parse_dates=[col]) | ||||||
jschendel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| %time ts.to_parquet("timeseries.parquet") | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| %time _ = pd.read_parquet("timeseries.parquet") | ||||||
|
|
||||||
| Notice that parquet gives much higher performance for reading and writing, both | ||||||
| in terms of speed and lower peak memory usage. See :ref:`io` for more. | ||||||
|
||||||
|
|
||||||
| Load less data | ||||||
| -------------- | ||||||
|
|
||||||
| Suppose our raw dataset on disk has many columns, but we need just a subset | ||||||
| for our analysis. To get those columns, we can either | ||||||
|
|
||||||
| 1. Load the entire dataset then select those columns. | ||||||
| 2. Just load the columns we need. | ||||||
|
|
||||||
| Loading just the columns you need can be much faster and requires less memory. | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| # make a similar dataset with many columns | ||||||
| timeseries = [ | ||||||
| make_timeseries(freq="1T", seed=i).rename(columns=lambda x: f"{x}_{i}") | ||||||
| for i in range(10) | ||||||
| ] | ||||||
| ts_wide = pd.concat(timeseries, axis=1) | ||||||
| ts_wide.head() | ||||||
| ts_wide.to_parquet("timeseries_wide.parquet") | ||||||
|
|
||||||
|
|
||||||
| Option 1 loads in all the data and then filters to what we need. | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| columns = ['id_0', 'name_0', 'x_0', 'y_0'] | ||||||
|
|
||||||
| %time _ = pd.read_parquet("timeseries_wide.parquet")[columns] | ||||||
|
|
||||||
| Option 2 only loads the columns we request. This is faster and has a lower peak | ||||||
| memory usage, since the entire dataset isn't in memory at once. | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| %time _ = pd.read_parquet("timeseries_wide.parquet", columns=columns) | ||||||
|
|
||||||
|
|
||||||
| With :func:`pandas.read_csv`, you can specify ``usecols`` to limit the columns | ||||||
| read into memory. | ||||||
|
|
||||||
|
|
||||||
| Use efficient datatypes | ||||||
| ----------------------- | ||||||
|
|
||||||
| The default pandas data types are not the most memory efficient. This is | ||||||
| especially true for high-cardinality text data (columns with relatively few | ||||||
| unique values). By using more efficient data types you can store larger datasets | ||||||
| in memory. | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| ts.dtypes | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| ts.memory_usage(deep=True) # memory usage in bytes | ||||||
|
|
||||||
|
|
||||||
| The ``name`` column is taking up much more memory than any other. It has just a | ||||||
| few unique values, so it's a good candidate for converting to a | ||||||
| :class:`Categorical`. With a Categorical, we store each unique name once and use | ||||||
| space-efficient integers to know which specific name is used in each row. | ||||||
|
|
||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| ts2 = ts.copy() | ||||||
| ts2['name'] = ts2['name'].astype('category') | ||||||
| ts2.memory_usage(deep=True) | ||||||
|
|
||||||
| We can go a bit further and downcast the numeric columns to their smallest types | ||||||
| using :func:`pandas.to_numeric`. | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| ts2['id'] = pd.to_numeric(ts2['id'], downcast='unsigned') | ||||||
| ts2[['x', 'y']] = ts2[['x', 'y']].apply(pd.to_numeric, downcast='float') | ||||||
| ts2.dtypes | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| ts2.memory_usage(deep=True) | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| reduction = (ts2.memory_usage(deep=True).sum() | ||||||
| / ts.memory_usage(deep=True).sum()) | ||||||
| print(f"{reduction:0.2f}") | ||||||
|
|
||||||
| In all, we've reduced the in-memory footprint of this dataset to 1/5 of its | ||||||
| original size. | ||||||
|
|
||||||
| See :ref:`categorical` for more on ``Categorical`` and :ref:`basics.dtypes` | ||||||
| for an overview of all of pandas' dtypes. | ||||||
|
|
||||||
| Use chunking | ||||||
| ------------ | ||||||
|
|
||||||
| Some workloads can be achieved with chunking: splitting a large problem like "convert this | ||||||
| directory of CSVs to parquet" into a bunch of small problems ("convert this individual parquet | ||||||
| file into a CSV. Now repeat that for each file in this directory."). As long as each chunk | ||||||
|
||||||
| fits in memory, you can work with datasets that are much larger than memory. | ||||||
|
|
||||||
| .. note:: | ||||||
|
|
||||||
| Chunking works well when the operation you're performing requires zero or minimal | ||||||
| coordination between chunks. For more complicated workflows, you're better off | ||||||
| :ref:`using another library <scale.other_libraries>`. | ||||||
|
|
||||||
| Let's make a larger dataset on disk (as parquet files) that's split into chunks, | ||||||
| one per year. | ||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| import pathlib | ||||||
|
|
||||||
| N = 12 | ||||||
| starts = [f'20{i:>02d}-01-01' for i in range(N)] | ||||||
| ends = [f'20{i:>02d}-12-13' for i in range(N)] | ||||||
|
|
||||||
| pathlib.Path("data/timeseries").mkdir(exist_ok=True) | ||||||
|
|
||||||
| for i, (start, end) in enumerate(zip(starts, ends)): | ||||||
| ts = make_timeseries(start=start, end=end, freq='1T', seed=i) | ||||||
| ts.to_parquet(f"data/timeseries/ts-{i}.parquet") | ||||||
|
|
||||||
| files = list(pathlib.Path("data/timeseries/").glob("ts*.parquet")) | ||||||
| files | ||||||
|
|
||||||
| Now we'll implement an out-of-core ``value_counts``. The peak memory usage of this | ||||||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| workflow is the single largest chunk, plus a small series storing the unique value | ||||||
| counts up to this point. | ||||||
|
|
||||||
|
|
||||||
| .. ipython:: python | ||||||
|
|
||||||
| %%time | ||||||
| counts = pd.Series(dtype=int) | ||||||
| for path in files: | ||||||
| # Only one dataframe is in memory at a time... | ||||||
| df = pd.read_parquet(path) | ||||||
| # ... plus a small Series `counts`, which is updated. | ||||||
| counts = counts.add(df['name'].value_counts(), fill_value=0) | ||||||
| counts.astype(int) | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this necessary? Just seems like some cruft in here for dtype preservation. Ideally would like to keep code here at a minimum
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Without it, you get a float: In [16]: s = pd.Series(dtype=int)
In [17]: s.add(t, fill_value=0)
Out[17]:
0 1.0
1 2.0
dtype: float64I think it'd be strange for a |
||||||
|
|
||||||
| Some readers, like :meth:`pandas.read_csv` offer parameters to control the | ||||||
jschendel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| ``chunksize``. Manually chunking is an OK option for workflows that don't | ||||||
| require too sophisticated of operations. Some operations, like ``groupby``, are | ||||||
| much harder to do chunkwise. In these cases, you may be better switching to a | ||||||
| different library that implements these out-of-core algorithms for you. | ||||||
|
|
||||||
| .. _scale.other_libraries: | ||||||
|
|
||||||
| Use Other libraries | ||||||
| ------------------- | ||||||
|
|
||||||
| Pandas is just one library offering a DataFrame API. Because of its popularity, | ||||||
| pandas' API has become something of a standard that other libraries implement. | ||||||
| The pandas documentation maintains a list of libraries implemetning a DataFrame API | ||||||
|
||||||
| The pandas documentation maintains a list of libraries implemetning a DataFrame API | |
| The pandas documentation maintains a list of libraries implementing a DataFrame API |
Uh oh!
There was an error while loading. Please reload this page.