Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: High performance pandas integration. #24

Merged
merged 149 commits into from
Jan 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
149 commits
Select commit Hold shift + click to select a range
12660f3
Testing tweak.
amunra Oct 26, 2022
bdbd283
Merge remote-tracking branch 'origin/main' into pandas_integration
amunra Oct 26, 2022
9e93993
Updated to c-questdb-client 2.1.1
amunra Oct 26, 2022
e56027c
Some progress..
amunra Oct 28, 2022
ed1658e
Merge remote-tracking branch 'origin/main' into pandas_integration
amunra Nov 1, 2022
d692098
Fixed broken check.
amunra Nov 1, 2022
a92a858
symbols validation.
amunra Nov 1, 2022
bebbc8a
Added repl command to proj script.
amunra Nov 1, 2022
ab05dec
Progress with pandas method input validation.
amunra Nov 1, 2022
dc65b2c
Moved Python IntVec to C int_vec.
amunra Nov 1, 2022
a46490a
More code to avoid python lists.
amunra Nov 2, 2022
737f970
CI fixup.
amunra Nov 2, 2022
d574486
CI fixup 2.
amunra Nov 2, 2022
b1c17b0
CI fixup 3.
amunra Nov 2, 2022
fc99054
CI fixup 4.
amunra Nov 2, 2022
3654938
CI fixup 5.
amunra Nov 2, 2022
7102ddd
Types and buffers.
amunra Nov 4, 2022
5d6d4be
Improved column index handling, actually getting buffers from numpy a…
amunra Nov 4, 2022
95a66bc
Introducing a small rust lib to convert python strings to UTF-8 witho…
amunra Nov 4, 2022
e43e564
Implemented conversions for UCS1/2/4.
amunra Nov 4, 2022
25bc49b
Renamed rust lib and wired up the build and linkage bits in setup.py
amunra Nov 4, 2022
b199db6
Auxilliary rust lib rename.
amunra Nov 5, 2022
7b5200e
Reworked API for better buffer reuse and for individual UCS1, 2, 4 fu…
amunra Nov 5, 2022
e945702
cbindgen to generate C .h and Cython .pxd headers.
amunra Nov 5, 2022
161f6ae
cbindgen fixup & including lib headers from setup.py
amunra Nov 7, 2022
5b5d58b
Wrote Cython function to invoke 'pystr-to-utf8' lib. Additional refac…
amunra Nov 7, 2022
12598c4
Transitioned all string buffer conversions to new Rust code lib.
amunra Nov 8, 2022
3544312
Made pystr_to_utf8 addresses stable.
amunra Nov 8, 2022
9fce68b
Rust str to utf8 lib fixes (but still broken - ongoing)
amunra Nov 8, 2022
9155357
Minor unicode test improvement. Transcoding works now.
amunra Nov 8, 2022
f953ad0
Rust PyStr lib tests and a few bugfixes.
amunra Nov 8, 2022
ab1566c
Updated pystr lib readme.
amunra Nov 8, 2022
d79955f
More pystr-to_utf8 tests and improvements.
amunra Nov 9, 2022
e8980d5
Added UCS-4 tests.
amunra Nov 9, 2022
644bdba
More unicode testing.
amunra Nov 10, 2022
b071edf
Fixed include for cython generation compatability.
amunra Nov 10, 2022
091e4c8
Table name columns, symbols and timestamps now work!
amunra Nov 10, 2022
29895f6
Handling null column values in strings.
amunra Nov 11, 2022
ef6735b
Added arrow C data interface type definitions.
amunra Nov 11, 2022
ead851d
Code reorg.
amunra Nov 11, 2022
d14eea9
Consolidated approach writeup and code reorg into .pxi files.
amunra Nov 14, 2022
94378a0
Undid removal of ingress.c from gitignore.
amunra Nov 14, 2022
939c50d
More writeup with final types.
amunra Nov 15, 2022
2355b3d
Categories added to write-up.
amunra Nov 15, 2022
1c14cb7
Consolidated Pandas logic into single .pxi file.
amunra Nov 15, 2022
4f7ef81
File renaming.
amunra Nov 15, 2022
de4b471
Reorganised existing logic into a sorted array of col_t types. Some p…
amunra Nov 16, 2022
ef2424d
Documented float16, added col_source_t.
amunra Nov 16, 2022
993dd5f
Beginning to resolve columns.
amunra Nov 17, 2022
8387209
More array extraction logic.
amunra Nov 18, 2022
5e5158f
Updating of types, updating tech doc for timezone timestamps.
amunra Nov 21, 2022
2a26d8f
Fixed up most cython build issues. Mostly enum usage issues.
amunra Nov 21, 2022
e9890a6
Code builds again finally.
amunra Nov 21, 2022
7e75260
Dead code removal.
amunra Nov 21, 2022
0e67480
Types to dispatch codes to functions.
amunra Nov 21, 2022
41d3e6b
Some test fixup
amunra Nov 22, 2022
7a1c0b0
Yay, segfault!
amunra Nov 22, 2022
3af4899
Fixed a few segfaults, got some more.
amunra Nov 22, 2022
9dc17a3
Fixed segfaults.
amunra Nov 22, 2022
9665d9f
Added missing dispatch codes and lots of TODOs.
amunra Nov 22, 2022
194ad25
Got rid of a lot of INCREF/DECREF silliness.
amunra Nov 23, 2022
9ea4a0e
Ohh look. Tests pass again.
amunra Nov 23, 2022
7c03446
Fixed another segfault.
amunra Nov 23, 2022
80748ad
Another bug bites the dust.
amunra Nov 23, 2022
01d8732
Implemented symbols='auto' and i32 column support.
amunra Nov 23, 2022
579e649
Swapped out error prone 'bint / except False' declarations with 'void…
amunra Nov 23, 2022
df823b7
More string trouble.
amunra Nov 24, 2022
cb97a20
Normality restored.
amunra Nov 24, 2022
5c1aaff
py obj to symbols.
amunra Nov 24, 2022
dc5795f
Done timestamp at and columns. Found out that timezone timestamps are…
amunra Nov 24, 2022
b7cfd50
Added some testing notes.
amunra Nov 24, 2022
12c6eec
TODO fixup.
amunra Nov 24, 2022
8593b75
Added support for datetimes with timezones (only nanosecond based) vi…
amunra Nov 25, 2022
7f6503d
Bool column support from Python objects.
amunra Nov 25, 2022
25c91be
Added arrow-based boolean pandas datatype column support.
amunra Nov 25, 2022
8b31010
Support for arrow integer columns.
amunra Nov 25, 2022
6cad41b
Progress handling strings.
amunra Nov 28, 2022
1499cff
Support for objects with integers.
amunra Nov 28, 2022
0e9e287
Float object support.
amunra Nov 28, 2022
a1e7d07
arrow f32 and f64
amunra Nov 28, 2022
898b157
str column pyarrow.
amunra Nov 28, 2022
88e618a
LTO, basic perf tests, removed debug logging, fixed a bug in string c…
amunra Nov 30, 2022
0158fad
Tests for categories.
amunra Nov 30, 2022
fd01ef3
Releasing and reacquiring GIL to avoid starving other threads.
amunra Nov 30, 2022
dfdd302
Fully releasing GIL whenever possible. This was fiddly to get working.
amunra Nov 30, 2022
5041d26
Refactoring out benchmarks, refactoring Py str to UTF8 rust impl.
amunra Dec 1, 2022
e4135d0
8% perf improvements in Python string to UTF-8 conversions.
amunra Dec 1, 2022
754534c
Multithreading benchmark.
amunra Dec 1, 2022
5915a00
Implemented column (arrow and pybuffer) cleanup.
amunra Dec 1, 2022
9b75a12
Formatting.
amunra Dec 1, 2022
7f8dab4
Tested all-nulls column is altogether skipped.
amunra Dec 1, 2022
21f39f8
Refactoring and sorting columns in C.
amunra Dec 1, 2022
f28903e
Updated c-questdb-client submodule: Latest perf improvements.
amunra Dec 2, 2022
8b6652a
Fixed broken build.
amunra Dec 2, 2022
cba2aaf
Single logic to infer object column types.
amunra Dec 2, 2022
e07bb42
Tests fixup.
amunra Dec 2, 2022
bd4bca6
More tests.
amunra Dec 2, 2022
7c734b5
Fixed a bug passing None in datetime columns.
amunra Dec 2, 2022
480343d
Tests for degenerate pandas dataframes.
amunra Dec 2, 2022
b1f2ebf
Informative message for row of nulls.
amunra Dec 2, 2022
9811007
Mandating pyarrow dependency for pandas functionality.
amunra Dec 3, 2022
c03c4ed
There's a chance this will fix CI.
amunra Dec 5, 2022
b1a4dc7
Second attempt to fix up the CI.
amunra Dec 5, 2022
8b3e45d
Third attempt to fix up the CI.
amunra Dec 5, 2022
15330b0
Reduced stack size in case of errors to aid legibility.
amunra Dec 5, 2022
cee6d4b
Fourth attempt to fix up the CI.
amunra Dec 5, 2022
5668b57
Fifth attempt to fix up the CI.
amunra Dec 5, 2022
77a612c
Sixth attempt to fix up the CI.
amunra Dec 5, 2022
57ae0b8
Progress on API docs.
amunra Dec 6, 2022
a2763f9
Found and fixed a memory leak.
amunra Dec 7, 2022
960cd74
More fuzzing.
amunra Dec 7, 2022
5998a1c
Added support from taking the table name from the df.index.name, rena…
amunra Dec 8, 2022
be0407c
General fixes and testcases for handling timestamps.
amunra Dec 12, 2022
24ee3cd
Should fix tests in CI.
amunra Dec 12, 2022
e8d8daa
Extra testing of 'TimestampXXX.now()' and hopefully fixing CI.
amunra Dec 12, 2022
5f8e8ee
CI fixup attempt.
amunra Dec 12, 2022
6b77da9
Fixing broken 32-bit binaries.
amunra Dec 12, 2022
aaf7e95
Slimmed down 'col_t' type.
amunra Dec 13, 2022
b9b2081
Implemented (but not yet tested) pandas auto-flush logic. Also releas…
amunra Dec 13, 2022
dcccabd
Tweak to pandas auto-flush logic.
amunra Dec 13, 2022
98a5496
Basic pandas end-to-end test.
amunra Dec 13, 2022
02d49dd
Tests (and bugfixes) for panda's auto-flush.
amunra Dec 13, 2022
1801f10
Pandas API docs.
amunra Dec 14, 2022
45aa14b
Renamed '.pandas()' to '.dataframe()'.
amunra Dec 14, 2022
ada3ac8
Int object int64 bounds check tests.
amunra Dec 15, 2022
89c50b4
Test strided numpy array with zero-copy into pandas.
amunra Dec 15, 2022
64f14fa
Serializing subset of dataframe rows.
amunra Dec 15, 2022
def3887
Improved error messaging.
amunra Dec 15, 2022
81d6cb8
Testing chunked arrow arrays.
amunra Dec 15, 2022
83a937a
Removed completed TODOs
amunra Dec 15, 2022
712ec1d
Hopefully fixing CI.
amunra Dec 15, 2022
88b043e
Dataframe API doc fixup.
amunra Dec 15, 2022
58de10c
Fixing the CI
amunra Dec 15, 2022
b461557
Parquet rountrip test.
amunra Dec 16, 2022
9625447
Added missing libs in dev_requirements.txt
amunra Dec 28, 2022
02da96b
CI fixup (hopefully)
amunra Dec 28, 2022
e26f5fe
CI fixup (hopefully, again)
amunra Dec 28, 2022
6dd6cf6
CI fixup (once more, with feeling)
amunra Dec 28, 2022
67cedd9
More examples.
amunra Dec 29, 2022
0c7b6ef
Parquet data example.
amunra Dec 30, 2022
cd97af2
Updated parquet example, added to docs.
amunra Jan 2, 2023
ab69e9c
Updated examples manifest to hint at more examples for Pandas datafra…
amunra Jan 2, 2023
3af8c85
Disabled bytecode file gen for install_rust.py
amunra Jan 2, 2023
25d4e2b
Updated CHANGELOG.rst
amunra Jan 3, 2023
39dc427
Minor error reporting bugfix.
amunra Jan 4, 2023
7818149
Improved docs.
amunra Jan 4, 2023
46999e7
Updated c-questdb-client dependency.
amunra Jan 4, 2023
32f3394
Exception type tidy-up.
amunra Jan 4, 2023
38eb382
Fixed typos spotted during the code review.
amunra Jan 4, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,12 +1,22 @@
src/questdb/ingress.html
src/questdb/ingress.c
src/questdb/*.html
rustup-init.exe

# Linux Perf profiles
perf.data*
perf/*.svg

# Atheris Crash/OOM and other files
fuzz-artifact/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# Parquet files generated as part of example runs
*.parquet

# C extensions
*.so

Expand Down
6 changes: 5 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
{
"esbonio.sphinx.confDir": ""
"esbonio.sphinx.confDir": "",
"cmake.configureOnOpen": false,
"files.associations": {
"ingress_helper.h": "c"
}
}
60 changes: 57 additions & 3 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,47 @@
Changelog
=========

1.1.0 (2023-01-04)
------------------

Features
~~~~~~~~

* High-performance ingestion of `Pandas <https://pandas.pydata.org/>`_
dataframes into QuestDB via ILP.
We now support most Pandas column types. The logic is implemented in native
code and is orders of magnitude faster than iterating the dataframe
in Python and calling the ``Buffer.row()`` or ``Sender.row()`` methods: The
``Buffer`` can be written from Pandas at hundreds of MiB/s per CPU core.
The new ``dataframe()`` method continues working with the ``auto_flush``
feature.
See API documentation and examples for the new ``dataframe()`` method
available on both the ``Sender`` and ``Buffer`` classes.

* New ``TimestampNanos.now()`` and ``TimestampMicros.now()`` methods.
*These are the new recommended way of getting the current timestamp.*

* The Python GIL is now released during calls to ``Sender.flush()`` and when
``auto_flush`` is triggered. This should improve throughput when using the
``Sender`` from multiple threads.

Errata
~~~~~~

* In previous releases the documentation for the ``from_datetime()`` methods of
the ``TimestampNanos`` and ``TimestampMicros`` types recommended calling
``datetime.datetime.utcnow()`` to get the current timestamp. This is incorrect
as it will (confusinly) return object with the local timezone instead of UTC.
This documentation has been corrected and now recommends calling
``datetime.datetime.now(tz=datetime.timezone.utc)`` or (more efficiently) the
new ``TimestampNanos.now()`` and ``TimestampMicros.now()`` methods.

1.0.2 (2022-10-31)
------------------

Features
~~~~~~~~

* Support for Python 3.11.
* Updated to version 2.1.1 of the ``c-questdb-client`` library:

Expand All @@ -14,20 +52,30 @@ Changelog
1.0.1 (2022-08-16)
------------------

Features
~~~~~~~~

* As a matter of convenience, the ``Buffer.row`` method can now take ``None`` column
values. This has the same semantics as skipping the column altogether.
Closes `#3 <https://github.com/questdb/py-questdb-client/issues/3>`_.

Bugfixes
~~~~~~~~

* Fixed a major bug where Python ``int`` and ``float`` types were handled with
32-bit instead of 64-bit precision. This caused certain ``int`` values to be
rejected and other ``float`` values to be rounded incorrectly.
Closes `#13 <https://github.com/questdb/py-questdb-client/issues/13>`_.
* As a matter of convenience, the ``Buffer.row`` method can now take ``None`` column
values. This has the same semantics as skipping the column altogether.
Closes `#3 <https://github.com/questdb/py-questdb-client/issues/3>`_.
* Fixed a minor bug where an error auto-flush caused a second clean-up error.
Closes `#4 <https://github.com/questdb/py-questdb-client/issues/4>`_.


1.0.0 (2022-07-15)
------------------

Features
~~~~~~~~

* First stable release.
* Insert data into QuestDB via ILP.
* Sender and Buffer APIs.
Expand All @@ -38,6 +86,9 @@ Changelog
0.0.3 (2022-07-14)
------------------

Features
~~~~~~~~

* Initial set of features to connect to the database.
* ``Buffer`` and ``Sender`` classes.
* First release where ``pip install questdb`` should work.
Expand All @@ -46,4 +97,7 @@ Changelog
0.0.1 (2022-07-08)
------------------

Features
~~~~~~~~

* First release on PyPI.
16 changes: 16 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,22 @@ The latest version of the library is 1.0.2.
columns={'temperature': 20.0, 'humidity': 0.5})
sender.flush()

You can also send Pandas dataframes:

.. code-block:: python

import pandas as pd
from questdb.ingress import Sender

df = pd.DataFrame({
'id': pd.Categorical(['toronto1', 'paris3']),
'temperature': [20.0, 21.0],
'humidity': [0.5, 0.6],
'timestamp': pd.to_datetime(['2021-01-01', '2021-01-02'])'})

with Sender('localhost', 9009) as sender:
sender.dataframe(df, table_name='sensors')


Docs
====
Expand Down
12 changes: 0 additions & 12 deletions TODO.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,6 @@ TODO
Build Tooling
=============

* **[HIGH]** Transition to Azure, move Linux arm to ARM pipeline without QEMU.

* **[MEDIUM]** Automate Apple Silicon as part of CI.

* **[LOW]** Release to PyPI from CI.
Expand All @@ -19,13 +17,3 @@ Docs
* **[MEDIUM]** Examples should be tested as part of the unit tests (as they
are in the C client). This is to ensure they don't "bit rot" as the code
changes.

* **[MEDIUM]** Document on a per-version basis.

Development
===========

* **[HIGH]** Implement ``tabular()`` API in the buffer.

* **[MEDIUM]** Implement ``pandas()`` API in the buffer.
*This can probably wait for a future release.*
16 changes: 8 additions & 8 deletions ci/cibuildwheel.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ stages:
- bash: |
set -o errexit
python3 -m pip install --upgrade pip
pip3 install cibuildwheel==2.11.1
python3 -m pip install cibuildwheel==2.11.2
displayName: Install dependencies
- bash: cibuildwheel --output-dir wheelhouse .
displayName: Build wheels
Expand All @@ -83,7 +83,7 @@ stages:
- bash: |
set -o errexit
python3 -m pip install --upgrade pip
pip3 install cibuildwheel==2.11.1
python3 -m pip install cibuildwheel==2.11.2
displayName: Install dependencies
- bash: cibuildwheel --output-dir wheelhouse .
displayName: Build wheels
Expand All @@ -100,7 +100,7 @@ stages:
- bash: |
set -o errexit
python3 -m pip install --upgrade pip
pip3 install cibuildwheel==2.11.1
python3 -m pip install cibuildwheel==2.11.2
displayName: Install dependencies
- bash: cibuildwheel --output-dir wheelhouse .
displayName: Build wheels
Expand All @@ -117,7 +117,7 @@ stages:
- bash: |
set -o errexit
python3 -m pip install --upgrade pip
pip3 install cibuildwheel==2.11.1
python3 -m pip install cibuildwheel==2.11.2
displayName: Install dependencies
- bash: cibuildwheel --output-dir wheelhouse .
displayName: Build wheels
Expand All @@ -134,7 +134,7 @@ stages:
- bash: |
set -o errexit
python3 -m pip install --upgrade pip
pip3 install cibuildwheel==2.11.1
python3 -m pip install cibuildwheel==2.11.2
displayName: Install dependencies
- bash: cibuildwheel --output-dir wheelhouse .
displayName: Build wheels
Expand All @@ -151,7 +151,7 @@ stages:
- bash: |
set -o errexit
python3 -m pip install --upgrade pip
python3 -m pip install cibuildwheel==2.11.1
python3 -m pip install cibuildwheel==2.11.2
displayName: Install dependencies
- bash: cibuildwheel --output-dir wheelhouse .
displayName: Build wheels
Expand All @@ -165,8 +165,8 @@ stages:
- task: UsePythonVersion@0
- bash: |
set -o errexit
python -m pip install --upgrade pip
pip install cibuildwheel==2.11.1
python3 -m pip install --upgrade pip
python3 -m pip install cibuildwheel==2.11.2
displayName: Install dependencies
- bash: cibuildwheel --output-dir wheelhouse .
displayName: Build wheels
Expand Down
74 changes: 74 additions & 0 deletions ci/pip_install_deps.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
import sys
import subprocess
import shlex
import textwrap
import platform


class UnsupportedDependency(Exception):
pass


def pip_install(package):
args = [
sys.executable,
'-m', 'pip', 'install',
'--upgrade',
'--only-binary', ':all:',
package]
args_s = ' '.join(shlex.quote(arg) for arg in args)
sys.stderr.write(args_s + '\n')
res = subprocess.run(
args,
stderr=subprocess.STDOUT,
stdout=subprocess.PIPE)
if res.returncode == 0:
return
output = res.stdout.decode('utf-8')
if 'Could not find a version that satisfies the requirement' in output:
raise UnsupportedDependency(output)
else:
sys.stderr.write(output + '\n')
sys.exit(res.returncode)


def try_pip_install(package):
try:
pip_install(package)
except UnsupportedDependency as e:
msg = textwrap.indent(str(e), ' ' * 8)
sys.stderr.write(f' Ignored unsatisfiable dependency:\n{msg}\n')


def ensure_timezone():
try:
import zoneinfo
if platform.system() == 'Windows':
pip_install('tzdata') # for zoneinfo
except ImportError:
pip_install('pytz')


def main():
ensure_timezone()
try_pip_install('fastparquet>=2022.12.0')
try_pip_install('pandas')
try_pip_install('numpy')
try_pip_install('pyarrow')

on_linux_is_glibc = (
(not platform.system() == 'Linux') or
(platform.libc_ver()[0] == 'glibc'))
is_64bits = sys.maxsize > 2**32
is_cpython = platform.python_implementation() == 'CPython'
if on_linux_is_glibc and is_64bits and is_cpython:
# Ensure that we've managed to install the expected dependencies.
import pandas
import numpy
import pyarrow
if sys.version_info >= (3, 8):
import fastparquet


if __name__ == "__main__":
main()
4 changes: 3 additions & 1 deletion ci/run_tests_pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,9 @@ stages:
submodules: true
- task: UsePythonVersion@0
- script: python3 --version
- script: python3 -m pip install cython
- script: |
python3 -m pip install cython
python3 ci/pip_install_deps.py
displayName: Installing Python dependencies
- script: python3 proj.py build
displayName: "Build"
Expand Down
6 changes: 5 additions & 1 deletion dev_requirements.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
setuptools>=45.2.0
Cython>=0.29.32
wheel>=0.34.2
cibuildwheel>=2.11.1
cibuildwheel>=2.11.2
Sphinx>=5.0.2
sphinx-rtd-theme>=1.0.0
twine>=4.0.1
bump2version>=1.0.1
pandas>=1.3.5
numpy>=1.21.6
pyarrow>=10.0.1
fastparquet>=2022.12.0
Loading