Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: High performance pandas integration. #24

Merged
merged 149 commits into from
Jan 4, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
149 commits
Select commit Hold shift + click to select a range
12660f3
Testing tweak.
amunra Oct 26, 2022
bdbd283
Merge remote-tracking branch 'origin/main' into pandas_integration
amunra Oct 26, 2022
9e93993
Updated to c-questdb-client 2.1.1
amunra Oct 26, 2022
e56027c
Some progress..
amunra Oct 28, 2022
ed1658e
Merge remote-tracking branch 'origin/main' into pandas_integration
amunra Nov 1, 2022
d692098
Fixed broken check.
amunra Nov 1, 2022
a92a858
symbols validation.
amunra Nov 1, 2022
bebbc8a
Added repl command to proj script.
amunra Nov 1, 2022
ab05dec
Progress with pandas method input validation.
amunra Nov 1, 2022
dc65b2c
Moved Python IntVec to C int_vec.
amunra Nov 1, 2022
a46490a
More code to avoid python lists.
amunra Nov 2, 2022
737f970
CI fixup.
amunra Nov 2, 2022
d574486
CI fixup 2.
amunra Nov 2, 2022
b1c17b0
CI fixup 3.
amunra Nov 2, 2022
fc99054
CI fixup 4.
amunra Nov 2, 2022
3654938
CI fixup 5.
amunra Nov 2, 2022
7102ddd
Types and buffers.
amunra Nov 4, 2022
5d6d4be
Improved column index handling, actually getting buffers from numpy a…
amunra Nov 4, 2022
95a66bc
Introducing a small rust lib to convert python strings to UTF-8 witho…
amunra Nov 4, 2022
e43e564
Implemented conversions for UCS1/2/4.
amunra Nov 4, 2022
25bc49b
Renamed rust lib and wired up the build and linkage bits in setup.py
amunra Nov 4, 2022
b199db6
Auxilliary rust lib rename.
amunra Nov 5, 2022
7b5200e
Reworked API for better buffer reuse and for individual UCS1, 2, 4 fu…
amunra Nov 5, 2022
e945702
cbindgen to generate C .h and Cython .pxd headers.
amunra Nov 5, 2022
161f6ae
cbindgen fixup & including lib headers from setup.py
amunra Nov 7, 2022
5b5d58b
Wrote Cython function to invoke 'pystr-to-utf8' lib. Additional refac…
amunra Nov 7, 2022
12598c4
Transitioned all string buffer conversions to new Rust code lib.
amunra Nov 8, 2022
3544312
Made pystr_to_utf8 addresses stable.
amunra Nov 8, 2022
9fce68b
Rust str to utf8 lib fixes (but still broken - ongoing)
amunra Nov 8, 2022
9155357
Minor unicode test improvement. Transcoding works now.
amunra Nov 8, 2022
f953ad0
Rust PyStr lib tests and a few bugfixes.
amunra Nov 8, 2022
ab1566c
Updated pystr lib readme.
amunra Nov 8, 2022
d79955f
More pystr-to_utf8 tests and improvements.
amunra Nov 9, 2022
e8980d5
Added UCS-4 tests.
amunra Nov 9, 2022
644bdba
More unicode testing.
amunra Nov 10, 2022
b071edf
Fixed include for cython generation compatability.
amunra Nov 10, 2022
091e4c8
Table name columns, symbols and timestamps now work!
amunra Nov 10, 2022
29895f6
Handling null column values in strings.
amunra Nov 11, 2022
ef6735b
Added arrow C data interface type definitions.
amunra Nov 11, 2022
ead851d
Code reorg.
amunra Nov 11, 2022
d14eea9
Consolidated approach writeup and code reorg into .pxi files.
amunra Nov 14, 2022
94378a0
Undid removal of ingress.c from gitignore.
amunra Nov 14, 2022
939c50d
More writeup with final types.
amunra Nov 15, 2022
2355b3d
Categories added to write-up.
amunra Nov 15, 2022
1c14cb7
Consolidated Pandas logic into single .pxi file.
amunra Nov 15, 2022
4f7ef81
File renaming.
amunra Nov 15, 2022
de4b471
Reorganised existing logic into a sorted array of col_t types. Some p…
amunra Nov 16, 2022
ef2424d
Documented float16, added col_source_t.
amunra Nov 16, 2022
993dd5f
Beginning to resolve columns.
amunra Nov 17, 2022
8387209
More array extraction logic.
amunra Nov 18, 2022
5e5158f
Updating of types, updating tech doc for timezone timestamps.
amunra Nov 21, 2022
2a26d8f
Fixed up most cython build issues. Mostly enum usage issues.
amunra Nov 21, 2022
e9890a6
Code builds again finally.
amunra Nov 21, 2022
7e75260
Dead code removal.
amunra Nov 21, 2022
0e67480
Types to dispatch codes to functions.
amunra Nov 21, 2022
41d3e6b
Some test fixup
amunra Nov 22, 2022
7a1c0b0
Yay, segfault!
amunra Nov 22, 2022
3af4899
Fixed a few segfaults, got some more.
amunra Nov 22, 2022
9dc17a3
Fixed segfaults.
amunra Nov 22, 2022
9665d9f
Added missing dispatch codes and lots of TODOs.
amunra Nov 22, 2022
194ad25
Got rid of a lot of INCREF/DECREF silliness.
amunra Nov 23, 2022
9ea4a0e
Ohh look. Tests pass again.
amunra Nov 23, 2022
7c03446
Fixed another segfault.
amunra Nov 23, 2022
80748ad
Another bug bites the dust.
amunra Nov 23, 2022
01d8732
Implemented symbols='auto' and i32 column support.
amunra Nov 23, 2022
579e649
Swapped out error prone 'bint / except False' declarations with 'void…
amunra Nov 23, 2022
df823b7
More string trouble.
amunra Nov 24, 2022
cb97a20
Normality restored.
amunra Nov 24, 2022
5c1aaff
py obj to symbols.
amunra Nov 24, 2022
dc5795f
Done timestamp at and columns. Found out that timezone timestamps are…
amunra Nov 24, 2022
b7cfd50
Added some testing notes.
amunra Nov 24, 2022
12c6eec
TODO fixup.
amunra Nov 24, 2022
8593b75
Added support for datetimes with timezones (only nanosecond based) vi…
amunra Nov 25, 2022
7f6503d
Bool column support from Python objects.
amunra Nov 25, 2022
25c91be
Added arrow-based boolean pandas datatype column support.
amunra Nov 25, 2022
8b31010
Support for arrow integer columns.
amunra Nov 25, 2022
6cad41b
Progress handling strings.
amunra Nov 28, 2022
1499cff
Support for objects with integers.
amunra Nov 28, 2022
0e9e287
Float object support.
amunra Nov 28, 2022
a1e7d07
arrow f32 and f64
amunra Nov 28, 2022
898b157
str column pyarrow.
amunra Nov 28, 2022
88e618a
LTO, basic perf tests, removed debug logging, fixed a bug in string c…
amunra Nov 30, 2022
0158fad
Tests for categories.
amunra Nov 30, 2022
fd01ef3
Releasing and reacquiring GIL to avoid starving other threads.
amunra Nov 30, 2022
dfdd302
Fully releasing GIL whenever possible. This was fiddly to get working.
amunra Nov 30, 2022
5041d26
Refactoring out benchmarks, refactoring Py str to UTF8 rust impl.
amunra Dec 1, 2022
e4135d0
8% perf improvements in Python string to UTF-8 conversions.
amunra Dec 1, 2022
754534c
Multithreading benchmark.
amunra Dec 1, 2022
5915a00
Implemented column (arrow and pybuffer) cleanup.
amunra Dec 1, 2022
9b75a12
Formatting.
amunra Dec 1, 2022
7f8dab4
Tested all-nulls column is altogether skipped.
amunra Dec 1, 2022
21f39f8
Refactoring and sorting columns in C.
amunra Dec 1, 2022
f28903e
Updated c-questdb-client submodule: Latest perf improvements.
amunra Dec 2, 2022
8b6652a
Fixed broken build.
amunra Dec 2, 2022
cba2aaf
Single logic to infer object column types.
amunra Dec 2, 2022
e07bb42
Tests fixup.
amunra Dec 2, 2022
bd4bca6
More tests.
amunra Dec 2, 2022
7c734b5
Fixed a bug passing None in datetime columns.
amunra Dec 2, 2022
480343d
Tests for degenerate pandas dataframes.
amunra Dec 2, 2022
b1f2ebf
Informative message for row of nulls.
amunra Dec 2, 2022
9811007
Mandating pyarrow dependency for pandas functionality.
amunra Dec 3, 2022
c03c4ed
There's a chance this will fix CI.
amunra Dec 5, 2022
b1a4dc7
Second attempt to fix up the CI.
amunra Dec 5, 2022
8b3e45d
Third attempt to fix up the CI.
amunra Dec 5, 2022
15330b0
Reduced stack size in case of errors to aid legibility.
amunra Dec 5, 2022
cee6d4b
Fourth attempt to fix up the CI.
amunra Dec 5, 2022
5668b57
Fifth attempt to fix up the CI.
amunra Dec 5, 2022
77a612c
Sixth attempt to fix up the CI.
amunra Dec 5, 2022
57ae0b8
Progress on API docs.
amunra Dec 6, 2022
a2763f9
Found and fixed a memory leak.
amunra Dec 7, 2022
960cd74
More fuzzing.
amunra Dec 7, 2022
5998a1c
Added support from taking the table name from the df.index.name, rena…
amunra Dec 8, 2022
be0407c
General fixes and testcases for handling timestamps.
amunra Dec 12, 2022
24ee3cd
Should fix tests in CI.
amunra Dec 12, 2022
e8d8daa
Extra testing of 'TimestampXXX.now()' and hopefully fixing CI.
amunra Dec 12, 2022
5f8e8ee
CI fixup attempt.
amunra Dec 12, 2022
6b77da9
Fixing broken 32-bit binaries.
amunra Dec 12, 2022
aaf7e95
Slimmed down 'col_t' type.
amunra Dec 13, 2022
b9b2081
Implemented (but not yet tested) pandas auto-flush logic. Also releas…
amunra Dec 13, 2022
dcccabd
Tweak to pandas auto-flush logic.
amunra Dec 13, 2022
98a5496
Basic pandas end-to-end test.
amunra Dec 13, 2022
02d49dd
Tests (and bugfixes) for panda's auto-flush.
amunra Dec 13, 2022
1801f10
Pandas API docs.
amunra Dec 14, 2022
45aa14b
Renamed '.pandas()' to '.dataframe()'.
amunra Dec 14, 2022
ada3ac8
Int object int64 bounds check tests.
amunra Dec 15, 2022
89c50b4
Test strided numpy array with zero-copy into pandas.
amunra Dec 15, 2022
64f14fa
Serializing subset of dataframe rows.
amunra Dec 15, 2022
def3887
Improved error messaging.
amunra Dec 15, 2022
81d6cb8
Testing chunked arrow arrays.
amunra Dec 15, 2022
83a937a
Removed completed TODOs
amunra Dec 15, 2022
712ec1d
Hopefully fixing CI.
amunra Dec 15, 2022
88b043e
Dataframe API doc fixup.
amunra Dec 15, 2022
58de10c
Fixing the CI
amunra Dec 15, 2022
b461557
Parquet rountrip test.
amunra Dec 16, 2022
9625447
Added missing libs in dev_requirements.txt
amunra Dec 28, 2022
02da96b
CI fixup (hopefully)
amunra Dec 28, 2022
e26f5fe
CI fixup (hopefully, again)
amunra Dec 28, 2022
6dd6cf6
CI fixup (once more, with feeling)
amunra Dec 28, 2022
67cedd9
More examples.
amunra Dec 29, 2022
0c7b6ef
Parquet data example.
amunra Dec 30, 2022
cd97af2
Updated parquet example, added to docs.
amunra Jan 2, 2023
ab69e9c
Updated examples manifest to hint at more examples for Pandas datafra…
amunra Jan 2, 2023
3af8c85
Disabled bytecode file gen for install_rust.py
amunra Jan 2, 2023
25d4e2b
Updated CHANGELOG.rst
amunra Jan 3, 2023
39dc427
Minor error reporting bugfix.
amunra Jan 4, 2023
7818149
Improved docs.
amunra Jan 4, 2023
46999e7
Updated c-questdb-client dependency.
amunra Jan 4, 2023
32f3394
Exception type tidy-up.
amunra Jan 4, 2023
38eb382
Fixed typos spotted during the code review.
amunra Jan 4, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,22 @@ The latest version of the library is 1.0.2.
columns={'temperature': 20.0, 'humidity': 0.5})
sender.flush()

You can also send Pandas dataframes:

.. code-block:: python

import pandas as pd
from questdb.ingress import Sender

df = pd.DataFrame({
'id': pd.Categorical(['toronto1', 'paris3']),
'temperature': [20.0, 21.0],
'humidity': [0.5, 0.6],
'timestamp': pd.to_datetime(['2021-01-01', '2021-01-02'])'})

with Sender('localhost', 9009) as sender:
sender.dataframe(df, table_name='sensors')


Docs
====
Expand Down
26 changes: 24 additions & 2 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ Installation
The Python QuestDB client does not have any additional run-time dependencies and
will run on any version of Python >= 3.7 on most platforms and architectures.

You can install it globally by running::
You can install it (or updated it) globally by running::
amunra marked this conversation as resolved.
Show resolved Hide resolved

python3 -m pip install questdb
python3 -m pip install -U questdb


Or, from within a virtual environment::
Expand All @@ -20,6 +20,15 @@ If you're using poetry, you can add ``questdb`` as a dependency::
poetry add questdb


Note that the :func:`questdb.ingress.Buffer.dataframe` and the
:func:`questdb.ingress.Sender.dataframe` methods also require the following
dependencies to be installed:

* ``pandas``
* ``pyarrow``
* ``numpy``


Verifying the Installation
==========================

Expand All @@ -34,3 +43,16 @@ following statements from a ``python3`` interactive shell:
<questdb.ingress.Buffer object at 0x104b68240>
>>> str(buf)
'test,a=b\n'

If you also want to check you can serialize from Pandas
amunra marked this conversation as resolved.
Show resolved Hide resolved
(which requires additional dependencies):

.. code-block:: python

>>> import questdb.ingress
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1, 2]})
>>> buf = questdb.ingress.Buffer()
>>> buf.dataframe(df, table_name='test')
>>> str(buf)
'test a=1i\ntest a=2i\n'
32 changes: 32 additions & 0 deletions src/questdb/ingress.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1200,6 +1200,31 @@ cdef class Buffer:
If a datetime value is specified as ``None`` (``NaT``), it is
interpreted as the current QuestDB server time set on receipt of
message.

**Error Handling and Recovery**

In case an exception is raised during dataframe serialization, the
buffer is left in its previous state.
The buffer remains in a valid state and can be used for further calls
even after an error.

For clarification, as an example, if an invalid ``None``
value appears at the 3rd row for a ``bool`` column, neither the 3rd nor
the preceding rows are added to the buffer.

**Note**: This differs from the :func:`Sender.dataframe` method, which
modifies this guarantee due to its ``auto_flush`` logic.

**Performance Considerations**

The Python GIL is released during serialization if it is not needed.
If any column requires the GIL, the entire serialization is done whilst
holding the GIL.

Column types that require the GIL are:

* Columns of ``str``, ``float`` or ``int`` or ``float`` Python objects.
* The ``'string[python]'`` dtype.
"""
_dataframe(
auto_flush_blank(),
Expand Down Expand Up @@ -1599,6 +1624,11 @@ cdef class Sender:

Additionally, this method also supports auto-flushing the buffer
as specified in the ``Sender``'s ``auto_flush`` constructor argument.
Auto-flushing is implemented incrementally, meanting that when
calling ``sender.dataframe(df)`` with a large ``df``, the sender may
have sent some of the rows to the server already whist the rest of the
rows are going to be sent at the next auto-flush or next explicit call
to :func:`Sender.flush`.

In case of data errors with auto-flushing enabled, some of the rows
may have been transmitted to the server already.
Expand Down Expand Up @@ -1636,6 +1666,8 @@ cdef class Sender:
If ``False``, the flushed buffer is left in the internal buffer.
Note that ``clear=False`` is only supported if ``buffer`` is also
specified.

The Python GIL is released during the network IO operation.
"""
cdef line_sender* sender = self._impl
cdef line_sender_error* err = NULL
Expand Down