Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CopyOut! interface for >5x faster data queries #241

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

cmey
Copy link

@cmey cmey commented Dec 29, 2021

Hi! I tend to do SELECT queries that return a large dataset for which the standard method takes a few seconds:

using LibPQ, BenchmarkTools, DataFrames

conn = LibPQ.Connection("dbname=postgres user=postgres")

result = execute(conn, """
    CREATE TEMPORARY TABLE libpqjl_test (
        no_nulls    varchar(10),
        yes_nulls   varchar(10)
    );
""")

result = execute(conn, """
    insert into libpqjl_test (no_nulls, yes_nulls)
select
    left(md5(i::text), 10),
    left(md5(random()::text), 10)
from generate_series(1, 1000000) s(i);
""")

@benchmark execute(conn, "SELECT * FROM libpqjl_test;") |> DataFrame

Time  (median):     2.977 s               ┊ GC (median):    13.03%

PostgreSQL supports the COPY TO API and I found it to be faster, so I implemented it! Use ] add https://github.com/cmey/LibPQ.jl#copy-out to try it out!
(we already support the COPY FROM API for fast ingestion which I took heavy inspiration from)

using CSV

@benchmark begin
    databuf = IOBuffer()
    copyout = LibPQ.CopyOut!(databuf, "COPY (SELECT * FROM libpqjl_test) TO STDOUT (FORMAT CSV, HEADER);")
    result = execute(conn, copyout)
    databuf |> CSV.File |> DataFrame
end

Time  (median):     590.818 ms               ┊ GC (median):    9.10%

Both methods return equal outputs:

isequal(execute(conn, "SELECT * FROM libpqjl_test;") |> DataFrame,
    begin databuf = IOBuffer()
        copyout = LibPQ.CopyOut!(databuf, "COPY (SELECT * FROM libpqjl_test) TO STDOUT (FORMAT CSV, HEADER);")
        result = execute(conn, copyout)
        databuf |> CSV.File |> DataFrame
    end)

true

The new API expects the user to pass a "string builder" type of construct in the form of a IOBuffer that, during execution, gets filled with the text of the CSV returned by the database.
I found this didn't fit well with the table-style of Result, since this is just a text blob.

Note that CSV API changed from CSV version 0.8 to 0.9, and that Julia <1.3.0-1 loads the old CSV, that's why there is the VERSION check in the test code, to support the test run on Julia 1.0.

I hope others will find this helpful! I certainly found this helped me, as I see >10x speed improvement in real-life use cases.

@cmey cmey changed the title Add CopyOut! interface for 5x faster data queries Add CopyOut! interface for >5x faster data queries Dec 29, 2021
@cmey cmey force-pushed the copy-out branch 2 times, most recently from f13aa0c to 46ab31b Compare December 30, 2021 09:31
@cmey
Copy link
Author

cmey commented Dec 30, 2021

@iamed2 Hi! I'm not sure how to fix the failing formatting CI step, it flags about all source files in the repo, including many I haven't touched 😮

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant