ARROW-2229: [C++][Python] Add WriteCsv functionality. #9504

emkornfield · 2021-02-16T07:36:41Z

This offers possibly performance naive CSV writer with
limited options to keep the initial PR down.

Obvious potential improvements to this approach
are:

Smarter casts for dictionaries
Arena allocation for intermediate cast results

The implementation also means that for all primitive type
support we might have to fill in gaps in our cast function.

github-actions · 2021-02-16T07:36:59Z

https://issues.apache.org/jira/browse/ARROW-2229

cpp/src/arrow/csv/writer.cc

emkornfield · 2021-02-16T07:42:32Z

@jorisvandenbossche @pitrou would one or both of you mind reviewing? It appears CSV is part of the minimal build but compute (casts are not). I was thinking of doing an ifdef and raise not implemented in that case. Does that seem like a reasonable approach?

pitrou

Very nice! The approach is definitely interesting. You'll find some comments below.

pitrou · 2021-02-16T17:35:27Z

python/pyarrow/_csv.pyx

Hmm... which table memory?

copy and paste bug.

pitrou · 2021-02-16T17:35:54Z

python/pyarrow/_csv.pyx

I would rather make the API more consistent and have the user pass a WriteOptions instance here.

pitrou · 2021-02-16T17:37:55Z

cpp/src/arrow/record_batch.cc

I don't understand this comment: where is the optimization which just returns the original RecordBatch?

left over comment from when I was going to down the rabbit whole. I've moved all this code into the writer.cc to avoid debating the best way to expose this publicly.

pitrou · 2021-02-16T17:38:53Z

cpp/src/arrow/record_batch.cc

Hmm... this means the iterator will fail if the user doesn't keep the original batch alive. Do we really gain anything by not taking a shared_ptr here?

yes, it would. The tricky part is we can't get a shared_ptr to record batch if this is a member method. (unless we add enable_shared_from_this).

To avoid any contention here for now I've moved this to be an implementation detail.

pitrou · 2021-02-16T17:40:39Z

cpp/src/arrow/record_batch.h

Wouldn't it be more consistent with the rest of the RecordBatch API to use RecordBatchReader?

I was guessing iterators were now preferred? I haven't been keeping up. But as noted above the point is moot now.

pitrou · 2021-02-16T18:21:29Z

cpp/src/arrow/csv/writer.cc

Nit, but I think we spell "CSV" everywhere currently.

pitrou · 2021-02-16T18:23:33Z

cpp/src/arrow/csv/writer.cc

Interesting approach. It may also allow parallelizing conversion if we want to go that way.

pitrou · 2021-02-16T18:24:13Z

cpp/src/arrow/csv/writer.cc

3 is because of quoting and delimiters?

yes, replaced with a constexpr

pitrou · 2021-02-16T18:24:42Z

cpp/src/arrow/csv/writer.cc

DCHECK that row_positions_.back() corresponds to the end of the buffer?

(even better would be to DCHECK that each row position is consistent with the corresponding pre-computed row length)

each row length positions is no longer possible I mutate offsets.

pitrou · 2021-02-16T18:29:28Z

cpp/src/arrow/csv/options.h

Intuitively, this seems a bit low, but we can tune it later.

emkornfield · 2021-02-17T06:42:29Z

@pitrou thank you for all the comment still working through addressing them (I'll ping you again for review when it is ready). Could you chime in if using you think the ifdef approach mentioned above to return not implemented when compute isn't available seems reasonable to you?

pitrou · 2021-02-17T15:03:22Z

Sorry, I hadn't seen that question. Yes, I think that raising NotImplemented is fine for now.

In the future, we may want to always enable Cast (it's also used in stl.h), if that doesn't add too much to the binary sizes.

jorisvandenbossche · 2021-02-17T21:47:55Z

Cool!

I don't have time right now to give it a more detailed review, but I quickly fetched the branch, and even for a not yet optimized first version, this is already much faster as the pure python pandas to_csv writer (in pandas only the csv reader is optimized, not the writer).
With a small example (50,000 rows, 5 columns, with floats/int/string), I get 140ms with pandas, and 20ms with this branch (in release build) which even included the conversion pandas->arrow (but that's only 2-3ms in this case).

Few things I noticed / random thoughts:

When writing a column of floats that don't have decimals, no decimal point is included in the output, so it looks like an int (so eg 1 instead of 1.0, pandas writes the latter). Not sure if we want to preserve this "type information" in this case.
Pandas doesn't use quoting by default, also for string columns. I am not fully sure what makes the most sense as default option, but disabling quoting can be a follow-up enhancement.
We don't support casting timestamps to strings (yet), so that will be a useful addition to casting to be used here

jorisvandenbossche · 2021-02-17T21:48:42Z

Pandas doesn't use quoting by default, also for string columns. I am not fully sure what makes the most sense as default option, but disabling quoting can be a follow-up enhancement.

Quoting by default ensures that we can distinguish missing values and empty strings, I suppose?

jorisvandenbossche · 2021-02-17T21:50:02Z

python/pyarrow/_csv.pyx

In the parquet and feather api, it's first the data and then the file name

reversed. thanks.

westonpace · 2021-02-18T04:42:34Z

cpp/src/arrow/csv/writer.cc

Another easy optimization is to move the writes to a dedicated background thread. I'll add this to my list of "to asyncerize" which will include this optimization by necessity.

Yes, agreed. My initial use-case for this will be to a bufferoutputstream so asynchronous isn't important for that at least. This actually raises a question on whether detecting blocking vs non-blocking IO sources is important for threading.

I don't know that a buffered output stream solves the problem I'm worried about. It will help mitigate the cost of many small writes by grouping them but the large writes still take time. So if the CSV is large enough to span multiple buffers you still have to block here occasionally. I may be misunderstanding what you mean by bufferedoutputstream though.

The detecting question is a good one. I think it boils down like so...

Currently, we assume all streams are blocking (buffered output stream is still a blocking stream, it just blocks less). This means that anytime we do I/O from an async context we need to "background" it by creating a dedicated thread to do the I/O (very soon I hope this will switch to "borrowing a dedicated thread from the I/O pool")

At some point, as more underlying filesystems are non-blocking, we can reverse the assumption and assume that all underlying streams are non-blocking. Any remaining blocking streams will then be responsible for the "background"ing.

I don't know that a buffered output stream solves the problem I'm worried about. It will help mitigate the cost of many small writes by grouping them but the large writes still take time. So if the CSV is large enough to span multiple buffers you still have to block here occasionally. I may be misunderstanding what you mean by bufferedoutputstream though.

BufferOutputStream is an in memory output stream that writes to a resizable buffer.

I'm not sure why we're tallking about doing writes in a separate thread. Writes are typically asynchronous, so the only cost is a memory copy (and perhaps a system call).

Ok, yes, not a problem at all for BufferOutputStream. That will teach me to read more closely.

@pitrou You are correct. Simply adding a background readahead style thread will just introduce a second cache in addition to the OS' existing cache (or whatever dirty page cache is in our s3 filesystem) and that doesn't help anything. I still think we will eventually want to make writes properly async (futures). In the event we are in a huge write or writing to a slow (S3) source we still might block, although the gain there is not performance of this CSV write (the one blocked by I/O) as it is keeping the CPU threads clear in case other tasks are going on at the same time.

emkornfield · 2021-02-18T05:02:39Z

Quoting by default ensures that we can distinguish missing values and empty strings, I suppose?

Yes, although, we could use this convention for only nulls. Always quoting also helps maintain data types. For instance all number in a string column would be maintained round-trip.

emkornfield · 2021-02-22T03:24:48Z

Sorry, I hadn't seen that question. Yes, I think that raising NotImplemented is fine for now.

In the future, we may want to always enable Cast (it's also used in stl.h), if that doesn't add too much to the binary s
sizes.

Sorry I missed this. The approach I went with is to not include write header in api.h or try to compile if compute isn't on.

emkornfield · 2021-02-22T03:45:33Z

@jorisvandenbossche thank you for trying it out. In regards to:

When writing a column of floats that don't have decimals, no decimal point is included in the output, so it looks like an int (so eg 1 instead of 1.0, pandas writes the latter). Not sure if we want to preserve this "type information" in this case.

I think this would be nice, or at least configurable, but I would like to make this outside the scope of this PR (this probably belongs as a feature on cast?)

emkornfield · 2021-02-22T03:47:57Z

@pitrou I think this is ready for another review when you have time.

pitrou · 2021-02-23T16:33:49Z

Can you rebase from master to fix Windows builds?

emkornfield · 2021-02-23T16:37:25Z

@pitrou rebased.

pitrou

Thanks for the update. Here are a couple more comments.

pitrou · 2021-02-23T18:23:22Z

cpp/cmake_modules/SetupCxxFlags.cmake

The preferred way would to be let CMake generate it: https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/config.h.cmake#L37
(also, if you just pass the option here, it might not be taken up by third-party applications including Arrow)

pitrou · 2021-02-23T18:27:38Z

cpp/src/arrow/csv/writer.h

Writing iteratively requires to be careful with the options (you don't want to append a new header for each subsequent batch).

As you prefer, though. We can convert this later.

pitrou · 2021-02-23T18:33:59Z

python/pyarrow/_csv.pyx

This docstring misses a description.

pitrou · 2021-02-23T18:35:15Z

python/pyarrow/_csv.pyx

Looking at the existing PyArrow code, we generally don't mention anything when the function doesn't return a useful value.

pitrou · 2021-02-23T18:37:39Z

python/pyarrow/_csv.pyx

It seems get_writer should do this for string and path-like inputs already. Is there a reason you had to write this try/except/else switch?

I think I copied this from someplace.

pitrou · 2021-02-23T18:38:37Z

python/pyarrow/_csv.pyx

pyarrow_unwrap_batch is generally preferred.

thanks. changed.

pitrou · 2021-02-23T18:38:46Z

python/pyarrow/_csv.pyx

pyarrow_unwrap_table also

thanks. changed.

pitrou · 2021-02-23T18:40:18Z

python/pyarrow/_csv.pyx

We should raise a proper TypeError, e.g.:

raise TypeError(f"Expected Table or RecordBatch, got '{type(data)}'")

pitrou · 2021-02-23T18:41:14Z

python/pyarrow/includes/libarrow.pxd

You shouldn't need this with the config.h.cmake approach.

pitrou · 2021-02-23T18:42:55Z

cpp/src/arrow/csv/writer_test.cc

Ok, thanks.

emkornfield · 2021-02-24T05:24:11Z

@pitrou I think I addressed all your comments. Not sure what is going on with the R CI builds?

pitrou · 2021-02-24T14:23:41Z

@emkornfield It looks like Arrow C++ failed compiling on those builds:
https://github.com/apache/arrow/pull/9504/checks?check_run_id=1967370780#step:9:566

It seems std::vector<bool> with a custom allocator isn't well supported on old gccs?

emkornfield · 2021-02-24T16:37:33Z

It seems std::vector with a custom allocator isn't well supported on old gccs?

trying using uint8_t instead. (I'm also open to maybe removing the allocator in this case?)

pitrou · 2021-02-24T17:56:44Z

Note that std::vector<bool> should be more space-efficient, but that matters only if it's large.

emkornfield · 2021-02-24T19:36:31Z

Note that std::vector should be more space-efficient, but that matters only if it's large.

yes, I agree this isn't great but it should be pretty small (at least 1024 bytes bytes by default. The other option would be to not use the custom allocator. Do you have a preference?

pitrou · 2021-02-24T19:44:39Z

I'd say just use the default allocator.

emkornfield · 2021-02-25T06:16:52Z

I'd say just use the default allocator.

Done.

This offers possibly performance naive CSV writer with limited options to keep the initial PR down. Obvious potential improvements to this approach are: - Smarter casts for dictionaries - Arena allocation for intermediate cast results The implementation also means that for all primitive type support we might have to fill in gaps in our cast function.

Add Python properties and test them

pitrou

Will merge. I still think the C++ API should be class-based, like other readers and writers.

emkornfield · 2021-02-25T16:46:39Z

I'll do a follow-up pr to expose a class/object for writing

github-actions bot added Component: C++ Component: Python labels Feb 16, 2021

emkornfield commented Feb 16, 2021

View reviewed changes

cpp/src/arrow/csv/writer.cc Outdated Show resolved Hide resolved

pitrou self-requested a review February 16, 2021 15:39

pitrou reviewed Feb 16, 2021

View reviewed changes

jorisvandenbossche reviewed Feb 17, 2021

View reviewed changes

westonpace reviewed Feb 18, 2021

View reviewed changes

emkornfield force-pushed the csv branch from 00696ad to beef736 Compare February 22, 2021 01:14

emkornfield force-pushed the csv branch from b846957 to 1dfacd1 Compare February 23, 2021 16:36

pitrou reviewed Feb 23, 2021

View reviewed changes

emkornfield force-pushed the csv branch from d583f8b to ade71a8 Compare February 24, 2021 05:06

emkornfield and others added 18 commits February 25, 2021 13:55

Update cpp/src/arrow/csv/writer.cc

ecc508b

fix lint

ac7b3f4

add missing test

8dcef6a

lint and appveyor fixes

efffe81

start addressing feedback

7794307

add cmake configuration

4deb8a8

update python

8b40ed7

refactor c++ tests

61856a4

hopefully fix lint

5c833a8

hopefully make windows builds happy

73826b8

more lint and windows

c5d76eb

address comments

2695891

add back options

da03cd6

Try uint8_t instead of bool.

2eb2251

change back to bool and remove custom allocator

23a2b0c

remove from initialize list

7e33776

Improve docstrings

a5ee563

Add Python properties and test them

pitrou force-pushed the csv branch from d88b7ac to a5ee563 Compare February 25, 2021 13:15

pitrou approved these changes Feb 25, 2021

View reviewed changes

pitrou closed this in 9a9baf6 Feb 25, 2021

asfimport mentioned this pull request Apr 15, 2021

[C++] Write CSV files from RecordBatch, Table #18192

Closed

ARROW-2229: [C++][Python] Add WriteCsv functionality. #9504

ARROW-2229: [C++][Python] Add WriteCsv functionality. #9504

Uh oh!

Conversation

emkornfield commented Feb 16, 2021

Uh oh!

github-actions bot commented Feb 16, 2021

Uh oh!

Uh oh!

emkornfield commented Feb 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield commented Feb 17, 2021

Uh oh!

pitrou commented Feb 17, 2021

Uh oh!

jorisvandenbossche commented Feb 17, 2021

Uh oh!

jorisvandenbossche commented Feb 17, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

emkornfield commented Feb 16, 2021 •

edited

Loading