ARROW-9782: [C++][Dataset] More configurable Dataset writing #8305

bkietz · 2020-09-29T21:15:09Z

Python:

ParquetFileFormat.write_options has been removed
Added classes {,Parquet,Ipc}FileWriteOptions
FileWriteOptions are constructed using FileFormat.make_write_options(...)
FileWriteOptions are passed as a parameter to _filesystemdataset_write()

R:

FileWriteOptions$create(...) to make write options; no subclasses exposed in R
A filter() on the dataset is applied to restrict written rows.

C++:

FileSystemDataset::Write's parameters have been consolidated into
- A Scanner, from which the batches to be written are pulled
- A FileSystemDatasetWriteOptions, which is an options struct specifying
  - destination filesystem
  - base directory
  - partitioning
  - basenames (via a string template, ex "dat_{i}.feather")
  - format specific write options
Format specific write options are represented using the FileWriteOptions hierarchy. An instance of these can be constructed from a format using FileFormat::DefaultWriteOptions(), after which the instance can be modified.
- ParquetFileFormat::{writer_properties, arrow_writer_properties} have been moved to ParquetFileWriteOptions, an implementation of FileWriteOptions.

Internal C++:

Individual files can now be incrementally written using a FileWriter, constructible from a format using FileFormat::MakeWriter
FileSystemDataset::Write now parallelizes across scan tasks rather than fragments, so there will be no difference in performance for different arrangements of tables/batches/lists of tables and batches when writing from memory
FileSystemDataset::Write::WriteQueue provides a threadsafe channel for batches awaiting write, allowing threads to produce batches as another thread flushes the queue to disk.

github-actions · 2020-09-29T21:24:09Z

https://issues.apache.org/jira/browse/ARROW-9782

jorisvandenbossche · 2020-09-30T13:00:15Z

@bkietz this removes the ability to specify format specific options? (or it's still WIP?)

bkietz · 2020-09-30T13:16:06Z

@jorisvandenbossche yes, not ready for review yet. I will repair format specific write options as part of this PR

jorisvandenbossche · 2020-09-30T13:22:50Z

Okido, will wait a bit more then ;-)

cpp/src/arrow/dataset/file_base.cc

fsaintjacques · 2020-10-02T11:13:28Z

cpp/src/arrow/dataset/file_base.cc

Since we know all fragments (and their expressions) already, can we avoid all the locking multi-threading in WriterSet (IIRC, you need them to create the writer once)? That would heavily simplify all of this.

In this context fragments are the object of writing rather than the target (so for example one might represent an in-memory table which is being copied to disk). Writers are not known ahead of time since they depend on the partitioning which depends on the set of unique values in a given column, which we discover only after running GroupBy on an input batch

We could do two scans of the input data:

Assemble a list of all unique values in the partition columns of the data, from which we can determine the precise set of writers to open

Apply groupings to batches, passing the results to pre-opened writers

This doesn't seem worthwhile to me; scanning the input is potentially expensive so we should avoid doing it twice. Furthermore we'll still need to coordinate between threads since two input batches might still contain rows bound for a single output writer.

cpp/src/arrow/dataset/file_base.cc

jorisvandenbossche · 2020-10-05T19:45:36Z

python/pyarrow/dataset.py

Should we provide a default template here?
Can eg the format object have a property with the default name to use? (or get the extension from there and use that in a default?)

@jorisvandenbossche Added, PTAL

nealrichardson

+1 from me, thanks for doing this!

cpp/src/arrow/dataset/file_base.h

cpp/src/arrow/dataset/file_ipc.cc

cpp/src/arrow/dataset/file_parquet.cc

pitrou · 2020-10-06T11:31:03Z

cpp/src/arrow/util/mutex.h

I would expect a Lock() method as well.

I'd prefer to continue acquiring new locks exclusively through Mutex::Lock; there's no loss of generality and it keeps Guard as simple as possible

cpp/src/arrow/util/mutex.cc

cpp/src/arrow/util/string.h

nealrichardson · 2020-10-06T18:02:45Z

Is this done, or what is left?

bkietz · 2020-10-06T18:08:18Z

@pitrou are you planning to review C++ again?
@jorisvandenbossche are you planning to review python?

pitrou · 2020-10-06T19:15:59Z

The C++ changes addressed my comments. It would be nice though if @fsaintjacques could take a look.

jorisvandenbossche

Did a pass over the python code

python/pyarrow/_dataset.pyx

jorisvandenbossche · 2020-10-07T16:08:49Z

python/pyarrow/_dataset.pyx

Does this change behaviour? It seems you are now creating a single fragment instead of a vector of fragments?

FileSystemDataset::Write now parallelizes across scan tasks rather than fragments so there will be no difference in performance/written files even if we create a single in-memory fragment, so I changed this to create a single fragment since it's simpler

python/pyarrow/dataset.py

bkietz · 2020-10-08T00:20:07Z

Merging

jorisvandenbossche · 2020-10-09T12:54:03Z

python/pyarrow/tests/test_dataset.py


    target = tempdir / 'single-directory-target'
-    expected_files = [target / "dat_0.ipc", target / "dat_1.ipc"]
+    expected_files = [target / "part-0.feather"]


Why did this change to a single file? (the original has 2 files, I expect the roundtrip to preserve those files)

After this patch, a single file will be written for each partition directory. In a follow up we'll add an optional cap on file size

jorisvandenbossche · 2020-10-09T12:54:59Z

python/pyarrow/tests/test_dataset.py

-    # check that all files are the same in both cases
-    paths1 = [p.relative_to(target1) for p in target1.rglob("*")]
-    paths2 = [p.relative_to(target2) for p in target2.rglob("*")]
-    assert set(paths1) == set(paths2)


Why was this removed? (does it no longer hold?)

It no longer holds consistently; the auto incremented {i} doesn't necessarily round trip

jorisvandenbossche · 2020-10-09T12:59:45Z

FileSystemDataset::Write now parallelizes across scan tasks rather than fragments so there will be no difference in performance/written files even if we create a single in-memory fragment, so I changed this to create a single fragment since it's simpler

There might be no difference, but I think the user should still be able to control how many files are created. Because now whathever you pass, it's always consolidated into a single file? (or one file per partitioning)

Also, it seems that reading/writing a dataset does not preserve the files? (so if we discover a dataset with multiple files, we write it as a single file?)

bkietz · 2020-10-09T15:34:44Z

If you're writing with no partitioning then yes, everything will be written to a single file. In a follow up we'll probably add a special case for unpartitioned writing which allocates an output file for each thread just for performance reasons.

bkietz force-pushed the 9782-more-configurable-writing branch from b090f12 to b47d07d Compare October 1, 2020 15:17

bkietz marked this pull request as ready for review October 1, 2020 18:56

bkietz requested review from fsaintjacques, jorisvandenbossche, nealrichardson and pitrou and removed request for nealrichardson October 1, 2020 18:56

fsaintjacques reviewed Oct 2, 2020

View reviewed changes

pitrou reviewed Oct 5, 2020

View reviewed changes

cpp/src/arrow/dataset/file_base.cc Outdated Show resolved Hide resolved

pitrou reviewed Oct 5, 2020

View reviewed changes

cpp/src/arrow/dataset/file_base.cc Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Oct 5, 2020

View reviewed changes

bkietz force-pushed the 9782-more-configurable-writing branch from ef5fc61 to b79a95c Compare October 5, 2020 21:00

nealrichardson approved these changes Oct 5, 2020

View reviewed changes

pitrou reviewed Oct 6, 2020

View reviewed changes

bkietz force-pushed the 9782-more-configurable-writing branch from 370e2a0 to bc3b106 Compare October 6, 2020 14:24

jorisvandenbossche reviewed Oct 7, 2020

View reviewed changes

bkietz force-pushed the 9782-more-configurable-writing branch from ef952ef to 20cf19f Compare October 7, 2020 17:17

bkietz and others added 5 commits October 7, 2020 16:41

ARROW-9782: [C++][Dataset] More configurable Dataset writing

754e559

Minimal hacking to get the R tests passing

e2c1199

fix Scanner splitting

9eea1bd

remove debug print()s

c263f6c

don't double unlock std::mutex

e70dd9d

bkietz and others added 19 commits October 7, 2020 16:42

extract and unit test string interpolation

1c6db50

refactor ::Write() to use explicit WriteQueues

6eb546e

cache queue mapping local to each thread

ed5ec52

lint, simplify WriteQueue storage, try workaround for atomic::atomic()

1a541d6

comparator must be const

95e548e

simplify thread local caching

7fd7185

simplify: revert local queue lookup caching

dde5eed

revert lock_free

dfc2291

more exact typing in GetOrInsertGenerated

a3454d9

move lazy initialization locking into Flush()

448e04e

fix comment

87d863a

address review comments

7db8bf3

add default basename_template for python

33257a6

R code/doc polishing

d46c1af

Update vignette now that you can filter when writing

7f1255e

lint fix

16d9d53

writing without partitioning will create a single file

086f59d

address review comments

998d760

correct R doc after dat_{i} -> part-{i}

5602aa8

bkietz force-pushed the 9782-more-configurable-writing branch from 5cb797e to 5602aa8 Compare October 7, 2020 20:43

bkietz closed this in ae396b9 Oct 8, 2020

jorisvandenbossche reviewed Oct 9, 2020

View reviewed changes

bkietz deleted the 9782-more-configurable-writing branch February 25, 2021 16:19

This was referenced Oct 9, 2020

[C++][Dataset] Ability to write ".feather" files with IpcFileFormat #25830

Closed

[C++][Dataset] Preserve order when writing dataset #26818

Closed

ARROW-9782: [C++][Dataset] More configurable Dataset writing #8305

ARROW-9782: [C++][Dataset] More configurable Dataset writing #8305

Uh oh!

Conversation

bkietz commented Sep 29, 2020 • edited by nealrichardson Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 29, 2020

Uh oh!

jorisvandenbossche commented Sep 30, 2020

Uh oh!

bkietz commented Sep 30, 2020

Uh oh!

jorisvandenbossche commented Sep 30, 2020

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkietz Oct 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nealrichardson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nealrichardson commented Oct 6, 2020

Uh oh!

bkietz commented Oct 6, 2020

Uh oh!

pitrou commented Oct 6, 2020

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bkietz commented Oct 8, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Oct 9, 2020

Uh oh!

bkietz commented Oct 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

bkietz commented Sep 29, 2020 •

edited by nealrichardson

Loading

bkietz Oct 5, 2020 •

edited

Loading