Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Sep 29, 2020

Python:

  • ParquetFileFormat.write_options has been removed
  • Added classes {,Parquet,Ipc}FileWriteOptions
  • FileWriteOptions are constructed using FileFormat.make_write_options(...)
  • FileWriteOptions are passed as a parameter to _filesystemdataset_write()

R:

  • FileWriteOptions$create(...) to make write options; no subclasses exposed in R
  • A filter() on the dataset is applied to restrict written rows.

C++:

  • FileSystemDataset::Write's parameters have been consolidated into
    • A Scanner, from which the batches to be written are pulled
    • A FileSystemDatasetWriteOptions, which is an options struct specifying
      • destination filesystem
      • base directory
      • partitioning
      • basenames (via a string template, ex "dat_{i}.feather")
      • format specific write options
  • Format specific write options are represented using the FileWriteOptions hierarchy. An instance of these can be constructed from a format using FileFormat::DefaultWriteOptions(), after which the instance can be modified.
    • ParquetFileFormat::{writer_properties, arrow_writer_properties} have been moved to ParquetFileWriteOptions, an implementation of FileWriteOptions.

Internal C++:

  • Individual files can now be incrementally written using a FileWriter, constructible from a format using FileFormat::MakeWriter
  • FileSystemDataset::Write now parallelizes across scan tasks rather than fragments, so there will be no difference in performance for different arrangements of tables/batches/lists of tables and batches when writing from memory
  • FileSystemDataset::Write::WriteQueue provides a threadsafe channel for batches awaiting write, allowing threads to produce batches as another thread flushes the queue to disk.

@github-actions
Copy link

@jorisvandenbossche
Copy link
Member

@bkietz this removes the ability to specify format specific options? (or it's still WIP?)

@bkietz
Copy link
Member Author

bkietz commented Sep 30, 2020

@jorisvandenbossche yes, not ready for review yet. I will repair format specific write options as part of this PR

@jorisvandenbossche
Copy link
Member

Okido, will wait a bit more then ;-)

@bkietz bkietz force-pushed the 9782-more-configurable-writing branch from b090f12 to b47d07d Compare October 1, 2020 15:17
@bkietz bkietz marked this pull request as ready for review October 1, 2020 18:56
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we know all fragments (and their expressions) already, can we avoid all the locking multi-threading in WriterSet (IIRC, you need them to create the writer once)? That would heavily simplify all of this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this context fragments are the object of writing rather than the target (so for example one might represent an in-memory table which is being copied to disk). Writers are not known ahead of time since they depend on the partitioning which depends on the set of unique values in a given column, which we discover only after running GroupBy on an input batch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do two scans of the input data:

  1. Assemble a list of all unique values in the partition columns of the data, from which we can determine the precise set of writers to open
  2. Apply groupings to batches, passing the results to pre-opened writers

This doesn't seem worthwhile to me; scanning the input is potentially expensive so we should avoid doing it twice. Furthermore we'll still need to coordinate between threads since two input batches might still contain rows bound for a single output writer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we provide a default template here?
Can eg the format object have a property with the default name to use? (or get the extension from there and use that in a default?)

Copy link
Member Author

@bkietz bkietz Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche Added, PTAL

@bkietz bkietz force-pushed the 9782-more-configurable-writing branch from ef5fc61 to b79a95c Compare October 5, 2020 21:00
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me, thanks for doing this!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect a Lock() method as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to continue acquiring new locks exclusively through Mutex::Lock; there's no loss of generality and it keeps Guard as simple as possible

@bkietz bkietz force-pushed the 9782-more-configurable-writing branch from 370e2a0 to bc3b106 Compare October 6, 2020 14:24
@nealrichardson
Copy link
Member

Is this done, or what is left?

@bkietz
Copy link
Member Author

bkietz commented Oct 6, 2020

@pitrou are you planning to review C++ again?
@jorisvandenbossche are you planning to review python?

@pitrou
Copy link
Member

pitrou commented Oct 6, 2020

The C++ changes addressed my comments. It would be nice though if @fsaintjacques could take a look.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a pass over the python code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change behaviour? It seems you are now creating a single fragment instead of a vector of fragments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileSystemDataset::Write now parallelizes across scan tasks rather than fragments so there will be no difference in performance/written files even if we create a single in-memory fragment, so I changed this to create a single fragment since it's simpler

@bkietz bkietz force-pushed the 9782-more-configurable-writing branch from ef952ef to 20cf19f Compare October 7, 2020 17:17
@bkietz bkietz force-pushed the 9782-more-configurable-writing branch from 5cb797e to 5602aa8 Compare October 7, 2020 20:43
@bkietz
Copy link
Member Author

bkietz commented Oct 8, 2020

Merging

@bkietz bkietz closed this in ae396b9 Oct 8, 2020

target = tempdir / 'single-directory-target'
expected_files = [target / "dat_0.ipc", target / "dat_1.ipc"]
expected_files = [target / "part-0.feather"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this change to a single file? (the original has 2 files, I expect the roundtrip to preserve those files)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this patch, a single file will be written for each partition directory. In a follow up we'll add an optional cap on file size

# check that all files are the same in both cases
paths1 = [p.relative_to(target1) for p in target1.rglob("*")]
paths2 = [p.relative_to(target2) for p in target2.rglob("*")]
assert set(paths1) == set(paths2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed? (does it no longer hold?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It no longer holds consistently; the auto incremented {i} doesn't necessarily round trip

@jorisvandenbossche
Copy link
Member

FileSystemDataset::Write now parallelizes across scan tasks rather than fragments so there will be no difference in performance/written files even if we create a single in-memory fragment, so I changed this to create a single fragment since it's simpler

There might be no difference, but I think the user should still be able to control how many files are created. Because now whathever you pass, it's always consolidated into a single file? (or one file per partitioning)

Also, it seems that reading/writing a dataset does not preserve the files? (so if we discover a dataset with multiple files, we write it as a single file?)

@bkietz
Copy link
Member Author

bkietz commented Oct 9, 2020

If you're writing with no partitioning then yes, everything will be written to a single file. In a follow up we'll probably add a special case for unpartitioned writing which allocates an output file for each thread just for performance reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants