PARQUET-1381: Add merge blocks command to parquet-tools #512

kgalieva · 2018-08-14T18:13:48Z

Current implementation of merge command in parquet-tools doesn't merge row groups, just places one after the other.
Add another command to be able to merge small blocks into larger ones up to specified size in bytes limit.

alexeyzavyalov · 2018-08-15T07:24:29Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

+      throw new RuntimeException("Illegal row group of 0 rows");
+    }
+    Optional<ColumnChunkMetaData> mc = findColumnByPath(block, columnDescriptor.getPath());
+    if (mc.isPresent()) {


Maybe return mc.map(column -> { ChunkDescriptor chunk = new ChunkDescriptor(columnDescriptor, column, column.getStartingPos(), (int) column.getTotalSize()); return readChunk(f, chunk) }); ?

alexeyzavyalov · 2018-08-15T07:29:28Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

+
+  private Optional<ColumnChunkMetaData> findColumnByPath(BlockMetaData block, String[] path) {
+    for (ColumnChunkMetaData column : block.getColumns()) {
+      if (Arrays.equals(column.getPath().toArray(), (path))) {


I think you could remove unnecessary brackets around path

alexeyzavyalov · 2018-08-15T07:48:46Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

+    int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0);
+    List<ByteBuffer> buffers = new ArrayList<>(numAllocations);
+
+    for (int i = 0; i < fullAllocations; i += 1) {


kgalieva · 2018-08-15T09:48:48Z

@alexeyzavyalov Thank you! Implemented all suggested changes.

zivanfi · 2018-08-15T11:13:46Z

Could you please reference the JIRA number in the PR title? The standard format is "PARQUET-###: Description". If there is no JIRA yet, could you please create one? Thanks!

zivanfi · 2018-08-15T11:22:54Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java

   * @param configuration Hadoop configuration
-   * @param schema the schema of the data
-   * @param file the file to write to
+   * @param schema        the schema of the data


This file contains a lot of formatting changes, like the alignment of the parameter descriptions here or the expansion of the one-liner methods above or the identation changes below. These changes are problematic because they add clutter to the commit diff and the git blame history and more importantly can lead to conflicts in the future when cherry-picking, backporting or merging in general.

For this reason, could you please revert lines that do not contain actual code changes?

nandorKollar · 2018-08-15T11:23:33Z

Instead of adding a new command, would it make sense to add a parameter like --mergeBlocks or --mergeRowGroups parameter to the existing MergeCommand?

kgalieva · 2018-08-15T23:53:01Z

@zivanfi and @nandorKollar thank you for quick feedback and good suggestions! Agree with you, changed the code and added JIRA task with description.

zivanfi

Could you please add unit tests for the code? It's not necessary for the parquet-tools command (we don't have any existing unit tests there either), but it's customary for the core Parquet API. Thanks!

zivanfi · 2018-08-17T14:24:57Z

parquet-tools/src/main/java/org/apache/parquet/tools/command/MergeCommand.java

  @Override
  public void execute(CommandLine options) throws Exception {
+    boolean mergeBlocks = options.hasOption('b');
+    int maxBlockSize = options.hasOption('l')? Integer.parseInt(options.getOptionValue('l')) : 128;


I may be mistaken, but I think this will result in a desired row group size of 128 bytes, not 128 megabytes.

I would suggest using the ParquetWriter.DEFAULT_BLOCK_SIZE constant instead of a hard-coded value.

nandorKollar · 2018-08-21T14:40:02Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/ParquetTripletUtils.java

+import org.apache.parquet.column.ColumnWriter;
+import org.apache.parquet.schema.PrimitiveType;
+
+public class ParquetTripletUtils {


Since apart from ParquetFileWriter it seems that no other code uses this utils, I'd recommend moving the only method into a private method in ParquetFileWriter.

kgalieva · 2018-08-28T17:01:44Z

@zivanfi , @nandorKollar thank you! Implemented changes and added tests.

nandorKollar · 2018-09-03T14:39:09Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java

    }
  }

+  public void flushToFileWriter(ColumnDescriptor path, ParquetFileWriter writer) throws IOException {


If I'm not mistaken, the visibility of this method can be restricted to package private

nandorKollar · 2018-09-03T15:51:12Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java

+  }
+
+  private List<ParquetFileReader> getReaders(List<InputFile> inputFiles) throws IOException {
+    List<ParquetFileReader> readers = new ArrayList<>();


nit: the ArrayList size is know in advance, it would be nice to add as constructor parameter to avoid resize

nandorKollar · 2018-09-03T15:57:49Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/BlocksCombiner.java

+    readers.forEach(r -> {
+      try {
+        r.close();
+      } catch (IOException e) {


It's not very nice to swallow an exception. Either throw a runtime exception, or at least log that something happened while closing the file.

kgalieva · 2018-09-03T16:26:23Z

Thank you @nandorKollar! I've made the suggested changes.

nandorKollar · 2018-09-04T07:52:03Z

Thanks, this sounds a very useful feature, I'd like to have a final look at the PR.

One notice, it seems you force push each time, could you please simply commit the changes instead? If there are multiple commits for a PR, those will be merged into one when the PR is merged, and if you commit instead of force push, it is easier to see the changes you for the response for the reviews.

gszadovszky · 2018-09-04T08:59:22Z

parquet-tools/src/main/java/org/apache/parquet/tools/command/MergeCommand.java

+      .desc("Merge adjacent blocks into one up to upper bound size limit default to 128 MB")
+      .build();
+
+    Option limit = Option.builder("l")


Thanks a lot for working on this.

Instead of grabbing some write properties like block size and compression and adding command line options for them I would suggest allowing to set any of the parquet properties in a future proof way.
Some ideas:

Have a command line parameter that argument is a key-value pair; the parameter can be used multiple times

Have a command line parameter that argument is a list of key-value pairs

Have a command line parameter that argument is a file that contains the key-value pairs

Multiple solutions might also make sense (e.g. set the key-value pairs from command line as well as from file).
The help shall reference the some docs or the source code for the up-to-date list of available options. It shall also list some of the most important options like parquet.block.size, parquet.compression, parquet.page.size or parquet.writer.max-padding.

It seems that we have several options here with no clear best choice. Since this review is already getting quite long, I think we can defer to do these in a follow-up change if the need arises. Are you okay with that Gabor?

nandorKollar · 2018-09-04T15:27:23Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

+   * @return the ByteBuffer blocks
+   * @throws IOException if there is an error while reading from the stream
+   */
+  public List<ByteBuffer> readBlocks(SeekableInputStream f, long offset, int length) throws IOException {


The visibility could be changed to package private.

nandorKollar · 2018-09-04T15:30:18Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

+    return buffers;
+  }
+
+  public Optional<PageReader> readColumnInBlock(int blockIndex, ColumnDescriptor columnDescriptor) throws IOException {


The visibility could be restricted to package protected. nit: this method doesn't throw IOException.

nandorKollar · 2018-09-04T15:34:44Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/BlocksCombiner.java

+  }
+
+  public static void closeReaders(List<ParquetFileReader> readers) {
+    readers.forEach(r -> {


With this logic, if there's an exception when closing a reader, then the rest remains unclosed. Though in a tool like this this is not a big deal, if one uses this method for other purposes, it could lead resource leakage.

nandorKollar · 2018-09-04T15:42:18Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java

+      }
+      this.endBlock();
+    }
+    this.end(new HashMap<>());


nit: Collections.emptyMap() would be reasonable here.

nandorKollar · 2018-09-04T15:48:50Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java

+    }
+    this.end(new HashMap<>());
+
+    BlocksCombiner.closeReaders(readers);


Readers should be closed in a finally block to avoid unclosed readers in case of exception.

I was also wondering if it would be possible to keep only two files open at the same time: one which Parquet writes to, and one which is currently being processed. Since it seems sufficient for me to keep only these files open, and close the current one as soon as the processing of it is finished, and then open the next one, we won't keep files open longer than needed, and in addition we don't keep too many files open for no reason at the same time. I'm not sure if this makes sense, just a thought. If it is too complicated to implement, then I'm fine with the current approach too.

nandorKollar

LGTM

zivanfi · 2018-09-11T11:57:42Z

I merged your commit @kgalieva, thanks for your contribution!

…he#512)" This reverts commit 863a081.

…" (#621) This reverts commit 863a081. The design of this feature has conceptional problems and also works incorrectly. See PARQUET-1381 for more details.

Summary: Merge two commits from upstream Revert "PARQUET-1381: Add merge blocks command to parquet-tools (apache#512) apache#621 PARQUET-1533: TestSnappy() throws OOM exception with Parquet-1485 change apache#622 Reviewers: pavi, leisun Reviewed By: leisun Differential Revision: https://code.uberinternal.com/D2544359

Summary: Revert "PARQUET-1381: Add merge blocks command to parquet-tools (apache#512)" (apache#621) This reverts commit 863a081. The design of this feature has conceptional problems and also works incorrectly. See PARQUET-1381 for more details. PARQUET-1531: Page row count limit causes empty pages to be written from MessageColumnIO (apache#620) PARQUET-1544: Possible over-shading of modules (apache#628) Reviewers: pavi Reviewed By: pavi Differential Revision: https://code.uberinternal.com/D2769319

alexeyzavyalov reviewed Aug 15, 2018

View reviewed changes

kgalieva force-pushed the merge-parquet-blocks branch from a33a2e5 to 82ae4a6 Compare August 15, 2018 09:35

zivanfi reviewed Aug 15, 2018

View reviewed changes

kgalieva force-pushed the merge-parquet-blocks branch 3 times, most recently from b08975b to 9f3a815 Compare August 15, 2018 23:43

kgalieva changed the title ~~Add merge blocks command to parquet-tools~~ PARQUET-1381 Add merge blocks command to parquet-tools Aug 15, 2018

kgalieva force-pushed the merge-parquet-blocks branch from 9f3a815 to 3943e19 Compare August 15, 2018 23:56

kgalieva changed the title ~~PARQUET-1381 Add merge blocks command to parquet-tools~~ PARQUET-1381: Add merge blocks command to parquet-tools Aug 15, 2018

zivanfi reviewed Aug 17, 2018

View reviewed changes

nandorKollar reviewed Aug 21, 2018

View reviewed changes

kgalieva force-pushed the merge-parquet-blocks branch from 3943e19 to d8a92cb Compare August 28, 2018 13:42

nandorKollar reviewed Sep 3, 2018

View reviewed changes

kgalieva force-pushed the merge-parquet-blocks branch from d8a92cb to 1a0a0a8 Compare September 3, 2018 16:06

PARQUET-1381: Add merge blocks command to parquet-tools

d1b8873

kgalieva force-pushed the merge-parquet-blocks branch from 1a0a0a8 to d1b8873 Compare September 3, 2018 16:19

gszadovszky reviewed Sep 4, 2018

View reviewed changes

nandorKollar reviewed Sep 4, 2018

View reviewed changes

PARQUET-1381: Fix close readers

58ae3bd

nandorKollar approved these changes Sep 10, 2018

View reviewed changes

zivanfi approved these changes Sep 11, 2018

View reviewed changes

zivanfi merged commit 863a081 into apache:master Sep 11, 2018

kgalieva deleted the merge-parquet-blocks branch November 14, 2018 19:49

gszadovszky added a commit to gszadovszky/parquet-mr that referenced this pull request Feb 19, 2019

Revert "PARQUET-1381: Add merge blocks command to parquet-tools (apac…

7978a47

…he#512)" This reverts commit 863a081.

PARQUET-1381: Add merge blocks command to parquet-tools #512

PARQUET-1381: Add merge blocks command to parquet-tools #512

Uh oh!

Conversation

kgalieva commented Aug 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kgalieva commented Aug 15, 2018

Uh oh!

zivanfi commented Aug 15, 2018

Uh oh!

zivanfi Aug 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nandorKollar commented Aug 15, 2018

Uh oh!

kgalieva commented Aug 15, 2018

Uh oh!

zivanfi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kgalieva commented Aug 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kgalieva commented Sep 3, 2018

Uh oh!

nandorKollar commented Sep 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nandorKollar left a comment

Choose a reason for hiding this comment

Uh oh!

zivanfi commented Sep 11, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zivanfi Aug 15, 2018 •

edited

Loading