ORC support integration for Spark 2.4.0 #139

edgarRd · 2019-03-20T01:33:24Z

I noticed that a large portion of the support for ORC in Netflix/iceberg@c59138e#diff-545c12970ccace1ba019c99192569301 was missing.
I've adapted that code to be supported with Spark 2.4.0 since the API for UnsafeWriter changed.

Test cases are passing for reads and writes.

rdblue · 2019-03-20T20:52:43Z

Thanks for working on this, @edgarRd! I'll take a look soon.

rdblue · 2019-03-24T22:45:53Z

@edgarRd, can you rebase this on master? We've renamed packages.

edgarRd · 2019-03-25T22:05:33Z

@rdblue, I've rebased with the new package names. Thanks!

edgarRd · 2019-03-30T00:22:42Z

@rdblue following up on this. Any comment? Thanks!

rdblue · 2019-04-01T15:47:18Z

@edgarRd, thanks for the reminder. I'll have a look.

spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcReader.java

spark/src/main/java/org/apache/iceberg/spark/source/Reader.java

spark/src/main/java/org/apache/iceberg/spark/source/Writer.java

spark/src/main/java/org/apache/iceberg/spark/source/Reader.java

rdblue · 2019-04-02T00:23:05Z

spark/src/test/java/org/apache/iceberg/spark/source/TestOrcScan.java

+
+import static org.apache.iceberg.Files.localOutput;
+
+public class TestOrcScan extends AvroDataTest {


The problem with these tests is that they require supporting ORC in the Spark reader and writer. I don't think we should update the Spark reader and writer until many of the problems with ORC support are fixed. For example, ORC doesn't actually use the InputFile and OutputFile abstractions, it creates paths from the locations and relies on Hadoop. It also doesn't support the full suite of features required for Iceberg formats, including column reordering.

What I'd prefer is to write tests like TestSparkParquetReader and TestSparkParquetWriter that don't go through the Spark reader and writer, but test the Spark data model directly. That way, we can remove the changes that actually expose the format until it has full support. Otherwise, we'll just have to remove it and disable the tests for the release.

This should have been on TestOrcWrite, not TestOrcScan. As long as we don't release write support, I'm happy adding read support.

Read support only for now sounds good to me. I'm working on this approach. Thanks for the guidance!

Looks like this still needs to be done as well.

I think I've addressed this concern by creating a test that does not go through the Spark reader and writer, namely in TestSparkOrcReader.

rdblue · 2019-04-02T00:26:43Z

@edgarRd, thanks for working on this! Overall, I think it looks like a great start toward ORC support, but I'd like to add ORC support back a little more carefully this time.

Last time, we added ORC support assuming that the remaining problems with it would be fixed rather quickly. Because we haven't seen those fixes, I'd prefer not to expose the write support through Spark (read support is fine). I suggested a way to test this code that doesn't add it to the write path. We can add it to the write path when it can make the same guarantees as the other formats.

edgarRd · 2019-04-25T23:23:47Z

@rdblue I've pushed changes that I think should address the comments previously made. However, some concerns like the Hadoop dependency on Path are still there due to the ORC API signatures. Please take a look whenever you have a chance and feel free to make any further comments. I appreciate the time. Thanks!

rdblue · 2019-05-01T16:39:48Z

orc/src/main/java/org/apache/iceberg/orc/OrcValueWriter.java

+
+/**
+ * Write data value of a schema.
+ * @author Edgar Rodriguez-Diaz


Nit: please remove @author tags and any empty tags like @since

rdblue · 2019-05-01T16:40:16Z

orc/src/main/java/org/apache/iceberg/orc/OrcValueReader.java

+
+  /**
+   * Reads
+   * @param reuse


Looks like Javadoc is incomplete.

rdblue · 2019-05-01T16:40:45Z

orc/src/main/java/org/apache/iceberg/orc/OrcIterable.java

@@ -0,0 +1,101 @@
+package org.apache.iceberg.orc;


These files need the Apache license header.

rdblue · 2019-05-01T16:42:12Z

orc/src/main/java/org/apache/iceberg/orc/OrcIterable.java

+    }
+  }
+
+  private class OrcIterator implements Iterator<T> {


Usually, Iterator classes should be static to ensure that the iterator shares no state with the Iterable other than what was passed in.

rdblue · 2019-05-01T16:43:42Z

orc/src/main/java/org/apache/iceberg/orc/OrcIterable.java

+
+  private static Reader newFileReader(InputFile file, Configuration config) {
+    try {
+      return OrcFile.createReader(new Path(file.location()),


Is there a way to use file to open instead of passing a Path?

Unfortunately no, I checked the OrcFile signatures and all (createReader and createWriter) use Path only.

rdblue · 2019-05-01T16:44:08Z

orc/src/main/java/org/apache/iceberg/orc/OrcIterable.java

+    try {
+      return new VectorizedRowBatchIterator(file.location(), orcSchema, orcFileReader.rows(options));
+    }
+    catch (IOException ioe) {


Nit: this should go on the same line as }.

rdblue · 2019-05-01T16:45:13Z

orc/src/main/java/org/apache/iceberg/orc/OrcIterable.java

+  @SuppressWarnings("unchecked")
+  @Override
+  public Iterator<T> iterator() {
+    return new OrcIterator(orcIter, (OrcValueReader<T>) readerFunction.apply(schema));


This should not use the same VectorizedRowBatchIterator for all iterators. Each iterator should be independent, so this should call newOrcIterator.

rdblue · 2019-05-01T16:45:36Z

orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java

+    final Writer writer;
+
+    try {
+      writer = OrcFile.createWriter(locPath, options);


Is it possible to pass an output stream?

Unfortunately no, I checked the OrcFile signatures and all (createReader and createWriter) use Path only. Having a similar method to this receiving an output stream would be ideal.

rdsr · 2019-05-01T16:58:11Z

orc/src/main/java/org/apache/iceberg/orc/ORC.java

-      OrcFile.WriterOptions options =
-          OrcFile.writerOptions(conf);
-      return new OrcFileAppender(schema, file, options, metadata);
+    public <D> FileAppender<D> build() {


This also addresses the issue #127

But this does not actually add support for generics, right?

Yes, you are right @rdblue

rdsr · 2019-05-01T18:18:35Z

orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java

 */
-public class OrcFileAppender implements FileAppender<VectorizedRowBatch> {
-  private final Writer writer;
+public class OrcFileAppender<D> implements FileAppender<D> {


nit: this should be package private, similar to other appenders

rdsr · 2019-05-01T18:25:11Z

spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcReader.java

@@ -0,0 +1,782 @@
+/*
+ * Copyright 2018 Hortonworks


Wrong license headers? Here and other places

build.gradle

rdsr · 2019-05-03T13:35:44Z

orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java

-    writer.addUserMetadata(COLUMN_NUMBERS_ATTRIBUTE, columnIds.serialize());
-    metadata.forEach(
-        (key,value) -> writer.addUserMetadata(key, ByteBuffer.wrap(value)));
+    batch = orcSchema.createRowBatch(BATCH_SIZE);


Should the batch size be user configurable, maybe large amount of data here can cause memory problems?. Using a default BATCH_SIZE of 1024 is already provided by the API orcSchema.createRowBatch()

I've added a configuration setting for this value.

rdsr · 2019-05-03T13:45:06Z

orc/src/main/java/org/apache/iceberg/orc/ORC.java

-      OrcFile.WriterOptions options =
-          OrcFile.writerOptions(conf);
-      return new OrcFileAppender(schema, file, options, metadata);
+    public <D> FileAppender<D> build() {


@rdblue I think Avro and Parquet store Iceberg schema as a json string in their metadata. I'm not sure why that is required, but do u think that it makes sense to do that here as well for ORC?

It's probably a good idea, but not required. The requirement is that we can get the file's Iceberg schema from its metadata, and I prefer to build that from the file schema itself, plus the column IDs. That way, we don't have problems from a faulty conversion in an old version.

rdsr · 2019-05-03T13:48:42Z

orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java

-public class OrcFileAppender implements FileAppender<VectorizedRowBatch> {
-  private final Writer writer;
+class OrcFileAppender<D> implements FileAppender<D> {
+  private final static int BATCH_SIZE = 1024;


Default batch size is already a public property
as part of org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch#DEFAULT_SIZE . Also see above comment

Thanks, I'm using this now.

rdsr · 2019-05-03T13:53:22Z

orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java

    }
  }

  public TypeDescription getSchema() {


Is there a requirement for this API?

I've removed this call since it was not used.

rdsr · 2019-05-03T14:28:12Z

orc/src/main/java/org/apache/iceberg/orc/OrcValueReader.java

+  /**
+   * Reads a value in row.
+   */
+  T read(Object reuse, int row);


Why Object instead of VectorizedRowBatch?

I tried to make this interface generic to potentially re-use it for reading other values in a similar approach as the other file formats are implemented.

I don't think this does quite the same thing that the other interfaces do.

For Avro, a similar interface allows reusing container objects. So a value reader that returns a Record can also accept a Record instance that it will fill with data. The reuse object here is always a VectorizedRowBatch and this returns an InternalRow. So the equivalent would be this:

InternalRow read(VectorizedRowBatch batch, int rowNum, Object rowToReuse);

The rows could be swapped out for some other in-memory container, like Iceberg's GenericRecord.

Unless this is doing something similar to what Avro does, I don't think this is a good change to include in this PR. Maybe we should keep it simple and go with the original code that didn't use a generic interface here.

I agree, in order for this interface to serve in a similar function as in the other formats it'd need more work. I'll set it to VectorizedRowBatch which is the only usage right now.

Unit tests for reads are passing.

Remove hack for converting decimal to long value.

rdblue · 2019-05-07T21:15:37Z

spark/src/main/java/org/apache/iceberg/spark/source/Writer.java

+import java.util.Map;
+import java.util.Set;
+import java.util.UUID;
+import java.util.function.Function;


Why did these imports move?

Most likely IDE moved them. Do we have a standard style for these and a way to enforce it?

rdblue · 2019-05-07T21:25:27Z

orc/src/main/java/org/apache/iceberg/orc/ORC.java

+      OrcFile.WriterOptions options = OrcFile.writerOptions(conf);
+      return new OrcFileAppender<>(TypeConversion.toOrc(schema, new ColumnIdMap()),
+          this.file, createWriterFunc, options, metadata,
+          conf.getInt(VECTOR_ROW_BATCH_SIZE, DEFAULT_BATCH_SIZE));


Why not use the ORC property instead of the copy?

rdblue · 2019-05-07T21:31:03Z

This looks about ready to me. I just want to get a few minor things fixed and fix the issue that @rdsr pointed out for the reader API.

Rename orcSchema to readSchema since that's effectively the function it accomplishes.

rdblue · 2019-05-15T23:35:21Z

@omalley, I think this ORC PR is ready to go in. It updates the license headers for ORC files from Hortonworks to the standard ASF header. Could you reply with +1 or -1 for that change?

omalley · 2019-05-16T01:00:48Z

+1

thanks @edgarRd for fixing up the ORC bindings

rdblue · 2019-05-16T20:15:54Z

Merged! Thanks for working on the ORC support, @edgarRd!

edgarRd · 2019-05-16T20:32:08Z

Great, thanks all for the help!

…pache#139) Co-authored-by: Hongyue/Steve Zhang <[email protected]>

rdblue reviewed Apr 1, 2019

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcReader.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 1, 2019

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcReader.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 2, 2019

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/Reader.java Show resolved Hide resolved

rdblue reviewed Apr 2, 2019

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/Writer.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 2, 2019

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/Reader.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 2, 2019

View reviewed changes

rdblue reviewed May 1, 2019

View reviewed changes

rdsr reviewed May 1, 2019

View reviewed changes

rdsr reviewed May 3, 2019

View reviewed changes

edgarRd added 7 commits May 6, 2019 18:15

Add missing components for ORC support in spark

cf5b191

Add ORC support for spark 2.4.0

e39fd43

Fix integration with spark 2.4.0

0e72ec0

Unit tests for reads are passing.

Adjust for latest changes

7b50e13

Rename test to new org.apache pkg

7b094ee

Convert printRow methods to converters to String

2051372

Use ORC 1.5.0 with storage-api 2.6.0

278eeca

Remove hack for converting decimal to long value.

edgarRd added 12 commits May 6, 2019 18:15

Remove unused imports

9bbf86c

Revert non-functional changes

5bdd32f

Refactor ORC write integration

f0adb15

Refactor organization of ORC package to match data source API in Iceberg

7bf13ee

Remove ORC tests using spark

0274109

Add License and fix JavaDocs

82aea48

Code review changes in OrcIterable

6c3173c

Code style

87cf57c

Use same license in the project

b5004a4

Remove unnecessary code

4b86672

Allow configuration of vector batch size in OrcFileAppender

2c8286d

Update to ORC storage api 1.5.5

c0429f0

rdblue reviewed May 7, 2019

View reviewed changes

edgarRd added 4 commits May 10, 2019 16:56

Use VectorizedRowBatch for interface OrcValueReader

29403e0

Reset Writer to original imports

d5ae94a

Use ORC DEFAULT_SIZE for batch size

4e785a9

Propagate ORC case sensitive setting for read schema

bb0ef40

Rename orcSchema to readSchema since that's effectively the function it accomplishes.

rdblue merged commit 6d11edd into apache:master May 16, 2019

rdsr pushed a commit to shardulm94/iceberg that referenced this pull request May 23, 2019

ORC read integration for Spark 2.4.0 (apache#139)

4a8909d

lxynov mentioned this pull request Jul 31, 2019

Add Iceberg connector trinodb/trino#1067

Merged

nastra pushed a commit to nastra/iceberg that referenced this pull request Aug 15, 2023

Core: Extend ResolvingFileIO to support BulkOperations (apache#7976) (a…

69ca9cb

…pache#139) Co-authored-by: Hongyue/Steve Zhang <[email protected]>


		import static org.apache.iceberg.Files.localOutput;

		public class TestOrcScan extends AvroDataTest {

ORC support integration for Spark 2.4.0 #139

ORC support integration for Spark 2.4.0 #139

Uh oh!

Conversation

edgarRd commented Mar 20, 2019

Uh oh!

rdblue commented Mar 20, 2019

Uh oh!

rdblue commented Mar 24, 2019

Uh oh!

edgarRd commented Mar 25, 2019

Uh oh!

edgarRd commented Mar 30, 2019

Uh oh!

rdblue commented Apr 1, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edgarRd Apr 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Apr 2, 2019

Uh oh!

edgarRd commented Apr 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdsr May 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdsr May 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

edgarRd Apr 25, 2019 •

edited

Loading

edgarRd commented Apr 25, 2019 •

edited

Loading

rdsr May 3, 2019 •

edited

Loading

rdsr May 3, 2019 •

edited

Loading

rdsr May 3, 2019 •

edited

Loading

rdsr May 3, 2019 •

edited

Loading