Use the FileIO submodule in Spark writers and readers. #52

mccheah · 2018-12-13T22:12:30Z

Tricky because the table operations needs to be exposed. Added a mixed interface of Table + TableOperations accordingly. Therefore now the Spark data source is opinionated about ensuring the returned table also implements HasTableOperations.

Tricky because the table operations needs to be exposed. Added a mixed interface of Table + TableOperations accordingly.

mccheah · 2018-12-13T22:13:06Z

spark/src/main/java/com/netflix/iceberg/spark/source/IcebergSource.java

  }

-  protected Table findTable(DataSourceOptions options) {
+  protected TableWithTableOperations findTable(DataSourceOptions options) {


@rdblue this is the most interesting question raised by this patch. What do you think?

What about adding io to the Table interface? I'd rather do that since FileIO is a public interface. I think that is mostly what HasTableOperations is used for anyway.

I think that's fine.

mccheah · 2018-12-14T00:04:52Z

api/src/main/java/com/netflix/iceberg/Table.java

+  /**
+   * @return a {@link FileIO} to read and write table data and metadata files
+   */
+  FileIO io();


Should TableOperations#io continue to exist?

Yes, because TableOperations is still how implementations are passed.

rdblue · 2018-12-14T20:07:30Z

spark/src/main/java/com/netflix/iceberg/spark/source/Writer.java

            int hash = HASH_FUNC.apply(partitionAndFilename);
-            return new Path(objectStorePath,
-                String.format("%08x/%s/%s", hash, context, partitionAndFilename));
+            return UriBuilder.fromUri(URI.create(objectStore))


The Path/URI methods cause problems with escape characters. This should use strings for reliability.

rdblue · 2018-12-14T20:09:05Z

spark/src/test/java/com/netflix/iceberg/spark/source/TestTables.java


-    TestTableOperations ops() {
+    @Override
+    public TestTableOperations operations() {


Why was this rename needed?

The previous version didn't actually override the superclass's method.

Probably because we didn't expose TableOperations at first. Now that FileIO will be public, we can probably remove HasTableOperations in a follow-up.

rdblue · 2018-12-14T20:10:23Z

@mccheah, thanks for working on this! It's about ready to go, but we need to change back to using String instead of URI. There's also an unnecessary rename that we should remove.

rdblue · 2018-12-18T20:22:20Z

spark/src/main/java/com/netflix/iceberg/spark/source/Writer.java

+        Function<PartitionKey, String> outputPathFunc = key ->
+            String.format("%s/%s/%s",
+                stripTrailingSlash(baseDataPath),
+                stripTrailingSlash(stripLeadingSlash(key.toPath())),


This is always implemented by PartitionSpec#partitionToPath. That never adds a starting or trailing / so this can rely on that and doesn't need to do the extra check.

rdblue · 2018-12-18T20:22:44Z

spark/src/main/java/com/netflix/iceberg/spark/source/Writer.java

+        String baseDataPath = stripTrailingSlash(dataLocation); // avoid calling this in the output path function
+        Function<PartitionKey, String> outputPathFunc = key ->
+            String.format("%s/%s/%s",
+                stripTrailingSlash(baseDataPath),


baseDataPath already stripped the trailing /.

rdblue · 2018-12-18T20:23:20Z

spark/src/main/java/com/netflix/iceberg/spark/source/Writer.java

-        Path baseDataPath = lazyDataPath(); // avoid calling this in the output path function
-        Function<PartitionKey, Path> outputPathFunc = key ->
-            new Path(new Path(baseDataPath, key.toPath()), filename);
+        String baseDataPath = stripTrailingSlash(dataLocation); // avoid calling this in the output path function


Since this is an internal implementation, I think that the dataLocation() method should guarantee that the trailing / is removed. That way we never need to do this in the tasks.

rdblue · 2018-12-18T20:31:38Z

spark/src/main/java/com/netflix/iceberg/spark/source/Writer.java

-                String.format("%08x/%s/%s", hash, context, partitionAndFilename));
+            return String.format(
+                "%s/%08x/%s/%s/%s",
+                stripTrailingSlash(objectStore),


Same general ideas here as above. This shouldn't remove slashes if it can be done in one central place or strings are generated by Iceberg code. We may need to document that other methods must not produce trailing or starting /, but that is a better way to go.

rdblue · 2019-01-07T22:08:24Z

@mccheah, I think this is just about ready to merge, but it doesn't need most of the calls to strip slashes. Once those are fixed, I'll merge it. No rush since you're probably just back from the holidays.

…e-io-in-spark

mccheah · 2019-01-08T22:23:32Z

Almost there, addressed the comments, fixed the merge conflicts.

rdblue · 2019-01-09T00:56:06Z

Merged. Thanks @mccheah!

PLAT-41915 Add Base ADLS Table ops implementation for atomic version_hint io

…cked Iceberg Parquet table (apache#28) (apache#52) (cherry picked from commit 165c5e8)

Use the FileIO submodule in Spark writers and readers.

2ea3fcd

Tricky because the table operations needs to be exposed. Added a mixed interface of Table + TableOperations accordingly.

mccheah commented Dec 13, 2018

View reviewed changes

Move FileIO provision to Table

57ac6c0

mccheah commented Dec 14, 2018

View reviewed changes

rdblue reviewed Dec 14, 2018

View reviewed changes

Remove URI methods, make sure to trim leading / trailing slashes

5e7e733

rdblue reviewed Dec 18, 2018

View reviewed changes

mccheah added 3 commits January 7, 2019 18:14

Remove superfluous removal of slashes.

212a4b8

Merge remote-tracking branch 'upstream-incubator/master' into use-fil…

8b9e7e4

…e-io-in-spark

Resolve conflicts

943d4dc

rdblue merged commit 67fbe5d into apache:master Jan 9, 2019

rdblue mentioned this pull request Jan 13, 2019

Use the correct FileIO for all file interaction #47

Closed

renato2099 pushed a commit to renato2099/incubator-iceberg that referenced this pull request Jan 30, 2019

Use the FileIO submodule in Spark writers and readers. (apache#52)

55a2fe1

jun-ma-0 pushed a commit to jun-ma-0/incubator-iceberg that referenced this pull request Feb 6, 2020

PLAT-41915 Base ADLS Operations (apache#52)

292884e

PLAT-41915 Add Base ADLS Table ops implementation for atomic version_hint io

puchengy added a commit to puchengy/iceberg that referenced this pull request Jun 20, 2023

[Iceberg thrift] support case insensitive id assignment for thrift ba…

70ae6a2

…cked Iceberg Parquet table (apache#28) (apache#52) (cherry picked from commit 165c5e8)

Use the FileIO submodule in Spark writers and readers. #52

Use the FileIO submodule in Spark writers and readers. #52

Uh oh!

Conversation

mccheah commented Dec 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Dec 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jan 7, 2019

Uh oh!

mccheah commented Jan 8, 2019

Uh oh!

rdblue commented Jan 9, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants