-
Notifications
You must be signed in to change notification settings - Fork 3k
WIP - Add an api for Parallelizing Manifest Reading in ManifestGroup #1420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,20 +1,15 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * Licensed under the Apache License, Version 2.0 (the "License"); | ||
| * you may not use this file except in compliance with the License. | ||
| * You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.iceberg; | ||
|
|
@@ -25,7 +20,7 @@ | |
| import static org.apache.iceberg.types.Types.NestedField.optional; | ||
| import static org.apache.iceberg.types.Types.NestedField.required; | ||
|
|
||
| interface ManifestEntry<F extends ContentFile<F>> { | ||
| public interface ManifestEntry<F extends ContentFile<F>> { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We've really tried to avoid making
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you expand on why making I personally am more fond of this proof of concept as opposed to the other one, so I'd love to get insight into your comment here. If it's just because we would then have to treat this API public, is it possible something could be done to mark it as private to iceberg, equivalent to scala's |
||
| enum Status { | ||
| EXISTING(0), | ||
| ADDED(1), | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| /* | ||
| * Licensed under the Apache License, Version 2.0 (the "License"); | ||
| * you may not use this file except in compliance with the License. | ||
| * You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.iceberg; | ||
|
|
||
| import java.io.Serializable; | ||
| import java.util.function.BiFunction; | ||
| import org.apache.iceberg.io.CloseableIterable; | ||
| import org.apache.iceberg.io.FileIO; | ||
|
|
||
| public abstract class ManifestProcessor implements Serializable { | ||
| public abstract <T extends ContentFile<T>> Iterable<CloseableIterable<ManifestEntry<T>>> readManifests(final Iterable<ManifestFile> fromIterable, | ||
| BiFunction<ManifestFile, FileIO, CloseableIterable<ManifestEntry<T>>> reader); | ||
|
|
||
| /** | ||
| * A Helper interface for making lambdas transform into the correct type for the ManfiestProcessor | ||
| * @param <T> The ManifestEntry Type being read from Manifest files | ||
| */ | ||
| public interface Func<T extends ContentFile<T>> extends BiFunction<ManifestFile, FileIO, | ||
| CloseableIterable<ManifestEntry<T>>>, Serializable {} | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know this is a WIP, so I'm partially commenting to follow along. But I'm not a huge fan of this interface being named Overall this is shaping up to be a great addition though.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Check out the other PR too :) And if you have another approach I'd be glad to look into that as well. This is in my opinion definitely a more kludgy approach in Java. The only reason we need this interface is because otherwise you have to do
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah. Ok. Thank you so much! As I've mentioned, in my day job I'm more of a scala developer so I'm much more used to working with all of the well enriched type stuff that's available there. Thankfully I've finally stood all of this up on my own K8s cluster somewhere so I can continue to get more experienced with practical usage of Iceberg (especially between query systems). When you explain it that way though, it makes perfect sense. :) As for the approach being kludgy, if it gets the job done / speeds up manifest reading and makes things like exception stack traces etc more readable than a very large number of lambda functions, I'm not necessarily opposed at all. As a person that spends a lot of time in their current position as a sort of developer advocate helping people with the more esoteric parts of Spark and Flink etc, I can absolutely get behind more simplified type info and less concern with the serializability of functions, especially if the lambdas get raised to real functions so that they have more readable stack traces. :) I'll be sure to check out the other approach. And if I can come up with any approaches or modifications to this approach that might be cleaner, I'll definitely let you know. I think that java doc comment on the interface is likely sufficient to explain |
||
|
|
||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -94,6 +94,11 @@ public interface TableScan { | |
| */ | ||
| TableScan includeColumnStats(); | ||
|
|
||
| /** | ||
| * Doc doc doc | ||
| */ | ||
| TableScan withManifestProcessor(ManifestProcessor processor); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see what you did here to get past the linter ;-) If we decide to use this approach, let's be sure to not merge in
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Of course :) |
||
|
|
||
| /** | ||
| * Create a new {@link TableScan} from this that will read the given data columns. This produces | ||
| * an expected schema that includes all fields that are either selected or used by this scan's | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -54,6 +54,7 @@ abstract class BaseTableScan implements TableScan { | |
| private final Table table; | ||
| private final Schema schema; | ||
| private final TableScanContext context; | ||
| protected ManifestProcessor manifestProcessor; | ||
|
|
||
| protected BaseTableScan(TableOperations ops, Table table, Schema schema) { | ||
| this(ops, table, schema, new TableScanContext()); | ||
|
|
@@ -64,6 +65,7 @@ protected BaseTableScan(TableOperations ops, Table table, Schema schema, TableSc | |
| this.table = table; | ||
| this.schema = schema; | ||
| this.context = context; | ||
| this.manifestProcessor = new LocalManifestProcessor(table.io()); | ||
| } | ||
|
|
||
| protected TableOperations tableOps() { | ||
|
|
@@ -131,6 +133,9 @@ public TableScan useSnapshot(long scanSnapshotId) { | |
| ops, table, schema, context.useSnapshotId(scanSnapshotId)); | ||
| } | ||
|
|
||
|
|
||
|
|
||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: unnecessary whitespace change |
||
| @Override | ||
| public TableScan asOfTime(long timestampMillis) { | ||
| Preconditions.checkArgument(context.snapshotId() == null, | ||
|
|
@@ -275,6 +280,12 @@ public String toString() { | |
| .toString(); | ||
| } | ||
|
|
||
| @Override | ||
| public TableScan withManifestProcessor(ManifestProcessor processor) { | ||
| this.manifestProcessor = processor; | ||
| return this; | ||
| } | ||
|
|
||
| /** | ||
| * To be able to make refinements {@link #select(Collection)} and {@link #caseSensitive(boolean)} in any order, | ||
| * we resolve the schema to be projected lazily here. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| /* | ||
| * Licensed under the Apache License, Version 2.0 (the "License"); | ||
| * you may not use this file except in compliance with the License. | ||
| * You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.iceberg; | ||
|
|
||
| import java.util.function.BiFunction; | ||
| import org.apache.iceberg.io.CloseableIterable; | ||
| import org.apache.iceberg.io.FileIO; | ||
|
|
||
| public class LocalManifestProcessor extends ManifestProcessor { | ||
| private final FileIO io; | ||
|
|
||
| public LocalManifestProcessor(FileIO io){ | ||
| this.io = io; | ||
| } | ||
|
|
||
| @Override | ||
| public <T extends ContentFile<T>> Iterable<CloseableIterable<ManifestEntry<T>>> readManifests( | ||
| Iterable<ManifestFile> fromIterable, | ||
| BiFunction<ManifestFile, FileIO, CloseableIterable<ManifestEntry<T>>> reader) { | ||
| return CloseableIterable.transform( | ||
| CloseableIterable.withNoopClose(fromIterable), manifestFile -> reader.apply(manifestFile, io)); | ||
| } | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| /* | ||
| * Licensed under the Apache License, Version 2.0 (the "License"); | ||
| * you may not use this file except in compliance with the License. | ||
| * You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.iceberg.util; | ||
|
|
||
| import java.io.Serializable; | ||
| import java.util.Objects; | ||
| import java.util.function.Predicate; | ||
| import org.apache.iceberg.relocated.com.google.common.base.Preconditions; | ||
|
|
||
| public interface SerializablePredicate<T> extends Predicate<T>, Serializable { | ||
|
|
||
| @Override | ||
| default SerializablePredicate<T> and(Predicate<? super T> other) { | ||
| Objects.requireNonNull(other); | ||
| Preconditions.checkArgument(other instanceof SerializablePredicate); | ||
| return (T x) -> this.test(x) && other.test(x); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this not match the rest of the files in the project? I'd prefer not to reformat headers in a different PR if we can avoid it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that this is just a proof of concept PR, to compare with the other idea demonstrated here: #1421
I think once one is decided upon / if either of the versions are decided upon, then @RussellSpitzer will be updating it to be properly formatted etc. Please feel free to correct me if I'm wrong though Russell 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep sorry, I was just trying to get to a prototype real fast here.