-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spark-3.5: Add spark action to compute partition stats #12450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,43 @@ | ||||||
| /* | ||||||
| * Licensed to the Apache Software Foundation (ASF) under one | ||||||
| * or more contributor license agreements. See the NOTICE file | ||||||
| * distributed with this work for additional information | ||||||
| * regarding copyright ownership. The ASF licenses this file | ||||||
| * to you under the Apache License, Version 2.0 (the | ||||||
| * "License"); you may not use this file except in compliance | ||||||
| * with the License. You may obtain a copy of the License at | ||||||
| * | ||||||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||||||
| * | ||||||
| * Unless required by applicable law or agreed to in writing, | ||||||
| * software distributed under the License is distributed on an | ||||||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||||||
| * KIND, either express or implied. See the License for the | ||||||
| * specific language governing permissions and limitations | ||||||
| * under the License. | ||||||
| */ | ||||||
| package org.apache.iceberg.actions; | ||||||
|
|
||||||
| import org.apache.iceberg.PartitionStatisticsFile; | ||||||
|
|
||||||
| /** | ||||||
| * An action that computes and writes the partition statistics of an Iceberg table. Current snapshot | ||||||
| * is used by default. | ||||||
| */ | ||||||
| public interface ComputePartitionStats | ||||||
| extends Action<ComputePartitionStats, ComputePartitionStats.Result> { | ||||||
| /** | ||||||
| * Choose the table snapshot to compute partition stats. | ||||||
| * | ||||||
| * @param snapshotId long ID of the snapshot for which stats need to be computed | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit
Suggested change
And Remove from the class javadoc, that way will be in sync with
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It was like that before. But I got an explicit comment from @nastra to move that away from snapshot api. more info: #12450 (comment) |
||||||
| * @return this for method chaining | ||||||
| */ | ||||||
| ComputePartitionStats snapshot(long snapshotId); | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need an api to force the computation, in cases when existing one is corrupted?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The user can unregister the stats from the table using table.updatePartitionStatistics().removePartitionStatistics(snapshotId).commit();, so it can force recompute when the compute API is called again.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As an end user invoking a SQL procedure, i would expect this to be available in the procedure, instead of having to use the Java API to achieve it.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Spark action is similar to API, so user can do by themself (unregister and call again). For, call procedure I will add an option to force refresh.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the action is the delegate for the Procedure. Procedure does input validation and result(from action's execution) conversion to internal row.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't think it is a strict guideline. Procedures are just a way to provide functionality from SQL. It can combine one or more APIs. Added the force compute. I agree that SQL user need a way to clear stats for force compute incase of corruption (even though it is very rare) without depending on APIs. I was neutral about adding it to the spark action as users of spark action already have API access. But I want to move forward with this work. So, I have added it.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry to go back and forth on this. |
||||||
|
|
||||||
| /** The result of partition statistics collection. */ | ||||||
| interface Result { | ||||||
|
|
||||||
| /** Returns statistics file or null if no statistics were collected. */ | ||||||
| PartitionStatisticsFile statisticsFile(); | ||||||
| } | ||||||
| } | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
| package org.apache.iceberg.actions; | ||
|
|
||
| import javax.annotation.Nullable; | ||
| import org.apache.iceberg.PartitionStatisticsFile; | ||
| import org.immutables.value.Value; | ||
|
|
||
| @Value.Enclosing | ||
| @SuppressWarnings("ImmutablesStyle") | ||
| @Value.Style( | ||
| typeImmutableEnclosing = "ImmutableComputePartitionStats", | ||
| visibilityString = "PUBLIC", | ||
| builderVisibilityString = "PUBLIC") | ||
| interface BaseComputePartitionStats extends ComputePartitionStats { | ||
|
|
||
| @Value.Immutable | ||
| interface Result extends ComputePartitionStats.Result { | ||
| @Override | ||
| @Nullable | ||
| PartitionStatisticsFile statisticsFile(); | ||
| } | ||
| } |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,103 @@ | ||||||
| /* | ||||||
| * Licensed to the Apache Software Foundation (ASF) under one | ||||||
| * or more contributor license agreements. See the NOTICE file | ||||||
| * distributed with this work for additional information | ||||||
| * regarding copyright ownership. The ASF licenses this file | ||||||
| * to you under the Apache License, Version 2.0 (the | ||||||
| * "License"); you may not use this file except in compliance | ||||||
| * with the License. You may obtain a copy of the License at | ||||||
| * | ||||||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||||||
| * | ||||||
| * Unless required by applicable law or agreed to in writing, | ||||||
| * software distributed under the License is distributed on an | ||||||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||||||
| * KIND, either express or implied. See the License for the | ||||||
| * specific language governing permissions and limitations | ||||||
| * under the License. | ||||||
| */ | ||||||
| package org.apache.iceberg.spark.actions; | ||||||
|
|
||||||
| import java.io.IOException; | ||||||
| import org.apache.iceberg.PartitionStatisticsFile; | ||||||
| import org.apache.iceberg.PartitionStatsHandler; | ||||||
| import org.apache.iceberg.Snapshot; | ||||||
| import org.apache.iceberg.Table; | ||||||
| import org.apache.iceberg.actions.ComputePartitionStats; | ||||||
| import org.apache.iceberg.actions.ImmutableComputePartitionStats; | ||||||
| import org.apache.iceberg.exceptions.RuntimeIOException; | ||||||
| import org.apache.iceberg.relocated.com.google.common.base.Preconditions; | ||||||
| import org.apache.iceberg.spark.JobGroupInfo; | ||||||
| import org.apache.spark.sql.SparkSession; | ||||||
| import org.slf4j.Logger; | ||||||
| import org.slf4j.LoggerFactory; | ||||||
|
|
||||||
| /** | ||||||
| * Computes the stats incrementally after the snapshot that has partition stats file till the given | ||||||
| * snapshot (uses current snapshot if not specified) and writes the combined result into a {@link | ||||||
| * PartitionStatisticsFile} after merging the stats for a given snapshot. Does a full compute if | ||||||
| * previous statistics file does not exist. Also registers the {@link PartitionStatisticsFile} to | ||||||
| * table metadata. | ||||||
| */ | ||||||
| public class ComputePartitionStatsSparkAction | ||||||
| extends BaseSparkAction<ComputePartitionStatsSparkAction> implements ComputePartitionStats { | ||||||
|
|
||||||
| private static final Logger LOG = LoggerFactory.getLogger(ComputePartitionStatsSparkAction.class); | ||||||
| private static final Result EMPTY_RESULT = | ||||||
| ImmutableComputePartitionStats.Result.builder().build(); | ||||||
|
|
||||||
| private final Table table; | ||||||
| private Snapshot snapshot; | ||||||
|
|
||||||
| ComputePartitionStatsSparkAction(SparkSession spark, Table table) { | ||||||
| super(spark); | ||||||
| this.table = table; | ||||||
| this.snapshot = table.currentSnapshot(); | ||||||
| } | ||||||
|
|
||||||
| @Override | ||||||
| protected ComputePartitionStatsSparkAction self() { | ||||||
| return this; | ||||||
| } | ||||||
|
|
||||||
| @Override | ||||||
| public ComputePartitionStats snapshot(long newSnapshotId) { | ||||||
| Snapshot newSnapshot = table.snapshot(newSnapshotId); | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Before fetching the table metadata, we can check whether the new snapshot ID is equal to the initial snapshot ID
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you elaborate? what do you mean by initial snapshot id ? User can pass any snapshot id and stats will be computed and committed for that particular snapshot. If the user has not set the snapshot, it uses This design is same as existing
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The 'initial snapshot' I referred to is obtained when constructing the When the newly passed snapshot is identical to this initial one, we can avoid fetching the Iceberg metadata file - a small optimization :) |
||||||
| Preconditions.checkArgument(newSnapshot != null, "Snapshot not found: %s", newSnapshotId); | ||||||
| this.snapshot = newSnapshot; | ||||||
| return this; | ||||||
| } | ||||||
|
|
||||||
| @Override | ||||||
| public Result execute() { | ||||||
| if (snapshot == null) { | ||||||
| LOG.info("No snapshot to compute partition stats for table {}", table.name()); | ||||||
| return EMPTY_RESULT; | ||||||
| } | ||||||
|
|
||||||
| JobGroupInfo info = newJobGroupInfo("COMPUTE-PARTITION-STATS", jobDesc()); | ||||||
| return withJobGroupInfo(info, this::doExecute); | ||||||
| } | ||||||
|
|
||||||
| private Result doExecute() { | ||||||
| LOG.info("Computing partition stats for {} (snapshot {})", table.name(), snapshot.snapshotId()); | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit
Suggested change
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I'd prefer without = and we can probably get away from the nested parentheses in a Log statement. "Computing partition stats for table {} snapshot {}" is sufficient info imo |
||||||
| PartitionStatisticsFile statisticsFile; | ||||||
| try { | ||||||
| statisticsFile = PartitionStatsHandler.computeAndWriteStatsFile(table, snapshot.snapshotId()); | ||||||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using the local algorithm instead of distributed algorithm as we did a POC and JMH benchmarks and found out that serialization cost high for distributive approach. Hence, went with local algorithm. more info: and #9437 (comment) |
||||||
| } catch (IOException e) { | ||||||
| throw new RuntimeIOException(e); | ||||||
| } | ||||||
|
|
||||||
| if (statisticsFile == null) { | ||||||
| return EMPTY_RESULT; | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens when you truncate a table, do we return a empty stats file or null? Also add a test
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean truncate a whole table? Is there a quick Spark API to do that? Since, Iceberg uses DELETE FROM, actual files are not deleted during truncate. Stats will still be computed with delete counters. It will be similar to existing delete tests.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens after a major compaction?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are testcase at API level for compaction. Didn't add the same test case here as these are just wrapper and core functionality already tests this. Compaction, writes new stats and data file count changes per partition after compaction. Rewrite manifest also has test covered. check the core API PR for test coverage: #12629 |
||||||
| } | ||||||
|
|
||||||
| table.updatePartitionStatistics().setPartitionStatistics(statisticsFile).commit(); | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the statistics for a particular snapshot were already computed, and user triggered the action second time what is the behavior?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Currently, it reads the stats file (of the same snapshot) and write it into a new stats file without any additional compute of reading manifests. I will change it to no-op by returning the same file along with adding Info logs that stats file exist for snapshot.
The user can unregister the stats from the table using
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
+1 For a sql user, there is no way to figure why the command completes faster(when returning existing) vs taking time(when computed new) without a status. Although this might not be an issue using APIs, since the user can check whether stats file is already available using table api (
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Maybe we need to provide a metadata table to list the statistics file (both table and partition) in future.
I think adding status is an overkill. They think about freshness etc. I will make it no op and print in the logs about why it is no op.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added no op and logs + testcase. |
||||||
| return ImmutableComputePartitionStats.Result.builder().statisticsFile(statisticsFile).build(); | ||||||
| } | ||||||
|
|
||||||
| private String jobDesc() { | ||||||
| return String.format( | ||||||
| "Computing partition stats for %s (snapshot=%s)", table.name(), snapshot.snapshotId()); | ||||||
| } | ||||||
| } | ||||||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm late to this PR so my apologies if this has already been discussed in the past but have we considered just including the ability for computing partition stats to the existing
ComputeTableStatsAction? I'm a bit wary of adding too many actions, especially in this case where an external user who is using spark has to know that there are 2 separate actions for "stats" in general.In my head something like "compute table stats" + a partition stats option API on that seems better.
I know the actual stats files are of course different but I'm mainly thinking from an action and API perspective, not exposing too many different concepts. The action implementation can hide the complexity of the "how" for both types of stats. The result type of the existing procedure would be both types of files. It may not be a big deal for sophisticated users who know when to use which but I think that many people will just want to generate stats and then the sophisticated users would drill down which ones they want and when to run it etc.
Of course if we do try to add flags to the existing procedure we have to think through what's the default behavior when not specified (preserving compatibility) and then what additional modes we have.
Thoughts @ajantha-bhat @karuppayya @nastra @aokolnychyi @rdblue ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did analyze it while adding this spark action.
First thing is with the naming
ComputeTableStatsProcedure, clearly says it is a table level statistics. Plugging in the partition level statistics didn't seem logical there. Also it takes column arguments to specify which columns need table level stats, which is currently not applicable for partition stats.Partition stats supports incremental compute and may need an option like force refresh in the call procedure. So, the syntax and input arguments (one needs column names and other needs boolean) differ with respect to the functionality.
Lastly, like you said, they are independent stats. Having an action to independently compute them is better I feel.
Spark CBO is only integrated with table level stats, don't have an interface to use partition level stats. This action is mainly for the use cases mentioned in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amogh-jahagirdar for bringing this up.
Though we don't need to follow other engines, i can see examples where
ANALYZE TABLEis used for both table level and partition level stats.I think it would be good to follow a similar pattern for Iceberg actions and procedure, so that users don't have to think about gathering iceberg stats differently as against other engines.
But like highlighted, this comes with a complexity of adding conditional code.
@ajantha-bhat , i think the force refresh is something that would be useful for current table stats action as well.
Also, eventually we would need to support incremental stats with the current table stats as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karuppayya, If you see Spark, Hive , Trino
They have unified stats in one compute.
At Iceberg, Table statistics computes just NDV. No column level stats aggregate, need to read manifests for that. Also, partition stats should also include column level stats, so later we can aggregate it into table level stats. @deniskuzZ has a proposal for that.
Currently, table stats is not complete and partition stats is also very different. So, I prefer keeping it separately.
In 2.0, if we unify them. We can have a single procedure and deprecate these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if completeness of functionality should determine this.
I am mainly coming from the end user perspective of having to use different action for stats whereas it is different in most other engine/formats.
But I dont have a strong opinion on this, I am fine either ways.
Would be good to get @aokolnychyi 's thought here since it was raised by him earlier
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajantha-bhat
Yeah my take is that stats computation should be best effort; so let's say someone configures both table and partition stats to run, and whichever one that runs first fails we just move on to the next. Of course at the end if there's any kind of failure we should clearly indicate that to the user, but I think the partial failure case you're referencing can be handled by just treating all stats generation as best effort in the implementation.
I guess I'm not understanding why the only additional validation would not just be
Preconditions.checkArgument(columns.isEmpty())in thewithOnlyPartitionStats? Are there any validations beyond this?@RussellSpitzer
My breakdown of the pros/cons, @ajantha-bhat feel free to add stuff in case I'm missing something
Combined action with different options:
Pros:
Hive
Spark
Trino
where column stats and partition stats are effectively collected through the same SQL.
I don't really see a distinction here between the procedure/action. I feel like we should generally try
and keep the action API surface area reduced, just as we do with the procedure, it's just that the procedure is SQL oriented instead of programatic.
Cons:
Separate Actions:
Pros:
The main argument I can think for separate procedures is for people familiar with the details of the statistics and their granularities, then having separate procedures is clearer since at least as of today they are different.
Cons:
The main arguments that I have against separate procedures is that if we consider that in the long run the options will converge it seems compelling to keep it combined (just going back to my feeling to keep as minimal of an API surface area as possible). I think it'd be better to do this to begin with rather than separate and then undo it later.
In the end, if folks feel strongly about keeping it separate I think I'm OK, I just want to make sure we're being thoughtful about the implications of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the argument here I feel strongest about is the mixing of options that are invalid that we can only determine that at runtime. IMHO, we should aim to have apis where there as few gotchas as possible. If the API can't hide the fact that there are two very different things happening under the hood I don't think they should be combined.
So I think we could keep them in the same action but only if we have the apis a little better segregated.
Perhaps instead of "columns" you have something like ".withTableLevelNDVs("columns").withPartitionStats()"
These are breaking changes I know, but they make clear that there are two different mutually exclusive things happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @RussellSpitzer.
I don't want to introduce breaking changes right now. Plus partition stats can evolve. It can even have column level stats per partition in the future. So, once everything is finalized. We can have a unified API if required in the future and deprecate the existing ones. Until now, I think good to keep separate APIs.
I agree on
@amogh-jahagirdar: Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely agree with this statement but I do think it should be quite possible to hide the fact that there are 2 different things happening under the hood.
I think I'm generally OK (+0.5) to move ahead as it is laid out here and I also want to avoid breaking changes, @RussellSpitzer @ajantha-bhat so if we want to move forward with this PR feel free!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amogh-jahagirdar and @RussellSpitzer for the discussion and helping to conclude on this.
I do understand that we have a chance to unify this in future. Once partition stats are matured and we need to provide new unified action (to avoid breaking changes).
For now, Please approve the PR if you are ok, We will merge it today. I just rebased it.