Add a Spark procedure to collect NDV #6582

huaxingao · 2023-01-13T19:04:35Z

Add a Spark procedure to collect NDV, which will be used for CBO.

huaxingao · 2023-01-13T19:10:36Z

core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java

   */
  public static final String APACHE_DATASKETCHES_THETA_V1 = "apache-datasketches-theta-v1";
+
+  public static final String NDV_BLOB = "ndv-blob";


Spark doesn't use Apache DataSketches to collect approximate NDV, so I am adding a new blob type. Hope this is OK.

@findepi What are you using for NDV stats here? I figure we should have a common blob type

'blob' seems a bit redundant as they are all blobs? And also looking at the code, it's an approx ndv, which I didnt get from this name.

We should use one blob type for NDV ideally, although Spark doesn't have the sketch data. I'm also curious how sketch data is useful for a table level metric. It is absolutely useful for file-level and partition-level since we can merge them later.

@szehon-ho Right, we should have a better name for this. I am not sure if we can have a common blob type here. I will wait for @findepi 's input before changing this one.

So it is impossible for us to make Theta Sketches using Spark? Things like that would be healthier for the long run if we implement that.

I agree with you @RussellSpitzer . BTW engine interop is the primary reason why we have settled on Theta sketches. For Trino it would be easier to go with HLL, since that's what Trino engine & SPI are supporting for years now.

@RussellSpitzer @findepi I agree it would be ideal if Spark can support Theta sketches. I will take a look to see the possibility to implement this.

@findepi Besides using NDV, Spark also uses other column stats such as NumOfNulls, Min, Max, etc. for CBO. I am wondering if Trino also use these stats and if these stats should also be stored in TableMetaData.

I think we should create real Theta sketches. If Spark only needs the NDV integer, then that's great. We can either keep track of NDV sketch and incrementally update internal to Iceberg, or we can do it async. Either way, there should be no need for a different sketch type.

To be clear I don't think we need to get this into OSS Spark, I think it's fine if we generate these sketches in user land code.

huaxingao · 2023-01-13T19:18:42Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java

+/**
+ * A procedure that gets approximate NDV (number of distinct value) for the requested columns
+ * and sets this to the table's StatisticsFile.
+ */


I am debating myself if I should collect ndv only or also collect everything else such as max, min, num_nulls etc. in ANALYZE TABLE. I will just collect ndv for now.

Yea we have all those in the Iceberg file level metadata already, wonder if its necessary as we could combine those to have an aggregate?

We have file level metadata for max, min, num_nulls etc. That's why I was hesitate to include those here. We don't have file level ndv, though.

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java

RussellSpitzer · 2023-01-13T20:45:07Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java

+    String query = "SELECT ";
+    for (int i = 0; i < columnSizes; i++) {
+      String colName = columns.getUTF8String(i).toString();
+      query += "APPROX_COUNT_DISTINCT(" + colName + "), ";


Since we are technically not using distinct here, maybe we should be calling the procedure "analyze"?

I will change the procedure name from DistinctCountProcedure to AnalyzeTableProcedure

RussellSpitzer · 2023-01-13T20:47:16Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java

+      for (int i = 0; i < columnSizes; i++) {
+        writer.add(
+            new Blob(
+                StandardBlobTypes.NDV_BLOB,


The issue with defining a new a new blob type here is we probably need to describe in the spec, otherwise folks won't be able to deserialize it

szehon-ho · 2023-01-13T21:40:39Z

core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java

   */
  public static final String APACHE_DATASKETCHES_THETA_V1 = "apache-datasketches-theta-v1";
+
+  public static final String NDV_BLOB = "ndv-blob";


'blob' seems a bit redundant as they are all blobs? And also looking at the code, it's an approx ndv, which I didnt get from this name.

szehon-ho · 2023-01-13T21:42:18Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java

+/**
+ * A procedure that gets approximate NDV (number of distinct value) for the requested columns
+ * and sets this to the table's StatisticsFile.
+ */


Yea we have all those in the Iceberg file level metadata already, wonder if its necessary as we could combine those to have an aggregate?

szehon-ho · 2023-01-13T23:00:07Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java

+
+    TableOperations operations = ((HasTableOperations) table).operations();
+    FileIO fileIO = ((HasTableOperations) table).operations().io();
+    String path = operations.metadataFileLocation(String.format("%s.stats", UUID.randomUUID()));


If it exists, we throw FileNotFoundException? Should we just check and throw better exception?

You mean AlreadyExistsException, right? Yes, we should check. I guess we can probably keep the AlreadyExistsException but make the error message better.

szehon-ho · 2023-01-13T23:02:05Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java

+
+    String viewName = viewName(args, tableName);
+    // Create a view for users to query
+    df.createOrReplaceTempView(viewName);


Not sure if I missed something, is there a point to keeping it as view, if its already returned by the procedure?

I kept this as a view so users will have an easy way to query the statistics information after calling the stored procedure.

The main reason I am adding this store procedure is because I can't get an agreement to implement ANALYZE TABLE for data source V2 in Spark. This stored procedure is doing something similar to ANALYZE TABLE. Normally after users analyze table, they will DESCRIBE to get the statistics information. I create a view so users can query the statistics.

ajantha-bhat · 2023-01-23T02:27:06Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java

+    }
+
+    query = query.substring(0, query.length() - 2) + " FROM " + tableName;
+    Dataset<Row> df = spark().sql(query);


@RussellSpitzer, @flyrain, @huaxingao: Is it good to have a spark action first and call that action from this procedure? This way the users who use only APIs can also leverage this feature.

That sounds reasonable. RewriteManifestsProcedure did the same thing.

amogh-jahagirdar · 2023-01-20T16:56:21Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java

+    OutputFile outputFile = fileIO.newOutputFile(path);
+
+    try (PuffinWriter writer =
+        Puffin.write(outputFile).createdBy("Spark DistinctCountProcedure").build()) {


Nit: Can we move "Spark DistinctCountProcedure" to a separate constant?

ajantha-bhat · 2024-05-15T10:40:56Z

I saw a new PR on the same : #10288

github-actions · 2024-08-24T00:13:26Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-09-12T00:14:26Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Add a Spark procedure to collect NDV

d243374

github-actions bot added core spark labels Jan 13, 2023

huaxingao commented Jan 13, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/DistinctCountProcedure.java Show resolved Hide resolved

huaxingao added 2 commits January 13, 2023 11:30

remove static import

274ed2f

fix import

f63186d

RussellSpitzer reviewed Jan 13, 2023

View reviewed changes

huaxingao mentioned this pull request Jan 13, 2023

Collecting Iceberg NDV Statistics for Spark Engine #6549

Closed

szehon-ho reviewed Jan 13, 2023

View reviewed changes

ajantha-bhat reviewed Jan 23, 2023

View reviewed changes

amogh-jahagirdar reviewed Jan 28, 2023

View reviewed changes

ajantha-bhat mentioned this pull request May 15, 2024

Spark Action to Analyze table #10288

Merged

github-actions bot added the stale label Aug 24, 2024

github-actions bot closed this Sep 12, 2024

Add a Spark procedure to collect NDV #6582

Add a Spark procedure to collect NDV #6582

Uh oh!

Conversation

huaxingao commented Jan 13, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat commented May 15, 2024

Uh oh!

github-actions bot commented Aug 24, 2024

Uh oh!

github-actions bot commented Sep 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants