Skip to content
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1616,11 +1616,10 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto
Seq(tbl, ext_tbl).foreach { tblName =>
sql(s"INSERT INTO $tblName VALUES (1, 'a', '2019-12-13')")

val expectedSize = 690
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we compare with the size as unexpected before insertion instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when spark.sql.statistics.size.autoUpdate.enabled is false (it's the default value), table stats is None until executing ANALYZE TABLE ...

I update the test to reflect that.

// analyze table
sql(s"ANALYZE TABLE $tblName COMPUTE STATISTICS NOSCAN")
var tableStats = getTableStats(tblName)
assert(tableStats.sizeInBytes == expectedSize)
val expectedSize = tableStats.sizeInBytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is a removal of test coverage technically, @pan3793 .

This test case is a known issue which fails due to the Parquet metadata (mostly version string) change. However, I'd like not to remove this test coverage.

Copy link
Member Author

@pan3793 pan3793 Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the original PR, the intention of this test is to make sure partition stats get updated, even though it equals existing table stats. The number of table's sizeInBytes does not really matter here.

Generally, asserting the size of binary data files like Parquet/ORC does not make sense, it can vary due to metadata change, as you pointed out, this is likely caused by the version string change, and also might be affected by the compression codec, the compressed data length might be different in different snappy version or platform.

assert(tableStats.rowCount.isEmpty)

sql(s"ANALYZE TABLE $tblName COMPUTE STATISTICS")
Expand Down