Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -346,7 +346,16 @@ public Map<String, String> getBasicStatistics(Partish partish) {
stats.put(StatsSetupConst.NUM_FILES, summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP));
}
if (summary.containsKey(SnapshotSummary.TOTAL_RECORDS_PROP)) {
stats.put(StatsSetupConst.ROW_COUNT, summary.get(SnapshotSummary.TOTAL_RECORDS_PROP));
long totalRecords = Long.parseLong(summary.get(SnapshotSummary.TOTAL_RECORDS_PROP));
if (summary.containsKey(SnapshotSummary.TOTAL_EQ_DELETES_PROP) &&

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if onlye one of TOTAL_EQ_DELETES_PROP and TOTAL_POS_DELETES_PROP persists?

summary.containsKey(SnapshotSummary.TOTAL_POS_DELETES_PROP)) {
Long actualRecords =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just share some my thought.
Not sure if i am understand correctly, the delete file in iceberg is also a special data file, and table scan in actual execution stage also should read all related delete files.

That is to say, the actual execution still requires scanning more data than the explain shows.
So, i am not sure if this PR can be give a optimized plans when iceberg table has both data files and delete files.

@deniskuzZ deniskuzZ May 9, 2023

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count(*) should use stats instead of scanning the whole dataset.
To be honest I don't really understand the purpose of 'total-records' if it doesn't reflect an accurate row count. insert 100 rows, delete all, 'total-records'=100 ???

looks like if there are only positional deletes we could get the accurate count by subtrracting “total-position-deletes” from “total-records”
@zhangbutao do you see any issues with that?

Another alternative would be to push row count stats from Hive to Iceberg summary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count(*) should use stats instead of scanning the whole dataset.

@deniskuzZ Do you mean we can push down min/max to iceberg? I think it's beyond this PR scope and it's not so easy. I find some info: HIVE-27099 select count(*) from table queries all data, and Spark has push down min/max to iceberg apache/iceberg#6622, but Spark will skip pushdown if including delete files https://github.com/apache/iceberg/pull/6622/files#diff-66bfda4bda6d505fe3de7db3b4d6b7923b3711b00e2801846dd7325edcdbf65eR224

@zhangbutao do you see any issues with that?

To be honest, i don't know too much about iceberg statistics at the moment so I can't share any more context.
Maybe after some time I can tell more or Implement some stats pushdown in Hive. :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangbutao, we are already pushing down column stats in puffin format HIVE-27158, however, we are still relying on Iceberg for basic stats.
In case of deletes, it becomes invalid. In this PR we are doing a workaround just for positional deletes use-case until it's fixed on the iceberg side.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Just say something else here. I think TrinoDB has a good implementation about puffin stats, maybe we can refer to some designs from TrinoDB. But, I also think lots of stuff need to be done on the iceberg side.
We can keep looking at the evolution about iceberg stats, as this can give a better cbo, pushdown, etc

totalRecords - (Long.parseLong(summary.get(SnapshotSummary.TOTAL_EQ_DELETES_PROP)) >
Comment thread
simhadri-g marked this conversation as resolved.
Outdated
0 ? 0 : Long.parseLong(summary.get(SnapshotSummary.TOTAL_POS_DELETES_PROP)));
totalRecords = actualRecords > 0 ? actualRecords : totalRecords;
// actualRecords maybe -ve in edge cases
}
stats.put(StatsSetupConst.ROW_COUNT, String.valueOf(totalRecords));
}
if (summary.containsKey(SnapshotSummary.TOTAL_FILE_SIZE_PROP)) {
stats.put(StatsSetupConst.TOTAL_SIZE, summary.get(SnapshotSummary.TOTAL_FILE_SIZE_PROP));
Expand Down
42 changes: 42 additions & 0 deletions iceberg/iceberg-handler/src/test/queries/positive/row_count.q
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
drop table llap_orders;
Comment thread
simhadri-g marked this conversation as resolved.
Outdated

CREATE EXTERNAL TABLE llap_orders (orderid INT, quantity INT, itemid INT, tradets TIMESTAMP) PARTITIONED BY (p1 STRING, p2 STRING) STORED BY ICEBERG STORED AS ORC tblproperties('format-version'='2');


INSERT INTO llap_orders VALUES
(0, 48, 5, timestamp('2000-06-04 19:55:46.129'), 'EU', 'DE'),
(1, 12, 6, timestamp('2007-06-24 19:23:22.829'), 'US', 'TX'),
(2, 76, 4, timestamp('2018-02-19 23:43:51.995'), 'EU', 'DE'),
(3, 91, 5, timestamp('2000-07-15 09:09:11.587'), 'US', 'NJ'),
(4, 18, 6, timestamp('2007-12-02 22:30:39.302'), 'EU', 'ES'),
(5, 71, 5, timestamp('2010-02-08 20:31:23.430'), 'EU', 'DE'),
(6, 78, 3, timestamp('2016-02-22 20:37:37.025'), 'EU', 'FR'),
(7, 88, 0, timestamp('2020-03-26 18:47:40.611'), 'EU', 'FR'),
(8, 87, 4, timestamp('2003-02-20 00:48:09.139'), 'EU', 'ES'),
(9, 60, 6, timestamp('2012-08-28 01:35:54.283'), 'EU', 'IT'),
(10, 24, 5, timestamp('2015-03-28 18:57:50.069'), 'US', 'NY'),
(11, 42, 2, timestamp('2012-06-27 01:13:32.350'), 'EU', 'UK'),
(12, 37, 4, timestamp('2020-08-09 01:18:50.153'), 'US', 'NY'),
(13, 52, 1, timestamp('2019-09-04 01:46:19.558'), 'EU', 'UK'),
(14, 96, 3, timestamp('2019-03-05 22:00:03.020'), 'US', 'NJ'),
(15, 18, 3, timestamp('2001-09-11 00:14:12.687'), 'EU', 'FR'),
(16, 46, 0, timestamp('2013-08-31 02:16:17.878'), 'EU', 'UK'),
(17, 26, 5, timestamp('2001-02-01 20:05:32.317'), 'EU', 'FR'),
(18, 68, 5, timestamp('2009-12-29 08:44:08.048'), 'EU', 'ES'),
(19, 54, 6, timestamp('2015-08-15 01:59:22.177'), 'EU', 'HU'),
(20, 10, 0, timestamp('2018-05-06 12:56:12.789'), 'US', 'CA');

--check row count
select count(*) from llap_orders;
describe formatted llap_orders;

--delete rows
delete from llap_orders where itemid = 6;
delete from llap_orders where itemid = 5;

--check for updated row count
select count(*) from llap_orders;
describe formatted llap_orders;

explain select count(*) from llap_orders;
explain insert into llap_orders select * from llap_orders limit 100000;
Loading