Select count(*) fails on big amount of data in spark #58

iwaniwaniwan012 · 2017-12-25T16:02:19Z

I tested hdfs_fdw with spark. In spark i created table from local file with 100M rows. In spark beeline i count with select count(*),but pg server get an error oom, and spark thift server fails too. Pg 9.6, spark 2.2.0, hdfs_fwd 2.0.3

gabbasb · 2017-12-25T18:22:54Z

Can you please provide the output of the following commands:
beeline : EXPLAIN EXTENDED select count() from big_table;
psql : EXPLAIN VERBOSE select count() from big_table;

Please note that Aggregate push down is not available in hdfs_fdw 2.0.3.

iwaniwaniwan012 · 2018-01-08T15:15:55Z

explain extended select count(*) from test;
| == Parsed Logical Plan ==
'Project [unresolvedalias('count(1), None)]
+- 'UnresolvedRelation test

== Analyzed Logical Plan ==
count(1): bigint
Aggregate [count(1) AS count(1)#28L]
+- MetastoreRelation default, test

== Optimized Logical Plan ==
Aggregate [count(1) AS count(1)#28L]
+- Project
+- MetastoreRelation default, test

== Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)], output=[count(1)#28L])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#30L])
+- HiveTableScan MetastoreRelation default, test |

EXPLAIN VERBOSE select count(*) from spark_table ;
QUERY PLAN

Aggregate (cost=1100002.50..1100002.51 rows=1 width=8)
Output: count(*)
-> Foreign Scan on public.spark_table (cost=100000.00..1100000.00 rows=1000 width=0)
Output: id, txt, tm
Remote SQL: SELECT * FROM default.test

gabbasb · 2018-01-09T07:07:54Z

In case of beeline a map-reduce job will be initiated for doing hash aggregate, in hdfs_fdw case all rows would first get selected which triggers OOM error. This will work when we provide support for pushing down aggregates to the spark/hive server in the remote query.

iwaniwaniwan012 · 2018-01-09T21:13:29Z

Thanks, are you going to provide this support soon or not?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Select count(*) fails on big amount of data in spark #58

Select count(*) fails on big amount of data in spark #58

iwaniwaniwan012 commented Dec 25, 2017

gabbasb commented Dec 25, 2017

iwaniwaniwan012 commented Jan 8, 2018 •

edited

Loading

gabbasb commented Jan 9, 2018

iwaniwaniwan012 commented Jan 9, 2018

Select count(*) fails on big amount of data in spark #58

Select count(*) fails on big amount of data in spark #58

Comments

iwaniwaniwan012 commented Dec 25, 2017

gabbasb commented Dec 25, 2017

iwaniwaniwan012 commented Jan 8, 2018 • edited Loading

EXPLAIN VERBOSE select count(*) from spark_table ; QUERY PLAN

gabbasb commented Jan 9, 2018

iwaniwaniwan012 commented Jan 9, 2018

iwaniwaniwan012 commented Jan 8, 2018 •

edited

Loading

EXPLAIN VERBOSE select count(*) from spark_table ;
QUERY PLAN