Fix for RM 41527, Issue #41 #42

gabbasb · 2017-08-23T10:45:09Z

Problem Statement:
Hive and Spark both support HiveQL, and are compatible except
for the behaviour of the ANALYZE command.
The difference is as follows:
In Hive, ANALYZE is a utility command and does not return any
result set whereas in Spark it returns a result set.
For example:
In Hive we get this output:

0: jdbc:hive2://localhost:10000/testdb> analyze table names_tab compute statistics;
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : number of splits:1
INFO : Submitting tokens for job: job_1488090103001_0007
INFO : The url to track the job: http://localhost:8088/proxy/application_1488090103001_0007/
INFO : Starting Job = job_1488090103001_0007, Tracking URL = http://localhost:8088/proxy/application_1488090103001_0007/
INFO : Kill Command = /home/abbasbutt/Projects/hadoop_fdw/hadoop/bin/hadoop job -kill job_1488090103001_0007
INFO : Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
INFO : 2017-08-22 19:08:11,328 Stage-0 map = 0%, reduce = 0%
No rows affected (11.949 seconds)
INFO : 2017-08-22 19:08:15,465 Stage-0 map = 100%, reduce = 0%, Cumulative CPU 0.93 sec
INFO : MapReduce Total cumulative CPU time: 930 msec
INFO : Ended Job = job_1488090103001_0007
INFO : Table testdb.names_tab stats: [numFiles=2, numRows=12, totalSize=76, rawDataSize=64]
0: jdbc:hive2://localhost:10000/testdb> [abbasbutt@localhost bin]$

In Spark we get this output:

0: jdbc:hive2://localhost:10000/my_spark_db> analyze table junk_table compute statistics;
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (1.462 seconds)

Solution:
The CREATE SERVER command already has a client_type parameter
that currently supports one value 'hiveserver2'.
To support ANALYZE on Spark the client type can also have the value
'spark'.
If the client_type is not specified the default will be hive
and analyze command will fail when Spark is used.
Otherwise if correct client_type is specified ANALYZE will work
fine with Spark.

For Example:
postgres=# CREATE EXTENSION hdfs_fdw;
CREATE EXTENSION
postgres=# CREATE SERVER hdfs_svr FOREIGN DATA WRAPPER hdfs_fdw OPTIONS (host '127.0.0.1',port '10000',client_type 'spark');
CREATE SERVER
postgres=# CREATE USER MAPPING FOR abbasbutt server hdfs_svr OPTIONS (username 'ldapadm', password 'ldapadm');
CREATE USER MAPPING
postgres=# CREATE FOREIGN TABLE fnt( a int, name varchar(255)) SERVER hdfs_svr OPTIONS (dbname 'my_spark_db', table_name 'junk_table');
CREATE FOREIGN TABLE
postgres=# ANALYZE fnt;
ANALYZE

Problem Statement: Hive and Spark both support HiveQL, and are compatible except for the behaviour of the ANALYZE command. The difference is as follows: In Hive, ANALYZE is a utility command and does not return any result set whereas in Spark it returns a result set. For example: In Hive we get this output: -------------------------- 0: jdbc:hive2://localhost:10000/testdb> analyze table names_tab compute statistics; INFO : Number of reduce tasks is set to 0 since there's no reduce operator INFO : number of splits:1 INFO : Submitting tokens for job: job_1488090103001_0007 INFO : The url to track the job: http://localhost:8088/proxy/application_1488090103001_0007/ INFO : Starting Job = job_1488090103001_0007, Tracking URL = http://localhost:8088/proxy/application_1488090103001_0007/ INFO : Kill Command = /home/abbasbutt/Projects/hadoop_fdw/hadoop/bin/hadoop job -kill job_1488090103001_0007 INFO : Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0 INFO : 2017-08-22 19:08:11,328 Stage-0 map = 0%, reduce = 0% No rows affected (11.949 seconds) INFO : 2017-08-22 19:08:15,465 Stage-0 map = 100%, reduce = 0%, Cumulative CPU 0.93 sec INFO : MapReduce Total cumulative CPU time: 930 msec INFO : Ended Job = job_1488090103001_0007 INFO : Table testdb.names_tab stats: [numFiles=2, numRows=12, totalSize=76, rawDataSize=64] 0: jdbc:hive2://localhost:10000/testdb> [abbasbutt@localhost bin]$ In Spark we get this output: --------------------------- 0: jdbc:hive2://localhost:10000/my_spark_db> analyze table junk_table compute statistics; +---------+--+ | Result | +---------+--+ +---------+--+ No rows selected (1.462 seconds) Solution: The CREATE SERVER command already has a client_type parameter that currently supports one value 'hiveserver2'. To support ANALYZE on Spark the client type can also have the value 'spark'. If the client_type is not specified the default will be hive and analyze command will fail when Spark is used. Otherwise if correct client_type is specified ANALYZE will work fine with Spark. For Example: postgres=# CREATE EXTENSION hdfs_fdw; CREATE EXTENSION postgres=# CREATE SERVER hdfs_svr FOREIGN DATA WRAPPER hdfs_fdw OPTIONS (host '127.0.0.1',port '10000',client_type 'spark'); CREATE SERVER postgres=# CREATE USER MAPPING FOR abbasbutt server hdfs_svr OPTIONS (username 'ldapadm', password 'ldapadm'); CREATE USER MAPPING postgres=# CREATE FOREIGN TABLE fnt( a int, name varchar(255)) SERVER hdfs_svr OPTIONS (dbname 'my_spark_db', table_name 'junk_table'); CREATE FOREIGN TABLE postgres=# ANALYZE fnt; ANALYZE

Update the readme file to reflect the following changes: 1. LD_LIBRARY_PATH is not required and is replaced by GUC hdfs_fdw.jvmpath 2. A new option auth_type is added in CREATE SERVER 3. A new value is added in client_type.

gabbasb added 2 commits August 23, 2017 15:41

Fix for RM 42253 Issue EnterpriseDB#43

830d3b2

Update the readme file to reflect the following changes: 1. LD_LIBRARY_PATH is not required and is replaced by GUC hdfs_fdw.jvmpath 2. A new option auth_type is added in CREATE SERVER 3. A new value is added in client_type.

ibrarahmad merged commit a360f75 into EnterpriseDB:master Aug 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for RM 41527, Issue #41 #42

Fix for RM 41527, Issue #41 #42

gabbasb commented Aug 23, 2017

Fix for RM 41527, Issue #41 #42

Fix for RM 41527, Issue #41 #42

Conversation

gabbasb commented Aug 23, 2017

In Spark we get this output: