Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem Statement:
Hive and Spark both support HiveQL, and are compatible except
for the behaviour of the ANALYZE command.
The difference is as follows:
In Hive, ANALYZE is a utility command and does not return any
result set whereas in Spark it returns a result set.
For example:
In Hive we get this output:
0: jdbc:hive2://localhost:10000/testdb> analyze table names_tab compute statistics;
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : number of splits:1
INFO : Submitting tokens for job: job_1488090103001_0007
INFO : The url to track the job: http://localhost:8088/proxy/application_1488090103001_0007/
INFO : Starting Job = job_1488090103001_0007, Tracking URL = http://localhost:8088/proxy/application_1488090103001_0007/
INFO : Kill Command = /home/abbasbutt/Projects/hadoop_fdw/hadoop/bin/hadoop job -kill job_1488090103001_0007
INFO : Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
INFO : 2017-08-22 19:08:11,328 Stage-0 map = 0%, reduce = 0%
No rows affected (11.949 seconds)
INFO : 2017-08-22 19:08:15,465 Stage-0 map = 100%, reduce = 0%, Cumulative CPU 0.93 sec
INFO : MapReduce Total cumulative CPU time: 930 msec
INFO : Ended Job = job_1488090103001_0007
INFO : Table testdb.names_tab stats: [numFiles=2, numRows=12, totalSize=76, rawDataSize=64]
0: jdbc:hive2://localhost:10000/testdb> [abbasbutt@localhost bin]$
In Spark we get this output:
0: jdbc:hive2://localhost:10000/my_spark_db> analyze table junk_table compute statistics;
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (1.462 seconds)
Solution:
The CREATE SERVER command already has a client_type parameter
that currently supports one value 'hiveserver2'.
To support ANALYZE on Spark the client type can also have the value
'spark'.
If the client_type is not specified the default will be hive
and analyze command will fail when Spark is used.
Otherwise if correct client_type is specified ANALYZE will work
fine with Spark.
For Example:
postgres=# CREATE EXTENSION hdfs_fdw;
CREATE EXTENSION
postgres=# CREATE SERVER hdfs_svr FOREIGN DATA WRAPPER hdfs_fdw OPTIONS (host '127.0.0.1',port '10000',client_type 'spark');
CREATE SERVER
postgres=# CREATE USER MAPPING FOR abbasbutt server hdfs_svr OPTIONS (username 'ldapadm', password 'ldapadm');
CREATE USER MAPPING
postgres=# CREATE FOREIGN TABLE fnt( a int, name varchar(255)) SERVER hdfs_svr OPTIONS (dbname 'my_spark_db', table_name 'junk_table');
CREATE FOREIGN TABLE
postgres=# ANALYZE fnt;
ANALYZE