Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for RM 41527, Issue #41 #42

Merged
merged 2 commits into from
Aug 23, 2017
Merged

Conversation

gabbasb
Copy link
Collaborator

@gabbasb gabbasb commented Aug 23, 2017

Problem Statement:
Hive and Spark both support HiveQL, and are compatible except
for the behaviour of the ANALYZE command.
The difference is as follows:
In Hive, ANALYZE is a utility command and does not return any
result set whereas in Spark it returns a result set.
For example:
In Hive we get this output:

0: jdbc:hive2://localhost:10000/testdb> analyze table names_tab compute statistics;
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : number of splits:1
INFO : Submitting tokens for job: job_1488090103001_0007
INFO : The url to track the job: http://localhost:8088/proxy/application_1488090103001_0007/
INFO : Starting Job = job_1488090103001_0007, Tracking URL = http://localhost:8088/proxy/application_1488090103001_0007/
INFO : Kill Command = /home/abbasbutt/Projects/hadoop_fdw/hadoop/bin/hadoop job -kill job_1488090103001_0007
INFO : Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
INFO : 2017-08-22 19:08:11,328 Stage-0 map = 0%, reduce = 0%
No rows affected (11.949 seconds)
INFO : 2017-08-22 19:08:15,465 Stage-0 map = 100%, reduce = 0%, Cumulative CPU 0.93 sec
INFO : MapReduce Total cumulative CPU time: 930 msec
INFO : Ended Job = job_1488090103001_0007
INFO : Table testdb.names_tab stats: [numFiles=2, numRows=12, totalSize=76, rawDataSize=64]
0: jdbc:hive2://localhost:10000/testdb> [abbasbutt@localhost bin]$

In Spark we get this output:

0: jdbc:hive2://localhost:10000/my_spark_db> analyze table junk_table compute statistics;
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (1.462 seconds)

Solution:
The CREATE SERVER command already has a client_type parameter
that currently supports one value 'hiveserver2'.
To support ANALYZE on Spark the client type can also have the value
'spark'.
If the client_type is not specified the default will be hive
and analyze command will fail when Spark is used.
Otherwise if correct client_type is specified ANALYZE will work
fine with Spark.

For Example:
postgres=# CREATE EXTENSION hdfs_fdw;
CREATE EXTENSION
postgres=# CREATE SERVER hdfs_svr FOREIGN DATA WRAPPER hdfs_fdw OPTIONS (host '127.0.0.1',port '10000',client_type 'spark');
CREATE SERVER
postgres=# CREATE USER MAPPING FOR abbasbutt server hdfs_svr OPTIONS (username 'ldapadm', password 'ldapadm');
CREATE USER MAPPING
postgres=# CREATE FOREIGN TABLE fnt( a int, name varchar(255)) SERVER hdfs_svr OPTIONS (dbname 'my_spark_db', table_name 'junk_table');
CREATE FOREIGN TABLE
postgres=# ANALYZE fnt;
ANALYZE

Problem Statement:
Hive and Spark both support HiveQL, and are compatible except
for the behaviour of the ANALYZE command.
The difference is as follows:
In Hive, ANALYZE is a utility command and does not return any
result set whereas in Spark it returns a result set.
For example:
In Hive we get this output:
--------------------------
0: jdbc:hive2://localhost:10000/testdb>  analyze table names_tab compute statistics;
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1488090103001_0007
INFO  : The url to track the job: http://localhost:8088/proxy/application_1488090103001_0007/
INFO  : Starting Job = job_1488090103001_0007, Tracking URL = http://localhost:8088/proxy/application_1488090103001_0007/
INFO  : Kill Command = /home/abbasbutt/Projects/hadoop_fdw/hadoop/bin/hadoop job  -kill job_1488090103001_0007
INFO  : Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
INFO  : 2017-08-22 19:08:11,328 Stage-0 map = 0%,  reduce = 0%
No rows affected (11.949 seconds)
INFO  : 2017-08-22 19:08:15,465 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 0.93 sec
INFO  : MapReduce Total cumulative CPU time: 930 msec
INFO  : Ended Job = job_1488090103001_0007
INFO  : Table testdb.names_tab stats: [numFiles=2, numRows=12, totalSize=76, rawDataSize=64]
0: jdbc:hive2://localhost:10000/testdb> [abbasbutt@localhost bin]$

In Spark we get this output:
---------------------------
0: jdbc:hive2://localhost:10000/my_spark_db> analyze table junk_table compute statistics;
+---------+--+
| Result  |
+---------+--+
+---------+--+
No rows selected (1.462 seconds)

Solution:
The CREATE SERVER command already has a client_type parameter
that currently supports one value 'hiveserver2'.
To support ANALYZE on Spark the client type can also have the value
'spark'.
If the client_type is not specified the default will be hive
and analyze command will fail when Spark is used.
Otherwise if correct client_type is specified ANALYZE will work
fine with Spark.

For Example:
postgres=# CREATE EXTENSION hdfs_fdw;
CREATE EXTENSION
postgres=# CREATE SERVER hdfs_svr FOREIGN DATA WRAPPER hdfs_fdw OPTIONS (host '127.0.0.1',port '10000',client_type 'spark');
CREATE SERVER
postgres=# CREATE USER MAPPING FOR abbasbutt server hdfs_svr OPTIONS (username 'ldapadm', password 'ldapadm');
CREATE USER MAPPING
postgres=# CREATE FOREIGN TABLE fnt( a int, name varchar(255)) SERVER hdfs_svr OPTIONS (dbname 'my_spark_db', table_name 'junk_table');
CREATE FOREIGN TABLE
postgres=# ANALYZE fnt;
ANALYZE
Update the readme file to reflect the following changes:

1. LD_LIBRARY_PATH is not required and is replaced by GUC hdfs_fdw.jvmpath
2. A new option auth_type is added in CREATE SERVER
3. A new value is added in client_type.
@ibrarahmad ibrarahmad merged commit a360f75 into EnterpriseDB:master Aug 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants