Skip to content

Fail query when the symlink file contains inexistent paths#19364

Merged
findepi merged 1 commit intotrinodb:masterfrom
findinpath:findinpath/hive-symlink-invalid
Oct 12, 2023
Merged

Fail query when the symlink file contains inexistent paths#19364
findepi merged 1 commit intotrinodb:masterfrom
findinpath:findinpath/hive-symlink-invalid

Conversation

@findinpath
Copy link
Copy Markdown
Contributor

Description

When dealing with a symlink Hive table which has a symlink.txt file containing an inexistent path, fail early with a meaningful exception (similar to what happens in Hive), instead of failing with the bogus exception:

Cannot invoke "io.trino.plugin.hive.fs.TrinoFileStatus.getLength()" because "status" is null

Reproduction scenario

Reproduction scenario:

Spin up the product test environment:

testing/bin/ptl env up --environment multinode --config config-default --without-trino

Create the tables in Hive:

CREATE TABLE testsimpleparquet (col integer)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';

insert into testsimpleparquet values (1);
insert into testsimpleparquet values (2);
CREATE TABLE testsymlinkparquet (col integer)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

Create a symlink.txt file with the following content:

hdfs://hadoop-master:9000/user/hive/warehouse/testsimpleparquet/000000_0
hdfs://hadoop-master:9000/user/hive/warehouse/testsimpleparquet/000000_0_copy_1_bad_file

000000_0_copy_1_bad_file doesn't actually exist

Copy the symlink.txt file to testsymlinkparquet storage:

[hive@hadoop-master tmp]$ hdfs dfs -copyFromLocal symlink.txt /user/hive/warehouse/testsymlinkparquet

Query in Hive:

0: jdbc:hive2://localhost:10000/default> select * from testsymlinkparquet;

error: java.io.IOException: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://hadoop-master:9000/user/hive/warehouse/testsimpleparquet/000000_0_copy_1_bad_file (state=,code=0)

Query in Trino:

trino> select * from hive.default.testsymlinkparquet;
Query 20231011_205902_00037_i66g8 failed: Cannot invoke "io.trino.plugin.hive.fs.TrinoFileStatus.getLength()" because "status" is null
io.trino.spi.TrinoException: Cannot invoke "io.trino.plugin.hive.fs.TrinoFileStatus.getLength()" because "status" is null
	at io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:294)

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@findinpath findinpath force-pushed the findinpath/hive-symlink-invalid branch from 18038c2 to 98a90ac Compare October 12, 2023 10:49
@findinpath findinpath force-pushed the findinpath/hive-symlink-invalid branch from 98a90ac to 181a536 Compare October 12, 2023 10:50
@findinpath findinpath self-assigned this Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed hive Hive connector

Development

Successfully merging this pull request may close these issues.

4 participants