Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying Unity Catalog with DuckDB UC extension requires retrying the first SQL statement #8

Open
edwardsdm opened this issue Jul 10, 2024 · 4 comments

Comments

@edwardsdm
Copy link

edwardsdm commented Jul 10, 2024

The information below has been share with the Databricks UC engineering team. They suggested the issue is likely with the DuckDB UC extension.

Credential serving is enabled on the Databricks instance the code is connecting to.

Duckdb CLI installed on MacOS - Retry SHOW TABLES query

image

R - Retry SHOW TABLES query

image

Python - Retry SHOW TABLES query

Additionally even after a successful SHOW TABLES query no tables are found and the table queried successfully in R cannot be queried using Python.

image
@edwardsdm edwardsdm changed the title Querying Unity Catalog with R & Python DuckDB packages requires retrying the first SQL statement Querying Unity Catalog with DuckDB UC extension requires retrying the first SQL statement Jul 10, 2024
@bschulth
Copy link

bschulth commented Jul 11, 2024

At least part of the problem appears to be that the extension is trying to convert column types to an internal type:

if (type_text == "tinyint") {
return LogicalType::TINYINT;
} else if (type_text == "smallint") {
return LogicalType::SMALLINT;
} else if (type_text == "bigint") {
return LogicalType::BIGINT;
} else if (type_text == "int") {
return LogicalType::INTEGER;
} else if (type_text == "long") {
return LogicalType::BIGINT;
} else if (type_text == "string") {
return LogicalType::VARCHAR;
} else if (type_text == "double") {
return LogicalType::DOUBLE;
} else if (type_text == "float") {
return LogicalType::FLOAT;
} else if (type_text == "boolean") {
return LogicalType::BOOLEAN;
} else if (type_text == "timestamp") {
return LogicalType::TIMESTAMP;
} else if (type_text == "binary") {
return LogicalType::BLOB;
} else if (type_text == "date") {
return LogicalType::DATE;
} else if (type_text == "timestamp") {
return LogicalType::TIMESTAMP; // TODO: Is this the right timestamp
} else if (type_text.find("decimal(") == 0) {
size_t spec_end = type_text.find(')');
if (spec_end != string::npos) {
size_t sep = type_text.find(',');
auto prec_str = type_text.substr(8, sep - 8);
auto scale_str = type_text.substr(sep + 1, spec_end - sep - 1);
uint8_t prec = Cast::Operation<string_t, uint8_t>(prec_str);
uint8_t scale = Cast::Operation<string_t, uint8_t>(scale_str);
return LogicalType::DECIMAL(prec, scale);
}
} else if (type_text.find("array<") == 0) {
size_t type_end = type_text.rfind('>'); // find last, to deal with nested
if (type_end != string::npos) {
auto child_type_str = type_text.substr(6, type_end - 6);
auto child_type = UCUtils::TypeToLogicalType(context, child_type_str);
return LogicalType::LIST(child_type);
}
} else if (type_text.find("map<") == 0) {
size_t type_end = type_text.rfind('>'); // find last, to deal with nested
if (type_end != string::npos) {
// TODO: Factor this and struct parsing into an iterator over ',' separated values
vector<LogicalType> key_val;
size_t cur = 4;
auto nested_opens = 0;
for (;;) {
size_t next_sep = cur;
// find the location of the next ',' ignoring nested commas
while (type_text[next_sep] != ',' || nested_opens > 0) {
if (type_text[next_sep] == '<') {
nested_opens++;
} else if (type_text[next_sep] == '>') {
nested_opens--;
}
next_sep++;
if (next_sep == type_end) {
break;
}
}
auto child_str = type_text.substr(cur, next_sep - cur);
auto child_type = UCUtils::TypeToLogicalType(context, child_str);
key_val.push_back(child_type);
if (next_sep == type_end) {
break;
}
cur = next_sep + 1;
}
if (key_val.size() != 2) {
throw NotImplementedException("Invalid map specification with %i types", key_val.size());
}
return LogicalType::MAP(key_val[0], key_val[1]);
}
} else if (type_text.find("struct<") == 0) {
size_t type_end = type_text.rfind('>'); // find last, to deal with nested
if (type_end != string::npos) {
child_list_t<LogicalType> children;
size_t cur = 7;
auto nested_opens = 0;
for (;;) {
size_t next_sep = cur;
// find the location of the next ',' ignoring nested commas
while (type_text[next_sep] != ',' || nested_opens > 0) {
if (type_text[next_sep] == '<') {
nested_opens++;
} else if (type_text[next_sep] == '>') {
nested_opens--;
}
next_sep++;
if (next_sep == type_end) {
break;
}
}
auto child_str = type_text.substr(cur, next_sep - cur);
size_t type_sep = child_str.find(':');
if (type_sep == string::npos) {
throw NotImplementedException("Invalid struct child type specifier: %s", child_str);
}
auto child_name = child_str.substr(0, type_sep);
auto child_type = UCUtils::TypeToLogicalType(context, child_str.substr(type_sep + 1, string::npos));
children.push_back({child_name, child_type});
if (next_sep == type_end) {
break;
}
cur = next_sep + 1;
}
return LogicalType::STRUCT(children);
}
}

But it does not have a mapping for "varchar(xxxx)", so it throws an exception:

throw NotImplementedException("Tried to fallback to unknown type for '%s'", type_text);

Scanning our (databricks) system.information_schema.columns for unique full_data_types

select distinct data_type, full_data_type from system.information_schema.columns where data_type not in ('DECIMAL', 'STRUCT', 'ARRAY', 'MAP') order by data_type, full_data_type

It looks like you might also need to map:

  • varchar(xxxx)
  • char(xxx)

Maybe also the 'unsized' versions for extra diligence?

  • char
  • varchar
image

@samansmink
Copy link
Collaborator

Hi @edwardsdm, thanks for reporting this. We will take a look at this at some point. Currently we have prioritized the development of the Delta extension over the uc_catalog extension so I can not give a timeline yet!

@gdubya gdubya mentioned this issue Sep 26, 2024
@riot-lbrandi
Copy link

Hey team, this issue is also happening to me and I could reproduce exactly what was described here. I can read the metadata for all tables normally after the 3rd attempt, WHICH IS SUPER COOL. But I can't query anything due to permission errors on S3, in my org we need to query that through UC and not using the S3 path directly (that I believe is what is happening under the hood with the Delta extension).

@bschulth
Copy link

bschulth commented Nov 19, 2024

@samansmink

Adding these 4 lines seems generally to fix the issues in our environment:

	} else if (type_text.find("varchar") == 0) {
	    return LogicalType::VARCHAR;
	} else if (type_text.find("char") == 0) {
	    return LogicalType::VARCHAR;

I built the plugin locally with these changes and tested it in our (@edwardsdm and me) environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants