ARROW-12106: [Rust][DataFusion] Support `SELECT * from information_schema.tables` #9818

alamb · 2021-03-26T21:16:29Z

Rationale

Provide configurable access to a table list so a user can see what tables exist).

See the doc for background: https://docs.google.com/document/d/12cpZUSNPqVH9Z0BBx6O8REu7TFqL-NPPAYCUPpDls1k/edit#

I plan to add support for SHOW TABLES (and information_schema.columns / SHOW COLUMNS) as follow on PRs

Example use:

Setup:

echo "1" > /tmp/table.csv
echo "2" >> /tmp/table.csv
cargo run --bin datafusion-cli

Then run :

>  CREATE EXTERNAL TABLE t(a int)
STORED AS CSV
LOCATION '/tmp/table.csv';

0 rows in set. Query took 0 seconds.
> select * from information_schema.tables;

+---------------+--------------------+------------+--------------+
| table_catalog | table_schema       | table_name | table_type   |
+---------------+--------------------+------------+--------------+
| datafusion    | public             | t          | BASE TABLE   |
| datafusion    | information_schema | TABLES     | SYSTEM TABLE |
+---------------+--------------------+------------+--------------+
2 row in set. Query took 0 seconds.
z>

github-actions · 2021-03-26T21:16:54Z

https://issues.apache.org/jira/browse/ARROW-12106

alamb · 2021-03-26T21:17:03Z

rust/datafusion/src/execution/context.rs

These tests illustrate how this feature works concisely I believe

alamb · 2021-03-26T21:17:29Z

FYI @seddonm1 @returnString @Dandandan

seddonm1 · 2021-03-26T21:36:05Z

rust/datafusion/src/catalog/information_schema.rs

The string builder requires a number here but the number or rows isn't known. I'll add a comment explaining 10 is some arbitrary number

Why not use 0 to keep it simple?

I guess I felt that 10 was a siimple as 0 -- lol though I don't really care

seddonm1 · 2021-03-26T21:38:15Z

rust/datafusion/src/catalog/information_schema.rs

I have not really spent much time with the information_schema. Are the own two types BASE TABLE and SYSTEM TABLE?

I am not sure to be honest. Postgres has BASE TABLE and VIEW that I saw.

I will spend some time figuring out how what the values of these should be.

Ok. Maybe we can find the standards document somewhere as I assume there is a fixed list.

@alamb Here is a link to the Postgres doc: https://www.postgresql.org/docs/current/infoschema-tables.html

Type of the table:

BASE TABLE for a persistent base table (the normal table type)

VIEW for a view

FOREIGN for a foreign table

LOCAL TEMPORARY for a temporary table

Are all our 'tables' kind of LOCAL TEMPORARY?

I could definitely see an argument that the csv/parquet TableProviders are temporary, but arbitrary TableProviders might use totally different semantics (e.g. in my case, fetching data from a persistent store that could be classed as a base table). If we want to support different options here, perhaps we need a new TableProvider method to indicate that?

I think LOCAL TEMPORARY is referring to what happens when you do CREATE TEMPORARY TABLE... in postgres -- so I agree with @returnString that LOCAL TEMPORARY is probably not what we would want.

I will change the information table to say VIEW for the information_schema tables, consistent with Postgres (and SQL server)

seddonm1 · 2021-03-26T21:38:43Z

@alamb my read through looked good (as expected). I can have a play with it early next week.

returnString · 2021-03-26T22:16:30Z

This looks awesome (and so hot on the heels of the basic catalog impl!) - will take a look tomorrow :D

Dandandan · 2021-03-26T22:18:45Z

rust/datafusion/src/catalog/catalog.rs

Better as &[&str] to avoid clone?

ah I see in the implementation it is not really possible

The way I could think of to avoid the copy is don't make a CatalogList trait and instead pass around MemoryCatalogList directly. This would allow a function like fn catalog_names(&self) -> impl IntoIterator<Item=&str>

However, given catalog_names are only called when creating the information_schema views, I am not sure this optimization is warranted at this time

Dandandan · 2021-03-26T22:23:31Z

rust/datafusion/src/execution/context.rs

Can it be disabled as well - or don't we expect that?

Would it be worth enabling by default, even, and adding an option to disable it here? I can't think of any downsides to enabling it, and for the subset of the info schema that we support, it's in line with the standard (as far as I'm aware without spending stupid money of a copy of the standard 😅)

Can it be disabled as well - or don't we expect that?

I didn't expect people to want to be able to change the setting at runtime on the same Context -- either they would want the information schema or not. However, for consistency with the other settings I think it would be good to allow the flag to be passed in and I will do so

Would it be worth enabling by default, even, and adding an option to disable it here?

There might be some small runtime overhead for always enabling it. I don't have any strong preference on one way or the other. I am happy to change the default. Does anyone else have any thoughts?

returnString · 2021-03-27T10:50:59Z

rust/datafusion/src/catalog/information_schema.rs

I could definitely see an argument that the csv/parquet TableProviders are temporary, but arbitrary TableProviders might use totally different semantics (e.g. in my case, fetching data from a persistent store that could be classed as a base table). If we want to support different options here, perhaps we need a new TableProvider method to indicate that?

returnString · 2021-03-27T10:56:09Z

rust/datafusion/src/catalog/catalog.rs

Minor: MemoryCatalogList instead of MemoryDatabase here?

returnString · 2021-03-27T11:01:33Z

rust/datafusion/src/execution/context.rs

Would it be worth enabling by default, even, and adding an option to disable it here? I can't think of any downsides to enabling it, and for the subset of the info schema that we support, it's in line with the standard (as far as I'm aware without spending stupid money of a copy of the standard 😅)

returnString · 2021-03-27T11:10:39Z

rust/datafusion/src/execution/context.rs

Not a massive issue as this isn't the critical path, but would it be worth doing this wrapping on catalog registration so we avoid query-side allocs here? I wonder if it's possible to remove the CatalogListWithInformationSchema entirely and just do this inside register_catalog instead. So if the info schema is enabled, wrap the catalog up in a CatalogWithInformationSchema before registering it with the list, or otherwise, register the plain catalog.

Since the Config can't be changed after the ExecutionContext is created I think this would be a good idea. I will give it a try.

in 5151f31

alamb · 2021-03-28T17:09:42Z

I could definitely see an argument that the csv/parquet TableProviders are temporary, but arbitrary TableProviders might use totally different semantics (e.g. in my case, fetching data from a persistent store that could be classed as a base table). If we want to support different options here, perhaps we need a new TableProvider method to indicate that?

Yes, I agree -- I see no reason to support different values for TABLE_TYPE in the initial PR -- and if someone wants that functionality they can perhaps propose a PR to add a methods to the TableProvider

alamb · 2021-03-29T11:07:30Z

Ok, I have addressed all comments and I think this PR is ready to merge (once it gets an approval).

returnString

One stray println left in as noted, but otherwise, looks good to me! Looking forward to using this :)

returnString · 2021-03-29T11:51:53Z

rust/datafusion/src/execution/context.rs

Leftover debugging?

whoops -- good call -- will remove.

might be a good one for a debug! line actually.

returnString · 2021-03-29T11:58:58Z

I've also logged https://issues.apache.org/jira/browse/ARROW-12132 for the table provider "type" followup.

jorgecarleitao

Hi @alamb . I am sorry: I am obviously missing something: I do not understand the rational of introducing catalogs and schemas.

In the proposal in google docs, the benefits enumerated are the ability to introspect the table and its schemas (i.e. column and types), which I very much agree with. I am just trying to understand the rational for catalogs and schemas (grouping of table names).

IMO Catalogs in postgres are a mechanism over which postgres uses its own system (RDBMS) to perform bookeeping about its own state, away from normal users.

IMO Schemas in postgres are primarily a mechanism over which postgres adds an extra layer of granularity over groups of tables, typically in the context of its role-based model and permission segmentation. Since DataFusion has no concept of user or role or permission, I do not see the why.

There is another reason in postgres to use schemas, and that may be relevant here: schemas reduce naming collisions by allowing users to defining tables in their own schema. However, I only see this being relevant whenever an administrator can control which users can write to which schemas, as that induces a "namespace cleanliness" of the schema (i.e. the DBA can promise the user that the schema is "clean"). IMO such promise requires a role-based model and a permission system attached, no?

Maybe the reason we use catalogs and schemas here is to mimic postgres behavior so that consumers can more easily query using the same SQL?

Sorry for the long comment, I am just trying to understand the logic here.

returnString · 2021-03-29T13:48:00Z

Hey @jorgecarleitao, so I introduced the general concept in #9762 (design doc here), with a mix of rationales

Firstly, I saw Andrew's proposal for this info schema work and wanted to support namespacing this correctly, as it works in other DBs (we don't want to disallow people from registering a table called tables, for the sake of a silly example, and having that shadow the system one).

Secondly, in my use case (DataFusion as the execution engine for an actual DB), I have catalogs and schemas as a mechanism for logical organisation. From the proposal doc:

Multi-tenant software seeking more stringent data isolation might use catalogs to represent individual tenants

Schemas could be used to split tables by producer in a centralised data pipeline

In a data warehouse comprising both raw event logs and pre-aggregated summary tables, you could use “raw” and “summary” schemas to organise the tables by type

Your point about roles/permissions is totally correct - in my case currently, I'm dynamically building execution contexts based on tables that users are allowed to access according to my external metadata. Users can also create tables, which is again a metadata operation that informs future execution context creation (and these are created within a specific catalog and schema).

Lemme know if anything there is unclear, doing lunchbreak replies atm so may have missed something 😄

Edit: oh, and one of the design goals was to ensure that as little as possible changed for use cases that don't care about namespacing - if a singular flat namespace works for a given use case, nothing user-facing should need to change, and you'll just see a default catalog/schema of "datafusion.public" when querying the info schema.

jorgecarleitao

LGTM. Thanks a lot @alamb and @returnString for this. Went through the code and looks good 💯

alamb · 2021-03-29T14:38:20Z

@jorgecarleitao -- @returnString did a great job explaining the rationale for schemas / catalogs from my perspective.

Thanks for checking in on us and making sure it is clear what is going on

…hema.tables`

alamb · 2021-03-30T11:51:38Z

Integration test failure seems unrelated

https://github.com/apache/arrow/pull/9818/checks?check_run_id=2219436309

# Rationale Accessing the list of tables via `select * from information_schema.tables` (introduced in #9818) is a lot to type See the doc for background: https://docs.google.com/document/d/12cpZUSNPqVH9Z0BBx6O8REu7TFqL-NPPAYCUPpDls1k/edit# # Proposal Add support for `SHOW TABLES` command. # Commentary This is different than both postgres (which uses `\d` in `psql` for this purpose), and MySQL (which uses `DESCRIBE`). I could be convinced that we should not add `SHOW TABLES` at all (and just stay with `select * from information_schema.tables` but I wanted to add the proposal) # Example Use Setup: ``` echo "1,Foo,44.9" > /tmp/table.csv echo "2,Bar,22.1" >> /tmp/table.csv cargo run --bin datafusion-cli ``` Then run : ``` CREATE EXTERNAL TABLE t(a int, b varchar, c float) STORED AS CSV LOCATION '/tmp/table.csv'; > show tables; +---------------+--------------------+------------+------------+ | table_catalog | table_schema | table_name | table_type | +---------------+--------------------+------------+------------+ | datafusion | public | t | BASE TABLE | | datafusion | information_schema | tables | VIEW | +---------------+--------------------+------------+------------+ 2 row in set. Query took 0 seconds. ``` Closes #9847 from alamb/alamb/really_show_tables Authored-by: Andrew Lamb <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>

…hema.columns` Builds on the code in #9818 # Rationale Provide schema metadata access (so a user can see what columns exist and their type). See the doc for background: https://docs.google.com/document/d/12cpZUSNPqVH9Z0BBx6O8REu7TFqL-NPPAYCUPpDls1k/edit# I plan to add support for `SHOW COLUMNS` possibly as a follow on PR (though I have found out that `SHOW COLUMNS` and `SHOW TABLES` are not supported by either MySQL or by Postgres 🤔 ) # Changes I chose to add the first 15 columns from `information_schema.columns` You can see the full list in Postgres [here](https://www.postgresql.org/docs/9.5/infoschema-columns.html) and SQL Server [here](https://docs.microsoft.com/en-us/sql/relational-databases/system-information-schema-views/columns-transact-sql?view=sql-server-ver15). There are a bunch more columns that say "Applies to features not available in PostgreSQL" and that don't apply to DataFusion either-- since my usecase is to get the basic schema information out I chose not to add a bunch of columns that are always null at this time. I feel the use of column builders here is somewhat awkward (as it requires many calls to `unwrap`). I am thinking of a follow on PR to refactor this code to use `Vec<String>` and `Vec<u64>` and then create `StringArray` and `UInt64Array` directly from them but for now I just want the functionality # Example use Setup: ``` echo "1,Foo,44.9" > /tmp/table.csv echo "2,Bar,22.1" >> /tmp/table.csv cargo run --bin datafusion-cli ``` Then run : ``` > CREATE EXTERNAL TABLE t(a int, b varchar, c float) STORED AS CSV LOCATION '/tmp/table.csv'; 0 rows in set. Query took 0 seconds. > select table_name, column_name, ordinal_position, is_nullable, data_type from information_schema.columns; +------------+-------------+------------------+-------------+-----------+ | table_name | column_name | ordinal_position | is_nullable | data_type | +------------+-------------+------------------+-------------+-----------+ | t | a | 0 | NO | Int32 | | t | b | 1 | NO | Utf8 | | t | c | 2 | NO | Float32 | +------------+-------------+------------------+-------------+-----------+ 3 row in set. Query took 0 seconds. ``` Closes #9840 from alamb/alamn/information_schema_columns Authored-by: Andrew Lamb <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>

github-actions bot added Component: Rust - DataFusion Component: Rust labels Mar 26, 2021

alamb commented Mar 26, 2021

View reviewed changes

rust/datafusion/src/execution/context.rs Outdated

Copy link

Contributor Author

alamb Mar 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests illustrate how this feature works concisely I believe

alamb changed the title ~~ARROW-12106: [Rust][DataFusion] Support `SELECT * from information_schema.tables~~ ARROW-12106: [Rust][DataFusion] Support SELECT * from information_schema.tables Mar 26, 2021

seddonm1 reviewed Mar 26, 2021

View reviewed changes

Dandandan reviewed Mar 26, 2021

View reviewed changes

returnString suggested changes Mar 27, 2021

View reviewed changes

alamb force-pushed the alamb/show_tables branch from 29f391f to 5151f31 Compare March 28, 2021 18:28

returnString approved these changes Mar 29, 2021

View reviewed changes

jorgecarleitao reviewed Mar 29, 2021

View reviewed changes

jorgecarleitao approved these changes Mar 29, 2021

View reviewed changes

alamb added 8 commits March 29, 2021 10:38

ARROW-12106: [Rust][DataFusion] Support `SELECT * from information_sc…

465f397

…hema.tables`

fixup test

cea11e3

fixup comment

ea3586b

use VIEW and clarify intent of initial capacity

df99e08

Add parameter, update tests

f0bc9b0

update test

721c5dd

Inject schema wrapper once rather than on each lookup

8c57e9a

Remove println

be47c11

alamb force-pushed the alamb/show_tables branch from 5151f31 to be47c11 Compare March 29, 2021 14:40

alamb mentioned this pull request Mar 29, 2021

ARROW-12107: [Rust][DataFusion] Support SELECT * from information_schema.columns #9840

Closed

alamb closed this in b6982df Mar 30, 2021

alamb mentioned this pull request Mar 30, 2021

ARROW-12108: [Rust] [DataFusion] Implement SHOW TABLES #9847

Closed

asfimport mentioned this pull request Mar 30, 2021

[Rust][DataFusion] Support SELECT * from information_schema.tables #27929

Closed

ARROW-12106: [Rust][DataFusion] Support SELECT * from information_schema.tables #9818

ARROW-12106: [Rust][DataFusion] Support SELECT * from information_schema.tables #9818

Uh oh!

Conversation

alamb commented Mar 26, 2021

Rationale

Example use:

Uh oh!

github-actions bot commented Mar 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Mar 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seddonm1 commented Mar 26, 2021

Uh oh!

returnString commented Mar 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Mar 28, 2021

Uh oh!

alamb commented Mar 29, 2021

Uh oh!

returnString left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

returnString commented Mar 29, 2021

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

ARROW-12106: [Rust][DataFusion] Support `SELECT * from information_schema.tables` #9818

ARROW-12106: [Rust][DataFusion] Support `SELECT * from information_schema.tables` #9818

returnString commented Mar 29, 2021 •

edited

Loading