-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12106: [Rust][DataFusion] Support SELECT * from information_schema.tables
#9818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests illustrate how this feature works concisely I believe
SELECT * from information_schema.tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string builder requires a number here but the number or rows isn't known. I'll add a comment explaining 10 is some arbitrary number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use 0 to keep it simple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I felt that 10 was a siimple as 0 -- lol though I don't really care
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not really spent much time with the information_schema. Are the own two types BASE TABLE and SYSTEM TABLE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure to be honest. Postgres has BASE TABLE and VIEW that I saw.
I will spend some time figuring out how what the values of these should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Maybe we can find the standards document somewhere as I assume there is a fixed list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb Here is a link to the Postgres doc: https://www.postgresql.org/docs/current/infoschema-tables.html
Type of the table:
BASE TABLEfor a persistent base table (the normal table type)VIEWfor a viewFOREIGNfor a foreign tableLOCAL TEMPORARYfor a temporary table
Are all our 'tables' kind of LOCAL TEMPORARY?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could definitely see an argument that the csv/parquet TableProviders are temporary, but arbitrary TableProviders might use totally different semantics (e.g. in my case, fetching data from a persistent store that could be classed as a base table). If we want to support different options here, perhaps we need a new TableProvider method to indicate that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think LOCAL TEMPORARY is referring to what happens when you do CREATE TEMPORARY TABLE... in postgres -- so I agree with @returnString that LOCAL TEMPORARY is probably not what we would want.
I will change the information table to say VIEW for the information_schema tables, consistent with Postgres (and SQL server)
|
@alamb my read through looked good (as expected). I can have a play with it early next week. |
|
This looks awesome (and so hot on the heels of the basic catalog impl!) - will take a look tomorrow :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better as &[&str] to avoid clone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see in the implementation it is not really possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I could think of to avoid the copy is don't make a CatalogList trait and instead pass around MemoryCatalogList directly. This would allow a function like fn catalog_names(&self) -> impl IntoIterator<Item=&str>
However, given catalog_names are only called when creating the information_schema views, I am not sure this optimization is warranted at this time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be disabled as well - or don't we expect that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth enabling by default, even, and adding an option to disable it here? I can't think of any downsides to enabling it, and for the subset of the info schema that we support, it's in line with the standard (as far as I'm aware without spending stupid money of a copy of the standard 😅)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be disabled as well - or don't we expect that?
I didn't expect people to want to be able to change the setting at runtime on the same Context -- either they would want the information schema or not. However, for consistency with the other settings I think it would be good to allow the flag to be passed in and I will do so
Would it be worth enabling by default, even, and adding an option to disable it here?
There might be some small runtime overhead for always enabling it. I don't have any strong preference on one way or the other. I am happy to change the default. Does anyone else have any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could definitely see an argument that the csv/parquet TableProviders are temporary, but arbitrary TableProviders might use totally different semantics (e.g. in my case, fetching data from a persistent store that could be classed as a base table). If we want to support different options here, perhaps we need a new TableProvider method to indicate that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: MemoryCatalogList instead of MemoryDatabase here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth enabling by default, even, and adding an option to disable it here? I can't think of any downsides to enabling it, and for the subset of the info schema that we support, it's in line with the standard (as far as I'm aware without spending stupid money of a copy of the standard 😅)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a massive issue as this isn't the critical path, but would it be worth doing this wrapping on catalog registration so we avoid query-side allocs here? I wonder if it's possible to remove the CatalogListWithInformationSchema entirely and just do this inside register_catalog instead. So if the info schema is enabled, wrap the catalog up in a CatalogWithInformationSchema before registering it with the list, or otherwise, register the plain catalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the Config can't be changed after the ExecutionContext is created I think this would be a good idea. I will give it a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in 5151f31
Yes, I agree -- I see no reason to support different values for |
29f391f to
5151f31
Compare
|
Ok, I have addressed all comments and I think this PR is ready to merge (once it gets an approval). |
returnString
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One stray println left in as noted, but otherwise, looks good to me! Looking forward to using this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops -- good call -- will remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be a good one for a debug! line actually.
|
I've also logged https://issues.apache.org/jira/browse/ARROW-12132 for the table provider "type" followup. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @alamb . I am sorry: I am obviously missing something: I do not understand the rational of introducing catalogs and schemas.
In the proposal in google docs, the benefits enumerated are the ability to introspect the table and its schemas (i.e. column and types), which I very much agree with. I am just trying to understand the rational for catalogs and schemas (grouping of table names).
IMO Catalogs in postgres are a mechanism over which postgres uses its own system (RDBMS) to perform bookeeping about its own state, away from normal users.
IMO Schemas in postgres are primarily a mechanism over which postgres adds an extra layer of granularity over groups of tables, typically in the context of its role-based model and permission segmentation. Since DataFusion has no concept of user or role or permission, I do not see the why.
There is another reason in postgres to use schemas, and that may be relevant here: schemas reduce naming collisions by allowing users to defining tables in their own schema. However, I only see this being relevant whenever an administrator can control which users can write to which schemas, as that induces a "namespace cleanliness" of the schema (i.e. the DBA can promise the user that the schema is "clean"). IMO such promise requires a role-based model and a permission system attached, no?
Maybe the reason we use catalogs and schemas here is to mimic postgres behavior so that consumers can more easily query using the same SQL?
Sorry for the long comment, I am just trying to understand the logic here.
|
Hey @jorgecarleitao, so I introduced the general concept in #9762 (design doc here), with a mix of rationales Firstly, I saw Andrew's proposal for this info schema work and wanted to support namespacing this correctly, as it works in other DBs (we don't want to disallow people from registering a table called Secondly, in my use case (DataFusion as the execution engine for an actual DB), I have catalogs and schemas as a mechanism for logical organisation. From the proposal doc:
Your point about roles/permissions is totally correct - in my case currently, I'm dynamically building execution contexts based on tables that users are allowed to access according to my external metadata. Users can also create tables, which is again a metadata operation that informs future execution context creation (and these are created within a specific catalog and schema). Lemme know if anything there is unclear, doing lunchbreak replies atm so may have missed something 😄 Edit: oh, and one of the design goals was to ensure that as little as possible changed for use cases that don't care about namespacing - if a singular flat namespace works for a given use case, nothing user-facing should need to change, and you'll just see a default catalog/schema of "datafusion.public" when querying the info schema. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks a lot @alamb and @returnString for this. Went through the code and looks good 💯
|
@jorgecarleitao -- @returnString did a great job explaining the rationale for schemas / catalogs from my perspective. Thanks for checking in on us and making sure it is clear what is going on |
5151f31 to
be47c11
Compare
|
Integration test failure seems unrelated https://github.com/apache/arrow/pull/9818/checks?check_run_id=2219436309 |
# Rationale Accessing the list of tables via `select * from information_schema.tables` (introduced in #9818) is a lot to type See the doc for background: https://docs.google.com/document/d/12cpZUSNPqVH9Z0BBx6O8REu7TFqL-NPPAYCUPpDls1k/edit# # Proposal Add support for `SHOW TABLES` command. # Commentary This is different than both postgres (which uses `\d` in `psql` for this purpose), and MySQL (which uses `DESCRIBE`). I could be convinced that we should not add `SHOW TABLES` at all (and just stay with `select * from information_schema.tables` but I wanted to add the proposal) # Example Use Setup: ``` echo "1,Foo,44.9" > /tmp/table.csv echo "2,Bar,22.1" >> /tmp/table.csv cargo run --bin datafusion-cli ``` Then run : ``` CREATE EXTERNAL TABLE t(a int, b varchar, c float) STORED AS CSV LOCATION '/tmp/table.csv'; > show tables; +---------------+--------------------+------------+------------+ | table_catalog | table_schema | table_name | table_type | +---------------+--------------------+------------+------------+ | datafusion | public | t | BASE TABLE | | datafusion | information_schema | tables | VIEW | +---------------+--------------------+------------+------------+ 2 row in set. Query took 0 seconds. ``` Closes #9847 from alamb/alamb/really_show_tables Authored-by: Andrew Lamb <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
…hema.columns` Builds on the code in #9818 # Rationale Provide schema metadata access (so a user can see what columns exist and their type). See the doc for background: https://docs.google.com/document/d/12cpZUSNPqVH9Z0BBx6O8REu7TFqL-NPPAYCUPpDls1k/edit# I plan to add support for `SHOW COLUMNS` possibly as a follow on PR (though I have found out that `SHOW COLUMNS` and `SHOW TABLES` are not supported by either MySQL or by Postgres 🤔 ) # Changes I chose to add the first 15 columns from `information_schema.columns` You can see the full list in Postgres [here](https://www.postgresql.org/docs/9.5/infoschema-columns.html) and SQL Server [here](https://docs.microsoft.com/en-us/sql/relational-databases/system-information-schema-views/columns-transact-sql?view=sql-server-ver15). There are a bunch more columns that say "Applies to features not available in PostgreSQL" and that don't apply to DataFusion either-- since my usecase is to get the basic schema information out I chose not to add a bunch of columns that are always null at this time. I feel the use of column builders here is somewhat awkward (as it requires many calls to `unwrap`). I am thinking of a follow on PR to refactor this code to use `Vec<String>` and `Vec<u64>` and then create `StringArray` and `UInt64Array` directly from them but for now I just want the functionality # Example use Setup: ``` echo "1,Foo,44.9" > /tmp/table.csv echo "2,Bar,22.1" >> /tmp/table.csv cargo run --bin datafusion-cli ``` Then run : ``` > CREATE EXTERNAL TABLE t(a int, b varchar, c float) STORED AS CSV LOCATION '/tmp/table.csv'; 0 rows in set. Query took 0 seconds. > select table_name, column_name, ordinal_position, is_nullable, data_type from information_schema.columns; +------------+-------------+------------------+-------------+-----------+ | table_name | column_name | ordinal_position | is_nullable | data_type | +------------+-------------+------------------+-------------+-----------+ | t | a | 0 | NO | Int32 | | t | b | 1 | NO | Utf8 | | t | c | 2 | NO | Float32 | +------------+-------------+------------------+-------------+-----------+ 3 row in set. Query took 0 seconds. ``` Closes #9840 from alamb/alamn/information_schema_columns Authored-by: Andrew Lamb <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
Rationale
Provide configurable access to a table list so a user can see what tables exist).
See the doc for background: https://docs.google.com/document/d/12cpZUSNPqVH9Z0BBx6O8REu7TFqL-NPPAYCUPpDls1k/edit#
I plan to add support for
SHOW TABLES(andinformation_schema.columns/SHOW COLUMNS) as follow on PRsExample use:
Setup:
Then run :