Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Mar 26, 2021

Rationale

Provide configurable access to a table list so a user can see what tables exist).

See the doc for background: https://docs.google.com/document/d/12cpZUSNPqVH9Z0BBx6O8REu7TFqL-NPPAYCUPpDls1k/edit#

I plan to add support for SHOW TABLES (and information_schema.columns / SHOW COLUMNS) as follow on PRs

Example use:

Setup:

echo "1" > /tmp/table.csv
echo "2" >> /tmp/table.csv
cargo run --bin datafusion-cli

Then run :

>  CREATE EXTERNAL TABLE t(a int)
STORED AS CSV
LOCATION '/tmp/table.csv';

0 rows in set. Query took 0 seconds.
> select * from information_schema.tables;

+---------------+--------------------+------------+--------------+
| table_catalog | table_schema       | table_name | table_type   |
+---------------+--------------------+------------+--------------+
| datafusion    | public             | t          | BASE TABLE   |
| datafusion    | information_schema | TABLES     | SYSTEM TABLE |
+---------------+--------------------+------------+--------------+
2 row in set. Query took 0 seconds.
z>

@github-actions
Copy link

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests illustrate how this feature works concisely I believe

@alamb
Copy link
Contributor Author

alamb commented Mar 26, 2021

FYI @seddonm1 @returnString @Dandandan

@alamb alamb changed the title ARROW-12106: [Rust][DataFusion] Support `SELECT * from information_schema.tables ARROW-12106: [Rust][DataFusion] Support SELECT * from information_schema.tables Mar 26, 2021
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 10?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string builder requires a number here but the number or rows isn't known. I'll add a comment explaining 10 is some arbitrary number

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use 0 to keep it simple?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I felt that 10 was a siimple as 0 -- lol though I don't really care

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not really spent much time with the information_schema. Are the own two types BASE TABLE and SYSTEM TABLE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure to be honest. Postgres has BASE TABLE and VIEW that I saw.

I will spend some time figuring out how what the values of these should be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Maybe we can find the standards document somewhere as I assume there is a fixed list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Here is a link to the Postgres doc: https://www.postgresql.org/docs/current/infoschema-tables.html

Type of the table: 

  • BASE TABLE for a persistent base table (the normal table type)
  • VIEW for a view
  • FOREIGN for a foreign table
  • LOCAL TEMPORARY for a temporary table

Are all our 'tables' kind of LOCAL TEMPORARY?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could definitely see an argument that the csv/parquet TableProviders are temporary, but arbitrary TableProviders might use totally different semantics (e.g. in my case, fetching data from a persistent store that could be classed as a base table). If we want to support different options here, perhaps we need a new TableProvider method to indicate that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think LOCAL TEMPORARY is referring to what happens when you do CREATE TEMPORARY TABLE... in postgres -- so I agree with @returnString that LOCAL TEMPORARY is probably not what we would want.

I will change the information table to say VIEW for the information_schema tables, consistent with Postgres (and SQL server)

@seddonm1
Copy link
Contributor

@alamb my read through looked good (as expected). I can have a play with it early next week.

@returnString
Copy link
Contributor

This looks awesome (and so hot on the heels of the basic catalog impl!) - will take a look tomorrow :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better as &[&str] to avoid clone?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see in the implementation it is not really possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I could think of to avoid the copy is don't make a CatalogList trait and instead pass around MemoryCatalogList directly. This would allow a function like fn catalog_names(&self) -> impl IntoIterator<Item=&str>

However, given catalog_names are only called when creating the information_schema views, I am not sure this optimization is warranted at this time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be disabled as well - or don't we expect that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth enabling by default, even, and adding an option to disable it here? I can't think of any downsides to enabling it, and for the subset of the info schema that we support, it's in line with the standard (as far as I'm aware without spending stupid money of a copy of the standard 😅)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be disabled as well - or don't we expect that?

I didn't expect people to want to be able to change the setting at runtime on the same Context -- either they would want the information schema or not. However, for consistency with the other settings I think it would be good to allow the flag to be passed in and I will do so

Would it be worth enabling by default, even, and adding an option to disable it here?

There might be some small runtime overhead for always enabling it. I don't have any strong preference on one way or the other. I am happy to change the default. Does anyone else have any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could definitely see an argument that the csv/parquet TableProviders are temporary, but arbitrary TableProviders might use totally different semantics (e.g. in my case, fetching data from a persistent store that could be classed as a base table). If we want to support different options here, perhaps we need a new TableProvider method to indicate that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: MemoryCatalogList instead of MemoryDatabase here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth enabling by default, even, and adding an option to disable it here? I can't think of any downsides to enabling it, and for the subset of the info schema that we support, it's in line with the standard (as far as I'm aware without spending stupid money of a copy of the standard 😅)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a massive issue as this isn't the critical path, but would it be worth doing this wrapping on catalog registration so we avoid query-side allocs here? I wonder if it's possible to remove the CatalogListWithInformationSchema entirely and just do this inside register_catalog instead. So if the info schema is enabled, wrap the catalog up in a CatalogWithInformationSchema before registering it with the list, or otherwise, register the plain catalog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the Config can't be changed after the ExecutionContext is created I think this would be a good idea. I will give it a try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in 5151f31

@alamb
Copy link
Contributor Author

alamb commented Mar 28, 2021

I could definitely see an argument that the csv/parquet TableProviders are temporary, but arbitrary TableProviders might use totally different semantics (e.g. in my case, fetching data from a persistent store that could be classed as a base table). If we want to support different options here, perhaps we need a new TableProvider method to indicate that?

Yes, I agree -- I see no reason to support different values for TABLE_TYPE in the initial PR -- and if someone wants that functionality they can perhaps propose a PR to add a methods to the TableProvider

@alamb alamb force-pushed the alamb/show_tables branch from 29f391f to 5151f31 Compare March 28, 2021 18:28
@alamb
Copy link
Contributor Author

alamb commented Mar 29, 2021

Ok, I have addressed all comments and I think this PR is ready to merge (once it gets an approval).

Copy link
Contributor

@returnString returnString left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One stray println left in as noted, but otherwise, looks good to me! Looking forward to using this :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover debugging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops -- good call -- will remove.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be a good one for a debug! line actually.

@returnString
Copy link
Contributor

I've also logged https://issues.apache.org/jira/browse/ARROW-12132 for the table provider "type" followup.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alamb . I am sorry: I am obviously missing something: I do not understand the rational of introducing catalogs and schemas.

In the proposal in google docs, the benefits enumerated are the ability to introspect the table and its schemas (i.e. column and types), which I very much agree with. I am just trying to understand the rational for catalogs and schemas (grouping of table names).

IMO Catalogs in postgres are a mechanism over which postgres uses its own system (RDBMS) to perform bookeeping about its own state, away from normal users.

IMO Schemas in postgres are primarily a mechanism over which postgres adds an extra layer of granularity over groups of tables, typically in the context of its role-based model and permission segmentation. Since DataFusion has no concept of user or role or permission, I do not see the why.

There is another reason in postgres to use schemas, and that may be relevant here: schemas reduce naming collisions by allowing users to defining tables in their own schema. However, I only see this being relevant whenever an administrator can control which users can write to which schemas, as that induces a "namespace cleanliness" of the schema (i.e. the DBA can promise the user that the schema is "clean"). IMO such promise requires a role-based model and a permission system attached, no?

Maybe the reason we use catalogs and schemas here is to mimic postgres behavior so that consumers can more easily query using the same SQL?

Sorry for the long comment, I am just trying to understand the logic here.

@returnString
Copy link
Contributor

returnString commented Mar 29, 2021

Hey @jorgecarleitao, so I introduced the general concept in #9762 (design doc here), with a mix of rationales

Firstly, I saw Andrew's proposal for this info schema work and wanted to support namespacing this correctly, as it works in other DBs (we don't want to disallow people from registering a table called tables, for the sake of a silly example, and having that shadow the system one).

Secondly, in my use case (DataFusion as the execution engine for an actual DB), I have catalogs and schemas as a mechanism for logical organisation. From the proposal doc:

  • Multi-tenant software seeking more stringent data isolation might use catalogs to represent individual tenants
  • Schemas could be used to split tables by producer in a centralised data pipeline
  • In a data warehouse comprising both raw event logs and pre-aggregated summary tables, you could use “raw” and “summary” schemas to organise the tables by type

Your point about roles/permissions is totally correct - in my case currently, I'm dynamically building execution contexts based on tables that users are allowed to access according to my external metadata. Users can also create tables, which is again a metadata operation that informs future execution context creation (and these are created within a specific catalog and schema).

Lemme know if anything there is unclear, doing lunchbreak replies atm so may have missed something 😄

Edit: oh, and one of the design goals was to ensure that as little as possible changed for use cases that don't care about namespacing - if a singular flat namespace works for a given use case, nothing user-facing should need to change, and you'll just see a default catalog/schema of "datafusion.public" when querying the info schema.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot @alamb and @returnString for this. Went through the code and looks good 💯

@alamb
Copy link
Contributor Author

alamb commented Mar 29, 2021

@jorgecarleitao -- @returnString did a great job explaining the rationale for schemas / catalogs from my perspective.

Thanks for checking in on us and making sure it is clear what is going on

@alamb
Copy link
Contributor Author

alamb commented Mar 30, 2021

@alamb alamb closed this in b6982df Mar 30, 2021
alamb added a commit that referenced this pull request Mar 31, 2021
# Rationale
Accessing the list of tables via `select * from information_schema.tables` (introduced in #9818) is a lot to type

See the doc for background: https://docs.google.com/document/d/12cpZUSNPqVH9Z0BBx6O8REu7TFqL-NPPAYCUPpDls1k/edit#

# Proposal

Add support for `SHOW TABLES` command.

# Commentary

This is different than both postgres (which uses `\d` in `psql` for this purpose), and MySQL (which uses `DESCRIBE`).

I could be convinced that we should not add `SHOW TABLES` at all (and just stay with `select * from information_schema.tables` but I wanted to add the proposal)

# Example Use

Setup:
```
echo "1,Foo,44.9" > /tmp/table.csv
echo "2,Bar,22.1" >> /tmp/table.csv
cargo run --bin datafusion-cli
```

Then run :

```
CREATE EXTERNAL TABLE t(a int, b varchar, c float)
STORED AS CSV
LOCATION '/tmp/table.csv';

> show tables;
+---------------+--------------------+------------+------------+
| table_catalog | table_schema       | table_name | table_type |
+---------------+--------------------+------------+------------+
| datafusion    | public             | t          | BASE TABLE |
| datafusion    | information_schema | tables     | VIEW       |
+---------------+--------------------+------------+------------+
2 row in set. Query took 0 seconds.

```

Closes #9847 from alamb/alamb/really_show_tables

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
alamb added a commit that referenced this pull request Apr 1, 2021
…hema.columns`

Builds on the code in #9818

# Rationale

Provide schema metadata access (so a user can see what columns exist and their type).

See the doc for background: https://docs.google.com/document/d/12cpZUSNPqVH9Z0BBx6O8REu7TFqL-NPPAYCUPpDls1k/edit#

I plan to add support for `SHOW COLUMNS` possibly as a follow on PR (though I have found out that `SHOW COLUMNS` and `SHOW TABLES` are not supported by either MySQL or by Postgres 🤔 )

# Changes
I chose to add the first 15 columns from `information_schema.columns` You can see the full list in Postgres [here](https://www.postgresql.org/docs/9.5/infoschema-columns.html) and SQL Server [here](https://docs.microsoft.com/en-us/sql/relational-databases/system-information-schema-views/columns-transact-sql?view=sql-server-ver15).

There are a bunch more columns that say "Applies to features not available in PostgreSQL" and that don't apply to DataFusion either-- since my usecase is to get the basic schema information out I chose not to add a bunch of columns that are always null at this time.

I feel the use of column builders here is somewhat awkward (as it requires many calls to `unwrap`). I am thinking of a follow on PR to refactor this code to use `Vec<String>` and `Vec<u64>` and then create `StringArray` and `UInt64Array` directly from them but for now I just want the functionality

# Example use

Setup:
```
echo "1,Foo,44.9" > /tmp/table.csv
echo "2,Bar,22.1" >> /tmp/table.csv
cargo run --bin datafusion-cli
```

Then run :

```
> CREATE EXTERNAL TABLE t(a int, b varchar, c float)
STORED AS CSV
LOCATION '/tmp/table.csv';
0 rows in set. Query took 0 seconds.

>   select table_name, column_name, ordinal_position, is_nullable, data_type from information_schema.columns;
+------------+-------------+------------------+-------------+-----------+
| table_name | column_name | ordinal_position | is_nullable | data_type |
+------------+-------------+------------------+-------------+-----------+
| t          | a           | 0                | NO          | Int32     |
| t          | b           | 1                | NO          | Utf8      |
| t          | c           | 2                | NO          | Float32   |
+------------+-------------+------------------+-------------+-----------+
3 row in set. Query took 0 seconds.
```

Closes #9840 from alamb/alamn/information_schema_columns

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants