Skip to content

Feature: Schema overriding#7539

Merged
fulghum merged 24 commits intomainfrom
fulghum/schema-pinning
Mar 8, 2024
Merged

Feature: Schema overriding#7539
fulghum merged 24 commits intomainfrom
fulghum/schema-pinning

Conversation

@fulghum
Copy link
Contributor

@fulghum fulghum commented Feb 26, 2024

Allows customers to specify a commit, branch, or tag in the @@dolt_schema_override_commit session variable and have all table's data mapped to the schema from that commit, branch, or tag, when queried.

Example

As a simple example, consider a database with a main branch that has added the new column birthday to a table, and an olderBranch branch with a table that has not been updated with that schema change. Customers cannot use the same queries from the main branch to query the data on the olderBranch because of the schema difference. Setting a schema override allows the customer to map the table schemas on the olderBranch branch to the same schema as on the main branch. This can be useful when you want to run queries on older data, but don't want to rewrite your queries for older schemas.

CALL dolt_checkout(‘olderBranch’);

SELECT name, birthday from people;
column "birthday" could not be found in any table in scope

SET @@dolt_schema_override_commit = ‘main’;
SELECT name, birthday from people;
+-------+----------+
| name  | birthday |
+-------+----------+
| Sammy | NULL     |
+-------+----------+

Limitations

The first version of schema override support is subject to several limitations. Please reach out to us and let us know if you'd like any of these to be prioritized.

  • Read-only – when a schema override has been set, only read queries can be executed. Attempting to write data or execute DDL will result in an error about the database being read-only.
  • System tables – Dolt system tables currently do not honor schema overrides.
  • Collation changes – Collations affect how data is sorted when stored on disk. To map data from one collation to another collation requires extra processing to ensure the returned results are sorted according to the mapped collation. This extra processing is not supported yet, so collation changes will not appear when overriding a schema.
  • Column defaults – If the overridden schema has added new columns with column defaults, those column defaults do not currently get applied when that column is queried. Using the column default value, instead of NULL is a planned enhancement.

Design doc

Reference docs update: dolthub/docs#2062

Fixes: #5486

@coffeegoddd

This comment was marked as outdated.

@coffeegoddd

This comment was marked as duplicate.

@fulghum fulghum marked this pull request as ready for review February 27, 2024 00:56
Copy link
Member

@zachmu zachmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great, nice work.

There are probably things we haven't thought of but the tests look good enough for an initial release.

}

// Load the overridden schema and convert it to a sql.Schema
overriddenSchema, err := t.GetSchema(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These GetSchema calls are very expensive and will give you a performance hit when customers use this feature. They're the main reason we have caching in getTable(). Because we cache based on root values, you can just reuse the same cache here as well, and it should help a lot. Probably fine to check in this way as is, but you should do a pass on perf.

if !ok {
return fmt.Errorf("unable to find table '%s' at overridden schema root", tableName)
}
overriddenSchema, err := overriddenTable.GetSchema(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about caching here

// set value by testing against nil.
func getOverriddenSchemaValue(ctx *sql.Context) (*string, error) {
doltSession := dsess.DSessFromSess(ctx.Session)
varValue, err := doltSession.GetSessionVariable(ctx, dsess.DoltOverrideSchema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These session variable lookups are kind of expensive as well, you might be surprised. I recommend profiling this after you get everything checked in.

Check out DoltSession.dbdbSessionVarsStale() to see where I added some caching for this interaction the last time I was working on session code.

@coffeegoddd

This comment was marked as duplicate.

@coffeegoddd

This comment was marked as duplicate.

@coffeegoddd

This comment was marked as duplicate.

@coffeegoddd
Copy link
Contributor

@fulghum DOLT

comparing_percentages
100.000000 to 100.000000
version result total
794e30d ok 5937457
version total_tests
794e30d 5937457
correctness_percentage
100.0

@fulghum fulghum merged commit 5ee9010 into main Mar 8, 2024
@fulghum fulghum deleted the fulghum/schema-pinning branch March 8, 2024 01:17
@coffeegoddd
Copy link
Contributor

@fulghum DOLT

comparing_percentages
100.000000 to 100.000000
version result total
d2c4af2 ok 5937457
version total_tests
d2c4af2 5937457
correctness_percentage
100.0

@github-actions
Copy link

github-actions bot commented Mar 8, 2024

@coffeegoddd DOLT

test_name detail row_cnt sorted mysql_time sql_mult cli_mult
batching LOAD DATA 10000 1 0.07 0.43
batching batch sql 10000 1 0.08 1.25
batching by line sql 10000 1 0.08 1.25
blob 1 blob 200000 1 0.92 3.01 3.39
blob 2 blobs 200000 1 0.94 3.77 4.31
blob no blob 200000 1 0.92 1.2 1.34
col type datetime 200000 1 0.83 1.69 1.88
col type varchar 200000 1 0.69 1.84 1.86
config width 2 cols 200000 1 0.8 1.26 1.26
config width 32 cols 200000 1 1.88 1.38 2.45
config width 8 cols 200000 1 0.97 1.29 1.62
pk type float 200000 1 0.87 1.13 1.2
pk type int 200000 1 0.84 1.12 1.21
pk type varchar 200000 1 1.57 0.95 0.95
row count 1.6mm 1600000 1 5.77 1.41 1.47
row count 400k 400000 1 1.47 1.34 1.41
row count 800k 800000 1 2.93 1.37 1.43
secondary index four index 200000 1 3.69 1.03 0.88
secondary index no secondary 200000 1 0.93 1.22 1.35
secondary index one index 200000 1 1.16 1.41 1.42
secondary index two index 200000 1 2.02 1.15 1.07
sorting shuffled 1mm 1000000 0 5.22 1.69 1.7
sorting sorted 1mm 1000000 1 5.21 1.72 1.69

@github-actions
Copy link

github-actions bot commented Mar 8, 2024

@coffeegoddd DOLT

name detail mean_mult
dolt_blame_basic system table 1.38
dolt_blame_commit_filter system table 3.59
dolt_commit_ancestors_commit_filter system table 0.81
dolt_commits_commit_filter system table 0.9
dolt_diff_log_join_from_commit system table 2.08
dolt_diff_log_join_to_commit system table 2.1
dolt_diff_table_from_commit_filter system table 1.15
dolt_diff_table_to_commit_filter system table 1.15
dolt_diffs_commit_filter system table 1
dolt_history_commit_filter system table 1.37
dolt_log_commit_filter system table 0.91

@github-actions
Copy link

github-actions bot commented Mar 8, 2024

@coffeegoddd DOLT

name add_cnt delete_cnt update_cnt latency
adds_only 60000 0 0 0.72
adds_updates_deletes 60000 60000 60000 3.94
deletes_only 0 60000 0 1.91
updates_only 0 0 60000 2.54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ability to "project" schema at a commit onto data at another commit

3 participants