feat: Remove CollectionDescription.Schema #1965

AndrewSisley · 2023-10-13T16:07:08Z

Relevant issue(s)

Resolves #1958

Description

Removes CollectionDescription.Schema.

Also splits the storage of schema out from within collection.

The storage of schema has been broken out to a new sub-package of db, at the moment it is a very simple file, but collection will be moved there in #1964. I was planning on doing that in this PR (in part, to provide context for reviewers, as atm it is basically a single-file package), but it proved to be non-trivial due to some existing messiness in that space and was broken out to two more tasks.

I also wish for stuff in that directory to eventually follow a repository-like pattern, where stuff is cached (within a context/txn's context) instead of fetching from store on each call.

Moving this stuff out to a new directory instead of preserving it in the (already very large) db directory should make both db and the new sub-package a fair bit more cohesive and easier to read.

codecov · 2023-10-13T17:17:37Z

Codecov Report

Attention: 54 lines in your changes are missing coverage. Please review.

Comparison is base (20b6329) 74.30% compared to head (17bd290) 74.03%.
Report is 1 commits behind head on develop.

@@             Coverage Diff             @@
##           develop    #1965      +/-   ##
===========================================
- Coverage    74.30%   74.03%   -0.28%     
===========================================
  Files          242      244       +2     
  Lines        23436    23508      +72     
===========================================
- Hits         17414    17402      -12     
- Misses        4837     4899      +62     
- Partials      1185     1207      +22

Flag	Coverage Δ
all-tests	`74.03% <65.38%> (-0.28%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
client/descriptions.go	`70.42% <ø> (ø)`
db/schema.go	`85.37% <100.00%> (+0.03%)`	⬆️
planner/sum.go	`88.98% <100.00%> (ø)`
core/key.go	`88.97% <87.50%> (-0.06%)`	⬇️
db/collection.go	`69.35% <88.46%> (-0.05%)`	⬇️
db/description/errors.go	`0.00% <0.00%> (ø)`
db/description/schema.go	`56.31% <56.31%> (ø)`

... and 8 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d0c7fb...17bd290. Read the comment docs.

AndrewSisley · 2023-10-13T19:53:06Z

db/collection.go

-	existingDescriptionsByName map[string]client.CollectionDescription,
-	proposedDescriptionsByName map[string]client.CollectionDescription,
+	existingSchemaByName map[string]client.SchemaDescription,
+	proposedDescriptionsByName map[string]client.SchemaDescription,
 	desc client.CollectionDescription,


todo: Passing this in no longer makes any sense, not does fetching it in patchSchema. Drop the patchSchema call to getCollectionsByName, and fetch it on demand in here.

See also #1939 (comment)

Remove unwanted complexity RE collectionDescription param

fredcarle · 2023-10-13T23:43:03Z

db/collection.go

+	collectionSchemaVersionKey := core.NewCollectionSchemaVersionKey(schema.VersionID)
 	// Whilst the schemaVersionKey is global, the data persisted at the key's location
 	// is local to the node (the global only elements are not useful beyond key generation).
 	err = txn.Systemstore().Put(ctx, collectionSchemaVersionKey.ToDS(), buf)
 	if err != nil {
 		return nil, err
 	}

-	collectionSchemaKey := core.NewCollectionSchemaKey(schemaID)
-	err = txn.Systemstore().Put(ctx, collectionSchemaKey.ToDS(), []byte(schemaVersionID))
+	collectionSchemaKey := core.NewCollectionSchemaKey(schema.SchemaID)
+	err = txn.Systemstore().Put(ctx, collectionSchemaKey.ToDS(), []byte(schema.VersionID))
 	if err != nil {
 		return nil, err
 	}

-	err = txn.Systemstore().Put(ctx, collectionKey.ToDS(), []byte(schemaVersionID))
+	err = txn.Systemstore().Put(ctx, collectionKey.ToDS(), []byte(schema.VersionID))


thought: These make no sense anymore. I think this is what you were referring to here

fredcarle · 2023-10-13T23:49:44Z

db/descriptions/schema.go

suggestion: Although I like the functions that you create here, I'm not a fan of the descriptions package. I would much rather see this under db and db.schema.go is still small enough to host the content of this file without causing trouble.

The reasons for this is documented in the description, whilst it is very small now, it will hopefully grow very soon. This package provides a single cohesive location for all similar stuff (schema, collection, index, repo-stuff) to live alongside each other without getting blended into the rest of the db package.

I read the description. My comment still stands. I don't think that even after moving collection in there that it would make sense to me. I think having the content under a file with a representative name under db would be much better.

It would be quite a lot of code. The schema stuff is 170 lines, collection and index and anything else that follows would be the same. Plus the repository layer on top of them. Why would you want all of that spread about a fairly large and general purpose package (db)? It would be much much easier to read and maintain if it all lived in isolation in the same location.

I don't believe it would make any difference for maintainability. The schema stuff was already in there. You took it out of existing functions. If the files are well named and represent their contents, I don't really care how large the package is in terms of lines of code. Readability would be unchanged. I also care more about the size of functions than the size of files.

If you move collection to a sub package, we will end up with the sub-package having its parent as a dependency and eventually risk hitting circular dependency issues.

collection depends on db...

I do not understand what you mean here. You might be misunderstanding what will be living in this package.

Why not just add collection_description.go and schema_description.go instead of creating a sub-package?

Because they are going to be separated by 10 or so other files in the db package that are alphabetically larger than the first and smaller than the second, with index_description.go somewhere in between, plus whatever files the repo stuff will use, all causing cohesion to be lost.

If schema_description.go is next to schema.go and functions inside schema.go call functions inside schema_description.go, I think that's pretty good cohesion.

Why would having schema_description.go next to collection_description.go be more cohesive than the above?

Why would having schema_description.go next to collection_description.go be more cohesive than the above?

The below is the definition of 'cohesive' returned from duck-duck-go. By definition if two files not next to each other, they are less cohesive than two files that are next to each other.

cohesive:
Holding the particles of a homogeneous body together; ; producing cohesion
Cohering, or sticking together, as in a mass; capable of cohering; tending to cohere.

For example, if the files were called description_schema.go and description_collection.go, they would be co-located, and it would be much easier to spot the shared concepts. A directory is just that really but using a / instead of an _, with some special treatment from file browser - this also emphasises the fact that they are separate conceptually from the rest of the db package contents.

When I look at the codebase, I'd like to see the stuff defining how to save and fetch schema next to the code used to save and fetch collections, and I'd also want the two neatly organised so that the implementation to fetch schema isn't mixed up with the code to save collections.

This lets me make sure that they all look and behave in predictable ways, and if I want to change all of them it is much more obvious what 'all of them' is, so that I don't accidentally miss some similar stuff (e.g. index descriptions).

Alright I'll get behind your approach. I would like that you name it in the singular if it's not too big an ask as pointed out by islam.

I am worried that we will eventually hit a circular dependency situation given the proximity to the types in the root db package. But we can cross that bridge when we get there.

Okay cheers, I'll add a task/note against the package name convo to rename it

islamaliev · 2023-10-17T16:44:08Z

core/key.go

@@ -132,6 +133,12 @@ type CollectionIndexKey struct {

 var _ Key = (*CollectionIndexKey)(nil)

+type SchemaVersionKey struct {


todo: I think it needs a comment here

Cheers - I missed that, will add

Doc key

islamaliev · 2023-10-17T16:48:35Z

core/key.go

@@ -257,6 +264,11 @@ func NewCollectionSchemaVersionKey(schemaVersionId string) CollectionSchemaVersi
 	return CollectionSchemaVersionKey{SchemaVersionId: schemaVersionId}
 }

+func NewCollectionSchemaVersionKeyFromString(key string) CollectionSchemaVersionKey {


question: Don't you think it makes sense to add some unit tests? There is a dedicated file that tests keys

I do not. This is very simple, and tested by pretty much every integration test we have. Unit tests would only slow down development (now and in the future).

Why would be relying on an integration test to flag an issue with something that can easily be locally tested? That means that if we make a change to any of the keys, we would have to run the integration test suite to see if there are any problems with the change instead of running the much faster local unit tests.

I had a closer look and I think that not unit testing our keys could cause bad changes to go unnoticed until the breaking change tests are executed which could be quite annoying.

I'm not really in the mood to get back into the integration vs unit test conversation at the moment. I have written a large amount on this topic before and you know my opinion on this matter.

We have two overall objective with testing:
1- Testing behaviour that is meaningful to the end user (like you mentioned above)
2- Reaching maximum code coverage so that all possible code paths have been tested (sometimes that is currently impossible to do just with integration tests). This is a goal that the team set for itself (significantly increase code coverage).

I'll end my input on this conversation with reminding you of the above objectives. If you believe that simply relying on integration tests in case contributes towards achieving the above objectives, then I think we're good.

If you believe that simply relying on integration tests in case contributes towards achieving the above objectives

I very strongly believe so. Sometimes the integration tests will need to be more imaginative than our current core suite, for example chaos monkey like magic, or providing a deliberately faulty/test datastore implementation (the root datastore is something that can be publicly provided), but ultimately everything should be testable via public only access points. If it isn't, it is dead code.

just another note: it is practically impossible to reach a desired test coverage just with integration tests.
Just imagine setting up "a deliberately faulty/test datastore" that you want to trigger for a specific request at a specific code location. You saw it was hard to achieve it with unit tests, doing it with integration tests will be a nightmare.

And those tests helped me write better code and fix bugs (not reported) in the existing code. Like this:

err := someAction() if err != nil { return err } ... res :=rsultsIter.NextSync() if res.Error != nil { return err // here is the problem }

Usually we don't write test that such paths. But we can easily overlook the details and ship incorrect code.

Such resilience tests don't need to be coded like traditional do-this, do-that tests. You can use this https://en.wikipedia.org/wiki/Chaos_engineering approach instead. This can allow the tests to be somewhat proactive (searching for bugs) instead reactive (testing known pathways where bugs are expected).

Also a lot of our errors can be reached my manipulating the datastore values directly, no need to mess around with mocks.

I also want to introduce a really magic testing layer, where randomized queries and setup data are fed in. Traditionally you can do this using recorded real-world data/actions, but there has been significant movement in using 'AI' to generate stuff like queries for such things.

islamaliev · 2023-10-17T16:53:29Z

db/descriptions/descriptions.go

suggestion: I think conventional way of naming packages in go is in singular form

I think I always saw that as when talking about a single thing. For example, the db package contains a single db definition, that can be used to instantiate multiple db instances. In descriptions we have multiple description definitions (schema, collection, indexes - eventually), each of which can be used to instantiate multiple instances.

as I said, there is no official rule for it, it's just community convention.
The main argument for it how you going to use it. Yes, the package might deal with several different descriptions, but for the most packages most of of their public methods deal with a single entity (a.k.a description).
Like context.Context allows you to create (or wrap) different types of contexts, but you handle a single one at a time.

Like context.Context allows you to create (or wrap) different types of contexts, but you handle a single one at a time.

Yes, but a context.Context is still just a (singular) context.Context. The descriptions in this package do (will) not share an interface, or base inherited type. The description types are all public and independent, 'descriptions' is just a made up umbrella term because I needed to give the package a name - there is no Description interface or type.

Another example is the enumerable package, this is singular, because it describes a single thing, although there are many implementations, and some extensions of the base interface.

Note: I'm mostly just chatting for fun here, I dont really care about the name of the package, and think description(s) is a bit awkward anyway.

Fred also prefers description over descriptions, renaming :)

Rename package

islamaliev · 2023-10-17T17:21:11Z

db/descriptions/schema.go

+	"github.com/sourcenetwork/defradb/datastore"
+)
+
+// CreateSchemaVersion creates and saves to the store a new schema version.


suggestion: "a new schema description version"

I do disagree, and don't think adding description there adds any clarity. It does also clash with Fred's long-term naming suggestion where SchemaDescription will be renamed to Schema (or maybe SchemaVersion, either way Description will be dropped).

it was just observation as a reader that I was looking for "description" word because the method is in descriptions package, but it wasn't clear to me at first sight why.

islamaliev · 2023-10-17T17:23:21Z

db/descriptions/schema.go

+	return desc, nil
+}
+
+// GetSchemas returns the schema of all the default schemas in the system.


question: what is "default schema"? Is it a common term in defra?

It is a fairly new term that came to be when we allowed the toggling of schema versions. There is a word missing here, it should read default schema versions. A default schema version is a version that is the one used by a collection/query/etc when not explicitly requesting a version.

Tweak doc wording

all right, the term "default schema versions" I know :)

islamaliev · 2023-10-17T17:30:37Z

db/descriptions/schema.go

+	}
+	defer func() {
+		if err := collectionSchemaVersionQuery.Close(); err != nil {
+			log.Error(ctx, NewErrFailedToCloseSchemaQuery(err).Error())


todo: #1450 (comment) :)

Hahaha, nice 😁 Thanks Islam 😆

Return error

islamaliev · 2023-10-17T17:51:22Z

db/indexed_docs_test.go

@@ -661,86 +661,6 @@ func TestNonUniqueDrop_ShouldDeleteStoredIndexedFields(t *testing.T) {
 	assert.Len(t, f.getPrefixFromDataStore(prodCatKey.ToString()), 1)
 }

-func TestNonUniqueDrop_IfDataStorageFails_ReturnError(t *testing.T) {


question: I know we already touched on this one, but it really hard to see these tests are being cut off.

I know these unit test might bit a hard to maintain with all these mocks but they reach to the place where integration tests can't reach and help us achieve a greater code coverage which we are striving for.

I wonder how do you decide if some tests need to be removed.

I know we already touched on this one, but it really hard to see these tests are being cut off.

I really do appreciate that, and it is a big personal reason why I do not want the removal of these tests to be blended in with the key.go unit test discussion.

help us achieve a greater code coverage which we are striving for

Code coverage is not an end goal, it is just an indicator. I do not strive for code coverage, I strive for correct and maintainable code. Good tests contribute to code correctness and lower maintenance costs, bad tests can have a negative effect on both.

Removing these tests does not appear to have affected code coverage.

I wonder how do you decide if some tests need to be removed.

:) I'm a lot more conservative than I want to be RE preserving unit tests. I have tried several times before to remove many of the unit tests that I have spend notable time changing in these last few PRs.

The index unit tests however have cost me a lot of time the last 10 days or so. It took me roughly half a day of development to get them working in a previous PR, I didn't bother trying in this one and removed the ones that started failing.

I do not consider these to be worth their maintenance cost.

They are the only tests to change in this PR.

This PR does not change any public behaviour.

This PR does not touch any index related code.

Yet these tests started to fail. These tests inhibit useful work, they prevent us from making useful changes to the codebase. Changes that are irrelevant to what they are testing for.

Almost everything that the tests test, is covered by integration tests. Some tests in these files do test errors which do not yet have integration test coverage, however I consider testing those errors to be very low value.

As well as the above, fixing these once they start failing is particularly painful. There is quite a lot of mock setup, and it is not easy code to read. Fixing them requires guessing how many internal calls to various mocks used by the tests are the 'correct' number, with the 'correct' input params and responses.

Somewhat related to the conversation we had RE the index-explain integration tests, these unit tests make unimportant stuff really important, and this wastes a lot of time.

In my opinion, the cost of these tests far outweighs any benefit of having them. Good tests save developer time (including external users' time in the case of bugs), that is why we write them - these tests cost more time than they will ever save.

I am not against the unit testing of stuff. Unit tests are useful in many cases.

I am not against mocks, they have their place, however here, with the amount of irrelevant stuff that they need to mock, they make these tests bad tests that cost more than they save.

If we wish to preserve tests similar in concept to the tests in this file they should be reworked somehow. They really should not fail when making the changes I have made in this PR.

Yet these tests started to fail.

To clarify on this, I changed the commit order - I try quite hard to make to all of my commits pass the CI, it makes a lot of things easier. So they started to fail part way through this PR, I removed them, then moved the commit to the start.

hey Andy, thanks for the detailed reasoning. I appreciate it.

I agree, these tests turned out to be a bit too invasive and have high maintenance cost with relatively little benefit.

These have cost me more than a day's worth of time the last week already and they have started failing again due to various internal mocks requiring the exact number of calls with the exact params etc and I have decided that is enough. These are bad tests that cost far more time to maintain than they will ever save. No public behaviour has changed here, yet they insist on failing, and require constant deep diving into really dense mock setup code that has very little to do with anything useful to our users. They are being removed now.

Also stores the schema against a new key. This key is not read in this commit, but will be shortly.

fredcarle

We're doing good progress on detangling collection and schema. Keep it up! 🙂

## Relevant issue(s) Resolves sourcenetwork#1958 ## Description Removes CollectionDescription.Schema. Also splits the storage of schema out from within collection. The storage of schema has been broken out to a new sub-package of db, at the moment it is a very simple file, but collection will be moved there in sourcenetwork#1964. I was planning on doing that in this PR (in part, to provide context for reviewers, as atm it is basically a single-file package), but it proved to be non-trivial due to some existing messiness in that space and was broken out to two more tasks. I also wish for stuff in that directory to eventually follow a repository-like pattern, where stuff is cached (within a context/txn's context) instead of fetching from store on each call. Moving this stuff out to a new directory instead of preserving it in the (already very large) db directory should make both db and the new sub-package a fair bit more cohesive and easier to read.

AndrewSisley added this to the DefraDB v0.8 milestone Oct 13, 2023

AndrewSisley requested a review from a team October 13, 2023 16:07

AndrewSisley self-assigned this Oct 13, 2023

AndrewSisley force-pushed the 1958-schema-collection-storage branch 2 times, most recently from b16ff04 to 159aa0b Compare October 13, 2023 16:49

AndrewSisley mentioned this pull request Oct 13, 2023

refactor: Deprecate CollectionDescription.Schema #1939

Merged

AndrewSisley commented Oct 13, 2023

View reviewed changes

fredcarle reviewed Oct 13, 2023

View reviewed changes

AndrewSisley mentioned this pull request Oct 16, 2023

feat: Remove collection from patch schema #1957

Merged

AndrewSisley force-pushed the 1958-schema-collection-storage branch 3 times, most recently from 74704e0 to c8170a0 Compare October 17, 2023 16:18

AndrewSisley requested a review from fredcarle October 17, 2023 16:19

islamaliev reviewed Oct 17, 2023

View reviewed changes

AndrewSisley requested a review from islamaliev October 17, 2023 21:21

AndrewSisley added 2 commits October 17, 2023 17:38

Remove missed refs to col.Schema

abe37a1

AndrewSisley added 6 commits October 17, 2023 17:38

Extract the creation of schema to new function

f2aa873

Also stores the schema against a new key. This key is not read in this commit, but will be shortly.

Fetch schema from store instead of collection

1c18f7d

Remove collectionDescription.Schema

2e42d45

PR FIXUP - Simplify updateSchema params

0f745b3

PR FIXUP - Document key struct

ad3c0eb

PR FIXUP - Fix GetSchemas doc wording

1d4b765

AndrewSisley force-pushed the 1958-schema-collection-storage branch from 606947d to ac557cc Compare October 17, 2023 21:38

PR FIXUP - Return error instead of logging

12ccb43

AndrewSisley force-pushed the 1958-schema-collection-storage branch from ac557cc to 12ccb43 Compare October 17, 2023 21:41

PR FIXUP - Rename description package

17bd290

fredcarle approved these changes Oct 17, 2023

View reviewed changes

AndrewSisley merged commit 0efe835 into sourcenetwork:develop Oct 17, 2023
29 checks passed

AndrewSisley deleted the 1958-schema-collection-storage branch October 17, 2023 22:10

fredcarle mentioned this pull request Oct 24, 2023

refactor: Reorganise collection description storage #1988

Merged

		@@ -132,6 +133,12 @@ type CollectionIndexKey struct {

		var _ Key = (*CollectionIndexKey)(nil)

		type SchemaVersionKey struct {

feat: Remove CollectionDescription.Schema #1965

feat: Remove CollectionDescription.Schema #1965

Conversation

AndrewSisley commented Oct 13, 2023 • edited Loading

Relevant issue(s)

Description

codecov bot commented Oct 13, 2023 • edited Loading

Codecov Report

AndrewSisley Oct 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewSisley Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fredcarle Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewSisley Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewSisley Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewSisley Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewSisley Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

AndrewSisley Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fredcarle left a comment

Choose a reason for hiding this comment

AndrewSisley commented Oct 13, 2023 •

edited

Loading

codecov bot commented Oct 13, 2023 •

edited

Loading

AndrewSisley Oct 13, 2023 •

edited

Loading

AndrewSisley Oct 17, 2023 •

edited

Loading

fredcarle Oct 17, 2023 •

edited

Loading

AndrewSisley Oct 17, 2023 •

edited

Loading

AndrewSisley Oct 17, 2023 •

edited

Loading

AndrewSisley Oct 17, 2023 •

edited

Loading

AndrewSisley Oct 17, 2023 •

edited

Loading

AndrewSisley Oct 17, 2023 •

edited

Loading