Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NULL fields and tag removal #919

Closed
wants to merge 16 commits into from

Conversation

geigerzaehler
Copy link

  1. Deleting fields works by setting them to NULL in the database
  2. The Zero plugin completely removes tags instead of storing the empty string (see zero: Remove tags instead of setting them to null #157).

@sampsyo
Copy link
Member

sampsyo commented Aug 25, 2014

Awesome; I think this is a great compromise—it's possible to mark something as "deleted" for the purposes of tags without exposing Nones all over the place to beets internals, which will be incredibly useful. Great idea.

One thing I'm slightly worried about is that this makes a store followed by a load "non-transparent". That is, previously, you could count on this:

print(a.somefield)
a.store()
a.load()
print(a.somefield)

always printing the same value twice. Now, if somefield had just been deleted, then you'll get None the first time and a normalized value the second time. This could come up, for example, in the modify command, where we ant to show the difference between the original and modified items, but the latter has not been though a store/load cycle so the None will be exposed.

I know it seems hacky, but maybe it would be useful to have separate state for indicating whether a field has been deleted? For example, Model could have a set called _null_fields that holds fixed field names that are missing values (i.e., they are NULL in the database and should be deleted from the tags).

@geigerzaehler
Copy link
Author

I agree that the attributes should be invariant under store() and load(), but temporarily breaking the invariance should not be a problem. In the long run I think we should emit None values from the model. (Again the ReplayGain example comes to mind, where 0 means something different than None.) The _null_fields approach seems to be an unnecessary detour.

@sampsyo
Copy link
Member

sampsyo commented Aug 25, 2014

I know it's complicated, but I do think it's incredibly valuable for most fields (ReplayGain being an exception!) to avoid None values, at least in the steady state. It's a dynamically-typed version of Tony Hoare's "billion-dollar mistake": missing values are an exceptional case, so it's a shame to force all client code to do handle Nones when it just wants to, for example, print out item.title.upper().

If anything, I think I'd advocate for removing the field entirely from the object when it's unset. For example, del item.title would make subsequent a item.title access raise an AttributeError rather than silently returning None. This would at least make attribute presence behave the same for flexible attributes as for fixed attributes. We could even provide a separate API, like item.get('title'), that would be guaranteed to return a useful value.

I'll continue to ponder also, but this is a surprisingly tricky issue!

@geigerzaehler
Copy link
Author

A tricky issue indeed, but I like the proposition in the second paragraph; it’s more pythonesque. Also, we already have a normalized API in place: the FormattedMapping.

@sampsyo
Copy link
Member

sampsyo commented Aug 25, 2014

Indeed, it could be that the only time you need "guaranteed success" is when formatting to a string, in which case item.formatted()['title'] would work nicely. The only question is whether there are lots places that need to get non-string normalized values: for example, if you want to sort by track number, it might be convenient to do items.sort(key=lambda i: i.track) and use 0 when the value is missing.

It's a pretty classic trade-off: we should endeavor to make the common case simple while keeping the uncommon case possible. If stuff like item.track is the common case, it would be a mistake to force all that code to either test for presence or use a more verbose API; we should instead make the "null-sensitive" version more verbose. But if in practice we mostly do item.get(key) anyway, then there's little harm in making basic access more complex.

@geigerzaehler
Copy link
Author

for example, if you want to sort by track number, it might be convenient to do items.sort(key=lambda i: i.track) and use 0 when the value is missing.

It think i.get('track', 0) would be a better choice, because it is verbose. Using a different API than that of dict and mapping seems counterintuitive.

@sampsyo
Copy link
Member

sampsyo commented Aug 25, 2014

Yeah, behaving like a "normal" dict where values go away when they're deleted is a very strong argument—your translation of my example being a perfect instance where an explicit default is nicer. And it is indeed frustrating that flexible attributes are different from fixed attributes in this way; it really would be better if this could be considered an implementation detail.

Want to give that refactoring a try? We'll need to look for all cases in the code that assume the existence of fixed attributes and make them explicitly handle absence.

@geigerzaehler
Copy link
Author

Want to give that refactoring a try? We'll need to look for all cases in the code that assume the existence of fixed attributes and make them explicitly handle absence.

Sure. What do you think of the following API

Model API

Actually two APIs: One for programmer input and one for user input.

  • model[key] Always returns same type or raises IndexError
  • model.key Same as above, but raises AttributeError
  • model.get(key, default=None) Same as above, but returns default
    instead of raising. If the default keyword is not given, uses
    Type.default.
  • model[key] = value Assigned value is validated, raises ValueError
    on failure, uses Type.normalize()
  • model.key = value Same as above.
  • model.set(key, value) The value is a string of user input that is
    parsed. If parsing fails, a ValueError is raised. Uses
    Type.parse()
Types API
  • parse(value) Used by model.set()
  • normalize(value) Must be invertible, only simple type conversions,
    raises ValueError, used by model[key] = value
  • default Used by model.get()
  • format(value) Always return unicode object, value might be None,
    used by formatted.get().
  • to_sql(value) Used by model.store()
  • from_sql(value) Used by model.load()

Not sure if we need the last two.

Print API
  • formatted[key] Returns either string or raises IndexError
  • formatted.get(key, default=None) Uses type.format().
Invariants
model[key] = value
assert model[key] == value
model.store()
model.load()
assert model[key] == value

assert t.from_sql(t.to_sql(x)) == x
assert t.to_sql(t.from_sql(x)) == x

I don’t know if the second line is stricly necessary, or if you we will loose too much flexibility.

@sampsyo
Copy link
Member

sampsyo commented Aug 25, 2014

Yes, awesome! I like this direction.

Here are a few comments:

  • You mentioned Type.default, which I suggest amending to Type.normalize(None), which is what we currently use. For most (all?) of the types we currently have, this is exactly equivalent to Type.null.
  • I like the addition of model.set for completeness, but currently the only code that needs parsing is the modify command. And in that case, we take advantage of a nice performance optimization by doing all the string-parsing up front before doing a batch set on each object (obj.update(mods), on line 1276 of commands.py currently). So this might not currently be used if we want to keep that optimization.
  • Yes, also not sure if we need {from,to}_sql. Currently, all Python values are the same as their SQLite equivalents, except for filenames due to a necessary hack (see _bytes_keys).
  • I'm not sure either, but my intuition says that we may want the flexibility for model[key] = value ; model[key] == value to be false sometimes due to normalization. For example, it might be nice to coerce strs to unicodes when assigning into a string field, or raw datetimes to raw floats (or vice-versa) for date fields.

I'm excited! This seems like it could be a really clarifying simplification.

@geigerzaehler
Copy link
Author

You mentioned Type.default, which I suggest amending to Type.normalize(None), which is what we currently use.

I was planning on just replacing missing values with a single value from Type.default, just like dict.get() does. The normalize() method is used for assigning fields and should not handle None. That’s what del model[key] is for.

For most (all?) of the types we currently have, this is exactly equivalent to Type.null.

default seems to me like a more descriptive name.

And in that case, we take advantage of a nice performance optimization by doing all the string-parsing up front before doing a batch set on each object (obj.update(mods), on line 1276 of commands.py currently). So this might not currently be used if we want to keep that optimization.

Its not really an optimization since we iterate and parse the modifications. With the new API it becomes

for key, val in mods.items():
    item.set(key, val)

so it’s actually shorter.

or raw datetimes to raw floats (or vice-versa) for date fields.

In this case I'd lean towards assigning either only datetimes or only floats and force the consumer to do the heavy lifting of conversion.

Yes, also not sure if we need {from,to}_sql. Currently, all Python values are the same as their SQLite equivalents, except for filenames due to a necessary hack (see _bytes_keys).

This seems to be the best argument for including the *_sql methods. Another is mtime and atime. In the item we might want to present them as datetimes but we have to serialize them to floats in the database.

@sampsyo
Copy link
Member

sampsyo commented Aug 26, 2014

The normalize() method is used for assigning fields and should not handle None. That’s what del model[key] is for.

Normalization already does handle None (that's most of its purpose). You're right that item.something = None should of course not delete the something field, but it is a nice way to reset to the default value for the field—something we don't otherwise have an API for.

This conversion/coercion stuff seems pretty orthogonal to the issue at hand, which is deletion—perhaps we should leave it in place for now to make this refactoring step more focused?

Its not really an optimization since we iterate and parse the modifications.

We're currently parsing the modifications once and then applying them N times to all the modified objects, without re-parsing. Your example would require re-parsing the modifications for every updated item (i.e., item.set implicitly parses). Perhaps I am missing something here?

@geigerzaehler
Copy link
Author

Normalization already does handle None (that's most of its purpose). You're right that item.something = None should of course not delete the something field, but it is a nice way to reset to the default value for the field—something we don't otherwise have an API for.

You’re right, having an API for assigning default values makes sense. This has lead me to another issue. If don’t think that it makes sense for fixed fields to be missing. Otherwise something like this wouldn’t work as expected

item1.update(dict(item2))

The same issue arises with mediafiles which delete None valued tags.

This conversion/coercion stuff seems pretty orthogonal to the issue at hand, which is deletion.

I think they are pretty closely connected: We have to ask ourselves how deleted values are represented internally in the object and in the database—and this is handled by conversion and coercion. I think we should hijack this issue and do right.

We're currently parsing the modifications once and then applying them N times to all the modified objects, without re-parsing. [...] Perhaps I am missing something here?

It was me. I totally missed the outer loop over objects. We could provide a model.parse(key, value) method and model.update(dict) will iterate and use model[key] = value.

All this has lead me to the following conclusions.

  1. We need to represent None values by None values internally and by NULL values in the database. Otherwise we can not pass a default value to model.get().
  2. For fixed fields model[field] never raises a KeyError but might return None. Also del model[field] is the same as setting it to None.
  3. normalize() should not handle None. If we assign None it is represented as None internally. If we use model[key] there is no need for normalization because the value is the same as the internal representation. There is also no need for normalizaion in model.get(): It just returns the same value as model[key] but replaces None with Type.default.

@geigerzaehler
Copy link
Author

A basic implementation would look like this

@sampsyo sampsyo force-pushed the master branch 2 times, most recently from 2ded210 to 4b11eed Compare August 27, 2014 17:55
@geigerzaehler geigerzaehler force-pushed the none_fields branch 2 times, most recently from fd3f7ee to 4038b75 Compare September 10, 2014 12:58
@geigerzaehler
Copy link
Author

This is now a working implementation. It is a bit hacky in parts but all tests pass. What do you think @sampsyo?

@sampsyo
Copy link
Member

sampsyo commented Sep 11, 2014

Hey @geigerzaehler; sorry for being so slow with this discussion! I've been a little busy with real-life stuff lately and therefore managing only to tread water with beets. I am working on formulating a better opinion here; I hope I'll have something smarter to say tomorrow.

@geigerzaehler
Copy link
Author

Take your time—it’s quite a handful and I’m still not confident I didn’t break anything. I suggest you ignore the FormattedMapping stuff for now, since this is the most convoluted.

Thomas Scholtes added 9 commits September 11, 2014 22:48
Removing a fixed field sets it to NULL in the database. If the model is
loaded, all NULL values are passed to `Type.normalize()`, so attributes
do not return `None`. Fixes beetbox#886.
We can now set model fields to `None`. `MediaFile` removes the tags
corresponding to `None` fields. Hopefully fixes beetbox#157.
All field operations of a model use the methods of a `Type` instance to
provide custom handling. Field operations include getting, setting,
deleting and conversion to and from SQL.

* There is a one-to-one mapping between SQL values and the values
  stored in the model. Conversion is done with `Type.from_sql()` and
  `Type.to_sql()`. `NULL` values in SQL are directly mapped to `None`
  values in the model
* `Model[field]` retrieves the internal value stored in the model. For
  fixed field this either gives a value or `None`. For flex fields this
  may raise a `KeyError`.
* `Model.get(field, default)` uses `Type.default` to replace missing
  and `None` values
* `Model[field] = value` uses `Type.normalize()` to make sure internal
  values are consisten. Normalization is skipped for `None`.
* `Model.set(field, value)` excpects a string value and uses
  `Type.parse()` to convert and normalize it to an internal value.
Thomas Scholtes added 7 commits September 11, 2014 22:51
This abstraction allows us to generalise a lot of code. It also makes
the the `normalize` method, and by extension assignment of values in
models, more restrictive. I therefore had to fix a few test cases.

For a description of the `model_type` property, see its documentation.
@sampsyo
Copy link
Member

sampsyo commented Sep 13, 2014

Thanks for your patience, and all the awesome work on this refactoring effort!

I think I have to advocate again for raising an exception for missing fields instead of returning None. I remain too concerned about the effect of exposing nulls silently throughout beets. In my experience, this kind of silent propagation of bad values can be subtle and pernicious. I just debugged a similar (unrelated) issue today. It's a troublesome anti-pattern for buggy code to not result in an error and instead just do the wrong thing.

There's also the other aforementioned issue of symmetry between fixed and flexible fields. It would be extremely valuable to make the two kinds of attributes as similar in behavior as possible—it should be possible for client code to be unaware of the distinction. The delete-means-None semantics breaks this.

Even so, there are other good pieces in here that we should add individually—from_sql and to_sql, for example, is definitely a good idea. And I'm sure lots of the work you have here in auditing for unchecked field accesses will be useful in an exception-triggering design too.

What do you think? I know there's the not-insignificant problem of making MediaFile work with missing fields. Perhaps we can make this a longer-term thing—we can take it one step at a time, maybe starting with MediaFile.

@geigerzaehler
Copy link
Author

This is a mere proposal, so I’m perfectly fine if we just pick out the good parts. To be honest I think it has become a little to big to be merged (one sometimes gets carried away 😉) and I would like to split the discussion. There are three parts to this.

  1. Discussing and implementing the semantics of the Model class.
  2. Refactoring to make the code more transparent.
  3. Working with media files to delete tags.

The first step should be 2. I will open another PR with some parts from this one and elaborate why I think it’s a good idea.

Then, I think, we should continue the discussion about semantics here. Maybe we could start be recording the current state. This involves the the interface for MediaFile that allows us to delete tags. For the discussion about new semantics code examples should be key. Not example implementations but examples of behavior, like informal test cases.

@sampsyo
Copy link
Member

sampsyo commented Sep 13, 2014

Fantastic—I look forward to seeing the ideas for refactoring; we could certainly use some more transparency in the dbcore stuff.

Describing the semantics is also critical. I'll start compiling some notes too. If we end up with something useful, we should consider adding it to the Sphinx docs too.

@jtpavlock
Copy link
Contributor

Is this still relevant?

@sampsyo
Copy link
Member

sampsyo commented Jul 9, 2020

It's a super complicated issue and supporting tag deletion would still be great to do someday, but yes, I do think we can close this effort now.

@sampsyo sampsyo closed this Jul 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants