Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull upstream changes #1

Open
wants to merge 49 commits into
base: master
Choose a base branch
from
Open

Conversation

danielunderwood
Copy link

No description provided.

amCap1712 and others added 30 commits May 31, 2022 19:14
The build_entity_query method of all entities is mocked in test_index_by_fk_*
tests. These mocks are never reset after the test is over. This causes
the mock to be visible instead of the actual method in other tests causing
those to fail. Fix the issue by using mock.patch and cleaning up after
test run is over.
The test config should be copied inside the Dockerfile manually. Copying
config.test.ini to config.ini in a local or jenkins script before building
the docker image will not work because config.ini is included in .dockerignore.
Therefore, config.ini will be ignored and not be copied to the image otherwise.

Co-authored-by: yvanzo <[email protected]>
This erases all the code from the image built in CI. So remove the mount,
the downside is that we need to build the image for tests each time. An
alternative could be to have separate docker compose files for CI and local
tests, perhaps even using one as overlay over the other.
We want to these indexing functions with real database tests. Thus,
accept an optional session parameter in these functions. In tests,
we create a session, start a transaction, insert data from sql files,
query the data in the same transaction and at the end of the test
rollback the transaction. As a result, we are able to get away without
resetting the database in between the tests.

If we don't pass the same session here, then it may not be possible to
use this method because changes in a transaction are not visible outside
it.

This code earlier used to be inside a with but the queries executed here
are only select, so we can do without the `with` here as well.
Following the model of `docker/push.sh`:

- Document expected behavior and usage with heading comments
- Exit on error, for example if the first Docker Compose command failed
- Support running from any working directory (using `cd`)
- Support any Docker Compose setup and version:
  * Run `docker-compose` by default assuming v1 and `docker` group
  * Support `docker compose` for Docker Compose v2
  * Support `sudo ...` if needed
The column can now be a hybrid_propertyProxy object which we don't want
to skip, necessarily. But the previous behavior of skipping strings and
mbdata models has been preserved by checking for those directly.
This brings us back to the same issue as in 1.1 with _wildcard_token,
so might as well do try and do the jump in one go.
The point of defer_everything_but was to effectively do
what load_only does, but probably back when load_only
was not an option yet.
This just gets rid of it, and uses load_only instead. Since load_only
cannot work on relationships, just columns, we filter
relationships out of the column set before using it.
I could not find docs on `table` attribute but
https://docs.sqlalchemy.org/en/14/orm/internals.html#sqlalchemy.orm.RelationshipProperty.mapper
documents a `mapper` attribute which seems to do the job. FWIW, there
is also a `target` attribute but that is not documented in SQLAlchemy.
I am not sure if the two are different or not.
__tablename__ is not a column anyways. Also, the
checks in `defer_everything_but` (before we replaced
it with `load_only`) filtered it out.
The path artist_links.artist.gid is not being processed correctly
by iterate_path_values instead of returning the gid, it returns the
artist. LinkArtistUrl.artist is a @hybrid_property and the check
isinstance(LinkArtistUrl.artist, InstrumentedAttribute) returns false.
To identify hybrid_attributes in iterate_path_values, we could use
https://docs.sqlalchemy.org/en/14/orm/internals.html#sqlalchemy.orm.InspectionAttr.extension_type .
I tried this, but currently it does not seem to work. Hence, it is
better that we do not use hybrid attributes in paths for now.
In tests, we need to pass the session manually to control transactions.
Refactoring code so that tests and production code execute same code path.
Starting from using SQLAlchemy 1.4, SIR crashes if materialized (or
denormalized) tables for the MusicBrainz database are not available.

Just document it for now and consider SEARCH-687 for better checks.
Upgrade SQLAlchemy from 1.0 to 1.4
* Fix deprecation warnings in SIR

After upgrading SIR to SQLAlchemy 1.4, some deprecation warnings came up.

1. SADeprecationWarning: Use .persist_selectable (deprecated since: 1.3)
    mapped_table = class_mapper(entity.model).mapped_table.name

  Fix is straightforward use .persist_selectable instead of .mapped_table as
  looking at following docs it only seems to have been renamed.

  .mapped_table in SQLAlchemy 1.2:
  https://docs.sqlalchemy.org/en/12/orm/mapping_api.html?highlight=mapped_table#sqlalchemy.orm.Mapper.mapped_table

  .persist_selectable in SQLAlchemy 1.4:
   https://docs.sqlalchemy.org/en/14/orm/mapping_api.html?highlight=mapped_table#sqlalchemy.orm.Mapper.persist_selectable

2. SADeprecationWarning: Calling URL() directly is deprecated and will be disabled in a future release.  The public constructor for URL is now the URL.create() method.

  As the warning says, replace URL() with URL.create()

3. /usr/local/lib/python2.7/site-packages/pysqlite2/dbapi2.py:81: DeprecationWarning: Converters and adapters are deprecated. Please use only supported SQLite types. Any type mapping should happen in layer above this module.

  This warning comes from pysqlite. Python 2.7 does have a sqlite module
  so getting rid of pysqlite entirely. We only used it for a few tests in
  any case.

4. Finally, there's also a SAWarning about overlapping relationships in
   B and C models which are only used in tests. Add a backref to fix the
   warning.
The test logs are filled with warnings like:

SAWarning: relationship 'AreaTag.area' will copy column area.id to column
area_tag.area, which conflicts with relationship(s): 'CustomArea.tags'
(copies area.id to area_tag.area). If this is not the intention, consider
if these relationships should be linked with back_populates, or if
viewonly=True should be applied to one or more if they are read-only. For
the less common case that foreign key constraints are partially overlapping,
the orm.foreign() annotation can be used to isolate the columns that should
be written towards. To silence this warning, add the parameter
'overlaps="tags"' to the 'AreaTag.area' relationship.

This particular warnings arises from `tags = relationship("AreaTag")` in
CustomArea. The reason (AFAIU) is that `tags` relationship links `Area`
to `AreaTag` and `area` in `AreaTag` links AreaTag back to `Area`.
However, these relationships aren't linked with a backref so it isn't
known to SQLAlchemy that both `relationship` properties represent
the same relation. Possible fixes include using `back_ref` to make
it known to SQLAlchemy that both relationships are same. That approach
however needs changes in mbdata.

In many cases like that of `Tag` and `Alias` relationships, it seems
like a good idea to update mbdata and say add `tags` and `aliases`
attributes directly to `Area` model and other entities. But there
are a few cases like `artist_links` and `recording_links` in
`CustomWork` which probably don't make sense to be added to mbdata
and should continue to live in custom sir models.

Another alternative approach to get rid of the warning as the message
suggests is setting `viewonly=True`, since we only ever read the data
in sir and never modify this option works fine for us. Updating mbdata
likely needs to consider implications on other users of the library so
currently I am not inclined to pursue the approach. Hence, going with
the `viewonly=True` approach.
The test logs are filled with warnings like:

SAWarning: implicitly coercing SELECT object to scalar subquery; please
use the .scalar_subquery() method to produce a scalar subquery.

These warnings come from code like which calculate various counts:

`label_count = column_property(select([func.count(Label.id)]).where(Label.area_id == Area.id))`

Fix is simple append .scalar_subquery() to the queries.
This reverts commit 8b16c86.

The SQL queries generated using load_only are less optimal as compared to
using defer_everything_but. The reason likely stems from a difference in
which columns ways eagerly load. For now, just revert to the old way to
avoid performance regressions.
Trying to specify the load type for a composite property is erroneous.
So filter those properties.
amCap1712 and others added 19 commits October 28, 2022 19:48
We access begin_date, end_date and ended fields from area fields in
convert_area_inner which is called for all area types. This function is
only called by wscompat converter so we only need to add these to extrapaths
and not to the main fields in the entity.
convert_event used as wscompat converter for event core calls
convert_artist_simple as well which accesses sort_name so eagerly
load it.
Logging SQL queries shows that when using the hybrid property like artist
of LinkArtistPlace model instead of the actual attribute like entity0 does
not trigger eager loading of the db column. Therefore, use entity0 instead
of artist in artist_links, entity0 instead of area in area_links and
entity1 instead of place in place_links.
Logging SQL queries shows that when using the hybrid property like artist
of LinkArtistUrl model instead of the actual attribute like entity0 does
not trigger eager loading of the db column. Therefore, use entity0 instead
of artist in artist_links and entity0 instead of release in release_links.
convert_release_group used as wscompat converter for release group core
calls convert_alias as well which accesses alias.gid so eagerly load it.
convert_release used as wscompat converter for release core calls
convert_artist_simple as well which accesses comment so eagerly load it.
convert_release used as wscompat converter for release core calls
convert_release_packaging as well which accesses comment so eagerly load
it.
Logging SQL queries shows that when using the hybrid property like area0
of LinkAreaArea model instead of the actual attribute like entity0 does
not trigger eager loading of the db column. Therefore, use entity0 instead
of area0 in area_links.
convert_area calls convert_area_relation_list which in turn calls
convert_area_inner on the related area that accesses these attributes
so eagerly load these for performance.
convert_place used as wscompat converter for place core calls
convert_area_inner as well which accesses area.type.gid, area.type.name,
area.begin_date, area.end_date, area.ended so eagerly load it.
The change amends the implementation from SEARCH-319.

The CustomReleaseGroup model has 2 relationships to ReleaseGroupMeta table
while stores first_release_date.

1. `meta` - this relationship is established on `ReleaseGroup` model
 through a [backref](https://github.com/acoustid/mbdata/blob/bbe303865e4cec3f83a65ce29f0d3468c729173e/mbdata/models.py#L8153) in `ReleaseGroupMeta` table.

2. `first_release_date` - we add this relationship to `ReleaseGroup` in
 `CustomReleaseGroup` in [sir](https://github.com/metabrainz/sir/blob/80954d6953da4b74028b3b5a73fada5d81f910e0/sir/schema/modelext.py#L122).

Ideally, we would have only one relationship: the one in mbdata. However,
using `meta` relationship leads to an error. Investigating the error, I
found that the `meta` relationship is created using `uselist=False` parameter.
The logic in [iterate_path_values](https://github.com/metabrainz/sir/blob/80954d6953da4b74028b3b5a73fada5d81f910e0/sir/querying.py#L78-L85) does not support [`ONETOONE`](https://docs.sqlalchemy.org/en/14/orm/basic_relationships.html#one-to-one) relationships
that `uselist=False` creates.

Until `iterate_path_values` is updated we need to keep using the
`first_release_date` relationship. Accordingly, update the logic in
convert_release_group converter. This should also likely speed up indexing
as well.
The change amends the implementation from SEARCH-218.

The CustomRecording model has 2 relationships to RecordingFirstReleaseDate
table which stores first_release_date.

1. `first_release` - this relationship is established on `Recording` model
 through a [backref](https://github.com/acoustid/mbdata/blob/bbe303865e4cec3f83a65ce29f0d3468c729173e/mbdata/models.py#L1616) in `RecordingFirstReleaseDate` table.

2. `first_release_date` - we add this relationship to `Recording` in
 `CustomRecording` in [sir](https://github.com/metabrainz/sir/blob/ade0073ddd450e641599d56af1a1ca4762aef83c/sir/schema/modelext.py#L116).

Ideally, we would have only one relationship: the one in mbdata. However,
using `first_release` relationship leads to an error. Investigating the error, I
found that the `first_release` relationship is created using `uselist=False` parameter.
The logic in [iterate_path_values](https://github.com/metabrainz/sir/blob/80954d6953da4b74028b3b5a73fada5d81f910e0/sir/querying.py#L78-L85) does not support [`ONETOONE`](https://docs.sqlalchemy.org/en/14/orm/basic_relationships.html#one-to-one) relationships
that `uselist=False` creates.

Until `iterate_path_values` is updated we need to keep using the
`first_release_date` relationship. Accordingly, update the logic in
convert_recording converter. This should also likely speed up indexing
as well.
The change amends the implementation from SEARCH-494.

The CustomRelease model has 2 relationships to ReleaseMeta table which
stores amazon asin.

1. `meta` - this relationship is established on `Release` model
 through a [backref](https://github.com/acoustid/mbdata/blob/bbe303865e4cec3f83a65ce29f0d3468c729173e/mbdata/models.py#L7872) in `RecordingFirstReleaseDate` table.

2. `asin` - we add this relationship to `Release` in
 `CustomRelease` in [sir](https://github.com/metabrainz/sir/blob/ade0073ddd450e641599d56af1a1ca4762aef83c/sir/schema/modelext.py#L134).

Ideally, we would have only one relationship: the one in mbdata. However,
using `meta` relationship leads to an error. Investigating the error, I
found that the `meta` relationship is created using `uselist=False` parameter.
The logic in [iterate_path_values](https://github.com/metabrainz/sir/blob/80954d6953da4b74028b3b5a73fada5d81f910e0/sir/querying.py#L78-L85) does not support [`ONETOONE`](https://docs.sqlalchemy.org/en/14/orm/basic_relationships.html#one-to-one) relationships
that `uselist=False` creates.

Until `iterate_path_values` is updated we need to keep using the
`asin` relationship. Accordingly, update the logic in convert_release
converter. This should also likely speed up indexing as well.
convert_artist calls convert_area_inner on the area that accesses these
area.type.name and area.type.gid attributes so eagerly load these for
performance.
convert_label used as wscompat converter for label core calls
convert_area_inner as well which accesses area.begin_date,
area.end_date, area.ended so eagerly load it.

Also, add one more test case where these fields aren't NULL.
Add sort_name and comment for eager loading because those are accessed
by convert_artist_simple. Change recording to entity0 because hybrid
properties are not eagerly loaded correctly.
We load recording links anyway so a len in python is simpler.
When moving to another instance of RabbitMQ (mostly for master not
mirror), the exchanges and queues have to be declared using
SIR’s `amqp_setup` command, otherwise messages are just dropped by
the new RabbitMQ instance. This step was missing in the documentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants