Pull upstream changes #1

danielunderwood · 2023-10-24T13:06:06Z

No description provided.

The build_entity_query method of all entities is mocked in test_index_by_fk_* tests. These mocks are never reset after the test is over. This causes the mock to be visible instead of the actual method in other tests causing those to fail. Fix the issue by using mock.patch and cleaning up after test run is over.

The test config should be copied inside the Dockerfile manually. Copying config.test.ini to config.ini in a local or jenkins script before building the docker image will not work because config.ini is included in .dockerignore. Therefore, config.ini will be ignored and not be copied to the image otherwise. Co-authored-by: yvanzo <[email protected]>

This erases all the code from the image built in CI. So remove the mount, the downside is that we need to build the image for tests each time. An alternative could be to have separate docker compose files for CI and local tests, perhaps even using one as overlay over the other.

We want to these indexing functions with real database tests. Thus, accept an optional session parameter in these functions. In tests, we create a session, start a transaction, insert data from sql files, query the data in the same transaction and at the end of the test rollback the transaction. As a result, we are able to get away without resetting the database in between the tests. If we don't pass the same session here, then it may not be possible to use this method because changes in a transaction are not visible outside it. This code earlier used to be inside a with but the queries executed here are only select, so we can do without the `with` here as well.

Following the model of `docker/push.sh`: - Document expected behavior and usage with heading comments - Exit on error, for example if the first Docker Compose command failed - Support running from any working directory (using `cd`) - Support any Docker Compose setup and version: * Run `docker-compose` by default assuming v1 and `docker` group * Support `docker compose` for Docker Compose v2 * Support `sudo ...` if needed

See https://docs.sqlalchemy.org/en/11/changelog/migration_11.html FIXME: LinkAreaArea.area0 is not instance of InstrumentAttribute

The column can now be a hybrid_propertyProxy object which we don't want to skip, necessarily. But the previous behavior of skipping strings and mbdata models has been preserved by checking for those directly.

This brings us back to the same issue as in 1.1 with _wildcard_token, so might as well do try and do the jump in one go.

The point of defer_everything_but was to effectively do what load_only does, but probably back when load_only was not an option yet. This just gets rid of it, and uses load_only instead. Since load_only cannot work on relationships, just columns, we filter relationships out of the column set before using it.

I could not find docs on `table` attribute but https://docs.sqlalchemy.org/en/14/orm/internals.html#sqlalchemy.orm.RelationshipProperty.mapper documents a `mapper` attribute which seems to do the job. FWIW, there is also a `target` attribute but that is not documented in SQLAlchemy. I am not sure if the two are different or not.

__tablename__ is not a column anyways. Also, the checks in `defer_everything_but` (before we replaced it with `load_only`) filtered it out.

The path artist_links.artist.gid is not being processed correctly by iterate_path_values instead of returning the gid, it returns the artist. LinkArtistUrl.artist is a @hybrid_property and the check isinstance(LinkArtistUrl.artist, InstrumentedAttribute) returns false. To identify hybrid_attributes in iterate_path_values, we could use https://docs.sqlalchemy.org/en/14/orm/internals.html#sqlalchemy.orm.InspectionAttr.extension_type . I tried this, but currently it does not seem to work. Hence, it is better that we do not use hybrid attributes in paths for now.

In tests, we need to pass the session manually to control transactions. Refactoring code so that tests and production code execute same code path.

Starting from using SQLAlchemy 1.4, SIR crashes if materialized (or denormalized) tables for the MusicBrainz database are not available. Just document it for now and consider SEARCH-687 for better checks.

Upgrade SQLAlchemy from 1.0 to 1.4

Provide information for basic maintenance and implementation details. Follow guidelines for using RabbitMQ service: - https://github.com/metabrainz/guidelines/blob/60c708538bc84e11a79d2cdad38274f5b04276ad/services/README.md - https://github.com/metabrainz/guidelines/blob/f2dcf19bac2d76d7348758c021f1e5ada4451cf8/services/RabbitMQ.md

* Fix deprecation warnings in SIR After upgrading SIR to SQLAlchemy 1.4, some deprecation warnings came up. 1. SADeprecationWarning: Use .persist_selectable (deprecated since: 1.3) mapped_table = class_mapper(entity.model).mapped_table.name Fix is straightforward use .persist_selectable instead of .mapped_table as looking at following docs it only seems to have been renamed. .mapped_table in SQLAlchemy 1.2: https://docs.sqlalchemy.org/en/12/orm/mapping_api.html?highlight=mapped_table#sqlalchemy.orm.Mapper.mapped_table .persist_selectable in SQLAlchemy 1.4: https://docs.sqlalchemy.org/en/14/orm/mapping_api.html?highlight=mapped_table#sqlalchemy.orm.Mapper.persist_selectable 2. SADeprecationWarning: Calling URL() directly is deprecated and will be disabled in a future release. The public constructor for URL is now the URL.create() method. As the warning says, replace URL() with URL.create() 3. /usr/local/lib/python2.7/site-packages/pysqlite2/dbapi2.py:81: DeprecationWarning: Converters and adapters are deprecated. Please use only supported SQLite types. Any type mapping should happen in layer above this module. This warning comes from pysqlite. Python 2.7 does have a sqlite module so getting rid of pysqlite entirely. We only used it for a few tests in any case. 4. Finally, there's also a SAWarning about overlapping relationships in B and C models which are only used in tests. Add a backref to fix the warning.

The test logs are filled with warnings like: SAWarning: relationship 'AreaTag.area' will copy column area.id to column area_tag.area, which conflicts with relationship(s): 'CustomArea.tags' (copies area.id to area_tag.area). If this is not the intention, consider if these relationships should be linked with back_populates, or if viewonly=True should be applied to one or more if they are read-only. For the less common case that foreign key constraints are partially overlapping, the orm.foreign() annotation can be used to isolate the columns that should be written towards. To silence this warning, add the parameter 'overlaps="tags"' to the 'AreaTag.area' relationship. This particular warnings arises from `tags = relationship("AreaTag")` in CustomArea. The reason (AFAIU) is that `tags` relationship links `Area` to `AreaTag` and `area` in `AreaTag` links AreaTag back to `Area`. However, these relationships aren't linked with a backref so it isn't known to SQLAlchemy that both `relationship` properties represent the same relation. Possible fixes include using `back_ref` to make it known to SQLAlchemy that both relationships are same. That approach however needs changes in mbdata. In many cases like that of `Tag` and `Alias` relationships, it seems like a good idea to update mbdata and say add `tags` and `aliases` attributes directly to `Area` model and other entities. But there are a few cases like `artist_links` and `recording_links` in `CustomWork` which probably don't make sense to be added to mbdata and should continue to live in custom sir models. Another alternative approach to get rid of the warning as the message suggests is setting `viewonly=True`, since we only ever read the data in sir and never modify this option works fine for us. Updating mbdata likely needs to consider implications on other users of the library so currently I am not inclined to pursue the approach. Hence, going with the `viewonly=True` approach.

The test logs are filled with warnings like: SAWarning: implicitly coercing SELECT object to scalar subquery; please use the .scalar_subquery() method to produce a scalar subquery. These warnings come from code like which calculate various counts: `label_count = column_property(select([func.count(Label.id)]).where(Label.area_id == Area.id))` Fix is simple append .scalar_subquery() to the queries.

This reverts commit 8b16c86. The SQL queries generated using load_only are less optimal as compared to using defer_everything_but. The reason likely stems from a difference in which columns ways eagerly load. For now, just revert to the old way to avoid performance regressions.

Trying to specify the load type for a composite property is erroneous. So filter those properties.

We access begin_date, end_date and ended fields from area fields in convert_area_inner which is called for all area types. This function is only called by wscompat converter so we only need to add these to extrapaths and not to the main fields in the entity.

convert_event used as wscompat converter for event core calls convert_artist_simple as well which accesses sort_name so eagerly load it.

Logging SQL queries shows that when using the hybrid property like artist of LinkArtistPlace model instead of the actual attribute like entity0 does not trigger eager loading of the db column. Therefore, use entity0 instead of artist in artist_links, entity0 instead of area in area_links and entity1 instead of place in place_links.

Logging SQL queries shows that when using the hybrid property like artist of LinkArtistUrl model instead of the actual attribute like entity0 does not trigger eager loading of the db column. Therefore, use entity0 instead of artist in artist_links and entity0 instead of release in release_links.

convert_release_group used as wscompat converter for release group core calls convert_alias as well which accesses alias.gid so eagerly load it.

convert_release used as wscompat converter for release core calls convert_artist_simple as well which accesses comment so eagerly load it.

convert_release used as wscompat converter for release core calls convert_release_packaging as well which accesses comment so eagerly load it.

Logging SQL queries shows that when using the hybrid property like area0 of LinkAreaArea model instead of the actual attribute like entity0 does not trigger eager loading of the db column. Therefore, use entity0 instead of area0 in area_links.

convert_area calls convert_area_relation_list which in turn calls convert_area_inner on the related area that accesses these attributes so eagerly load these for performance.

convert_place used as wscompat converter for place core calls convert_area_inner as well which accesses area.type.gid, area.type.name, area.begin_date, area.end_date, area.ended so eagerly load it.

The change amends the implementation from SEARCH-319. The CustomReleaseGroup model has 2 relationships to ReleaseGroupMeta table while stores first_release_date. 1. `meta` - this relationship is established on `ReleaseGroup` model through a [backref](https://github.com/acoustid/mbdata/blob/bbe303865e4cec3f83a65ce29f0d3468c729173e/mbdata/models.py#L8153) in `ReleaseGroupMeta` table. 2. `first_release_date` - we add this relationship to `ReleaseGroup` in `CustomReleaseGroup` in [sir](https://github.com/metabrainz/sir/blob/80954d6953da4b74028b3b5a73fada5d81f910e0/sir/schema/modelext.py#L122). Ideally, we would have only one relationship: the one in mbdata. However, using `meta` relationship leads to an error. Investigating the error, I found that the `meta` relationship is created using `uselist=False` parameter. The logic in [iterate_path_values](https://github.com/metabrainz/sir/blob/80954d6953da4b74028b3b5a73fada5d81f910e0/sir/querying.py#L78-L85) does not support [`ONETOONE`](https://docs.sqlalchemy.org/en/14/orm/basic_relationships.html#one-to-one) relationships that `uselist=False` creates. Until `iterate_path_values` is updated we need to keep using the `first_release_date` relationship. Accordingly, update the logic in convert_release_group converter. This should also likely speed up indexing as well.

The change amends the implementation from SEARCH-218. The CustomRecording model has 2 relationships to RecordingFirstReleaseDate table which stores first_release_date. 1. `first_release` - this relationship is established on `Recording` model through a [backref](https://github.com/acoustid/mbdata/blob/bbe303865e4cec3f83a65ce29f0d3468c729173e/mbdata/models.py#L1616) in `RecordingFirstReleaseDate` table. 2. `first_release_date` - we add this relationship to `Recording` in `CustomRecording` in [sir](https://github.com/metabrainz/sir/blob/ade0073ddd450e641599d56af1a1ca4762aef83c/sir/schema/modelext.py#L116). Ideally, we would have only one relationship: the one in mbdata. However, using `first_release` relationship leads to an error. Investigating the error, I found that the `first_release` relationship is created using `uselist=False` parameter. The logic in [iterate_path_values](https://github.com/metabrainz/sir/blob/80954d6953da4b74028b3b5a73fada5d81f910e0/sir/querying.py#L78-L85) does not support [`ONETOONE`](https://docs.sqlalchemy.org/en/14/orm/basic_relationships.html#one-to-one) relationships that `uselist=False` creates. Until `iterate_path_values` is updated we need to keep using the `first_release_date` relationship. Accordingly, update the logic in convert_recording converter. This should also likely speed up indexing as well.

The change amends the implementation from SEARCH-494. The CustomRelease model has 2 relationships to ReleaseMeta table which stores amazon asin. 1. `meta` - this relationship is established on `Release` model through a [backref](https://github.com/acoustid/mbdata/blob/bbe303865e4cec3f83a65ce29f0d3468c729173e/mbdata/models.py#L7872) in `RecordingFirstReleaseDate` table. 2. `asin` - we add this relationship to `Release` in `CustomRelease` in [sir](https://github.com/metabrainz/sir/blob/ade0073ddd450e641599d56af1a1ca4762aef83c/sir/schema/modelext.py#L134). Ideally, we would have only one relationship: the one in mbdata. However, using `meta` relationship leads to an error. Investigating the error, I found that the `meta` relationship is created using `uselist=False` parameter. The logic in [iterate_path_values](https://github.com/metabrainz/sir/blob/80954d6953da4b74028b3b5a73fada5d81f910e0/sir/querying.py#L78-L85) does not support [`ONETOONE`](https://docs.sqlalchemy.org/en/14/orm/basic_relationships.html#one-to-one) relationships that `uselist=False` creates. Until `iterate_path_values` is updated we need to keep using the `asin` relationship. Accordingly, update the logic in convert_release converter. This should also likely speed up indexing as well.

convert_artist calls convert_area_inner on the area that accesses these area.type.name and area.type.gid attributes so eagerly load these for performance.

convert_label used as wscompat converter for label core calls convert_area_inner as well which accesses area.begin_date, area.end_date, area.ended so eagerly load it. Also, add one more test case where these fields aren't NULL.

Add sort_name and comment for eager loading because those are accessed by convert_artist_simple. Change recording to entity0 because hybrid properties are not eagerly loaded correctly.

We load recording links anyway so a len in python is simpler.

When moving to another instance of RabbitMQ (mostly for master not mirror), the exchanges and queues have to be declared using SIR’s `amqp_setup` command, otherwise messages are just dropped by the new RabbitMQ instance. This step was missing in the documentation.

amCap1712 and others added 30 commits May 31, 2022 19:14

Fix SQLAlchemy warning to use text directly

fc5cf88

Add indexing tests for entities using actual data

ce8bf44

Upgrade SQLAlchemy (1/3) from 1.0.19 to 1.1.18

bd99fd4

See https://docs.sqlalchemy.org/en/11/changelog/migration_11.html FIXME: LinkAreaArea.area0 is not instance of InstrumentAttribute

Fix column type check in last_model_in_path

7a2d2ad

The column can now be a hybrid_propertyProxy object which we don't want to skip, necessarily. But the previous behavior of skipping strings and mbdata models has been preserved by checking for those directly.

Fix debug message by creating list from generator

f53b8b7

Update psycopg requirement to -binary

129b2d4

Update to 1.4, fix missing text() error

631b7ec

This brings us back to the same issue as in 1.1 with _wildcard_token, so might as well do try and do the jump in one go.

Document filter_valid_annotations

e96bb3b

remove __tablename__ from column list

34139b1

__tablename__ is not a column anyways. Also, the checks in `defer_everything_but` (before we replaced it with `load_only`) filtered it out.

Update SQLAlchemy documentation link

e668d96

Try fixing AttributeError for session

62b786e

Accept session attribute in index_entity and live_index_entity

5d86c7a

In tests, we need to pass the session manually to control transactions. Refactoring code so that tests and production code execute same code path.

Document requiring MBDB materialized tables

73e2528

Starting from using SQLAlchemy 1.4, SIR crashes if materialized (or denormalized) tables for the MusicBrainz database are not available. Just document it for now and consider SEARCH-687 for better checks.

Merge pull request #111 from yvanzo/sqlalchemy13

cef9055

Upgrade SQLAlchemy from 1.0 to 1.4

Remove unused import

63aa1cb

Do not defer CompositeProperty

6599b4c

Trying to specify the load type for a composite property is erroneous. So filter those properties.

Upgrade SQLAlchemy to latest version

05c277d

amCap1712 and others added 19 commits October 28, 2022 19:48

Eagerly load area fields

99b3208

We access begin_date, end_date and ended fields from area fields in convert_area_inner which is called for all area types. This function is only called by wscompat converter so we only need to add these to extrapaths and not to the main fields in the entity.

Eagerly load artist.sort_name in event indexing

dd87d0d

convert_event used as wscompat converter for event core calls convert_artist_simple as well which accesses sort_name so eagerly load it.

Eagerly load artist_alias.gid in release group indexing (#148)

16577db

convert_release_group used as wscompat converter for release group core calls convert_alias as well which accesses alias.gid so eagerly load it.

Eagerly load artist.comment in release indexing

89ef1ab

convert_release used as wscompat converter for release core calls convert_artist_simple as well which accesses comment so eagerly load it.

Eagerly load packaging.gid in release indexing

80954d6

convert_release used as wscompat converter for release core calls convert_release_packaging as well which accesses comment so eagerly load it.

Eagerly load area0.type.name and area0.ended

a596aa6

convert_area calls convert_area_relation_list which in turn calls convert_area_inner on the related area that accesses these attributes so eagerly load these for performance.

Eagerly load area attributes in place indexing

ade0073

convert_place used as wscompat converter for place core calls convert_area_inner as well which accesses area.type.gid, area.type.name, area.begin_date, area.end_date, area.ended so eagerly load it.

Eagerly load area_type related attributes for artist core

17a8340

convert_artist calls convert_area_inner on the area that accesses these area.type.name and area.type.gid attributes so eagerly load these for performance.

Eagerly load area attributes in label indexing

e9f6181

convert_label used as wscompat converter for label core calls convert_area_inner as well which accesses area.begin_date, area.end_date, area.ended so eagerly load it. Also, add one more test case where these fields aren't NULL.

Add test case for work that includes recording links

ee18b5a

Update extrapaths fields indexed for work

d322f80

Add sort_name and comment for eager loading because those are accessed by convert_artist_simple. Change recording to entity0 because hybrid properties are not eagerly loaded correctly.

Avoid doing recording count in sql query for work

dc96c30

We load recording links anyway so a len in python is simpler.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull upstream changes #1

Pull upstream changes #1

danielunderwood commented Oct 24, 2023

Pull upstream changes #1

Are you sure you want to change the base?

Pull upstream changes #1

Conversation

danielunderwood commented Oct 24, 2023