Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Fix a bug introduced in Synapse v1.74.0 where searching with colons when using ICU for search term tokenisation would fail with an error. #15079

Merged
merged 9 commits into from
Feb 20, 2023

Conversation

reivilibre
Copy link
Contributor

Fixes: #14815

Introduced by: #14464 (as far as I can tell: it changed the tokenisation rules to leave :s in when you have ICU installed.)

Base: develop

This pull request is commit-by-commit review friendly.

Original commit schedule, with full messages:

  1. Add failing test

  2. Small rename of variables

  3. Add minitests to characterise how we do tokenisation today

  4. Add escaping to our tsquery builder

  5. Test that we can find a user by prefix of MXID
    (I wasn't super confident that the ICU tokeniser wouldn't break this, but a bit of poking later,

    I think I'm happy. But thought it should be tested anyway!)

(I wasn't super confident that the ICU tokeniser wouldn't break this, but a bit of poking later,

I think I'm happy. But thought it should be tested anyway!)
Signed-off-by: Olivier Wilkinson (reivilibre) <[email protected]>
@babolivier
Copy link
Contributor

babolivier commented Feb 16, 2023

I'm quite surprised this is an issue with libicu. From testing on my local machine libicu seems to correctly tokenise the string foo:bar into foo, :, bar (and then the colon is removed from the set of results by _parse_words_with_icu's check against the ([\w\-]+) regex):

>>> s = "foo:bar"
>>> import icu
>>> breaker = icu.BreakIterator.createWordInstance(icu.Locale.getDefault())
>>> breaker.setText(s)
>>> b = breaker.nextBoundary()
>>> result = s[0:b]
>>> result
'foo'
>>> b2 = breaker.nextBoundary()
>>> s[b:b2]
':'
>>> import re
>>> re.findall(r"([\w\-]+)", s[b:b2], re.UNICODE)
[]
>>> 

So with this logic, _parse_words should return [foo, bar] when ICU support is available (and also when it's not fwiw).

(though I'll admit I just tinkered with pyicu in the python interpreter and not with an actual instance Synapse so it's fairly likely I've missed something)

@reivilibre
Copy link
Contributor Author

reivilibre commented Feb 16, 2023

I'm quite surprised this is an issue with libicu. From testing on my local machine libicu seems to correctly tokenise the string foo:bar into foo, :, bar (and then the colon is removed from the set of results by _parse_words_with_icu's check against the ([\w\-]+) regex):

I didn't realise this to start with of course, but on CI and on my local machine, ICU is giving two different tokenisations. Possibly the underlying version of libICU itself is different?

edit: I'm not even sure my homeserver runs with ICU though, so maybe there's more to it than meets the eye.

edit: actually it does run with ICU. I don't remember installing it.

@reivilibre reivilibre marked this pull request as ready for review February 16, 2023 12:52
@reivilibre reivilibre requested a review from a team as a code owner February 16, 2023 12:52
@babolivier
Copy link
Contributor

babolivier commented Feb 16, 2023

Possibly the underlying version of libICU itself is different?

Possibly yeah, though I'd be a bit surprised if the tokenisation differs so much between versions. Not sure if it's worth investigating. Agreed it's weird that it gives you different results between CI and your local machine though.

edit: actually it does run with ICU. I don't remember installing it.

If you install Synapse with the all extra (or use either Docker or the debs), pyicu will be installed in Synapse's venv. iirc libicu itself is installed by default on a number of systems and distros.

@reivilibre
Copy link
Contributor Author

If you install Synapse with the all extra (or use either Docker or the debs), pyicu will be installed in Synapse's venv. iirc libicu itself is installed by default on a number of systems and distros.

Yeah, indeed, but I don't use [all] in my installation tool. But it's possible I did it manually once when trying a git branch for something experimental and ICU has stuck around since.

@reivilibre
Copy link
Contributor Author

reivilibre commented Feb 16, 2023

I'd be a bit surprised if the tokenisation differs so much between versions. Not sure if it's worth investigating. Agreed it's weird that it gives you different results between CI and your local machine though.

Yes and no. My distro has 2 different versions of ICU packaged which is unusual unless they're known to be substantially different (smells like some software is only able to cope with one of the two). As for CI vs local machine, CI is running Ubuntu 20.04 and I'm on 22.04 (the major version of ICU is also different by a few versions). So I'm not entirely shocked.

Running your example:

Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> s = "foo:bar"
>>> import icu
>>> breaker = icu.BreakIterator.createWordInstance(icu.Locale.getDefault())
>>> breaker.setText(s)
>>> b = breaker.nextBoundary()
>>> result = s[0:b]
>>> result
'foo:bar'
>>> b2 = breaker.nextBoundary()
>>> s[b:b2]
''
>>> import re
>>> re.findall(r"([\w\-]+)", s[b:b2], re.UNICODE)
[]
>>> 

plainly different. Maybe it makes sense? I think I've seen some languages use : in words to denote multiple variants, or something along those lines. I guess tokenisation is hard.

edit: or maybe using the 'default' locale is making a difference

@babolivier
Copy link
Contributor

edit: or maybe using the 'default' locale is making a difference

Yeah I was going to comment that - maybe the default locale is the differentiating factor here.

Copy link
Contributor

@DMRobertson DMRobertson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sane, but one point to check:

tests/handlers/test_user_directory.py Outdated Show resolved Hide resolved
Comment on lines +485 to +497
@override_config({"user_directory": {"search_all_users": True}})
def test_search_user_dir_start_of_user_id(self) -> None:
"""Tests that a user can look up another user by searching for the start
of their user ID.
"""
r = self.get_success(self.store.search_user_dir(ALICE, "somenickname:exa", 10))
self.assertFalse(r["limited"])
self.assertEqual(1, len(r["results"]))
self.assertDictEqual(
r["results"][0],
{"user_id": BELA, "display_name": "Bela", "avatar_url": None},
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test (and any others) running under with-ICU and without-ICU? (Should it?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now it is :)

@reivilibre reivilibre merged commit 1cbc3f1 into develop Feb 20, 2023
@reivilibre reivilibre deleted the rei/userdir_search_with_NEKUDOTAYIM branch February 20, 2023 12:00
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Mar 5, 2023
Synapse 1.78.0 (2023-02-28)
===========================

Bugfixes
--------

- Fix a bug introduced in Synapse 1.76 where 5s delays would occasionally occur in deployments using workers. ([\#15150](matrix-org/synapse#15150))


Synapse 1.78.0rc1 (2023-02-21)
==============================

Features
--------

- Implement the experimental `exact_event_match` push rule condition from [MSC3758](matrix-org/matrix-spec-proposals#3758). ([\#14964](matrix-org/synapse#14964))
- Add account data to the command line [user data export tool](https://matrix-org.github.io/synapse/v1.78/usage/administration/admin_faq.html#how-can-i-export-user-data). ([\#14969](matrix-org/synapse#14969))
- Implement [MSC3873](matrix-org/matrix-spec-proposals#3873) to disambiguate push rule keys with dots in them. ([\#15004](matrix-org/synapse#15004))
- Allow Synapse to use a specific Redis [logical database](https://redis.io/commands/select/) in worker-mode deployments. ([\#15034](matrix-org/synapse#15034))
- Tag opentracing spans for federation requests with the name of the worker serving the request. ([\#15042](matrix-org/synapse#15042))
- Implement the experimental `exact_event_property_contains` push rule condition from [MSC3966](matrix-org/matrix-spec-proposals#3966). ([\#15045](matrix-org/synapse#15045))
- Remove spurious `dont_notify` action from the defaults for the `.m.rule.reaction` pushrule. ([\#15073](matrix-org/synapse#15073))
- Update the error code returned when user sends a duplicate annotation. ([\#15075](matrix-org/synapse#15075))


Bugfixes
--------

- Prevent clients from reporting nonexistent events. ([\#13779](matrix-org/synapse#13779))
- Return spec-compliant JSON errors when unknown endpoints are requested. ([\#14605](matrix-org/synapse#14605))
- Fix a long-standing bug where the room aliases returned could be corrupted. ([\#15038](matrix-org/synapse#15038))
- Fix a bug introduced in Synapse 1.76.0 where partially-joined rooms could not be deleted using the [purge room API](https://matrix-org.github.io/synapse/latest/admin_api/rooms.html#delete-room-api). ([\#15068](matrix-org/synapse#15068))
- Fix a long-standing bug where federated joins would fail if the first server in the list of servers to try is not in the room. ([\#15074](matrix-org/synapse#15074))
- Fix a bug introduced in Synapse v1.74.0 where searching with colons when using ICU for search term tokenisation would fail with an error. ([\#15079](matrix-org/synapse#15079))
- Reduce the likelihood of a rare race condition where rejoining a restricted room over federation would fail. ([\#15080](matrix-org/synapse#15080))
- Fix a bug introduced in Synapse 1.76 where workers would fail to start if the `health` listener was configured. ([\#15096](matrix-org/synapse#15096))
- Fix a bug introduced in Synapse 1.75 where the [portdb script](https://matrix-org.github.io/synapse/release-v1.78/postgres.html#porting-from-sqlite) would fail to run after a room had been faster-joined. ([\#15108](matrix-org/synapse#15108))


Improved Documentation
----------------------

- Document how to start Synapse with Poetry. Contributed by @thezaidbintariq. ([\#14892](matrix-org/synapse#14892), [\#15022](matrix-org/synapse#15022))
- Update delegation documentation to clarify that SRV DNS delegation does not eliminate all needs to serve files from .well-known locations. Contributed by @williamkray. ([\#14959](matrix-org/synapse#14959))
- Fix a mistake in registration_shared_secret_path docs. ([\#15078](matrix-org/synapse#15078))
- Refer to a more recent blog post on the [Database Maintenance Tools](https://matrix-org.github.io/synapse/latest/usage/administration/database_maintenance_tools.html) page. Contributed by @jahway603. ([\#15083](matrix-org/synapse#15083))


Internal Changes
----------------

- Re-type hint some collections as read-only. ([\#13755](matrix-org/synapse#13755))
- Faster joins: don't stall when another user joins during a partial-state room resync. ([\#14606](matrix-org/synapse#14606))
- Add a class `UnpersistedEventContext` to allow for the batching up of storing state groups. ([\#14675](matrix-org/synapse#14675))
- Add a check to ensure that locked dependencies have source distributions available. ([\#14742](matrix-org/synapse#14742))
- Tweak comment on `_is_local_room_accessible` as part of room visibility in `/hierarchy` to clarify the condition for a room being visible. ([\#14834](matrix-org/synapse#14834))
- Prevent `WARNING: there is already a transaction in progress` lines appearing in PostgreSQL's logs on some occasions. ([\#14840](matrix-org/synapse#14840))
- Use `StrCollection` to avoid potential bugs with `Collection[str]`. ([\#14929](matrix-org/synapse#14929))
- Improve performance of `/sync` in a few situations. ([\#14973](matrix-org/synapse#14973))
- Limit concurrent event creation for a room to avoid state resolution when sending bursts of events to a local room. ([\#14977](matrix-org/synapse#14977))
- Skip calculating unread push actions in /sync when enable_push is false. ([\#14980](matrix-org/synapse#14980))
- Add a schema dump symlinks inside `contrib`, to make it easier for IDEs to interrogate Synapse's database schema. ([\#14982](matrix-org/synapse#14982))
- Improve type hints. ([\#15008](matrix-org/synapse#15008), [\#15026](matrix-org/synapse#15026), [\#15027](matrix-org/synapse#15027), [\#15028](matrix-org/synapse#15028), [\#15031](matrix-org/synapse#15031), [\#15035](matrix-org/synapse#15035), [\#15052](matrix-org/synapse#15052), [\#15072](matrix-org/synapse#15072), [\#15084](matrix-org/synapse#15084))
- Update [MSC3952](matrix-org/matrix-spec-proposals#3952) support based on changes to the MSC. ([\#15037](matrix-org/synapse#15037))
- Avoid mutating a cached value in `get_user_devices_from_cache`. ([\#15040](matrix-org/synapse#15040))
- Fix a rare exception in logs on start up. ([\#15041](matrix-org/synapse#15041))
- Update pyo3-log to v0.8.1. ([\#15043](matrix-org/synapse#15043))
- Avoid mutating cached values in `_generate_sync_entry_for_account_data`. ([\#15047](matrix-org/synapse#15047))
- Refactor arguments of `try_unbind_threepid` and `_try_unbind_threepid_with_id_server` to not use dictionaries. ([\#15053](matrix-org/synapse#15053))
- Merge debug logging from the hotfixes branch. ([\#15054](matrix-org/synapse#15054))
- Faster joins: omit device list updates originating from partial state rooms in /sync responses without lazy loading of members enabled. ([\#15069](matrix-org/synapse#15069))
- Fix clashing database transaction name. ([\#15070](matrix-org/synapse#15070))
- Upper-bound frozendict dependency. This works around us being unable to test installing our wheels against Python 3.11 in CI. ([\#15114](matrix-org/synapse#15114))
- Tweak logging for when a worker waits for its view of a replication stream to catch up. ([\#15120](matrix-org/synapse#15120))

<details><summary>Locked dependency updates</summary>

- Bump bleach from 5.0.1 to 6.0.0. ([\#15059](matrix-org/synapse#15059))
- Bump cryptography from 38.0.4 to 39.0.1. ([\#15020](matrix-org/synapse#15020))
- Bump ruff version from 0.0.230 to 0.0.237. ([\#15033](matrix-org/synapse#15033))
- Bump dtolnay/rust-toolchain from 9cd00a88a73addc8617065438eff914dd08d0955 to 25dc93b901a87e864900a8aec6c12e9aa794c0c3. ([\#15060](matrix-org/synapse#15060))
- Bump systemd-python from 234 to 235. ([\#15061](matrix-org/synapse#15061))
- Bump serde_json from 1.0.92 to 1.0.93. ([\#15062](matrix-org/synapse#15062))
- Bump types-requests from 2.28.11.8 to 2.28.11.12. ([\#15063](matrix-org/synapse#15063))
- Bump types-pillow from 9.4.0.5 to 9.4.0.10. ([\#15064](matrix-org/synapse#15064))
- Bump sentry-sdk from 1.13.0 to 1.15.0. ([\#15065](matrix-org/synapse#15065))
- Bump types-jsonschema from 4.17.0.3 to 4.17.0.5. ([\#15099](matrix-org/synapse#15099))
- Bump types-bleach from 5.0.3.1 to 6.0.0.0. ([\#15100](matrix-org/synapse#15100))
- Bump dtolnay/rust-toolchain from 25dc93b901a87e864900a8aec6c12e9aa794c0c3 to e12eda571dc9a5ee5d58eecf4738ec291c66f295. ([\#15101](matrix-org/synapse#15101))
- Bump dawidd6/action-download-artifact from 2.24.3 to 2.25.0. ([\#15102](matrix-org/synapse#15102))
- Bump types-pillow from 9.4.0.10 to 9.4.0.13. ([\#15104](matrix-org/synapse#15104))
- Bump types-setuptools from 67.1.0.0 to 67.3.0.1. ([\#15105](matrix-org/synapse#15105))


</details>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

User directory search fails if your query has : in it
4 participants