Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct our use of synonyms for ES6 #381

Closed
orangejulius opened this issue Sep 27, 2019 · 5 comments · Fixed by #390
Closed

Correct our use of synonyms for ES6 #381

orangejulius opened this issue Sep 27, 2019 · 5 comments · Fixed by #390

Comments

@orangejulius
Copy link
Member

orangejulius commented Sep 27, 2019

While testing ES6 support, I ran into the following error while running an OSM import:

type=illegal_argument_exception, reason=startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=9,endOffset=20,lastStartOffset=14 for field 'name.default'

After some digging it appears this error is related to token position offsets created by the Synonym token filter.

There is a very interesting Elastic blog post from 2017 discussing the solution: the new Synonym graph token filter and how to use it to improve how synonyms expansion works. We'll need to figure out what the right solution is here for ES6 support

Overall, the biggest takeaway appears to be this:

To make multi-token synonyms work correctly you must apply your synonyms at query time, not index-time, since a Lucene index cannot store a token graph.

Connects pelias/pelias#719

@missinglink
Copy link
Member

Hmm.. very interesting, I don't think we could get away with doing our synonyms at query-time because of autocomplete.
eg. if the source data was rd and the user entered roa then the documents would not match using query-time synonym substitution.

I suspect it would also increase the response-time significantly, and also potentially change some behaviour, so it's not a simple thing to change.

It looks like this bug exclusively affects multi-word synonyms, and we have relatively few of those?

@missinglink
Copy link
Member

missinglink commented Sep 27, 2019

I just looked and I couldn't find any multi-token synonyms listed in this repo, any idea which synonym is causing this?

[edit] there are multi-token synonyms in this repo after all! see https://github.com/pelias/schema/pull/388/files

@Joxit
Copy link
Member

Joxit commented Sep 27, 2019

Query time synonyms can be cool if we do change often our synonyms. But it's not really the case here.
And as @missinglink says, I'm a bit afraid with the response time and the CPU and IO usage that will cause the query-time synonyms...

@orangejulius
Copy link
Member Author

ohhh! you know what? i was testing using on of our geocode.earth client configurations. They have multi-token synonyms. so its still important to consider, but won't affect "stock" pelias

orangejulius added a commit that referenced this issue Sep 27, 2019
This is an exploration of using the `synonym_graph` filter instead of
the `synonym` filter. Quite a few integration tests fail, but they all
look to be simple order changes of otherwise identical tokens.

I didn't bother to fix them all because I'd first like to explore
whether or not this change actually has any effect on query results or
ES6 compatibility.

Connects #381
orangejulius added a commit that referenced this issue Sep 27, 2019
This is an exploration of using the `synonym_graph` filter instead of
the `synonym` filter. Quite a few integration tests fail, but they all
look to be simple order changes of otherwise identical tokens.

I didn't bother to fix them all because I'd first like to explore
whether or not this change actually has any effect on query results or
ES6 compatibility.

Connects #381
missinglink pushed a commit that referenced this issue Oct 30, 2019
This is an exploration of using the `synonym_graph` filter instead of
the `synonym` filter. Quite a few integration tests fail, but they all
look to be simple order changes of otherwise identical tokens.

I didn't bother to fix them all because I'd first like to explore
whether or not this change actually has any effect on query results or
ES6 compatibility.

Connects #381
missinglink pushed a commit that referenced this issue Oct 30, 2019
This is an exploration of using the `synonym_graph` filter instead of
the `synonym` filter. Quite a few integration tests fail, but they all
look to be simple order changes of otherwise identical tokens.

I didn't bother to fix them all because I'd first like to explore
whether or not this change actually has any effect on query results or
ES6 compatibility.

Connects #381
@missinglink
Copy link
Member

okay, so I have tracked down a reproducible testcase: https://gist.github.com/missinglink/8f55271dcf4f5e7e8d0712b1f2c8d742

a simple way to trigger this error is with:

POST http://localhost:9200/pelias/_analyze
{
    "analyzer": "peliasIndexOneEdgeGram",
    "text": "set"
}

The synonym generation goes crazy:

{
    "tokens": [
        {
            "token": "s",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "se",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "set",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "sep",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "sept",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "septi",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "septie",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "septiem",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "septiemb",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "septiembr",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "septiembre",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "setb",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "setbr",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "setbre",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "sepe",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "sepb",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "sepbr",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "sepbre",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "7",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "7b",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "7br",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "7bre",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "b",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "br",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "bre",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "7",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "7r",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "7re",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 1
        },
        {
            "token": "r",
            "start_offset": 1,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 2
        },
        {
            "token": "re",
            "start_offset": 1,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 2
        },
        {
            "token": "7",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 2
        },
        {
            "token": "s",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 2
        },
        {
            "token": "se",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 2
        },
        {
            "token": "sep",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 2
        },
        {
            "token": "r",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 3
        },
        {
            "token": "re",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 3
        },
        {
            "token": "b",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 3
        },
        {
            "token": "br",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 3
        },
        {
            "token": "bre",
            "start_offset": 0,
            "end_offset": 3,
            "type": "SYNONYM",
            "position": 3
        }
    ]
}

@missinglink missinglink changed the title Correct our use of multi-token synonyms for ES6 Correct our use of synonyms for ES6 Oct 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants