-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[ML] Allow NLP truncate option to be updated when span is set #91224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/ml-core (Team:ML) |
|
Hi @davidkyle, I've created a changelog YAML for you. |
dimitris-athanasiou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a comment about clarifying constructors for RobertaTokenization.
| private final boolean addPrefixSpace; | ||
|
|
||
| private RobertaTokenization( | ||
| public RobertaTokenization( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a constructor that doesn't expect doLowerCase for RobertaTokenization? The only public constructor before this change forces lower case to false. This seems something worth clarifying as we're refactoring the area.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted this change as it was only a convenience for a test. The value of doLowerCase should always be false.
dimitris-athanasiou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
💚 Backport successful
|
* main: (1300 commits) update c2id/c2id-server-demo docker image to support ARM (elastic#91144) Allow legacy index settings on legacy indices (elastic#90264) Skip prevoting if single-node discovery (elastic#91255) Chunked encoding for snapshot status API (elastic#90801) Allow different decay values depending on the score function (elastic#91195) Fix handling indexed envelopes crossing the dateline in mvt API (elastic#91105) Ensure cleanups succeed in JoinValidationService (elastic#90601) Add overflow behaviour test for RecyclerBytesStreamOutput (elastic#90638) More actionable error for ancient indices (elastic#91243) Fix APM configuration file delete (elastic#91058) Clean up handshake test class (elastic#90966) Improve H3#hexRing logic and add H3#areNeighborCells method (elastic#91140) Restrict direct use of `ApplicationPrivilege` constructor (elastic#91176) [ML] Allow NLP truncate option to be updated when span is set (elastic#91224) Support multi-intersection for FieldPermissions (elastic#91169) Support intersecting multi-sets of queries with DocumentPermissions (elastic#91151) Ensure TermsEnum action works correctly with API keys (elastic#91170) Fix NPE in auditing authenticationSuccess for non-existing run-as user (elastic#91171) Ensure PKI's delegated_by_realm metadata respect run-as (elastic#91173) [ML] Update API documentation for anomaly score explanation (elastic#91177) ... # Conflicts: # x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/XPackClientPlugin.java # x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/RollupShardIndexer.java # x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/TransportRollupIndexerAction.java # x-pack/plugin/rollup/src/test/java/org/elasticsearch/xpack/rollup/v2/RollupActionSingleNodeTests.java
Models preconfigured with the tokenization options
truncate: NONEandspan: Xwhere X is > 0 error when updating the truncate option. For exampleReturns the error
Because the model is configured with a
spanoption the validation check fails. The work around is to setspan: -1where the value-1unsetsspan.This change wipes out any preexisting
spanoption when truncate is set tofirstorsecond. That in itself is a small change the rest is testing and a refactoring of the Roberta and BERT tokenizers to share common code.