-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CU-8693qx9yp Deid chunking - hugging face pipeline approach #405
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find! It's much better for us to use work something that someone else has done somewhere. And that's what this seems to do!
With that said, do we need to specifically allow the addition of a config dict when loading the model?
The TransformersNER model gets saved and loaded along with its config already.
Now, for older models, this config would not have the new option set. But the model for the config has a default value for it so when initialised from a previous instance, it should use the default where no value was loaded off disk.
In any case, we certainly shouldn't hijack the MetaCAT config dict. If we need this functionality for some reason, we'd need to create and use a new argument.
If this is so we can inject a new value for chunking_overlap_window
before the pipe is created, surely we should just set the config value and call TransformersNER.create_eval_pipe
again? Though in any case, we may want to document this within the config entry so it's clear that simply changing the value does not change behaviour before the pipe is recreated.
EDIT:
I noticed the multiprocessing DeID tests were failing due to taking too long / timing out. I'll take a look and see what the issue may be.
Added NER config in cat load function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm - just some comments to clarify please
medcat/config_transformers_ner.py
Outdated
@@ -13,6 +13,8 @@ class General(MixingConfig, BaseModel): | |||
"""How many characters are piped at once into the meta_cat class""" | |||
ner_aggregation_strategy: str = 'simple' | |||
"""Agg strategy for HF pipeline for NER""" | |||
chunking_overlap_window: int = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empirically 5 is good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anthony mentioned he'd want it to be 5 as it would have a good trade-off between computational complexity and performance.
I feel 10 would be better, but 5 works as well
Just as a note here. I've isolated the issue to some changes in this branch. The test runs fine without these changes. But I don't know why it would have this effect. Especially since it seems to persist even when setting the value to 0, which should be the default. EDIT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
… non-functioning chunking window
Task linked: CU-8693qx9yp Fix chunking issues for De-ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
* Cu 8693u6b4u tests continue on fail (#400) * CU-8693u6b4u: Make sure failed/errored tests fail the main workflow * CU-8693u6b4u: Attempt to fix deid multiprocessing, at least for GHA * CU-8693u6b4u: Fix small docstring issue * CU-8693v3tt6 SOMED opcs refset selection (#402) * CU-8693v3tt6: Update refset ID for OPCS4 mappings in newer SNOMED releases * CU-8693v3tt6: Add method to get direct refset mappings * CU-8693v3tt6: Add tests to direct refset mappings method * CU-8693v3tt6: Fix OPCS4 refset ID selection logic * CU-8693v3tt6: Add test for OPCS4 refset ID selection * CU-8693v6epd: Move typing imports away from pydantic (#403) * CU-8693qx9yp Deid chunking - hugging face pipeline approach (#405) * Pushing chunking update * Update transformers_ner.py * Pushing update to config Added NER config in cat load function * Update cat.py * Updating chunking overlap * CU-8693qx9yp: Add warning for deid multiprocessing with (potentially) non-functioning chunking window * CU-8693qx9yp: Fix linting issue --------- Co-authored-by: mart-r <[email protected]> --------- Co-authored-by: Shubham Agarwal <[email protected]>
Adding functionality for chunking documents that exceed the maximum number of tokens the model can process.