-
Notifications
You must be signed in to change notification settings - Fork 2
add rwth_dbis learner models #284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Krishna-Rani-t , the reviews are ready.
Please also check whether you are using pre-commit or not here, as I see a few issues with linting.
| from .rag import AutoRAGLearner | ||
| from .prompt import StandardizedPrompting | ||
| from .label_mapper import LabelMapper | ||
| from .taxonomy_discovery.rwthdbis import RWTHDBISSFTLearner as RWTHDBISTaxonomyLearner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using as here is not recommended. So i would recommend this way of importing.
from .taxonomy_discovery import RWTHDBISSFTLearner- don't add
.rwdhdbis - keep only the class name from
RWTHDBISSFTLearnertoRWTHDBISTaxonomyLearner
| from .prompt import StandardizedPrompting | ||
| from .label_mapper import LabelMapper | ||
| from .taxonomy_discovery.rwthdbis import RWTHDBISSFTLearner as RWTHDBISTaxonomyLearner | ||
| from .term_typing.rwthdbis import RWTHDBISSFTLearner as RWTHDBISTermTypingLearner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous line comment also applicable to this line of code as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove changes from here and add them to the text section of the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code-level documentation (functions' docstrings) is missing from this script!
| self, | ||
| min_predictions: int = 1, | ||
| model_name: str = "distilroberta-base", | ||
| output_dir: str = "./results/{model_name}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it would be great to do:
output_dir: str = "./results/taxonomy-discovery",| self.id2label: Dict[int, str] = {} | ||
| self.label2id: Dict[str, int] = {} | ||
|
|
||
| def _term_typing(self, data: Any, test: bool = False) -> Optional[Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use if and else here for clarity
| terms = self._collect_eval_terms(data) | ||
| return self._predict_structured_output(terms) | ||
|
|
||
| def _load_robust_tokenizer(self, backbone: str) -> AutoTokenizer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use AutoTokenizer?
| " - pip install --upgrade sentencepiece\n" | ||
| " - ensure network access for model files\n" | ||
| " - clear your HF cache and retry\n" | ||
| " - pin versions: transformers==4.43.*, tokenizers<0.20\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this conflicts with Ontolearner requirements! So please try to remove this exception, as I see it is not well aligned with the library
| f"Original error: {final_err}" | ||
| ) | ||
|
|
||
| def _expand_multilabel_training_rows( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These function arguments are not well adjusted. Are you using pre-commits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a few class functions are static, is there any specific reason for that? if it does, the other class (in term typing) doesn't have such feature!
|
Please also add unittests for models! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Krishna-Rani-t I see this is becoming problematic! So here is the new idea:
Let's not import the models here! so
from .taxonomy_discovery.skhnlp import SKHNLPSequentialFTLearner, SKHNLPZSLearner
from .taxonomy_discovery.sbunlp import SBUNLPFewShotLearner
from .term_typing.sbunlp import SBUNLPZSLearner
from .text2onto import SBUNLPFewShotLearner as SBUNLPText2OntoLearneror similar works will be removed from this init, and in the ontolearner/init.py you DO NOT NEED to do the following imports:
RWTHDBISTaxonomyLearner,
RWTHDBISTermTypingLearner,
SKHNLPZSLearner,
SKHNLPSequentialFTLearner,
SBUNLPFewShotLearner,
SBUNLPZSLearner,
SBUNLPText2OntoLearner)In your examples, for loading lets say SKHNLPZSLearner, you will do this:
from ontolearner.learner.taxonomy_discover import SKHNLPZSLearnerso if you use the same class name inside the learner/term_typing / it will be
from ontolearner.learner.term_typing import SKHNLPZSLearnerThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Krishna-Rani-t , please check out the comments and once you made the fix please add comment of justification per comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a clear docstring to the functions.
| import os | ||
| import re | ||
| import json | ||
| import importlib.util |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why import is being done in this format?
|
|
||
| self.tokenizer: Optional[AutoTokenizer] = None | ||
| self.model: Optional[AutoModelForCausalLM] = None | ||
| self.device = "cuda" if torch.cuda.is_available() else "cpu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The device should be something that passed by user in argument! and default value it would be great to passed as a cpu!
|
|
||
| self.train_pairs_clean: List[Dict[str, str]] = [] | ||
|
|
||
| # ----------------------- small helpers ---------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this (small helper comment line) and add notes inside the function docstring about it.
| if maybe_path: | ||
| os.makedirs(maybe_path, exist_ok=True) | ||
|
|
||
| # ---------------------- model load/gen ---------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, remove this comment line and move it inside the load function.
| return decoded_outputs | ||
|
|
||
| # ----------------------------------------------------------------------------- | ||
| # Main Learner: SBUNLPFewShotLearner (Task A Text2Onto) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these comments are not necessary!
| # ----------------------------------------------------------------------------- | ||
| # Concrete AutoLLM: local HF wrapper that follows the AutoLLM interface | ||
| # ----------------------------------------------------------------------------- | ||
| class LocalAutoLLM(AutoLLM): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the AutoLLM is doing the same work! why not call this QuantizedAutoLLM instead of LocalAutoLLM! and probably you should replace this place that AutoLLM is defined and here is not the correct place.
| if self.device == "cpu": | ||
| self.model.to("cpu") | ||
|
|
||
| def generate(self, inputs: List[str], max_new_tokens: int = 64, temperature: float = 0.0, top_p: float = 1.0) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow the AutoLLM generate argument pattern! don't add args that are not there in this function args, but inside the init also it would be great to leave temrature and top_p as a default.
| flattened_list = [a_match or b_match for a_match, b_match in quoted_matches] | ||
| return flattened_list | ||
|
|
||
| def _call_model_one(self, prompt: str, max_new_tokens: int = 120) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a good naming! dont use such namings _call_model_one is unclear
| predicted_pairs.add((doc_id, value.strip().lower())) | ||
| return predicted_pairs | ||
|
|
||
| def evaluate_extraction_f1(self, terms2doc_path: str, predicted_jsonl: str, key: str = "term") -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
evaluation should not be defiend here, there is a dedicated evaluation module in the library, this should be defined there if its is a different than others!
No description provided.