Skip to content

Conversation

@Krishna-Rani-t
Copy link
Collaborator

No description provided.

Copy link
Member

@HamedBabaei HamedBabaei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Krishna-Rani-t , the reviews are ready.

Please also check whether you are using pre-commit or not here, as I see a few issues with linting.

from .rag import AutoRAGLearner
from .prompt import StandardizedPrompting
from .label_mapper import LabelMapper
from .taxonomy_discovery.rwthdbis import RWTHDBISSFTLearner as RWTHDBISTaxonomyLearner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using as here is not recommended. So i would recommend this way of importing.

from .taxonomy_discovery import RWTHDBISSFTLearner
  • don't add .rwdhdbis
  • keep only the class name from RWTHDBISSFTLearner to RWTHDBISTaxonomyLearner

from .prompt import StandardizedPrompting
from .label_mapper import LabelMapper
from .taxonomy_discovery.rwthdbis import RWTHDBISSFTLearner as RWTHDBISTaxonomyLearner
from .term_typing.rwthdbis import RWTHDBISSFTLearner as RWTHDBISTermTypingLearner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous line comment also applicable to this line of code as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove changes from here and add them to the text section of the PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-level documentation (functions' docstrings) is missing from this script!

self,
min_predictions: int = 1,
model_name: str = "distilroberta-base",
output_dir: str = "./results/{model_name}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it would be great to do:

        output_dir: str = "./results/taxonomy-discovery",

self.id2label: Dict[int, str] = {}
self.label2id: Dict[str, int] = {}

def _term_typing(self, data: Any, test: bool = False) -> Optional[Any]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use if and else here for clarity

terms = self._collect_eval_terms(data)
return self._predict_structured_output(terms)

def _load_robust_tokenizer(self, backbone: str) -> AutoTokenizer:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use AutoTokenizer?

" - pip install --upgrade sentencepiece\n"
" - ensure network access for model files\n"
" - clear your HF cache and retry\n"
" - pin versions: transformers==4.43.*, tokenizers<0.20\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this conflicts with Ontolearner requirements! So please try to remove this exception, as I see it is not well aligned with the library

f"Original error: {final_err}"
)

def _expand_multilabel_training_rows(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These function arguments are not well adjusted. Are you using pre-commits?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a few class functions are static, is there any specific reason for that? if it does, the other class (in term typing) doesn't have such feature!

@HamedBabaei
Copy link
Member

Please also add unittests for models!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Krishna-Rani-t I see this is becoming problematic! So here is the new idea:

Let's not import the models here! so

from .taxonomy_discovery.skhnlp import SKHNLPSequentialFTLearner, SKHNLPZSLearner
from .taxonomy_discovery.sbunlp import SBUNLPFewShotLearner
from .term_typing.sbunlp import SBUNLPZSLearner
from .text2onto import SBUNLPFewShotLearner as SBUNLPText2OntoLearner

or similar works will be removed from this init, and in the ontolearner/init.py you DO NOT NEED to do the following imports:

RWTHDBISTaxonomyLearner,
                      RWTHDBISTermTypingLearner,
                      SKHNLPZSLearner,
                      SKHNLPSequentialFTLearner,
                      SBUNLPFewShotLearner,
                      SBUNLPZSLearner,
                      SBUNLPText2OntoLearner)

In your examples, for loading lets say SKHNLPZSLearner, you will do this:

from ontolearner.learner.taxonomy_discover import SKHNLPZSLearner

so if you use the same class name inside the learner/term_typing / it will be

from ontolearner.learner.term_typing import SKHNLPZSLearner

Copy link
Member

@HamedBabaei HamedBabaei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Krishna-Rani-t , please check out the comments and once you made the fix please add comment of justification per comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a clear docstring to the functions.

import os
import re
import json
import importlib.util
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why import is being done in this format?


self.tokenizer: Optional[AutoTokenizer] = None
self.model: Optional[AutoModelForCausalLM] = None
self.device = "cuda" if torch.cuda.is_available() else "cpu"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device should be something that passed by user in argument! and default value it would be great to passed as a cpu!


self.train_pairs_clean: List[Dict[str, str]] = []

# ----------------------- small helpers ----------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this (small helper comment line) and add notes inside the function docstring about it.

if maybe_path:
os.makedirs(maybe_path, exist_ok=True)

# ---------------------- model load/gen ----------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, remove this comment line and move it inside the load function.

return decoded_outputs

# -----------------------------------------------------------------------------
# Main Learner: SBUNLPFewShotLearner (Task A Text2Onto)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these comments are not necessary!

# -----------------------------------------------------------------------------
# Concrete AutoLLM: local HF wrapper that follows the AutoLLM interface
# -----------------------------------------------------------------------------
class LocalAutoLLM(AutoLLM):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the AutoLLM is doing the same work! why not call this QuantizedAutoLLM instead of LocalAutoLLM! and probably you should replace this place that AutoLLM is defined and here is not the correct place.

if self.device == "cpu":
self.model.to("cpu")

def generate(self, inputs: List[str], max_new_tokens: int = 64, temperature: float = 0.0, top_p: float = 1.0) -> List[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow the AutoLLM generate argument pattern! don't add args that are not there in this function args, but inside the init also it would be great to leave temrature and top_p as a default.

flattened_list = [a_match or b_match for a_match, b_match in quoted_matches]
return flattened_list

def _call_model_one(self, prompt: str, max_new_tokens: int = 120) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a good naming! dont use such namings _call_model_one is unclear

predicted_pairs.add((doc_id, value.strip().lower()))
return predicted_pairs

def evaluate_extraction_f1(self, terms2doc_path: str, predicted_jsonl: str, key: str = "term") -> float:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

evaluation should not be defiend here, there is a dedicated evaluation module in the library, this should be defined there if its is a different than others!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants