-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable T-Res with GPU [PR open] #274
Comments
Mentioned in Mariona handover doc:
|
We also need to update to use a dataset: See here for how to do this https://huggingface.co/docs/transformers/en/main_classes/pipelines Create a custom dataset: https://huggingface.co/docs/datasets/en/create_dataset |
DeezyMatchRelevant section in # Seek fuzzy string matches.
candidate_scenario = os.path.join(
dm_path, "combined", dm_cands + "_" + dm_model
)
pretrained_model_path = os.path.join(
f"{dm_path}", "models", f"{dm_model}", f"{dm_model}" + ".model"
)
pretrained_vocab_path = os.path.join(
f"{dm_path}", "models", f"{dm_model}", f"{dm_model}" + ".vocab"
)
deezy_result = candidate_ranker(
candidate_scenario=candidate_scenario,
query=query,
ranking_metric=self.deezy_parameters["ranking_metric"],
selection_threshold=self.deezy_parameters["selection_threshold"],
num_candidates=self.deezy_parameters["num_candidates"],
search_size=self.deezy_parameters["num_candidates"],
verbose=self.deezy_parameters["verbose"],
output_path=os.path.join(dm_path, "ranking", dm_output),
pretrained_model_path=pretrained_model_path,
pretrained_vocab_path=pretrained_vocab_path,
) and the candidate_ranker function in DeezyMatch looks like this: def candidate_ranker(
input_file_path="default",
query_scenario=None,
candidate_scenario=None,
ranking_metric="faiss",
selection_threshold=0.8,
query=None,
num_candidates=10,
search_size=4,
length_diff=None,
calc_predict=False,
calc_cosine=False,
output_path="ranker_output",
pretrained_model_path=None,
pretrained_vocab_path=None,
number_test_rows=-1,
verbose=True,
):
"""
find and rank a set of candidates (from a dataset) for given queries in the same or another dataset
Parameters
----------
input_file_path
path to the input file. "default": read input file in `candidate_scenario`
query_scenario
directory that contains all the assembled query vectors
candidate_scenario
directory that contains all the assembled candidate vectors
... Since we are not passing the That is: candidate_scenario = os.path.join(
dm_path, "combined", dm_cands + "_" + dm_model
) which for us is: Therefore, to specify GPU config parameters we must set them at the top of general:
use_gpu: True # only if available
# specify CUDA device, these are 0-indexed, e.g.,
# cuda:0, cuda:1 or others. "cuda" is the default CUDA device
gpu_device: "cuda"
# Parent dir to save trained models
models_dir: "../resources/deezymatch/models" |
TODO:
*Loading a pickle file of NER mentions would be a convenient way to avoid re-running the (long-running) NER step after a failed run, but is incompatible with T-Res pipeline when place of publication info is used (because we need to be able to associate place of pub with the NLP field in the CSV input data file). This would therefore require a major reworking of the batch processing code (to be done only if time permits). |
In order to run on Baskerville, could we make sure T-Res is utilising GPU whenever possible.
TODO:
dataset
type to pass data to the pipelineThe text was updated successfully, but these errors were encountered: