Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable T-Res with GPU [PR open] #274

Open
2 tasks done
rwood-97 opened this issue May 21, 2024 · 5 comments
Open
2 tasks done

Enable T-Res with GPU [PR open] #274

rwood-97 opened this issue May 21, 2024 · 5 comments
Assignees

Comments

@rwood-97
Copy link
Collaborator

rwood-97 commented May 21, 2024

In order to run on Baskerville, could we make sure T-Res is utilising GPU whenever possible.

  • Feed in device as parameter for all model set up
  • Do we also need to do this for deezy match and REL

TODO:

  • device parameter
  • use Huggingface dataset type to pass data to the pipeline
@rwood-97
Copy link
Collaborator Author

Mentioned in Mariona handover doc:

Finally, note for Fede: the NER module is the one that takes the longest. But we are using HuggingFace, so it could be much faster if we were running it on a machine with a GPU, it’d be faster, as long as we load the NER pipeline for using a GPU. This should be done here:

def create_pipeline(self):
"""
Create a pipeline for performing NER given a NER model.
Returns:
self.model (str): the model name.
self.pipe (Pipeline): a pipeline object which performs
named entity recognition given a model.
"""
print("*** Creating and loading a NER pipeline.")
# Path to NER Model:
model_name = self.model
# If the model is local (has not been obtained from the hub),
# pre-append the model path and the extension of the model
# to obtain the model name.
if self.load_from_hub == False:
model_name = self.model_path + self.model + ".model"
# Load a NER pipeline:
self.pipe = pipeline("ner", model=model_name, ignore_labels=[])
return self.pipe
Read the readme at /home/mcollardanuy/H-Top/README.md in toponymVM2.0.

@thobson88
Copy link
Collaborator

This is done in #275 but will need re-doing on top of the #282 refactor (I'll do that).

@thobson88 thobson88 self-assigned this Jan 9, 2025
@thobson88 thobson88 changed the title Enable T-Res with GPU Enable T-Res with GPU [PR open] Jan 22, 2025
@thobson88 thobson88 mentioned this issue Feb 5, 2025
@rwood-97
Copy link
Collaborator Author

rwood-97 commented Feb 5, 2025

We also need to update to use a dataset:
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset

See here for how to do this https://huggingface.co/docs/transformers/en/main_classes/pipelines

Create a custom dataset: https://huggingface.co/docs/datasets/en/create_dataset

@thobson88
Copy link
Collaborator

DeezyMatch

Relevant section in ranking.py is:

    # Seek fuzzy string matches.
    candidate_scenario = os.path.join(
        dm_path, "combined", dm_cands + "_" + dm_model
    )
    pretrained_model_path = os.path.join(
        f"{dm_path}", "models", f"{dm_model}", f"{dm_model}" + ".model"
    )
    pretrained_vocab_path = os.path.join(
        f"{dm_path}", "models", f"{dm_model}", f"{dm_model}" + ".vocab"
    )
    
    deezy_result = candidate_ranker(
        candidate_scenario=candidate_scenario,
        query=query,
        ranking_metric=self.deezy_parameters["ranking_metric"],
        selection_threshold=self.deezy_parameters["selection_threshold"],
        num_candidates=self.deezy_parameters["num_candidates"],
        search_size=self.deezy_parameters["num_candidates"],
        verbose=self.deezy_parameters["verbose"],
        output_path=os.path.join(dm_path, "ranking", dm_output),
        pretrained_model_path=pretrained_model_path,
        pretrained_vocab_path=pretrained_vocab_path,
    )

and the candidate_ranker function in DeezyMatch looks like this:

def candidate_ranker(
    input_file_path="default",
    query_scenario=None,
    candidate_scenario=None,
    ranking_metric="faiss",
    selection_threshold=0.8,
    query=None,
    num_candidates=10,
    search_size=4,
    length_diff=None,
    calc_predict=False,
    calc_cosine=False,
    output_path="ranker_output",
    pretrained_model_path=None,
    pretrained_vocab_path=None,
    number_test_rows=-1,
    verbose=True,
):
    """
    find and rank a set of candidates (from a dataset) for given queries in the same or another dataset

    Parameters
    ----------
    input_file_path
        path to the input file. "default": read input file in `candidate_scenario`
    query_scenario
        directory that contains all the assembled query vectors
    candidate_scenario
        directory that contains all the assembled candidate vectors
    ...

Since we are not passing the input_file_path parameter, the (yaml) input file is read by looking inside the subdirectory named according to the candidate_scenario.

That is:

candidate_scenario = os.path.join(
        dm_path, "combined", dm_cands + "_" + dm_model
    )

which for us is: T-Res/resources/deezymatch/combined/wkdtalts_w2v_ocr.

Therefore, to specify GPU config parameters we must set them at the top of input_dfm.yaml in that directory, which should look like this:

general:
  use_gpu: True    # only if available
  # specify CUDA device, these are 0-indexed, e.g.,
  #   cuda:0, cuda:1 or others. "cuda" is the default CUDA device
  gpu_device: "cuda"
  # Parent dir to save trained models
  models_dir: "../resources/deezymatch/models"

@thobson88
Copy link
Collaborator

thobson88 commented Feb 11, 2025

TODO:

  • handle Datasets input to the NER pipeline
  • option to load a pickle file of NER mentions see below *
  • make Deezy parameters configurable in BatchJob

*Loading a pickle file of NER mentions would be a convenient way to avoid re-running the (long-running) NER step after a failed run, but is incompatible with T-Res pipeline when place of publication info is used (because we need to be able to associate place of pub with the NLP field in the CSV input data file). This would therefore require a major reworking of the batch processing code (to be done only if time permits).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants