Enable T-Res with GPU [PR open] #274

rwood-97 · 2024-05-21T08:33:42Z

In order to run on Baskerville, could we make sure T-Res is utilising GPU whenever possible.

Feed in device as parameter for all model set up
Do we also need to do this for deezy match and REL

TODO:

device parameter
use Huggingface dataset type to pass data to the pipeline

The text was updated successfully, but these errors were encountered:

rwood-97 · 2024-05-21T10:21:48Z

Mentioned in Mariona handover doc:

Finally, note for Fede: the NER module is the one that takes the longest. But we are using HuggingFace, so it could be much faster if we were running it on a machine with a GPU, it’d be faster, as long as we load the NER pipeline for using a GPU. This should be done here:

T-Res/geoparser/recogniser.py

Lines 249 to 268 in 1d887ea

def create_pipeline(self):

"""

Create a pipeline for performing NER given a NER model.

Returns:

self.model (str): the model name.

self.pipe (Pipeline): a pipeline object which performs

named entity recognition given a model.

"""

print("*** Creating and loading a NER pipeline.")

# Path to NER Model:

model_name = self.model

# If the model is local (has not been obtained from the hub),

# pre-append the model path and the extension of the model

# to obtain the model name.

if self.load_from_hub == False:

model_name = self.model_path + self.model + ".model"

# Load a NER pipeline:

self.pipe = pipeline("ner", model=model_name, ignore_labels=[])

return self.pipe

Read the readme at /home/mcollardanuy/H-Top/README.md in toponymVM2.0.

thobson88 · 2025-01-09T11:28:44Z

This is done in #275 but will need re-doing on top of the #282 refactor (I'll do that).

rwood-97 · 2025-02-05T14:02:23Z

We also need to update to use a dataset:
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset

See here for how to do this https://huggingface.co/docs/transformers/en/main_classes/pipelines

Create a custom dataset: https://huggingface.co/docs/datasets/en/create_dataset

thobson88 · 2025-02-10T15:22:06Z

DeezyMatch

Relevant section in ranking.py is:

    # Seek fuzzy string matches.
    candidate_scenario = os.path.join(
        dm_path, "combined", dm_cands + "_" + dm_model
    )
    pretrained_model_path = os.path.join(
        f"{dm_path}", "models", f"{dm_model}", f"{dm_model}" + ".model"
    )
    pretrained_vocab_path = os.path.join(
        f"{dm_path}", "models", f"{dm_model}", f"{dm_model}" + ".vocab"
    )
    
    deezy_result = candidate_ranker(
        candidate_scenario=candidate_scenario,
        query=query,
        ranking_metric=self.deezy_parameters["ranking_metric"],
        selection_threshold=self.deezy_parameters["selection_threshold"],
        num_candidates=self.deezy_parameters["num_candidates"],
        search_size=self.deezy_parameters["num_candidates"],
        verbose=self.deezy_parameters["verbose"],
        output_path=os.path.join(dm_path, "ranking", dm_output),
        pretrained_model_path=pretrained_model_path,
        pretrained_vocab_path=pretrained_vocab_path,
    )

and the candidate_ranker function in DeezyMatch looks like this:

def candidate_ranker(
    input_file_path="default",
    query_scenario=None,
    candidate_scenario=None,
    ranking_metric="faiss",
    selection_threshold=0.8,
    query=None,
    num_candidates=10,
    search_size=4,
    length_diff=None,
    calc_predict=False,
    calc_cosine=False,
    output_path="ranker_output",
    pretrained_model_path=None,
    pretrained_vocab_path=None,
    number_test_rows=-1,
    verbose=True,
):
    """
    find and rank a set of candidates (from a dataset) for given queries in the same or another dataset

    Parameters
    ----------
    input_file_path
        path to the input file. "default": read input file in `candidate_scenario`
    query_scenario
        directory that contains all the assembled query vectors
    candidate_scenario
        directory that contains all the assembled candidate vectors
    ...

Since we are not passing the input_file_path parameter, the (yaml) input file is read by looking inside the subdirectory named according to the candidate_scenario.

That is:

candidate_scenario = os.path.join(
        dm_path, "combined", dm_cands + "_" + dm_model
    )

which for us is: T-Res/resources/deezymatch/combined/wkdtalts_w2v_ocr.

Therefore, to specify GPU config parameters we must set them at the top of input_dfm.yaml in that directory, which should look like this:

general:
  use_gpu: True    # only if available
  # specify CUDA device, these are 0-indexed, e.g.,
  #   cuda:0, cuda:1 or others. "cuda" is the default CUDA device
  gpu_device: "cuda"
  # Parent dir to save trained models
  models_dir: "../resources/deezymatch/models"

thobson88 · 2025-02-11T12:03:12Z

TODO:

handle Datasets input to the NER pipeline
~~option to load a pickle file of NER mentions~~ see below *
make Deezy parameters configurable in BatchJob

*Loading a pickle file of NER mentions would be a convenient way to avoid re-running the (long-running) NER step after a failed run, but is incompatible with T-Res pipeline when place of publication info is used (because we need to be able to associate place of pub with the NLP field in the CSV input data file). This would therefore require a major reworking of the batch processing code (to be done only if time permits).

rwood-97 mentioned this issue Jun 19, 2024

Add GPU option to T-Res #275

Closed

thobson88 self-assigned this Jan 9, 2025

thobson88 changed the title ~~Enable T-Res with GPU~~ Enable T-Res with GPU [PR open] Jan 22, 2025

thobson88 mentioned this issue Feb 5, 2025

GPU support #291

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable T-Res with GPU [PR open] #274

Enable T-Res with GPU [PR open] #274

rwood-97 commented May 21, 2024 •

edited by thobson88

Loading

rwood-97 commented May 21, 2024

thobson88 commented Jan 9, 2025

rwood-97 commented Feb 5, 2025 •

edited

Loading

thobson88 commented Feb 10, 2025

thobson88 commented Feb 11, 2025 •

edited

Loading

Enable T-Res with GPU [PR open] #274

Enable T-Res with GPU [PR open] #274

Comments

rwood-97 commented May 21, 2024 • edited by thobson88 Loading

rwood-97 commented May 21, 2024

thobson88 commented Jan 9, 2025

rwood-97 commented Feb 5, 2025 • edited Loading

thobson88 commented Feb 10, 2025

DeezyMatch

thobson88 commented Feb 11, 2025 • edited Loading

TODO:

rwood-97 commented May 21, 2024 •

edited by thobson88

Loading

rwood-97 commented Feb 5, 2025 •

edited

Loading

thobson88 commented Feb 11, 2025 •

edited

Loading