Update _auto.py Encoding Script to Handle Output Tensor Dimensionality for BPR Model #2054

abdoelsayed2016 · 2024-12-30T21:23:30Z

i am trying to run the following Pyserini command

python -m pyserini.encode ^
    input --corpus msmarco-passage-corpus/msmarco-passage-corpus.json ^
          --fields title text ^
          --delimiter "\n" ^
          --shard-id 0 ^
          --shard-num 1 ^
    output --embeddings ./msmarco-passage-corpus/bpr ^
           --to-faiss ^
    encoder --encoder castorini/bpr-nq-ctx-encoder ^
            --fields title text ^
            --batch 32 ^
            --fp16

When using the castorini/bpr-nq-ctx-encoder model for encoding with Pyserini, the following error occurs:

  File "C:\Users\user\anaconda3\envs\pyserini_2\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\user\anaconda3\envs\pyserini_2\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\user\anaconda3\envs\pyserini_2\lib\site-packages\pyserini\encode\__main__.py", line 151, in <module>
    embeddings = encoder.encode(**kwargs)
  File "C:\Users\user\anaconda3\envs\pyserini_2\lib\site-packages\pyserini\encode\_auto.py", line 69, in encode
    embeddings = outputs[0][:, 0, :].detach().cpu().numpy()
IndexError: too many indices for tensor of dimension 2

This issue occurs because the tensor returned by the castorini/bpr-nq-ctx-encoder model has only two dimensions, while the current script assumes a three-dimensional tensor. This error does not occur with the DPR model because its output tensor matches the expected structure.

Proposed Fix

To make the script compatible with both BPR and DPR models, update the indexing logic in the _auto.py script.

Current Code (Line 69):

embeddings = outputs[0][:, 0, :].detach().cpu().numpy()

update to the following

output_tensor = outputs[0]
if len(output_tensor.shape) == 3:
    # If three-dimensional, proceed as before
    embeddings = output_tensor[:, 0, :].detach().cpu().numpy()
elif len(output_tensor.shape) == 2:
    # If two-dimensional, adjust indexing
    embeddings = output_tensor.detach().cpu().numpy()
else:
    raise ValueError("Unexpected output tensor shape.")

is my solution is correct?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update _auto.py Encoding Script to Handle Output Tensor Dimensionality for BPR Model #2054

Update _auto.py Encoding Script to Handle Output Tensor Dimensionality for BPR Model #2054

abdoelsayed2016 commented Dec 30, 2024

Update _auto.py Encoding Script to Handle Output Tensor Dimensionality for BPR Model #2054

Update _auto.py Encoding Script to Handle Output Tensor Dimensionality for BPR Model #2054

Comments

abdoelsayed2016 commented Dec 30, 2024

Proposed Fix

Current Code (Line 69):

update to the following