Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update _auto.py Encoding Script to Handle Output Tensor Dimensionality for BPR Model #2054

Open
abdoelsayed2016 opened this issue Dec 30, 2024 · 0 comments

Comments

@abdoelsayed2016
Copy link

i am trying to run the following Pyserini command

python -m pyserini.encode ^
    input --corpus msmarco-passage-corpus/msmarco-passage-corpus.json ^
          --fields title text ^
          --delimiter "\n" ^
          --shard-id 0 ^
          --shard-num 1 ^
    output --embeddings ./msmarco-passage-corpus/bpr ^
           --to-faiss ^
    encoder --encoder castorini/bpr-nq-ctx-encoder ^
            --fields title text ^
            --batch 32 ^
            --fp16

When using the castorini/bpr-nq-ctx-encoder model for encoding with Pyserini, the following error occurs:

  File "C:\Users\user\anaconda3\envs\pyserini_2\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\user\anaconda3\envs\pyserini_2\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\user\anaconda3\envs\pyserini_2\lib\site-packages\pyserini\encode\__main__.py", line 151, in <module>
    embeddings = encoder.encode(**kwargs)
  File "C:\Users\user\anaconda3\envs\pyserini_2\lib\site-packages\pyserini\encode\_auto.py", line 69, in encode
    embeddings = outputs[0][:, 0, :].detach().cpu().numpy()
IndexError: too many indices for tensor of dimension 2

This issue occurs because the tensor returned by the castorini/bpr-nq-ctx-encoder model has only two dimensions, while the current script assumes a three-dimensional tensor. This error does not occur with the DPR model because its output tensor matches the expected structure.

Proposed Fix

To make the script compatible with both BPR and DPR models, update the indexing logic in the _auto.py script.

Current Code (Line 69):

embeddings = outputs[0][:, 0, :].detach().cpu().numpy()

update to the following

output_tensor = outputs[0]
if len(output_tensor.shape) == 3:
    # If three-dimensional, proceed as before
    embeddings = output_tensor[:, 0, :].detach().cpu().numpy()
elif len(output_tensor.shape) == 2:
    # If two-dimensional, adjust indexing
    embeddings = output_tensor.detach().cpu().numpy()
else:
    raise ValueError("Unexpected output tensor shape.")

is my solution is correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant