Skip to content

Conversation

@NohTow
Copy link
Collaborator

@NohTow NohTow commented Feb 4, 2025

Right now, we are using default values for doc/query length, markers and attending to expansion tokens when reading stanford models.
This causes some issues as highlighted in #85 and also requires the user to specify a lot of information when using a model that is not using default values, as can be seen with the loading of Jina-ColBERT.

This PR simply add the reading process of the artifact.metadata file of Stanford NLP models and read markers, lengths and attend to expansion tokens values.

As usual, we override those if the user feed values to the init of the model. Also changed the attend_to_expansion_tokens parameter to match the other (None by default and override at the end).

@NohTow NohTow mentioned this pull request Feb 4, 2025
@NohTow NohTow changed the title [DRAFT] Read stanford configs Read stanford configs Feb 5, 2025
@NohTow
Copy link
Collaborator Author

NohTow commented Feb 5, 2025

I think it should be ok @raphaelsty
Could you please review to make sure the behavior is in line with what's expected? I did some tests, but I might have missed something
Note that it is based on #87, so we should merge this one first

@sam-hey
Copy link
Contributor

sam-hey commented Feb 5, 2025

It seems that my reviews are in Pending state, so they can't be seen at the moment. I have just one small regrade:

I wouldn’t categorize this as a warning, as it’s expected behavior. It’s merely informing you that the StanfordNLP model has successfully loaded the weights.

logger.warning("Loaded the weights from Stanford NLP model.")

Similarly, I believe there should be a clear distinction between informational events and actual warnings. For instance, if there were an issue loading the file and it fell back to a default, that would merit a warning.

logger.warning("Loaded the configuration from Stanford NLP model.")
except EnvironmentError:
if self.query_prefix is None:
self.query_prefix = "[unused0]"
if self.document_prefix is None:
self.document_prefix = "[unused1]"
# We do not set the query/doc length as they'll be set to the default values afterwards. We do it for prefixes as the default from stanford is different from ours
logger.warning(
"Could not load the configuration file from Stanford NLP model, using default values."
)

@NohTow
Copy link
Collaborator Author

NohTow commented Feb 5, 2025

That is very fair, I don't know why I started using logger.warning, I think I used it for a proper warning once and then copied it over the whole loading logic, my bad.
I'll clean that up before merging, thanks for the remark!

@NohTow NohTow force-pushed the read_stanford_configs branch from f0899ab to b957372 Compare February 6, 2025 10:00
@NohTow NohTow force-pushed the read_stanford_configs branch from b957372 to d7e868c Compare February 6, 2025 10:43
@NohTow
Copy link
Collaborator Author

NohTow commented Feb 6, 2025

@sam-hey I changed some warnings to info
@raphaelsty you can merge after you have double checked the loading/overriding logic

@raphaelsty
Copy link
Collaborator

LGTM

@raphaelsty raphaelsty merged commit 11d18b8 into main Feb 28, 2025
5 checks passed
@NohTow NohTow deleted the read_stanford_configs branch March 24, 2025 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants