Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Dataset and OCR variational pipeline #18

Merged
merged 65 commits into from
Nov 18, 2024
Merged

New Dataset and OCR variational pipeline #18

merged 65 commits into from
Nov 18, 2024

Conversation

J-Dymond
Copy link
Collaborator

This branch includes:

  • src/arc_spice/data/multieurlex_utils.py

    • Here code for loading and preprocessing the MultiEURLEX dataset is located.
  • src/arc_spice/variational_pipelines/RTC_variational_pipeline.py

    • Here the variational pipeline is located. It has clean_inference and variational_inference functionality. As well as calculating some confidence metrics on the outputs.
  • src/arc_spice/variational_pipelines/dropout_utils.py

    • This file contains some utility functions for performing MC dropout.
  • src/arc_spice/eval/classification_error.py and src/arc_spice/eval/translation_error.py

    • These contain some helper functions for calculating errors and uncertainties relating to the two tasks
  • scripts/variational_RTC_example.py

    • This is a barebones script with example usage of the variational pipeline.

…n by default, also a function to change the dropout setting at runtime
…e uncertainty. TODO: calibrate these confidences
@J-Dymond J-Dymond linked an issue Nov 13, 2024 that may be closed by this pull request
@J-Dymond J-Dymond requested review from eddableheath and lannelin and removed request for eddableheath November 14, 2024 10:44
Copy link
Collaborator

@lannelin lannelin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving a partial review before I jump into meetings.

Overall looks good, I've added some comments requesting some small changes.

# change huggingface cache to be in project dir rather than user home
export HF_HOME="/bask/projects/v/vjgo8416-spice/hf_cache"

# TODO: script uses relative path to project home so must be run from home, fix
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outstanding TODO - I think a simple fix in the data loading, will comment separately.


return translation

return self.translator(text)[0]["translation_text"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this [0] relying on the fact that text is a str and never a list of strings? if so, maybe add a guard to check it is a str

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this assumes a batch size of 1

Copy link
Collaborator

@lannelin lannelin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionality looks good.

I resolved the merge conflict for .gitignore as it was stopping the CI checks from running. Those are failing at the moment so could you take a look at why @J-Dymond ? I suspect you haven't got pre-commit installed locally when you're committing up. If you haven't, try:

pip install -e ".[dev]" # from project dir, installs with dev deps
pre-commit install
pre-commit run --all-files

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@J-Dymond J-Dymond linked an issue Nov 15, 2024 that may be closed by this pull request
Copy link
Collaborator

@lannelin lannelin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work! A couple of very minor comments to check but then should be good to merge.

@J-Dymond J-Dymond merged commit 32774fc into main Nov 18, 2024
5 checks passed
@J-Dymond J-Dymond deleted the 8-taxi500-dataset branch November 18, 2024 11:37
This was linked to issues Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants