Skip to content

Conversation

@amogkam
Copy link
Collaborator

@amogkam amogkam commented Nov 17, 2020

What does this PR do?

This PR adds a new distributed retriever implementation for RAG built on Ray, as an alternative to the current retriever implementation that uses torch.distributed. With Ray it's possible to load the index on multiple processes instead of just the rank 0 training worker, allowing fine tuning to scale out better to multiple GPUs, and also allowing the index to potentially be fit in GPU memory. This also removes a core dependency on Pytorch, allowing a Tensorflow implementation of finetune.py.

This PR also makes changes to support finetune.py with Pytorch Lightning >v1.0.

A benchmark of Pytorch distribtued retrieval vs. Ray distributed retrieval
image

Implementation Details

In the current Pytorch retrieval implementation, the index is loaded once on just the rank 0 training workers. Training worker 0 gathers the inputs from all other workers, performs the index lookup, and scatters the results back to the other workers.
image

With the Ray implementation, the index is loaded on separate processes, which are referred to as Ray actors. Each training worker randomly selects a retrieval actor to query for documents and Ray handles all the communication between the processes. Because the index can be loaded in multiple processes, training can scale up since no synchronization needs to happen for the index lookup.
image

Note that Pytorch Lightning is still handling distributed training, but Ray manages distributed retrieval. Because PTL calls the entire training script under the hood multiple times, we have to use Ray's named actors feature (https://docs.ray.io/en/master/actors.html?highlight=named%20actors#named-actors) allowing the retrieval actors to be referenced by all training processes. The use of named actors is necessitated by how PTL handles distributed training, and a simpler approach could probably be used for a Tensorflow implentation.

Testing Strategy

Unit tests were added to test_distributed_retriever.py. Note that the local Ray cluster for the tests had to be started with local_mode=True because the test file modifies sys.path and these changes are not propagated to remote processes. See https://stackoverflow.com/questions/54338013/parallel-import-a-python-file-from-sibling-folder for more info.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to the it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

@lhoestq
Copy link
Member

lhoestq commented Nov 17, 2020

Hi ! This looks awesome :)
I was about to create a PR that fixes the init_ddp_connection in finetune.py and that adds a test script to make sure the finetuning script works as expected. With minimal changes on my side I can easily reduce conflicts between our two changes to finetune.py (I guess I'll just reuse the CustomAccelerator). Does that sound good to you ?

@amogkam
Copy link
Collaborator Author

amogkam commented Nov 17, 2020

@lhoestq yes that sounds great!

@lhoestq
Copy link
Member

lhoestq commented Nov 25, 2020

Yes indeed ! Feel free to set this PR to ready for review

Also it looks like the CI fails because of a failed import of ray.
To fix that you need to move the import of ray into the test functions decorated with require_distributed_retrieval .

You should also add ray to the test dependencies, or the test will simply be ignored

@amogkam amogkam marked this pull request as ready for review November 26, 2020 01:21
@amogkam
Copy link
Collaborator Author

amogkam commented Nov 26, 2020

@lhoestq CI is passing now!

@amogkam
Copy link
Collaborator Author

amogkam commented Nov 28, 2020

@lhoestq any ETA on when this PR can get reviewed? Thanks

@lhoestq
Copy link
Member

lhoestq commented Nov 28, 2020

Hi ! I've already started to look at the changes and it looks pretty good so far :) I'll finish my review soon, probably tomorrow

@amogkam
Copy link
Collaborator Author

amogkam commented Nov 28, 2020

Awesome thanks!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good ! thank you for adding ray support for RAG fine-tuning :)
And the speed up comparing to using only one worker for retrieval is pretty cool.

I left a few comments, mainly about separating the pytorch tests from the ray tests.

Comment on lines 69 to 77
python examples/rag/finetune.py \
--data_dir $DATA_DIR \
--output_dir $OUTPUT_DIR \
--model_name_or_path $MODEL_NAME_OR_PATH \
--model_type rag_sequence \
--fp16 \
--gpus 8
--distributed_retriever ray \
--num_retrieval_workers 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add an example for torch as well ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently distributed_retriever defaults to pytorch so an example command for this would just be the same as the command earlier in the Readme. I added a sentence saying that the default is pytorch though.

import ray # noqa: F401

_has_ray = True
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding ray integration here cc @LysandreJik

@LysandreJik LysandreJik requested a review from sgugger November 30, 2020 14:34
@LysandreJik
Copy link
Member

@sgugger it would be cool if you could review as this changes some things in the trainer/integrations.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other instance of is_ray_available to change to is_ray_tune_available if we go with the name change:

  • in integrations.py, inside the function hp_params and default_hp_search_backend
  • in trainer_utils.py, inside the function default_hp_space_ray

The main __init__ should also be updated to provide the two functions.

@amogkam amogkam requested review from lhoestq and sgugger November 30, 2020 19:53
@amogkam
Copy link
Collaborator Author

amogkam commented Dec 1, 2020

Hi @lhoestq @sgugger I addressed the feedback you guys gave. Do you think you can take another look? Thanks

@sgugger
Copy link
Collaborator

sgugger commented Dec 18, 2020

Hi there, sorry for the delay. Could you close and reopen your PR? Because of a bad force-push on our side, the diff has become unreadable. Also, the examples folder has slightly changed structure, so you might need to move the folder.

Ping me, @patrickvonplaten and @LysandreJik on the PR you reopen and we'll look at it quickly.

@amogkam
Copy link
Collaborator Author

amogkam commented Dec 18, 2020

Opened a new one here: #9197!

@amogkam amogkam closed this Dec 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants