[RAG] Add Ray implementation for distributed retrieval #8583

amogkam · 2020-11-17T08:01:51Z

What does this PR do?

This PR adds a new distributed retriever implementation for RAG built on Ray, as an alternative to the current retriever implementation that uses torch.distributed. With Ray it's possible to load the index on multiple processes instead of just the rank 0 training worker, allowing fine tuning to scale out better to multiple GPUs, and also allowing the index to potentially be fit in GPU memory. This also removes a core dependency on Pytorch, allowing a Tensorflow implementation of finetune.py.

This PR also makes changes to support finetune.py with Pytorch Lightning >v1.0.

A benchmark of Pytorch distribtued retrieval vs. Ray distributed retrieval

Implementation Details

In the current Pytorch retrieval implementation, the index is loaded once on just the rank 0 training workers. Training worker 0 gathers the inputs from all other workers, performs the index lookup, and scatters the results back to the other workers.

With the Ray implementation, the index is loaded on separate processes, which are referred to as Ray actors. Each training worker randomly selects a retrieval actor to query for documents and Ray handles all the communication between the processes. Because the index can be loaded in multiple processes, training can scale up since no synchronization needs to happen for the index lookup.

Note that Pytorch Lightning is still handling distributed training, but Ray manages distributed retrieval. Because PTL calls the entire training script under the hood multiple times, we have to use Ray's named actors feature (https://docs.ray.io/en/master/actors.html?highlight=named%20actors#named-actors) allowing the retrieval actors to be referenced by all training processes. The use of named actors is necessitated by how PTL handles distributed training, and a simpler approach could probably be used for a Tensorflow implentation.

Testing Strategy

Unit tests were added to test_distributed_retriever.py. Note that the local Ray cluster for the tests had to be started with local_mode=True because the test file modifies sys.path and these changes are not propagated to remote processes. See https://stackoverflow.com/questions/54338013/parallel-import-a-python-file-from-sibling-folder for more info.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to the it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

…-ray

…into rag-ray

…-ray

lhoestq · 2020-11-17T09:17:47Z

Hi ! This looks awesome :)
I was about to create a PR that fixes the init_ddp_connection in finetune.py and that adds a test script to make sure the finetuning script works as expected. With minimal changes on my side I can easily reduce conflicts between our two changes to finetune.py (I guess I'll just reuse the CustomAccelerator). Does that sound good to you ?

amogkam · 2020-11-17T09:29:00Z

@lhoestq yes that sounds great!

lhoestq · 2020-11-25T13:52:19Z

Yes indeed ! Feel free to set this PR to ready for review

Also it looks like the CI fails because of a failed import of ray.
To fix that you need to move the import of ray into the test functions decorated with require_distributed_retrieval .

You should also add ray to the test dependencies, or the test will simply be ignored

amogkam · 2020-11-26T03:50:06Z

@lhoestq CI is passing now!

amogkam · 2020-11-28T19:22:56Z

@lhoestq any ETA on when this PR can get reviewed? Thanks

lhoestq · 2020-11-28T20:01:23Z

Hi ! I've already started to look at the changes and it looks pretty good so far :) I'll finish my review soon, probably tomorrow

amogkam · 2020-11-28T20:45:29Z

Awesome thanks!

lhoestq

Really good ! thank you for adding ray support for RAG fine-tuning :)
And the speed up comparing to using only one worker for retrieval is pretty cool.

I left a few comments, mainly about separating the pytorch tests from the ray tests.

lhoestq · 2020-11-29T22:13:24Z

examples/rag/README.md

+python examples/rag/finetune.py \
+    --data_dir $DATA_DIR \
+    --output_dir $OUTPUT_DIR \
+    --model_name_or_path $MODEL_NAME_OR_PATH \
+    --model_type rag_sequence \
+    --fp16 \
+    --gpus 8
+    --distributed_retriever ray \
+    --num_retrieval_workers 4


maybe add an example for torch as well ?

Currently distributed_retriever defaults to pytorch so an example command for this would just be the same as the command earlier in the Readme. I added a sentence saying that the default is pytorch though.

examples/rag/finetune_rag.py

examples/rag/test_distributed_retriever.py

examples/rag/test_finetune_rag.py

lhoestq · 2020-11-29T22:27:40Z

src/transformers/integrations.py

    import ray  # noqa: F401

    _has_ray = True
+    try:


adding ray integration here cc @LysandreJik

LysandreJik · 2020-11-30T14:35:07Z

@sgugger it would be cool if you could review as this changes some things in the trainer/integrations.

sgugger

There are other instance of is_ray_available to change to is_ray_tune_available if we go with the name change:

in integrations.py, inside the function hp_params and default_hp_search_backend
in trainer_utils.py, inside the function default_hp_space_ray

The main __init__ should also be updated to provide the two functions.

…-ray

amogkam · 2020-12-01T19:45:10Z

Hi @lhoestq @sgugger I addressed the feedback you guys gave. Do you think you can take another look? Thanks

…-ray

sgugger · 2020-12-18T19:22:20Z

Hi there, sorry for the delay. Could you close and reopen your PR? Because of a bad force-push on our side, the diff has become unreadable. Also, the examples folder has slightly changed structure, so you might need to move the folder.

Ping me, @patrickvonplaten and @LysandreJik on the PR you reopen and we'll look at it quickly.

amogkam · 2020-12-18T19:34:58Z

Opened a new one here: #9197!

amogkam and others added 27 commits October 22, 2020 11:20

wip

ab8bf04

Merge branch 'master' of github.com:huggingface/transformers into rag…

455dc98

…-ray

wip

2d11672

wip

a02fdbf

wip

2da166c

wip

7853f19

Merge branch 'master' of github.com:huggingface/transformers into rag…

b584095

…-ray

Merge branch 'master' of https://github.com/huggingface/transformers …

8ba7e86

…into rag-ray

Merge branch 'rag-ray' of github.com:amogkam/transformers into rag-ray

897d8b7

wip

dc86027

wip

81962a0

wip

0628726

Merge branch 'rag-ray' of github.com:amogkam/transformers into rag-ray

9c62c31

uncomment

810dd7d

uncomment

6118bab

wip

a4a5c79

updates

fc3cee1

add docstring

48a9dc9

updates

581d23b

fix arg

c034679

fixes

010f25b

add unit tests

03ac6b3

update readme

b9e109a

update readme

e768ebb

update finetune script

9166ba9

update test

90c5668

Merge branch 'master' of github.com:huggingface/transformers into rag…

5696b9a

…-ray

lhoestq mentioned this pull request Nov 17, 2020

Fix rag finetuning + add finetuning test #8585

Merged

add ray to test dependencies

7fade9a

amogkam marked this pull request as ready for review November 26, 2020 01:21

amogkam added 7 commits November 25, 2020 17:31

separate ray and ray tune

0fb4a82

formatting

532e7d9

shutdown ray at end of test

22b239a

fix tests

dd8527b

formatting

7dd354a

formatting

7d5b4d0

even more formatting

51b5ef3

lhoestq reviewed Nov 29, 2020

View reviewed changes

LysandreJik requested a review from sgugger November 30, 2020 14:34

sgugger suggested changes Nov 30, 2020

View reviewed changes

amogkam added 3 commits November 30, 2020 11:29

address comments

2dd4e55

formatting

f1b3d18

Merge branch 'master' of github.com:huggingface/transformers into rag…

8bddc7f

…-ray

amogkam requested review from lhoestq and sgugger November 30, 2020 19:53

mfuntowicz force-pushed the master branch from 447808c to 18c32ee Compare December 8, 2020 22:38

amogkam added 2 commits December 17, 2020 15:42

Merge branch 'master' of github.com:huggingface/transformers into rag…

6872da8

…-ray

add files

904e53b

amogkam closed this Dec 18, 2020

[RAG] Add Ray implementation for distributed retrieval #8583

[RAG] Add Ray implementation for distributed retrieval #8583

Uh oh!

Conversation

amogkam commented Nov 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Implementation Details

Testing Strategy

Before submitting

Who can review?

Uh oh!

lhoestq commented Nov 17, 2020

Uh oh!

amogkam commented Nov 17, 2020

Uh oh!

lhoestq commented Nov 25, 2020

Uh oh!

amogkam commented Nov 26, 2020

Uh oh!

amogkam commented Nov 28, 2020

Uh oh!

lhoestq commented Nov 28, 2020

Uh oh!

amogkam commented Nov 28, 2020

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq Nov 29, 2020

Choose a reason for hiding this comment

Uh oh!

amogkam Nov 30, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lhoestq Nov 29, 2020

Choose a reason for hiding this comment

Uh oh!

LysandreJik commented Nov 30, 2020

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

amogkam commented Dec 1, 2020

Uh oh!

sgugger commented Dec 18, 2020

Uh oh!

amogkam commented Dec 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

amogkam commented Nov 17, 2020 •

edited

Loading