[RAG] Add Ray implementation for distributed retrieval #9197

amogkam · 2020-12-18T19:33:10Z

What does this PR do?

This PR adds a new distributed retriever implementation for RAG built on Ray, as an alternative to the current retriever implementation that uses torch.distributed. With Ray it's possible to load the index on multiple processes instead of just the rank 0 training worker, allowing fine tuning to scale out better to multiple GPUs, and also allowing the index to potentially be fit in GPU memory. This also removes a core dependency on Pytorch, allowing a Tensorflow implementation of finetune.py.

This PR also makes changes to support finetune.py with Pytorch Lightning >v1.0.

A benchmark of Pytorch distribtued retrieval vs. Ray distributed retrieval

Implementation Details

In the current Pytorch retrieval implementation, the index is loaded once on just the rank 0 training workers. Training worker 0 gathers the inputs from all other workers, performs the index lookup, and scatters the results back to the other workers.

With the Ray implementation, the index is loaded on separate processes, which are referred to as Ray actors. Each training worker randomly selects a retrieval actor to query for documents and Ray handles all the communication between the processes. Because the index can be loaded in multiple processes, training can scale up since no synchronization needs to happen for the index lookup.

Note that Pytorch Lightning is still handling distributed training, but Ray manages distributed retrieval. Because PTL calls the entire training script under the hood multiple times, we have to use Ray's named actors feature (https://docs.ray.io/en/master/actors.html?highlight=named%20actors#named-actors) allowing the retrieval actors to be referenced by all training processes. The use of named actors is necessitated by how PTL handles distributed training, and a simpler approach could probably be used for a Tensorflow implentation.

Testing Strategy

Unit tests were added to test_distributed_retriever.py. Note that the local Ray cluster for the tests had to be started with local_mode=True because the test file modifies sys.path and these changes are not propagated to remote processes. See https://stackoverflow.com/questions/54338013/parallel-import-a-python-file-from-sibling-folder for more info.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to the it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

…-ray

…into rag-ray

…-ray

amogkam · 2020-12-18T19:34:33Z

cc @sgugger @patrickvonplaten @LysandreJik @lhoestq

sgugger

Thanks for closing/reopening, I have just one last nit!

examples/research_projects/rag/test_distributed_retriever.py

Co-authored-by: Sylvain Gugger <[email protected]>

patrickvonplaten · 2020-12-18T21:15:40Z

examples/research_projects/rag/distributed_ray_retriever.py

+
+    @classmethod
+    def get_tokenizers(cls, retriever_name_or_path, indexed_dataset=None, **kwargs):
+        return super(RagRayDistributedRetriever, cls).get_tokenizers(retriever_name_or_path, indexed_dataset, **kwargs)


I'd prefer copy paste here instead of making a change to the general rag_retriaval.py file. Abstracting that much just to not repeat code is not worth it here IMO. Ideally, I'd like to not have a get_tokenizers class method at all.

Got it- I made the change to remove get_tokenizers and keep everything in from_pretrained.

The reason I did this originally was so that any future changes to retrieval_rag::from_pretrained wouldn't also have to be made to distributed_ray_retriever::from_pretrained, since it might be easy to forget to do this in case the tests don't catch it. This is something we just have to keep in mind.

src/transformers/models/rag/retrieval_rag.py

patrickvonplaten

Thanks so much for working on this! The PR looks great except for the get_tokenizers(...) class method. Could we try to not split up the from_pretrained(...) in retrieval_rag.py at the cost of maybe copy pasting some code?

Also before merging I'd like @lhoestq to take a quick look - he probably knows best here.

lhoestq

Looks all good thanks !

Looking forward to using it :)

patrickvonplaten · 2020-12-21T09:39:17Z

Nice, good to merge then!

richardliaw · 2020-12-21T18:54:31Z

Awesome, thank you so much for the reviews @lhoestq @patrickvonplaten -- happy holidays!

amogkam · 2020-12-21T18:56:34Z

Thanks guys!

shamanez · 2021-02-11T01:30:59Z

@amogkam @patrickvonplaten I need some help to implement an end-to-end retrieval training feature for the rag with Ray.

How can I run document encoding and indexing with an updated doc-encoder (context encoder network that kept frozen in the original RAG) using a Ray actor separated from the main training process?

How can I access the document index inside Ray actors during the training incase I want to update the index, say in every 5000 steps.

richardliaw · 2021-02-11T08:34:05Z

@shamanez could you open a new issue to track this?

shamanez · 2021-02-11T08:37:09Z

@richardliaw

I have already opened one a few weeks ago. Please refer to this issue

I added a new issue explaining the exact problem in this

amogkam and others added 30 commits October 22, 2020 11:20

wip

ab8bf04

Merge branch 'master' of github.com:huggingface/transformers into rag…

455dc98

…-ray

wip

2d11672

wip

a02fdbf

wip

2da166c

wip

7853f19

Merge branch 'master' of github.com:huggingface/transformers into rag…

b584095

…-ray

Merge branch 'master' of https://github.com/huggingface/transformers …

8ba7e86

…into rag-ray

Merge branch 'rag-ray' of github.com:amogkam/transformers into rag-ray

897d8b7

wip

dc86027

wip

81962a0

wip

0628726

Merge branch 'rag-ray' of github.com:amogkam/transformers into rag-ray

9c62c31

uncomment

810dd7d

uncomment

6118bab

wip

a4a5c79

updates

fc3cee1

add docstring

48a9dc9

updates

581d23b

fix arg

c034679

fixes

010f25b

add unit tests

03ac6b3

update readme

b9e109a

update readme

e768ebb

update finetune script

9166ba9

update test

90c5668

Merge branch 'master' of github.com:huggingface/transformers into rag…

5696b9a

…-ray

Merge branch 'master' of github.com:huggingface/transformers into rag…

dce22fa

…-ray

add test

65ee572

add ray to test dependencies

7fade9a

amogkam added 6 commits November 25, 2020 19:41

even more formatting

51b5ef3

address comments

2dd4e55

formatting

f1b3d18

Merge branch 'master' of github.com:huggingface/transformers into rag…

8bddc7f

…-ray

Merge branch 'master' of github.com:huggingface/transformers into rag…

6872da8

…-ray

add files

904e53b

amogkam mentioned this pull request Dec 18, 2020

[RAG] Add Ray implementation for distributed retrieval #8583

Closed

5 tasks

LysandreJik requested review from lhoestq, patrickvonplaten and sgugger December 18, 2020 19:44

sgugger approved these changes Dec 18, 2020

View reviewed changes

examples/research_projects/rag/test_distributed_retriever.py Outdated Show resolved Hide resolved

Update examples/research_projects/rag/test_distributed_retriever.py

c097bfc

Co-authored-by: Sylvain Gugger <[email protected]>

patrickvonplaten reviewed Dec 18, 2020

View reviewed changes

src/transformers/models/rag/retrieval_rag.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Dec 18, 2020

View reviewed changes

amogkam added 3 commits December 18, 2020 21:15

address comments

e624cf3

addressing comments

0e7769c

Merge branch 'rag-ray2' of github.com:amogkam/transformers into rag-ray2

d327fae

patrickvonplaten approved these changes Dec 19, 2020

View reviewed changes

lhoestq approved these changes Dec 21, 2020

View reviewed changes

patrickvonplaten merged commit a4b21cd into huggingface:master Dec 21, 2020

[RAG] Add Ray implementation for distributed retrieval #9197

[RAG] Add Ray implementation for distributed retrieval #9197

Conversation

amogkam commented Dec 18, 2020

What does this PR do?

Implementation Details

Testing Strategy

Before submitting

Who can review?

Uh oh!

amogkam commented Dec 18, 2020

Uh oh!

sgugger left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

patrickvonplaten Dec 18, 2020

Choose a reason for hiding this comment

Uh oh!

amogkam Dec 19, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten commented Dec 21, 2020

Uh oh!

richardliaw commented Dec 21, 2020

Uh oh!

amogkam commented Dec 21, 2020

Uh oh!

shamanez commented Feb 11, 2021

Uh oh!

richardliaw commented Feb 11, 2021

Uh oh!

shamanez commented Feb 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sgugger left a comment •

edited

Loading

shamanez commented Feb 11, 2021 •

edited

Loading