Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lightning Lite core and tests #10175

Merged
merged 60 commits into from
Oct 29, 2021
Merged

Lightning Lite core and tests #10175

merged 60 commits into from
Oct 29, 2021

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Oct 27, 2021

What does this PR do?

Part of #9987

This is the V1 for the new Lightning Lite package. This PR adds the main Lite components and tests. Planned to be released as part of 1.5.
For the docs, see the follow up #10176.

Demo

Lightning Lite Demo

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

I made sure I had fun coding 🙃

Part of #1 (it's a lie, this is just here to avoid noisy GitHub bot)

@awaelchli awaelchli added this to the v1.5 milestone Oct 27, 2021
@awaelchli awaelchli added feature Is an improvement or enhancement priority: 0 High priority task fabric lightning.fabric.Fabric labels Oct 27, 2021
@awaelchli awaelchli mentioned this pull request Oct 27, 2021
11 tasks
@awaelchli awaelchli mentioned this pull request Oct 27, 2021
23 tasks
@awaelchli awaelchli marked this pull request as ready for review October 27, 2021 15:54
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome !

tests/lite/test_parity.py Outdated Show resolved Hide resolved
@mergify mergify bot added the ready PRs ready to be merged label Oct 27, 2021
Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Requesting changes to allow myself to fully review)

@mergify mergify bot removed the ready PRs ready to be merged label Oct 27, 2021
Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if someone calls run twice, does the full accelerator setup happen again? if so, is that always desirable?

Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀🚀🚀🚀

@aazzolini
Copy link

Why do we need a run() method and an abstract class?

It looks completely out of place to me and misleading. I really like the simplicitly of LightningLite as a concrete class.
The run method here is completely out of place. The docstring is wrong, because nothing gets accelerated because it is inside the "run" function, no other places call the "run" method.

I believe it would be great if we could remove the "run" method from here and make LightingLite a concrete class.

@aazzolini
Copy link

Dear @aazzolini,

Let's keep PR conversation there.

About #10208, you are right, it could be an Iterator. Let's us investigate if there were some design constrains forcing a DataLoader here.

About #10207, the reason for a run function is 2 folds. First, it won't work with all ddp_spawn related plugins without it e.g TPU, secondly it is required for DeepSpeed to perform sharding instantiation under the context. The run function was the simplest API enabling all plugins to work together with the minimal code changes required for the users from raw PyTorch / across plugins. We invested multiple designs and this one was the simplest and less opiniated from all.

The problem is that the current POC doesn't exercise this use case. It's bad because it's not clear and i don't think it can be done in a clean way. If we ask the user to implement run() and then we ask the user to directly call run() there's no opportunity to incercept anything in between. Does it mean that we will have to add yet another wrapper around LighntiungLite as well?

I'd be inclined to ask you folks to exercise these use cases before moving ahead with this prototype and closing on a public API because I'm almost certain it would have to change in order to accomodate these very use cases.

@justusschock
Copy link
Member

If we ask the user to implement run() and then we ask the user to directly call run() there's no opportunity to incercept anything in between

Why would we have to intercept anything there? All our logic lives either before or after the user-defined run-method.

Does it mean that we will have to add yet another wrapper around LighntiungLite as well?

Definitely not. This is meant as an entry point to lightning for people not willing to fully convert. It will provide some features like hardware acceleration but it won't at any point offer all the features Lightning does. That was never the plan.

@awaelchli
Copy link
Contributor Author

awaelchli commented Oct 29, 2021

@aazzolini The reason we have a run method under a class is to avoid an inconvenience for the user to manually change their code if they switch from (for example) Lite(gpus=8) to Lite(tpu_cores=8). What would happen if we didn't have a run() method? Let's see:

# python script
# training on GPUs

def main():
    lite = LightningLite(gpus=2)  # no run method 

    lite.setup()
    # train loop 


if __name__ == "__main__"
   main()

Everything works here. It runs as a script or even in a Jupyter notebook. The user want's to try if TPUs have better performance for their use case. They think all they have to do is

lite = LightningLite(tpu_cores=8)  # no run method 

Will that wok? No. The TPU code here would require a function to spawn processes from. This is annoying, because we would need to rewrite the code into this:

def main():
    lite = LightningLite(tpu_cores=?)  # no run method 

    lite.setup()
    # train loop 


if __name__ == "__main__"
   xmp.spawn(main, ..., nprocs=8, method="fork")

Lightning does all that for you and can do so because the users code is not just "anywhere" but in the LightningModule. We wanted to have the same convenience with the run() method in Lite. In the above case, under the hood, Lite will do this:

xmp.spawn(self.run, nprocs=..., method=...) for the user when they call lite.run().

By forcing the user to put their code into the run method, Lite can consistently choose the right operation based on the accelerator choices of the user.

How much effort is it to convert the above code to a class + run method?

class Lite(LightningLite):
    # rename main to run and indent your code. Done!
    def run():
        ...


if __name__ == "__main__"
   # main()
   lite = Lite(gpus=...)

Plus (big plus imo): No launcher utilities. The user just runs python train.py as they would normally, or run their cell in the notebook as usual.

Beyond that, there are benefits to converting to Lightning from here, as there is already a class + method which mimics what the LightningModule would look like.

@tchaton tchaton enabled auto-merge (squash) October 29, 2021 21:15
@tchaton tchaton merged commit 9d136a9 into master Oct 29, 2021
@tchaton tchaton deleted the lightning-lite/lite-core branch October 29, 2021 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fabric lightning.fabric.Fabric feature Is an improvement or enhancement priority: 0 High priority task ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants