-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Move Trainer's loop-affecting arguments to fit
, validate
, test
, and predict
#10444
Comments
fit
, validate
, test
, and predict
fit
, validate
, test
, and predict
+1 love this idea! |
@PyTorchLightning/core-contributors I'd like your feedback on this |
Dear @ananthsub, Thanks for opening this issue. I wanted to make this proposition for some time already !
I entirely agree with this and I believe this change would improve `Loop Customization if designed properly as right now, the Trainer arguments aren't passed to the new loops. However, this would be one of our biggest breaking changes and we should action on it only if both the Lightning Team, Facebook, and Lightning BDFL are onboard. Best, |
@williamFalcon what are your thoughts on this? |
I like this change. But as @tchaton mentioned this is going to be a big refactor with a lot of deprecations at least for Just some personal opinions
Possible Cons: Ref: |
Does look nice from an API level however I'm wondering how does this weight against the already established familiarity of Lightning? Things such as this reddit comment make me think very hard if the value of changing the API is worth it. Not against such as large API change if we think the overall value is worth it, just throwing some caution in there! |
I'm fully in favor of this. This is also something I have found confusing since I initially found Lightning. However, it is a BIG design shift. This will also make it clearer that this is not a trainer but an engine named "Trainer", as now the trainer-specific flags will be under As a side note, we would need to re-think |
This is not Lightning anymore. In Lightning, the Trainer arguments are in one place so they can be easily configured. This change will effectively make the |
In a certain way, I like the argument - move some arguments to particular methods for better understanding what arguments are used in the
fully agree with @awaelchli |
Hey @ananthsub, This is still open to debate but here is a quick summary from the previous comments. Pros:
Cons:
Right now, it seems there are quite good arguments on both sides and my conservative/stability side would push me toward no change. |
First, does everyone agree that the pain points outlined in the issue today are problems? In particular, confusing/misleading trainer flags and not knowing when they take effect. For all of these flags, it is way too much work to expect users to read the implementations to see how/when they take effect. This behavior can surprise people. This should be immediately obvious by the name and where the args are specified. If you agree that these are problems, do you have alternate proposals that would address those concerns that is not what's listed here? I am looking to address confusion over how to use the Lightning API for loop behaviors, and this way seemed most natural to me. If you had to re-design this today, knowing what features the Trainer has today, how would you do it? Regarding the cons:
Furthermore, this would follow the documented API evolution process. This would not be a breaking change, and this issue is open for public comment. I don't anticipate a social media storm because I think that happens for silent/breaking changes that are not clearly communicated.
This worked when the project was a lot smaller. There were far fewer features, and there weren't as many trainer functions or trainer features. But the project has grown a lot. I strongly believe we ought to prioritize API principles & correctness over convenient configuration. As stated above, there are users today who are confused by the flags offered and when they take effect. It doesn't matter if it's easy for those users to set values if they're using the flags incorrectly. That being said, this is not saying configuration doesn't have a role in usability. I am not proposing something that would make instantiation much more difficult, such as relying on complex objects as input values to the Trainer constructor. I'd argue that the clarity this provides is worth that change. This feels much cleaner to me, opens up more extension points, and it means we don't need to force all functionality through the Trainer constructor. And most critically, it changes how users should view the Trainer given the far expanded functionality it now has compared to the start of the project. It is very tempting to make everything easily configurable. It's also how you end up with functions/classes that take 200+ arguments, which have pairwise conflicts, and which no one knows how to use them or what they work with. Config encourages monolithic designs over modularity. Coding by config has never worked with pytorch because configs are not as flexible as code, and PyTorch encourages writing code. I have seen this repeatedly (like way more than you'd imagine) with prior platforms I've worked on, and every single time, the whole project is either rewritten from scratch or someone adopts something more flexible because it allows people to write the code they want. That migration is much more painful for everyone. That is why I feel very strongly about this: because w/o more modular design, my concern is either we'd have to rewrite Lightning from first principles, or someone else will write a leaner version instead which users will adopt. @williamFalcon |
I'd like to provide some thoughts from the perspective of outside product's (NeMo) development on top of Lightning. PTL has two major classes: Trainer and LightningModule which (roughly) separate engineering and science in DL training. It was convenient to build on top of these two abstractions. However, every time Trainer and/or LightningModule loses instance variables that configure the session it will break other products built on top of PTL (such as NeMo). If this RFC is to pass, may we ask that neither Trainer nor LigntingModule lose their members or constructor arguments? And when the same name is passed to .fit(…)/.validate(…) then the one which was passed in the member function overrides the one in Trainer? For example:
Backwards compatibility is not the only reason behind this ask. Sometimes the model needs to be able to access Trainer instance before the model is even constructed. For example, for model and/or pipeline parallelism, the model needs to know world size. As another example, learning rate scheduler may (or may not) need to know number of GPUs, max_steps, batch size etc. Additionally, we have concerns with setting chpt_path in .fit(…) method. There are two scenarios when one might want to do that: (a) fine-tuning and (b) resuming training from checkpoint. The second scenario is must have for training on shared clusters with strict time limits (universities and companies). But it would be very error prone because the users won't be able to re-submit the script as-is and have it do the right thing. Instead they will need to adjust this parameter with every re-submission to make sure the right checkpoint is used. |
Echoing @okuchaiev, there's value in knowing the value of parameters as soon as possible, prior to model creation. Which ones is hard to tell actually, we don't have real visibility on what products built on top of Lightning might decide to do. Knowing how Following this logic, allowing Trainer arguments to be eventually overridden by the same arguments to I'm wondering if there's a way to retain the current A couple of possibilities come to mind, but I may or may not agree with them 🙂:
|
With Lightning's maturity comes the expectation of a stable API. Unfortunately, the changes proposed here are not backwards compatible and will jeopardize an otherwise stable API. I do however agree with the sentiment that there might be some confusion. But let's take a step back and revisit the issues from the user perspective (and make sure it applies to the whole community) before jumping into any one particular decision. So, for now, let's go ahead and close this while we identify the core usability issues here and brainstorm ways to address them that don't affect backward compatibility. For all intensive purposes, consider the Trainer API as core/foundationally/stable. |
Btw... on the configuration combinatronics explosion point, it turns out we've solved a lot of those issues through our current CI/CD (kind of one of the very valuable points that PL adds). we have soooo many tests that test ALL possible combinations of things. In addition, any time something has been broken, we fixed and added a test. So, at this point this is not a top of mind worry that I would spend too much time concerned with. Lightning Trainer API is stable. It's very unlikely more flags will be added. It's 100% unlikely (zero probability) that > 10 flags will EVER be added, so we don't need to solve "scale" with 2000 flags, etc... because that's never going to happen. The simplifications to the Trainer are places where we can remove certain flags in favor of more expressive flags (but not throw stuff into a callback hidden under a submodule). However, we've been thinking about these flags for 2+ years at this point (with thousands of eyeballs) and we haven't had any new insights in a long time about any flags that might not be necessary, or flags that could be better named/more expressive. However, I am very hopeful that there will be more insights that help the Trainer become more expressive! |
@williamFalcon , thanks for this: "Lightning Trainer API is stable." :) |
I would argue this isn't so much a question of compatibility/stability but a question of vision. We can maintain backwards compatible support for existing flags for an arbitrary length of time. Either by keeping the original flags or adding a catch-all This would allow our users to gradually and slowly convert. Lengthening out this process also gives us time to gather reactions from the community, as these RFCs do not reach deep into our userbase. For this reason, what's important to me is knowing whether this proposed change is a better API had Lightning not been initially designed with most arguments grouped into the |
Thanks all for the comments. The purpose of an RFC is to gather feedback, and I'm glad this discussion took place. And I agree 100% with what @carmocca said above^. Ignoring any backward compatibility concerns, is the API proposal better than what currently exists? Addressing specific concerns raised:
The core usability issue highlighted here is that the Trainer is both the runner and the engine. By this I mean: We see circular references & ownership throughout the codebase as a result. The "runner" logic of the trainer ends up having a reference to all other components in Lightning. But with the trainer as an engine, all other components in Lightning end up with a reference to the Trainer (lightning module, data module, callbacks, even the lightning optimizer). I see this as a huge potential for misuse: #7315 It also lays out that different people have different definitions of "model code" vs "engineering code" which creates this ambiguity.
We rely a lot on mini end to end tests that leverage these flags. I cannot confidently say we test all possible combinations, but we do our best on this front. However, this reliance on end to end tests is tricky for exactly the combinatorics presented, in addition to dev efficiency reason: we consistently see slow & flaky tests. Keeping the components modular ensures we can provide better stability & a great dev experience without resorting to more extreme measures like fuzz testing all possible combinations of input types/values and waiting hours/days for test signals.
This also applies to needing to pass
This is exactly the issue I mentioned above. It also puts much greater pressure on the initialization & hook order. The more hooks offered, the more potential paths through them, the more chance there's a conflict between what users want. Some will want A to happen before B and some will want B to happen before A. What happens then? In a plain python/pytorch training script, this is entirely user controlled, and these are likely parametrized in a way that dependencies are clearly laid out. But with Lightning now, users end up relying on the Trainer to access all these properties. On a tactical level, guaranteeing every property's availability per each hook for all different trainer configurations is a very tall order. From a user experience POV, Lightning's claim is that it is organized PyTorch, not layers over PyTorch. If users end up writing code that looks different with vs without the framework, that is a concern. |
From my point of view, Lightning is a training framework and I am strongly urging like in #10573 that any existing arguments and properties in Trainer remain in Trainer's init if they affect
We are not interested in which hooks are run when. We are interested in being able to access Trainer to get properties prior to
Building on top of @lantiga's suggestions, I would suggest
|
I think that without any changes to where the arguments will be located, this would already be a great enhancement, since this allows a more flexible configuration and also changing these arguments inbetween two In total: I understand both sides here and I think that for now, stability is way more important than changing the location. I also think though, that this is something, we definitely should revisit, when considering |
Proposed refactoring or deprecation
Background
We are auditing the Lightning components and APIs to assess opportunities for improvements:
This issue aims to address how loop conditions are specified when using the Lightning trainer.
Motivation
Can a new user tell how long the test runs for here? One could read this and think, naturally this runs for 10 steps. Only in diving into the documentation does one realize that this argument doesn't affect the
test
run at all!Another example is calling
fit
repeatedly with the same Trainer object (Missing cleanup after trainer.fit() and trainer.test() #4385). This re-design would make the trainer more reusable: users shouldn't need to re-instantiate the trainer if they want to only kick off a new run with different stopping conditions or loop settings.Address maintenance concerns with the Trainer constructor over time: Maintaining the Trainer constructor over time #9006 .
What happens if we need to add a new trainer function? The current design & practice would mean replicating all these args again on the Trainer. This makes discovering features of the core trainer harder with more flags added.
ckpt_path
:resume_from_checkpoint
took effect only duringfit
whereasckpt_path
was a function argument tovalidate
.test
,predict
- this was really confusing having 2 different ways to specify paths to load training state from. [checkpoint] Resolve 2 different checkpoint loading paths acrossfit
vsvalidate
/test
/predict
#9405I've grouped the constructor arguments here based on which trainer function/loop they take effect in. Critically, almost all of these arguments do not apply to all loops.
check_val_every_n_epoch
: this is an integer for whether to run validation after N train epochsmax_epochs
: number of training epochs to run before stoppingmin_epochs
: force this number of training epochs to complete before stoppingmax_steps
: maximum number of training steps to run before stoppingmin_steps
: minimum number of training steps to run before stoppingmax_time
: stop training after the specified time has elapsed (in case the number of steps/epochs aren’t known in advance)num_sanity_val_steps
: number of batches to run before starting the training routineval_check_interval
: how often to check the validation set. supports either % of dataset or based on number of stepslimit_train_batches
: How much of associated dataset to iterate through per epoch. If a float is passed in, lightning assumes this is a fraction (check X% of dataset) while if an integer is passed in, lightning parses this as check N steps.limit_val_batches
: same as limit_train_batches but for val_dataloaderreload_dataloaders_every_n_epochs
: Reloads dataloaders during fitting (there's a separate issue for deprecating this off the trainer entirely in favor of the datahooks): Move reload_dataloaders_every_n_epochs to the DataHooks class #8738limit_val_batches
: same aslimit_train_batches
but for val_dataloaderlimit_test_batches
: same aslimit_train_batches
but fortest_dataloader
limit_predict_batches
: same aslimit_train_batches
but for predict_dataloaderApplies to all
fast_dev_run
: Runs n steps of train/val/test to find bugs. A form of canarying the run (fast e2e check)overfit_batches
: applies to fit/val/test but not predict (should this only apply to training?)Pitch
fit
Before:
After:
validate
Before:
After:
test
Before:
After:
predict
Before:
After:
Alternatives
Keep as is
Additional context
It is very tempting to make everything easily configurable. It's also how you end up with functions/classes that take 200+ arguments, which have pairwise conflicts, and which no one knows how to use them or what they work with. Config encourages monolithic designs over modularity. Coding by config has never worked with pytorch because configs are not as flexible as code, and PyTorch encourages writing code. I have seen this repeatedly (like way more than you'd imagine) with prior platforms I've worked on, and every single time, the whole project is either rewritten from scratch or someone adopts something more flexible because it allows people to write the code they want. That migration is much more painful for everyone. That is why I feel very strongly about this: because w/o more modular design, my concern is either we'd have to rewrite Lightning from first principles, or someone else will write a leaner version instead which users will adopt. @williamFalcon
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
The text was updated successfully, but these errors were encountered: