Releases: Lightning-AI/pytorch-lightning
standard weekly patch release
Detail changes
Added
- Added casting to python types for numpy scalars when logging
hparams
(#4647) - Added warning when progress bar refresh rate is less than 20 on Google Colab to prevent crashing (#4654)
- Added
F1
class metric (#4656)
Changed
- Consistently use
step=trainer.global_step
inLearningRateMonitor
independently oflogging_interval
(#4376) - Metric states are no longer as default added to
state_dict
(#4685) - Renamed class metric
Fbeta
>>FBeta
(#4656) - Model summary: add 1 decimal place (#4745)
- Do not override
PYTHONWARNINGS
(#4700)
Fixed
- Fixed checkpoint
hparams
dict casting whenomegaconf
is available (#4770) - Fixed incomplete progress bars when total batches not divisible by refresh rate (#4577)
- Updated SSIM metric (#4566)(#4656)
- Fixed batch_arg_name - add
batch_arg_name
to all calls to_adjust_batch_size
bug (#4812) - Fixed
torchtext
data to GPU (#4785) - Fixed a crash bug in MLFlow logger (#4716)
Contributors
@awaelchli, @jonashaag, @jungwhank, @M-Salti, @moi90, @pgagarinov, @s-rog, @Samyak2, @SkafteNicki, @teddykoker, @ydcjeff
If we forgot someone due to not matching commit email with GitHub account, let us know :]
standard weekly patch release
Detail changes
Added
- Added lambda closure to
manual_optimizer_step
(#4618)
Changed
- Change Metrics
persistent
default mode toFalse
(#4685)
Fixed
- Prevent crash if
sync_dist=True
on CPU (#4626) - Fixed average pbar Metrics (#4534)
- Fixed
setup
callback hook to correctly pass the LightningModule through (#4608) - Allowing decorate model init with saving
hparams
inside (#4662) - Fixed
split_idx
set byLoggerConnector
inon_trainer_init
toTrainer
(#4697)
Contributors
@ananthsub, @Borda, @SeanNaren, @SkafteNicki, @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]
standard weekly patch release
Detail changes
Added
- Added metrics aggregation in Horovod and fixed early stopping (#3775)
- Added
manual_optimizer_step
which work withAMP Native
andaccumulated_grad_batches
(#4485) - Added
persistent(mode)
method to metrics, to enable and disable metric states being added tostate_dict
(#4482) - Added congratulations at the end of our notebooks (#4555)
Changed
- Changed
fsspec
to tuner (#4458) - Unify sLURM/TorchElastic under backend plugin (#4578, #4580, #4581, #4582, #4583)
Fixed
- Fixed feature-lack in
hpc_load
(#4526) - Fixed metrics states being overridden in DDP mode (#4482)
- Fixed
lightning_getattr
,lightning_hasattr
not finding the correct attributes in datamodule (#4347) - Fixed automatic optimization AMP by
manual_optimization_step
(#4485) - Replace
MisconfigurationException
with warning inModelCheckpoint
Callback (#4560) - Fixed logged keys in mlflow logger (#4412)
- Fixed
is_picklable
by catchingAttributeError
(#4508)
Contributors
@dscarmo, @jtamir, @kazhang, @maxjeblick, @rohitgr7, @SkafteNicki, @tarepan, @tchaton, @tgaddair, @williamFalcon
If we forgot someone due to not matching commit email with GitHub account, let us know :]
standard weekly patch release
Detail changes
Added
- Added PyTorch 1.7 Stable support (#3821)
- Added timeout for
tpu_device_exists
to ensure process does not hang indefinitely (#4340)
Changed
- W&B log in sync with
Trainer
step (#4405) - Hook
on_after_backward
is called only whenoptimizer_step
is being called (#4439) - Moved
track_and_norm_grad
intotraining loop
and called only whenoptimizer_step
is being called (#4439) - Changed type checker with explicit cast of ref_model object (#4457)
Deprecated
- Deprecated passing
ModelCheckpoint
instance tocheckpoint_callback
Trainer argument (#4336)
Fixed
- Disable saving checkpoints if not trained (#4372)
- Fixed error using
auto_select_gpus=True
withgpus=-1
(#4209) - Disabled training when
limit_train_batches=0
(#4371) - Fixed that metrics do not store computational graph for all seen data (#4313)
- Fixed AMP unscale for
on_after_backward
(#4439) - Fixed TorchScript export when module includes Metrics (#4428)
- Fixed CSV logger warning (#4419)
- Fixed skip DDP parameter sync (#4301)
Contributors
@ananthsub, @awaelchli, @borisdayma, @carmocca, @justusschock, @lezwon, @rohitgr7, @SeanNaren, @SkafteNicki, @ssaru, @tchaton, @ydcjeff
If we forgot someone due to not matching commit email with GitHub account, let us know :]
standard weekly patch release
Detail changes
Added
- Added
dirpath
andfilename
parameter inModelCheckpoint
(#4213) - Added plugins docs and DDPPlugin to customize ddp across all accelerators (#4258)
- Added
strict
option to the scheduler dictionary (#3586) - Added
fsspec
support for profilers (#4162) - Added autogenerated helptext to
Trainer.add_argparse_args
(#4344) - Added support for string values in
Trainer
'sprofiler
parameter (#3656)
Changed
- Improved error messages for invalid
configure_optimizers
returns (#3587) - Allow changing the logged step value in
validation_step
(#4130) - Allow setting
replace_sampler_ddp=True
with a distributed sampler already added (#4273) - Fixed santized parameters for
WandbLogger.log_hyperparams
(#4320)
Deprecated
- Deprecated
filepath
inModelCheckpoint
(#4213) - Deprecated
reorder
parameter of theauc
metric (#4237) - Deprecated bool values in
Trainer
'sprofiler
parameter (#3656)
Fixed
- Fixed setting device ids in DDP (#4297)
- Fixed synchronization of best model path in
ddp_accelerator
(#4323) - Fixed
WandbLogger
not uploading checkpoint artifacts at the end of training (#4341)
Contributors
@ananthsub, @awaelchli, @carmocca, @ddrevicky, @louis-she, @mauvilsa, @rohitgr7, @SeanNaren, @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]
standard weekly patch release
Detail changes
Added
- Added persistent flag to
Metric.add_state
(#4195)
Changed
Fixed
Contributors
If we forgot someone due to not matching commit email with GitHub account, let us know :]
fixes a major logging bug for val in 1.0
Fixes the last major bugs for validation logging.
Also removes duplicate charts for metric / metric_loss.
Doing this minor release because correct validation metrics logging is critical.
Details changes
Added
- Added trace functionality to the function
to_torchscript
(#4142)
Changed
- Called
on_load_checkpoint
before loadingstate_dict
(#4057)
Removed
- Removed duplicate metric vs step log for train loop (#4173)
Fixed
- Fixed the self.log problem in
validation_step()
(#4169) - Fixed
hparams
saving - save the state whensave_hyperparameters()
is called [in__init__
] (#4163) - Fixed runtime failure while exporting
hparams
to yaml (#4158)
Contributors
@Borda, @NumesSanguis, @rohitgr7, @williamFalcon
If we forgot someone due to not matching commit email with GitHub account, let us know :]
minor jit fixes
Obligatory post 1.0 minor release. Main fix is to make Lightning module fully compatible with Jit (had some edge-cases we had not covered).
1.0.0 - General availability
Overview
...
Detail changes
- Added Explained Variance Metric + metric fix (#4013)
- Added Metric <-> Lightning Module integration tests (#4008)
- Added parsing OS env vars in
Trainer
(#4022) - Added classification metrics (#4043)
- Updated explained variance metric (#4024)
- Enabled plugins (#4041)
- Enabled custom clusters (#4048)
- Enabled passing in custom accelerators (#4050)
- Added
LightningModule.toggle_optimizer
(#4058) - Added
LightningModule.manual_backward
(#4063)
Changed
- Integrated metrics API with self.log (#3961)
- Decoupled Appex (#4052, #4054, #4055, #4056, #4058, #4060, #4061, #4062, #4063, #4064, #4065)
- Renamed all backends to
Accelerator
(#4066) - Enabled manual returns (#4089)
Removed
- Removed
output
argument from*_batch_end
hooks (#3965, #3966) - Removed
output
argument from*_epoch_end
hooks (#3967) - Removed support for EvalResult and TrainResult (#3968)
- Removed deprecated trainer flags:
overfit_pct
,log_save_interval
,row_log_interval
(#3969) - Removed deprecated early_stop_callback (#3982)
- Removed deprecated model hooks (#3980)
- Removed deprecated callbacks (#3979)
- Removed
trainer
argument inLightningModule.backward
[#4056)
Fixed
- Fixed
current_epoch
property update to reflect true epoch number insideLightningDataModule
, whenreload_dataloaders_every_epoch=True
. (#3974) - Fixed to print scaler value in progress bar (#4053)
- Fixed mismatch between docstring and code regarding when
on_load_checkpoint
hook is called (#3996)
Contributors
@ananyahjha93, @Borda, @edenlightning, @hbredin, @rohitgr7, @SkafteNicki, @teddykoker, @williamFalcon
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Buffer release before 1.0
This release is a buffer in case 1.0 breaks any compatibility for people who upgrade. 0.10.0 has all the bug fixes and features of 1.0 but is 100% backward compatible. The 1.0 release following in the next 24 hours.
Overview
The major changes are:
- Results objects are deprecated (we hated them too haha)
- This means dataflow and logging have been decoupled
To log:
def any_step(...):
self.log('something', i_computed)
Separately, return whatever you want from methods:
def training_step(...):
return loss
or
def training_step(...):
return {'loss': loss, 'whatever': [1, 'want']}
Detail changes
Added
- Added new Metrics API. (#3868, [#3921)
- Enable PyTorch 1.7 compatibility (#3541)
- Added
LightningModule.to_torchscript
to support exporting asScriptModule
(#3258) - Added warning when dropping unpicklable
hparams
(#2874) - Added EMB similarity (#3349)
- Added
ModelCheckpoint.to_yaml
method (#3048) - Allow
ModelCheckpoint
monitor to beNone
, meaning it will always save ([3630) - Disabled optimizers setup during testing (#3059)
- Added support for datamodules to save and load checkpoints when training (#3563
- Added support for datamodule in learning rate finder (#3425)
- Added gradient clip test for native AMP (#3754)
- Added dist lib to enable syncing anything across devices (#3762)
- Added
broadcast
toTPUBackend
(#3814) - Added
XLADeviceUtils
class to check XLA device type (#3274)
Changed
- Refactored accelerator backends:
- moved TPU
xxx_step
to backend (#3118) - refactored DDP backend
forward
(#3119) - refactored GPU backend
__step
(#3120) - refactored Horovod backend (#3121, #3122)
- remove obscure forward call in eval + CPU backend
___step
(#3123) - reduced all simplified forward (#3126)
- added hook base method (#3127)
- refactor eval loop to use hooks - use
test_mode
for if so we can split later (#3129) - moved
___step_end
hooks (#3130) - training forward refactor (#3134)
- training AMP scaling refactor (#3135)
- eval step scaling factor (#3136)
- add eval loop object to streamline eval loop (#3138)
- refactored dataloader process hook (#3139)
- refactored inner eval loop (#3141)
- final inner eval loop hooks (#3154)
- clean up hooks in
run_evaluation
(#3156) - clean up data reset (#3161)
- expand eval loop out (#3165)
- moved hooks around in eval loop (#3195)
- remove
_evaluate
fx (#3197) Trainer.fit
hook clean up (#3198)- DDPs train hooks (#3203)
- refactor DDP backend (#3204, #3207, #3208, #3209, #3210)
- reduced accelerator selection (#3211)
- group prepare data hook (#3212)
- added data connector (#3285)
- modular is_overridden (#3290)
- adding
Trainer.tune()
(#3293) - move
run_pretrain_routine
->setup_training
(#3294) - move train outside of setup training (#3297)
- move
prepare_data
to data connector (#3307) - moved accelerator router (#3309)
- train loop refactor - moving train loop to own object (#3310, #3312, #3313, #3314)
- duplicate data interface definition up into DataHooks class (#3344)
- inner train loop (#3359, #3361, #3362, #3363, #3365, #3366, #3367, #3368, #3369, #3370, #3371, #3372, #3373, #3374, #3375, #3376, #3385, #3388, #3397)
- all logging related calls in a connector (#3395)
- device parser (#3400, #3405)
- added model connector (#3407)
- moved eval loop logging to loggers (#3408)
- moved eval loop (#3412[#3408)
- trainer/separate argparse (#3421, #3428, #3432)
- move
lr_finder
(#3434) - organize args (##3435, #3442, #3447, #3448, #3449, #3456)
- move specific accelerator code (#3457)
- group connectors (#3472)
- accelerator connector methods x/n (#3469, #3470, #3474)
- merge backends (#3476, #3477, #3478, #3480, #3482)
- apex plugin (#3502)
- precision plugins (#3504)
- Result - make monitor default to
checkpoint_on
to simplify (#3571) - reference to the Trainer on the
LightningDataModule
(#3684) - add
.log
to lightning module (#3686, #3699, #3701, #3704, #3715) - enable tracking original metric when step and epoch are both true (#3685)
- deprecated results obj, added support for simpler comms (#3681)
- move backends back to individual files (#3712)
- fixes logging for eval steps (#3763)
- decoupled DDP, DDP spawn (#3733, #3766, #3767, #3774, #3802, #3806)
- remove weight loading hack for ddp_cpu (#3808)
- separate
torchelastic
from DDP (#3810) - separate SLURM from DDP (#3809)
- decoupled DDP2 (#3816)
- bug fix with logging val epoch end + monitor (#3812)
- decoupled DDP, DDP spawn (#3733, #3817, #3819, #3927)
- callback system and init DDP (#3836)
- adding compute environments (#3837, [#3842)
- epoch can now log independently (#3843)
- test selecting the correct backend. temp backends while slurm and TorchElastic are decoupled (#3848)
- fixed
init_slurm_connection
causing hostname errors (#3856) - moves init apex from LM to apex connector (#3923)
- moves sync bn to each backend (#3925)
- moves configure ddp to each backend (#3924)
- moved TPU
- Deprecation warning (#3844)
- Changed
LearningRateLogger
toLearningRateMonitor
(#3251) - Used
fsspec
instead ofgfile
for all IO (#3320)- Swaped
torch.load
forfsspec
load in DDP spawn backend (#3787) - Swaped
torch.load
forfsspec
load in cloud_io loading (#3692) - Added support for
to_disk()
to use remote filepaths withfsspec
(#3930) - Updated model_checkpoint's to_yaml to use
fsspec
open (#3801) - Fixed
fsspec
is inconsistant when doingfs.ls
(#3805)
- Swaped
- Refactor
GPUStatsMonitor
to improve training speed (#3257) - Changed IoU score behavior for classes absent in target and pred (#3098)
- Changed IoU
remove_bg
bool toignore_index
optional int (#3098) - Changed defaults of
save_top_k
andsave_last
toNone
in ModelCheckpoint (#3680) row_log_interval
andlog_save_interval
are now based on training loop'sglobal_step
instead of epoch-internal batch index (#3667)- Silenced some warnings. verified ddp refactors (#3483)
- Cleaning up stale logger tests (#3490)
- Allow
ModelCheckpoint
monitor to beNone
(#3633) - Enable
None
model checkpoint default (#3669) - Skipped
best_model_path
ifcheckpoint_callback
isNone
(#2962) - Used
raise .. from ..
to explicitly chain exceptions (#3750) - Mocking loggers (#3596, #3617, #3851, #3859, #3884, #3853, #3910, #3889, #3926)
- Write predictions in LightningModule instead of EvalResult [#3882
Deprecated
- Deprecated
TrainResult
andEvalResult
, useself.log
andself.write
from theLightningModule
to log metrics and write predictions.training_step
can now only return a scalar (for the loss) or a dictionary with anything you want. (#3681) - Deprecate
early_stop_callback
Trainer argument (#3845) - Rename Trainer arguments
row_log_interval
>>log_every_n_steps
andlog_save_interval
>>flush_logs_every_n_steps
(#3748)
Removed
- Removed experimental Metric API (#3868, #3943, #3949, #3946), listed changes before final removal:
- Added
EmbeddingSimilarity
metric (#3349, [#3358) - Added hooks to metric module interface (#2528)
- Added error when AUROC metric is used for multiclass problems (#3350)
- Fixed
ModelCheckpoint
withsave_top_k=-1
option not tracking the best models when a monitor metric is available (#3735) - Fixed counter-intuitive error being thrown in
Accuracy
metric for zero target tensor (#3764) - Fixed aggregation of metrics (#3517)
- Fixed Metric aggregation (#3321)
- Fixed RMSLE metric (#3188)
- Renamed
reduction
toclass_reduction
in classification metrics (#3322) - Changed
class_reduction
similar to sklearn for classification metrics (#3322) - Renaming of precision recall metric (#3308)
- Added
Fixed
- Fixed
on_train_batch_start
hook to end epoch early (#3700) - Fixed
num_sanity_val_steps
is clipped tolimit_val_batches
(#2917) - Fixed ONNX model save on GPU (#3145)
- Fixed
GpuUsageLogger
to work on different platforms (#3008) - Fixed auto-scale batch size not dumping
auto_lr_find
parameter (#3151) - Fixed
batch_outputs
with optimizer frequencies (#3229) - Fixed setting batch size in
LightningModule.datamodule
when usingauto_scale_batch_size
(#3266) - Fixed Horovod distributed backend compatibility with native AMP (#3404)
- Fixed batch size auto scaling exceeding the size of the dataset (#3271)
- Fixed getting
experiment_id
from MLFlow only once instead of each training loop (#3394) - Fixed
overfit_batches
which now correctly disables shuffling for the training loader. (#3501) - Fixed gradient norm tracking for
row_log_interval > 1
(#3489) - Fixed
ModelCheckpoint
name formatting ([3164) - Fixed auto-scale batch size (#3151)
- Fixed example implementation of AutoEncoder (#3190)
- Fixed invalid paths when remote logging with TensorBoard (#3236)
- Fixed change
t()
totranspose()
as XLA devices do not support.t()
on 1-dim tensor (#3252) - Fixed (weights only) checkpoints loading without PL (#3287)
- Fixed
gather_all_tensors
cross GPUs in DDP (#3319) - Fixed CometML save dir (#3419)
- Fixed forward key metrics (#3467)
- Fixed normalize mode at confusion matrix (replace NaNs with zeros) (#3465)
- Fixed global step increment in training loop when
training_epoch_end
hook is used (#3673) - Fixed dataloader shuffling not getting turned off with
overfit_batches > 0
anddistributed_backend = "ddp"
(#3534) - Fixed determinism in
DDPSpawnBackend
when usingseed_everything
in main process (#3335) - Fixed
ModelCheckpoint
period
to actually save everyperiod
epochs (#363...