Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add str method to datamodule #20301

Merged

Conversation

MrWhatZitToYaa
Copy link
Contributor

@MrWhatZitToYaa MrWhatZitToYaa commented Sep 25, 2024

What does this PR do?

Added a str function to the DataModule in order to be able to print Datasets included in the module instead of an adress.

Fixes #9947

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--20301.org.readthedocs.build/en/20301/

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Sep 25, 2024
@MrWhatZitToYaa
Copy link
Contributor Author

Will add documentation once code is fixed

@MrWhatZitToYaa MrWhatZitToYaa marked this pull request as ready for review October 3, 2024 15:22
@MrWhatZitToYaa
Copy link
Contributor Author

I have been waiting for quite some time. How do I go on? Do I just wait some more until someone looks at this?

@lantiga lantiga changed the title Feature/9947 Add str method to datamodule Add str method to datamodule Nov 12, 2024
@lantiga
Copy link
Collaborator

lantiga commented Nov 12, 2024

hey @MrWhatZitToYaa thanks for taking this up and sorry for the wait

can you add a bit more detail on what do we think it's good printing from a datamodule?

the original design in #9947 was a bit different, as was #9967, but happy to consider alternatives after there's some discussion on the rationale

@lantiga lantiga added waiting on author Waiting on user action, correction, or update lightningdatamodule pl.LightningDataModule labels Nov 12, 2024
@MrWhatZitToYaa
Copy link
Contributor Author

hey @MrWhatZitToYaa thanks for taking this up and sorry for the wait

can you add a bit more detail on what do we think it's good printing from a datamodule?

the original design in #9947 was a bit different, as was #9967, but happy to consider alternatives after there's some discussion on the rationale

The idea originally came to me when I had to to use the Datamodule to implement some datasets we have in our lab. It was really annoying to deal with the fact that you never really know what you are dealing with and makes debugging really hard. I talked to some colleagues and they agreed that they ran into the same problem. That's why I would like to add a str functionality for the Datamodule: primarily for debugging / logging.
I think the simplest approach is to scan the whole module for instances of datasets and then return a string where each line contains the name of the dataset and the length if it is available. This would be a big quality of life improvement when dealing with / constructing Datamodules.

@github-actions github-actions bot added docs Documentation related ci Continuous Integration fabric lightning.fabric.Fabric dependencies Pull requests that update a dependency file dockers package labels Nov 20, 2024
Added alternative Boring Data Module implementations
Added test cases for all possible options
Added additional check for NotImplementedError in string function of DataModule
Made changes to comply with requested suggestions
Switched from hardcoded \n to more general os.linesep
@MrWhatZitToYaa MrWhatZitToYaa force-pushed the feature/9947_dataloader-string branch from 0c57055 to 122cf6d Compare November 20, 2024 21:07
lantiga and others added 2 commits November 21, 2024 00:49
Corrected the annotation for the internal function and the list that is suppsoed to store the information on the datasets
@mergify mergify bot removed the has conflicts label Nov 25, 2024
Copy link
Collaborator

@lantiga lantiga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, I aligned it with the latest master and added a comment for a fix before we can merge. It should be quick.

src/lightning/pytorch/core/datamodule.py Outdated Show resolved Hide resolved
Copy link

codecov bot commented Nov 25, 2024

Codecov Report

Attention: Patch coverage is 92.59259% with 6 lines in your changes missing coverage. Please review.

Project coverage is 79%. Comparing base (601c060) to head (36fdb57).
Report is 2 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (601c060) and HEAD (36fdb57). Click for more details.

HEAD has 187 uploads less than BASE
Flag BASE (601c060) HEAD (36fdb57)
cpu 68 23
lightning 51 17
pytest 46 0
python3.9 18 6
lightning_fabric 13 0
python3.10 9 3
python3.11 18 6
python3.12 23 8
gpu 2 0
pytest-full 24 23
pytorch2.4.1 3 2
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #20301     +/-   ##
=========================================
- Coverage      88%      79%     -9%     
=========================================
  Files         267      264      -3     
  Lines       23304    23328     +24     
=========================================
- Hits        20407    18365   -2042     
- Misses       2897     4963   +2066     

MrWhatZitToYaa and others added 8 commits November 25, 2024 13:54
Fixed type annotation issue
Reduced code size by using Sized object from abc library
Switched from Dataset based implementation to Dataloader based implementation
Added missing size value to tuple in the error case instead of returning only a string
Adjusted test to match the new implementation requirenemnts
Added necessary BoringModules for tests
Fixed bugs and annotation issues in the str method
Refactored code and made it more readable by implementing more abstarct fucntion methods
Adjusted tests
Removed debug statements
Removed TODO comments
Renamed varaibles to more sensible names to increase readability
@MrWhatZitToYaa
Copy link
Contributor Author

The string method will now check if any of the 4 dataloaders are available and try to print information on them. I hope this is how you imagined it to work

@lantiga lantiga removed ci Continuous Integration waiting on author Waiting on user action, correction, or update dependencies Pull requests that update a dependency file dockers package docs Documentation related labels Dec 8, 2024
Copy link
Collaborator

@lantiga lantiga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. We're almost there, I propose we update the strings from

Train dataset: available=yes, size=unknown
Validation dataset: available=yes, size=64
Test dataset: available=no, size=unknown
Prediction dataset: available=yes, size=64

to

Train dataloader: size=NA
Validation dataloader: size=64
Test dataloader: None
Predict dataloader: size=64

summary:

  • dataset -> dataloader
  • Prediction -> Predict
  • remove "available"
  • unknown -> NA
  • "available=no" -> None

what do you think?

Switched name from dataset to dataloader
Switched name Prediction to Predict
removed available keyword and instead write None if not available
Switched from unknown to NA
@MrWhatZitToYaa
Copy link
Contributor Author

Alright, sounds good to me
I made the changes so I assume it complies with your specification now

src/lightning/pytorch/core/datamodule.py Outdated Show resolved Hide resolved
src/lightning/pytorch/core/datamodule.py Outdated Show resolved Hide resolved
@lantiga lantiga merged commit 9709c64 into Lightning-AI:master Dec 11, 2024
74 of 76 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fabric lightning.fabric.Fabric lightningdatamodule pl.LightningDataModule pl Generic label for PyTorch Lightning package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support str(datamodule)
3 participants