-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add alexandria dataset to matsciml toolkit https://github.com/IntelLabs/matsciml/discussions/107 #132
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JonathanSchmidt1 for the great PR!!
For the most part things look good and functional to me; I did leave a few style remarks, and wanted to see if you could add a PyG test as well. If you don't normally use that pipeline, don't worry about it and we can add it later.
|
||
|
||
@registry.register_dataset("AlexandriaDataset") | ||
class AlexandriaDataset(PointCloudDataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More a comment for discussion with @melo-gonzo but the fact that @JonathanSchmidt1 had to copy over a bunch of functions here means we could do better on the abstraction: what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parse_structure/parse_symmetry functions from the materials project class seems quite general. Maybe it makes sense to put them in some utils file instead and import them in the respective datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, a lot of this can be refactored into some sort of utils file - it would make all the datasets much cleaner. I can draft up an issue for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! It looks pretty good overall, I've left a few initial comments. In addition to what was noted, can you add some details to the DATASETS.md as well? It's a bit redundant but for consistency it will be nice to have some details in there as well.
LOL the timing on our reviews |
…each dataset. Also removed 1 atom structures from devset due to dgl errors and added max_atom parameter for download to avoid oom errors during training
Thank you for both of the reviews. I will go through them today and tomorrow. btw maybe it's an issue for the materials project dataset also but for my dataset the check for precomputed src and dst nodes lead to there only being self loops in the end
instead of
with the check commented out |
ok the first question solved itself. Now I am just wondering about the edge case if you do not use the PeriodicPropertiesTransform
but maybe I am missunderstanding the edge indices |
Test do not run without PeriodicPropertiesTransform, Added units to README, changed num_atoms to 100, fixed __init__ that was changed by precommit hooks,
…xandria_api merge updates
To be clear (I don't know if this is a misunderstanding or not so forgive me) the function If you want to understand the graph edge construction, there are basically two implemented paths now: the Sorry about the late reply - it was a crazy week last week. Hopefully my response answers your questions, and please let us know if and when the code is ready for re-review? |
Thank you very much for the explanation.
After the update the PointCloudToGraphTransform does not do anything if return_dict.update(**chosen_nodes) already exists
returning "wrong edges" that were actually part of the point_cloud_featurization. If you run:
and add some print statement for the precomputed check you will find that
Sorry I will have to find another few hours to finish it. maybe this weekend |
…_alexandria into alexandria_api
@laserkelvin One question concerning the test that I did not completely understand is still open. The rest should be ready for rereview. |
I'm out of office for the rest of the week and I'll review it once I get back. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just some minor comments, and otherwise everything looks good and we can merge afterwards!
matsciml/datasets/alexandria/api.py
Outdated
self.get_data_dict(entry) | ||
for entry in tqdm(data["entries"], desc=f"Processing file {index}") | ||
] | ||
if self.devset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to move this up so you don't run all of the requests for devsets? That way you don't use too much server load if you don't need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move it up one step so we run self.get_data_dict only for the 100 entries. Right now we only have the database split up into the 100k structure jsons so one has to download at least one of them (and it also will only be one of them for the devset so that should not be an issue for the server).
def devset(cls, lmdb_target_dir: str) -> AlexandriaRequest:
return cls([0], lmdb_target_dir, dataset="3D", devset=True)
Thanks for the corrections :) |
…added warning for large cutoff radii for 1D and 2D
I implemented the suggestions.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more minor updates from my end, and then will be good to approve!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just highlighted the unused import flagged by pre-commit
, but other than that I'm giving my approval for merging.
I think we're good to go - I looked into the failing tests and nothing is relevant to this PR. Thank you @JonathanSchmidt1 for all your hard work on this PR! |
This pull request adds the AlexandriaRequest class to download the Alexandria database to an lmdb dataset and a AlexandriaDataset class to use it for ML within the matsciml toolkit.
The AlexandriaDataset class mostly follows the Materials project example.
I also added most of the same tests.
At the moment there are not test for the AlexandriaRequest class.