Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LMDB traversal cli #301

Merged
merged 11 commits into from
Oct 3, 2024
Merged

Conversation

laserkelvin
Copy link
Collaborator

This PR adds a big QoL oriented CLI, which provides some high level functionality for inspecting LMDB datasets.

  • Adds a matsciml.datasets.lmdb_cli module, which houses a click-based interface with multiple commands that perform various LMDB inspection tasks
  • Updates pyproject.toml to install lmdb_cli as a "script", which allows you to access the CLI after installing matsciml simply by running lmdb_cli in the command line.
  • Accompanying documentaton

@laserkelvin laserkelvin added documentation Improvements or additions to documentation ux User experience, quality of life changes data Issues related to data loading, pipelining, etc. labels Oct 3, 2024
Copy link
Collaborator

@melo-gonzo melo-gonzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Not entirely sure why we would want to use a window size instead of just setting num_samples to something smaller. Made one comment for potential clean-up, but feel free to merge when ready.

Comment on lines 235 to 244
transforms = []
if periodic:
transforms.append(PeriodicPropertiesTransform(radius, adaptive_cutoff))
if graph_backend:
transforms.append(PointCloudToGraphTransform(graph_backend))
target_class = (
BaseLMDBDataset
if not dataset_type
else registry.get_dataset_class(dataset_type)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could consolidate this common code block into its own function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed with 16da9d8

@laserkelvin
Copy link
Collaborator Author

So window size is used by the running average, so as you're iterating through the dataset it will do (by default) a running average of properties based on 10 of the last samples. It's different from just capping the number of samples to go through, because you might want to sweep through the data and look for outliers.

@laserkelvin laserkelvin merged commit 25969cc into IntelLabs:main Oct 3, 2024
2 of 3 checks passed
@laserkelvin laserkelvin deleted the lmdb-traversal-cli branch October 3, 2024 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Issues related to data loading, pipelining, etc. documentation Improvements or additions to documentation ux User experience, quality of life changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants