Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small CLI to export data from ExDB #26

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

dmyersturnbull
Copy link

@dmyersturnbull dmyersturnbull commented Feb 22, 2025

Added a new CLI called exdb-export with a single subcommand, export, which simply writes a MongoDB collection (or subset of fields) to a JSON file.

This allows weekly-update-workflow to get the list of chemical component ids. Getting exactly those would probably make for an overspecialized entry point, so the export subcommand takes a collection name and optionally a list of fields.

I kept it self-contained; a config file, rcsb.db.mongo, etc. feel unnecessary. The first commit just cleans up setup.py slightly.

@piehld
Copy link
Contributor

piehld commented Feb 24, 2025

Thanks @dmyersturnbull. Can I ask what the inspiration/motivation for this is? Is this just a helper utility for some of your work? Or is this going to be a new step in our workflow?

Copy link
Member

@josemduarte josemduarte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think this looks good, no issues on the code. And I understand the motivation behind making it very general, so that it is more reusable.

Having said that, to me it looks like the mongoexport CLI from the MongoDB CLI tools does exactly this. Doesn't it? If so, I'd say it'd be preferable to use the off-the-shelf tool. Sorry that I've only thought about this now, but I was initially more focused on a specific chem comp pipeline, thinking that it required something more custom.

@dmyersturnbull
Copy link
Author

Can I ask what the inspiration/motivation for this is?

@piehld Yeah, it was for the new chemical service ETL workflow, which otherwise doesn't need to talk to dw/exdb. But, as Jose points out, it's not doing more than MongoDB export right now. It might get functionality for incremental loading later.

@piehld
Copy link
Contributor

piehld commented Feb 25, 2025

Cool thanks for clarifying @dmyersturnbull! I guess I have the same question as @josemduarte now too, on whether mongoexport CLI can be used?

If not, one thing that might help your code is to rely on our ExDB configuration file (e.g., with Mongo params here) being passed in as a CLI flag, which you could use our ConfigUtil to read in and grab the necessary Mongo client information from. This config file is what is passed in during production for ExDB loading tasks.

@dmyersturnbull
Copy link
Author

Cool thanks for clarifying @dmyersturnbull! I guess I have the same question as @josemduarte now too, on whether mongoexport CLI can be used?

Yeah, I think for non-incremental mongoexport works perfectly. For incremental updates, we'll need slightly more logic and so will need some code -- at that point, both PyMongo or mongoexport work (because mongoexport allows --query). I added (just pushed) options for incremental changes (which aren't fully used yet).

If not, one thing that might help your code is to rely on our ExDB configuration file (e.g., with Mongo params here) being passed in as a CLI flag, which you could use our ConfigUtil to read in and grab the necessary Mongo client information from. This config file is what is passed in during production for ExDB loading tasks.

Yeah, Jose and I discussed this. I originally took that approach for consistency with our other Python projects, but I think we should move to just using single URI connection strings (from config files, env vars, or (less securely) from CLI args).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants