-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Small CLI to export data from ExDB #26
base: master
Are you sure you want to change the base?
Conversation
Thanks @dmyersturnbull. Can I ask what the inspiration/motivation for this is? Is this just a helper utility for some of your work? Or is this going to be a new step in our workflow? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think this looks good, no issues on the code. And I understand the motivation behind making it very general, so that it is more reusable.
Having said that, to me it looks like the mongoexport CLI from the MongoDB CLI tools does exactly this. Doesn't it? If so, I'd say it'd be preferable to use the off-the-shelf tool. Sorry that I've only thought about this now, but I was initially more focused on a specific chem comp pipeline, thinking that it required something more custom.
@piehld Yeah, it was for the new chemical service ETL workflow, which otherwise doesn't need to talk to dw/exdb. But, as Jose points out, it's not doing more than MongoDB export right now. It might get functionality for incremental loading later. |
Cool thanks for clarifying @dmyersturnbull! I guess I have the same question as @josemduarte now too, on whether mongoexport CLI can be used? If not, one thing that might help your code is to rely on our ExDB configuration file (e.g., with Mongo params here) being passed in as a CLI flag, which you could use our ConfigUtil to read in and grab the necessary Mongo client information from. This config file is what is passed in during production for ExDB loading tasks. |
Yeah, I think for non-incremental mongoexport works perfectly. For incremental updates, we'll need slightly more logic and so will need some code -- at that point, both PyMongo or mongoexport work (because mongoexport allows
Yeah, Jose and I discussed this. I originally took that approach for consistency with our other Python projects, but I think we should move to just using single URI connection strings (from config files, env vars, or (less securely) from CLI args). |
Added a new CLI called
exdb-export
with a single subcommand,export
, which simply writes a MongoDB collection (or subset of fields) to a JSON file.This allows weekly-update-workflow to get the list of chemical component ids. Getting exactly those would probably make for an overspecialized entry point, so the
export
subcommand takes a collection name and optionally a list of fields.I kept it self-contained; a config file,
rcsb.db.mongo
, etc. feel unnecessary. The first commit just cleans upsetup.py
slightly.