FEATURE: More flexible ingestion command #3561
Labels
feature-request
Requests for new features or enhancements of existing features
Moderate
Issue that may require attention
The alephclient ingestion tool has a "crawldir" option to ingest every file in a folder, but in many cases I just want to ingest one file at a time. This is because I often add PDFs to a folder where some of the PDFs (the old ones) have already been ingested/OCRed but the new ones have not. It's a waste of CPU resources to repeatedly re-ingest documents that are already in Aleph.
Either by default the alephclient command should skip over files where the SHA-1 checksum matches a checksum already in ElasticSearch (isn't this the point of storing the checksum?) with something like a --force option if it's really necessary to re-run the ingestion process for those files, and/or alephclient should allow more precise targeting of specific files to avoid the re-ingestion problem.
The alternative of moving files to a random folder in /tmp to ingest them there isn't ideal because alephclient absorbs some of that file structure metadata when it's figuring out what it's ingesting.
It would also be nice to be able to define metadata to go along with an ingestion batch by passing some information from the command line. Right now I'm not sure how to do that, and such metadata might be useful later to filter ingested data in a certain way.
The text was updated successfully, but these errors were encountered: