FEATURE: More flexible ingestion command #3561

PlainSite · 2024-01-08T21:30:46Z

The alephclient ingestion tool has a "crawldir" option to ingest every file in a folder, but in many cases I just want to ingest one file at a time. This is because I often add PDFs to a folder where some of the PDFs (the old ones) have already been ingested/OCRed but the new ones have not. It's a waste of CPU resources to repeatedly re-ingest documents that are already in Aleph.

Either by default the alephclient command should skip over files where the SHA-1 checksum matches a checksum already in ElasticSearch (isn't this the point of storing the checksum?) with something like a --force option if it's really necessary to re-run the ingestion process for those files, and/or alephclient should allow more precise targeting of specific files to avoid the re-ingestion problem.

The alternative of moving files to a random folder in /tmp to ingest them there isn't ideal because alephclient absorbs some of that file structure metadata when it's figuring out what it's ingesting.

It would also be nice to be able to define metadata to go along with an ingestion batch by passing some information from the command line. Right now I'm not sure how to do that, and such metadata might be useful later to filter ingested data in a certain way.

Rosencrantz · 2024-01-16T11:15:40Z

Hey @PlainSite

Thanks for raising this feature request. This is something that we are aware of, and that we'd like to address with time. Right now the team are focused on a significant backlog of other work, so it may be a while before we can get to this. If you felt like putting together a pull request then we'd love to review it.

Thanks

tillprochaska · 2024-01-25T21:51:20Z

Rosencrantz added Major issue that requires attention Moderate Issue that may require attention and removed triage These issues need to be reviewed by the Aleph team Major issue that requires attention labels Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEATURE: More flexible ingestion command #3561

FEATURE: More flexible ingestion command #3561

PlainSite commented Jan 8, 2024

Rosencrantz commented Jan 16, 2024

tillprochaska commented Jan 25, 2024 •

edited

Loading

FEATURE: More flexible ingestion command #3561

FEATURE: More flexible ingestion command #3561

Comments

PlainSite commented Jan 8, 2024

Rosencrantz commented Jan 16, 2024

tillprochaska commented Jan 25, 2024 • edited Loading

tillprochaska commented Jan 25, 2024 •

edited

Loading