-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature_request(ebooks): kill gremlin characters #46
Comments
You can somewhat work around this by searching for
finds it ( |
Type: Reply 💬 1. SummaryI think, this is bad for practical usage. 2. Argumentation2.1. Common causeAutomation. Saving users time. 2.2. DetailsI don’t think you are seriously offering this to ripgrep-all users. ripgrep-all active users will have to spend a lot of time for printing additional symbols. It would be nice to have automatical method for solving this problem. 3. Note(And your method doesn't worked for my example, but it is not important) Thanks. |
Type: Addition ➕ 1. Expected behaviorPossibly, expected behavior in most cases — removing this character combination: For OCR layer of some old books 2. Example2.1. DataPart of page 66 (page 69 in PDF readers) from this book (
2.2. Screenshots2.3. Expected behaviorIt would be nice to replace “Гит 3. Another examplesAnother books, where “ Thanks. |
Interesting discussion. Never knew it was called this name.
I am a heavy user of ripgrep on large collections of pdf and epub ebooks. Some also CJK languages with even weirder UTF-16 stuff. I had never thought about this, but now it makes sense. I had always wandered how to "grep" for these things to clean up. I want to learn more about this, could you recommend other similar tips and web pages, specifically about these isssues in converted pdf's/epub's and how to clean them up. I'm not a programmer, but I am very confortable with the command line, shell scripts and unix power tools. I tried to google for "gremlins characters pdf epub ebook" |
1. Summary
It would be nice, if would be possible searched text in ebooks with gremlin characters.
2. Gremlins
2.1. Definition
Gremlins — is invisible non-printable characters, which prevent text in ebooks from being searched. They come across the books with poor-quality OCR.
2.2. Gremlins example
Paragraph of text from page 10 of “The Enigma of Reason” book:
See it in service, that show non printable characters:
For example, we can't find in this ebook
philosophers
word, because 3 gremlins inside it:It would be nice, if ripgrep-all users can find
philosophers
word in this book.2.3. pdftotext gremlins
Paragraph in
KiraTheEnigmaOfReason.txt
:pdftotext not delete gremlins.
3. Additional links
4. Environment
Thanks.
The text was updated successfully, but these errors were encountered: