Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature_request(ebooks): kill gremlin characters #46

Open
Kristinita opened this issue May 17, 2020 · 4 comments
Open

feature_request(ebooks): kill gremlin characters #46

Kristinita opened this issue May 17, 2020 · 4 comments

Comments

@Kristinita
Copy link

1. Summary

It would be nice, if would be possible searched text in ebooks with gremlin characters.

2. Gremlins

2.1. Definition

Gremlins — is invisible non-printable characters, which prevent text in ebooks from being searched. They come across the books with poor-quality OCR.

2.2. Gremlins example

Paragraph of text from page 10 of “The Enigma of Reason” book:

Okular text

They drink and piss, eat and shit. They sleep and snore. They sweat and
shiver. They lust. They mate. Their births and deaths are messy affairs. Ani-
mals, ­humans are animals! Ah, but ­humans, and ­humans alone, are endowed
with reason. Reason sets them apart, high above other creatures—or so
Western phi­los­o­phers have claimed.

See it in service, that show non printable characters:

soscisurvey.de

For example, we can't find in this ebook philosophers word, because 3 gremlins inside it:

Philosophers

It would be nice, if ripgrep-all users can find philosophers word in this book.

2.3. pdftotext gremlins

pdftotext KiraTheEnigmaOfReason.pdf

Paragraph in KiraTheEnigmaOfReason.txt:

They drink and piss, eat and shit. They sleep and snore. They sweat and
shiver. They lust. They mate. Their births and deaths are messy affairs. Animals, ­humans are animals! Ah, but ­humans, and ­humans alone, are endowed
with reason. Reason sets them apart, high above other creatures—or so
Western phi­los­o­phers have claimed.

pdftotext philosophers

pdftotext not delete gremlins.

3. Additional links

  1. Hunting gremlin characters
  2. Removing non-printable “gremlin” chars from text files

4. Environment

  1. Windows 10.0.18363 Pro N for Workstations 64-bit EN
  2. ripgrep-all 0.9.3 (currently, the latest Windows version)
  3. pdftotext (from conda-forge Poppler) 0.88.0
  4. Okular 1.10.70

Thanks.

@phiresky
Copy link
Owner

You can somewhat work around this by searching for \p{C}* between each character

rga 'p\p{C}*h\p{C}*i\p{C}*l\p{C}*o\p{C}*s\p{C}*o\p{C}*p\p{C}*h\p{C}*e\p{C}*r\p{C}*s'

finds it

("philosophers".split("").join("\\p{C}*"))

@Kristinita
Copy link
Author

Type: Reply 💬

@phiresky:

1. Summary

I think, this is bad for practical usage.

2. Argumentation

2.1. Common cause

Automation. Saving users time.

2.2. Details

I don’t think you are seriously offering this to ripgrep-all users.

ripgrep-all active users will have to spend a lot of time for printing additional symbols. It would be nice to have automatical method for solving this problem.

3. Note

(And your method doesn't worked for my example, but it is not important)

Thanks.

@Kristinita
Copy link
Author

Type: Addition

1. Expected behavior

Possibly, expected behavior in most cases — removing this character combination:

    \u00ADCRLF

For OCR layer of some old books \u00AD gremlin symbol is equivalent of soft hyphen at the end of line.

2. Example

2.1. Data

Part of page 66 (page 69 in PDF readers) from this book (pdftotext -layout):

правили в камеру, а фюрер остался с Рудоль­
фом Гессом обсуждать список заговорщиков.
Речь ш ла о расстрелах виновных. Прокуро­
ром и судьей был сам Гитлер. “Гесс страстно
сраж ался за каждое имя, его не останавлива­
ли даже самые яростные приступы гнева Гит­
лера. Их [заговорщиков] было много, но н и к­
то никогда не узнает, скольким из них он
спас ж изнь”.

2.2. Screenshots

  1. Okular:

    Hitler Ocular

  2. soscisurvey.de

    soscisurvey.de

2.3. Expected behavior

It would be nice to replace “Гит\u00ADCRLFлера” → “Гитлера”, that ripgrep-all will search Гитлера word in this page of this book.

3. Another examples

Another books, where “\u00ADCRLF” used as soft hyphen:

  1. 1
  2. 2
  3. 3

Thanks.

@m040601
Copy link

m040601 commented Sep 11, 2020

Gremlins — is invisible non-printable characters, which prevent text in ebooks from being searched

Interesting discussion. Never knew it was called this name.

They come across the books with poor-quality OCR.

I am a heavy user of ripgrep on large collections of pdf and epub ebooks. Some also CJK languages with even weirder UTF-16 stuff. I had never thought about this, but now it makes sense. I had always wandered how to "grep" for these things to clean up.
Will take notice of this from now on using ripgrep-all.

@Kristinita:

I want to learn more about this, could you recommend other similar tips and web pages, specifically about these isssues in converted pdf's/epub's and how to clean them up.

I'm not a programmer, but I am very confortable with the command line, shell scripts and unix power tools.

I tried to google for "gremlins characters pdf epub ebook"
But it only gives me back the movie or the book "Gremlins"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants