Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add image preprocessing steps #159

Open
bertsky opened this issue Jun 10, 2020 · 5 comments
Open

add image preprocessing steps #159

bertsky opened this issue Jun 10, 2020 · 5 comments
Assignees

Comments

@bertsky
Copy link
Collaborator

bertsky commented Jun 10, 2020

IMO there is a large, still unmet demand in OCR-D for image preprocessing tools to

  1. color-normalize raw images (i.e. linear or non-linear contrast stretching, gamma correction)
  2. denoise raw images (i.e. luminance/grayscale or color denoising before binarization)

Most binarization algorithms depend on this. For example, Sauvola (unless it exposes the R parameter and one can estimate a good fit from the image dynamics) assumes full dynamic range.

So how about adding the following:

  • in the METS specs, new fileGrp/@USE name recommendations OCR-D-IMG-NORM and OCR-D-IMG-RAWDEN
  • in the PAGE specs, new AlternativeImage/@comments classes normalized and raw-denoised
  • in the ocrd-tool schema, new tool/steps enum types preprocessing/optimization/normalization (which is different from grayscale_normalization) and preprocessing/optimization/raw-denoising (which is different from binary despeckling)
@bertsky
Copy link
Collaborator Author

bertsky commented Jul 15, 2020

* in the METS specs, new `fileGrp/@USE` name recommendations `OCR-D-IMG-NORM` and `OCR-D-IMG-RAWDEN`

should now read: OCR-D-PRE-NORM and OCR-D-PRE-RAWDEN

* in the PAGE specs, new `AlternativeImage/@comments` classes `normalized` and `raw-denoised`

Instead of introducing the term raw denoising, we could also differentiate despeckling (after binarization) and denoising (before binarization)...

@kba
Copy link
Member

kba commented Jul 15, 2020

should now read: OCR-D-PRE-NORM and OCR-D-PRE-RAWDEN

👍

Instead of introducing the term raw denoising, we could also differentiate despeckling (after binarization) and denoising (before binarization)...

IMHO "raw denoising" is clearer than distinguishing despeckling/denoising. Then again, our glossary currently defines despeckling as

Remove artifacts such as smudges, ink blots, underlinings etc. from an image. Typically applied to remove “salt-and-pepper” noise resulting from Binarization.

And "denoise" is not introduced at all. So, we're free to define it as you proposed. @EEngl52 any objection?

@bertsky
Copy link
Collaborator Author

bertsky commented Jul 15, 2020

Then again, our glossary currently defines despeckling as

Remove artifacts such as smudges, ink blots, underlinings etc. from an image. Typically applied to remove “salt-and-pepper” noise resulting from Binarization.

Oh, but these physical artifacts cannot be reliably removed after binarization IMHO. You need special detectors on raw colors. So if that's the term OCR-D (or the OCR community in general) has agreed upon, let's stick to that, and not project any other interpretation. In that sense I think we still have no despeckling processors yet.

And "denoise" is not introduced at all.

Then let's define it! Let's also differentiate between raw and bilevel denoising.

@EEngl52
Copy link

EEngl52 commented Jul 15, 2020

IMO we could differentiate denoising/despeckling. But then the processors should be named accordingly. I would find it quite confusing to use a processor called denoising in a workflow step called despeckling. So it would probably be easier to go with @bertsky 's last suggestion on raw and bilevel denoising and to actually define denoising in the glossary

@bertsky
Copy link
Collaborator Author

bertsky commented Jul 15, 2020

But then the processors should be named accordingly. I would find it quite confusing to use a processor called denoising in a workflow step called despeckling.

Absolutely. Since despeckling was all we had, the current denoising processors all use that (in @comments and tool json):

  • ocrd_cis: ocrd-cis-ocropy-denoise, ocrd-cis-ocropy-binarize
  • ocrd_wrap: ocrd-skimage-denoise, ocrd-skimage-denoise-raw

We should open respective issues in those repos, and in the workflow guide of course.

And "denoise" is not introduced at all.

Then let's define it! Let's also differentiate between raw and bilevel denoising.

So how about:

  • in the PAGE specs, new AlternativeImage/@comments classes normalized and denoised
    (IMO there's no need for a raw-denoised, since we now require ordering anyway, so we should see things like denoised,binarized,denoised)
  • in the ocrd-tool schema, new tool/steps enum types preprocessing/optimization/normalization (which is different from grayscale_normalization), preprocessing/optimization/raw-denoising (which is different from despeckling) and preprocessing/optimization/binary-denoising

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants