Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alto to text: too many spaces #129

Closed
jbarth-ubhd opened this issue Jan 22, 2021 · 7 comments
Closed

alto to text: too many spaces #129

jbarth-ubhd opened this issue Jan 22, 2021 · 7 comments

Comments

@jbarth-ubhd
Copy link

Example alto excerpt:

<TextLine><String CONTENT="Wappen:"/><SP/><String CONTENT="Heimstatt;"/><SP/><String CONTENT="Heimstatt,">... ...

converts to text

Wappen:␣␣Heimstatt;␣␣Heimstatt,␣␣Neipperg,␣␣Gemmingen ... ...
@kba
Copy link
Collaborator

kba commented Jan 22, 2021

You mean every <SP/> results in two spaces, not one?

@jbarth-ubhd
Copy link
Author

I don't know where the two spaces exactly come from, but there should only 1 I'd say.

@kba
Copy link
Collaborator

kba commented Jan 22, 2021

ALTO-to-Text transformation is using @filak's XSLT (https://github.com/filak/hOCR-to-ALTO/blob/master/alto__text.xsl), this needs to be fixed upstream. Can you open an issue there as well pls?

@jbarth-ubhd
Copy link
Author

opened: filak/hOCR-to-ALTO#22

@kba
Copy link
Collaborator

kba commented Jan 25, 2021

fix: #130

@stweil
Copy link
Member

stweil commented Jan 23, 2022

So is this issue fixed and can it be closed?

@kba
Copy link
Collaborator

kba commented Jan 24, 2022

So is this issue fixed and can it be closed?

Yes, it has been fixed. If you have an older installation of ocr-fileformat (before feb 2021), you'll need to re-clone hOCR-to-ALTO:

rm vendor/hOCR-to-ALTO
make vendor install

(we should really use git submodules to make tracking changes and updating easier)

@stweil stweil closed this as completed Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants