Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rendering ucto output: whitespace inserted/deleted in some views #166

Closed
pirolen opened this issue Dec 8, 2020 · 3 comments
Closed

Rendering ucto output: whitespace inserted/deleted in some views #166

pirolen opened this issue Dec 8, 2020 · 3 comments
Assignees

Comments

@pirolen
Copy link

pirolen commented Dec 8, 2020

An issue in the context of FLAT rendering ucto output, and document/paragraph view.

In screenshot 1, observe the extra whitespace around punctuation marks (docx converted by piereling, then ucto in lama).
In screenshot 2 the same, but there is also missing whitespace between tokens (abbyy xml converted by FoLiA-abby, then ucto in lama).

I also attach the corresponding XML files.

Screen Shot 2020-12-08 at 3 28 56 PM
Screen Shot 2020-12-08 at 3 38 19 PM

pers_verz_test.ucto.folia.xml.txt
b1_2_kap4_pp298-1.png.ucto.folia.xml.txt

@proycon
Copy link
Owner

proycon commented Dec 11, 2020

This is a bug in FLAT indeed: ucto puts a space="no" attribute in the FoLiA if there is to be no space between the word and the next one (with respect to the untokenised original), but for some reason something is wrong in FLAT and it interprets it the other way around, removing the space with the previous word instead of the next one. I wonder if this emerges in all scenarios as I hadn't seen it before. Will investigate and fix!

@proycon
Copy link
Owner

proycon commented Dec 11, 2020

It seems this only emerges with other text classes so it is probably a problem that I accidentally introduced when I fixed #139 recently.

@proycon
Copy link
Owner

proycon commented Dec 11, 2020

@pirolen: The first document you provided does not exhibit the same problem as the second by the way, in the first document these spaces are actually encoded that way in the FoLiA document. This may be caused by either pandoc (docx->rst) or rst2folia.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants