Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

processor decorator: parse multi-value page-id correctly #921

Closed
bertsky opened this issue Oct 11, 2022 · 4 comments
Closed

processor decorator: parse multi-value page-id correctly #921

bertsky opened this issue Oct 11, 2022 · 4 comments
Assignees

Comments

@bertsky
Copy link
Collaborator

bertsky commented Oct 11, 2022

The spec states that --page-id is both a multi-value option (i.e. comma-separated) and a range option (i.e. ellipsis allowing). Above that, core also implements the // prefix for regex values.

Naturally I would assume that I can combine these possibilities. But comma-separation only seems to work for literals, and regex is only activated for the expression as a whole or not at all. This is too restrictive IMO and should be fixed.

@bertsky
Copy link
Collaborator Author

bertsky commented Oct 11, 2022

Also, I believe it is not correct that generate_range greedily selects the first numerical substring. Page identifiers could be made up of several numbers...

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 10, 2022

Plus, at the very least, the parameter parser should complain if it cannot correctly decode the full expression. But it does not. (For example, in the greedy numerical range case, it does not complain if – as a result of misreading the numerical part that is to be ranged over – start and stop gets to be the same.)

Since this whole thing will likely also be used for page selection on the web API, I suggest addressing these problems thoroughly, and soon.

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 23, 2022

Fixed by #955 – thx!

(It's clear that comma must take precedence over regex interpretation, because XS-IDs cannot contain comma, but perhaps we should also explain the combination in the CLI spec?)

@bertsky bertsky closed this as completed Nov 23, 2022
@kba
Copy link
Member

kba commented Nov 23, 2022

(It's clear that comma must take precedence over regex interpretation, because XS-IDs cannot contain comma, )

I was considering that and first implemented the token splitting with a negative lookbehind for backslash (re.split(r'(?>!\\),')) to allow for escaping commas. But then I thought who would consciously put commas in their identifiers and did a simple split-at-comma and reverted.

but perhaps we should also explain the combination in the CLI spec?

Sure, we could say that the multi-value mechanics do not allow comma in values.

kba added a commit to OCR-D/spec that referenced this issue Nov 23, 2022
kba added a commit to OCR-D/spec that referenced this issue Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants