processor decorator: parse multi-value page-id correctly #921

bertsky · 2022-10-11T12:42:16Z

The spec states that --page-id is both a multi-value option (i.e. comma-separated) and a range option (i.e. ellipsis allowing). Above that, core also implements the // prefix for regex values.

Naturally I would assume that I can combine these possibilities. But comma-separation only seems to work for literals, and regex is only activated for the expression as a whole or not at all. This is too restrictive IMO and should be fixed.

The text was updated successfully, but these errors were encountered:

bertsky · 2022-10-11T12:50:01Z

Also, I believe it is not correct that generate_range greedily selects the first numerical substring. Page identifiers could be made up of several numbers...

bertsky · 2022-11-10T19:22:58Z

Plus, at the very least, the parameter parser should complain if it cannot correctly decode the full expression. But it does not. (For example, in the greedy numerical range case, it does not complain if – as a result of misreading the numerical part that is to be ranged over – start and stop gets to be the same.)

Since this whole thing will likely also be used for page selection on the web API, I suggest addressing these problems thoroughly, and soon.

bertsky · 2022-11-23T16:54:04Z

Fixed by #955 – thx!

(It's clear that comma must take precedence over regex interpretation, because XS-IDs cannot contain comma, but perhaps we should also explain the combination in the CLI spec?)

kba · 2022-11-23T17:04:25Z

(It's clear that comma must take precedence over regex interpretation, because XS-IDs cannot contain comma, )

I was considering that and first implemented the token splitting with a negative lookbehind for backslash (re.split(r'(?>!\\),')) to allow for escaping commas. But then I thought who would consciously put commas in their identifiers and did a simple split-at-comma and reverted.

but perhaps we should also explain the combination in the CLI spec?

Sure, we could say that the multi-value mechanics do not allow comma in values.

kba self-assigned this Nov 17, 2022

kba mentioned this issue Nov 17, 2022

OcrdMets.find_files: improve search for pageId #955

Merged

bertsky closed this as completed Nov 23, 2022

kba added a commit to OCR-D/spec that referenced this issue Nov 23, 2022

cli: multi-value values cannot contain comma, OCR-D/core#921

432d3de

kba added a commit to OCR-D/spec that referenced this issue Nov 23, 2022

cli: multi-value values cannot contain comma, OCR-D/core#921

23c9d90

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

processor decorator: parse multi-value page-id correctly #921

processor decorator: parse multi-value page-id correctly #921

bertsky commented Oct 11, 2022

bertsky commented Oct 11, 2022

bertsky commented Nov 10, 2022

bertsky commented Nov 23, 2022

kba commented Nov 23, 2022 •

edited

Loading

processor decorator: parse multi-value page-id correctly #921

processor decorator: parse multi-value page-id correctly #921

Comments

bertsky commented Oct 11, 2022

bertsky commented Oct 11, 2022

bertsky commented Nov 10, 2022

bertsky commented Nov 23, 2022

kba commented Nov 23, 2022 • edited Loading

kba commented Nov 23, 2022 •

edited

Loading