-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Message structure #139
Comments
Processing request message example: # Generated fields
processor-name: "ocrd-.*" # e.g., "ocrd-cis-ocropy-binarize", not necessary for processing, but might be useful for human
job_id: "uuid" # e.g., "fe869a65-ea1c-4be5-8053-a3b8e01beecd", generated by the Processing Broker
# Mandatory fields - provided by the User/Workflow Server to the Processing Broker
input_file_grp: "OCR-D-.*" # e.g., "OCR-D-DEFAULT"
workspace: # either of the two fields must be available
id: "uuid" # e.g., "4a5795d6-136c-40b7-8eed-eeca8ff1c249", generated by the Workspace Server
path_to_mets: "absolute path" # e.g., "/workspaces/workspace1/mets.xml", locally available on the Processing Worker
# Optional fields - provided by the User/Workflow Server to the Processing Broker
output_file_grp: "OCR-D-.*" # e.g., "OCR-D-BIN"
page_id: "id" # e.g., "PHYS_0005..PHYS_0010" will process only pages between 5-10
overwrite: boolean
parameter:
- key_1: value
- ..
- key_N: value
log_level: "level" # e.g., "INFO"
# "save the resource usage of the processor to a file in the root directory"
save_profiling_to: "path" # e.g., "./logs/resource-usage.txt"
# If this field is not provided, do not push results back to the message queue
result_queue: "name" # e.g., "ocrd-cis-ocropy-binarize-result" Result response message example: # Mandatory fields - reused from the processing request message or generated by the Processing Worker
job_id: "uuid" # e.g., "fe869a65-ea1c-4be5-8053-a3b8e01beecd"
status: "value" # e.g., completed/failed
# Optional fields - available only if provided in the processing request message
path_to_mets: "absolute path" EDIT: This post gets updated and contains the latest version of the message drafts. |
Sorry, I find it a bit difficult to read your draft. What syntax is it? Could you just make one or some message examples? My first comment is: why do we need to include |
It just represents the two message types, it's not a field. Okay, let me edit that and replace the regex patterns with examples. |
processing message type:
result message type:
|
I would reuse the structure of the POST request to the processing endpoint of the Web API as much as possible. So, I suggest a message structure like this: # This is not necessary for processing, but might be useful for human
processor_name: ocrd-cis-ocropy-binarize
path: /path/to/mets.xml
description: some text description here
input_file_grps:
- OCR-D-INPUT
output_file_grps:
- OCR-D-OUTPUT
page_id: PHYS_001,PHYS_002
parameters:
params_1: 1
params_2: 2
# If this field is not provided, do not push results back to the message queue
result_queue_name: ocrd-cis-ocropy-binarize-result
# The time this message is created in Unix time
created_time: 1668782988590 The |
|
After rechecking the specs and reconsidering some things in more depth, here is an update that, hopefully, covers all possible use cases. I will edit my first post and pin the examples to the top if there are no current objections. @tdoan2010 Processing request message example: # Generated fields - automatically generated by the Processing Broker
# This is not necessary for processing, but might be useful for human
processor-name: "ocrd-.*" # e.g., "ocrd-cis-ocropy-binarize"
job-id: "uuid" # e.g., "fe869a65-ea1c-4be5-8053-a3b8e01beecd"
# Mandatory fields - provided by the User/Workflow Server to the Processing Broker
workspace:
# Generated by the Workspace Server when a workspace is posted
id: "uuid" # e.g., "4a5795d6-136c-40b7-8eed-eeca8ff1c249"
input-file-grp: "OCR-D-.*" # e.g., "OCR-D-DEFAULT"
# Optional fields - provided by the User/Workflow Server to the Processing Broker
output-file-grp: "OCR-D-.*" # e.g., "OCR-D-BIN"
page-id: "id" # e.g., "PHYS_0005..PHYS_0010" will process only pages between 5-10
overwrite: boolean
parameter:
- key-1: value
- ..
- key-N: value
mets: "path" # e.g., "uuid/mets/mets.xml"
log-level: "level" # e.g., "INFO"
# "save the resource usage of the processor to a file in the root directory"
show-resource: "path" # e.g., "uuid/logs/resource-usage.txt"
# If this field is not provided, do not push results back to the message queue
result-queue: "name" # e.g., "ocrd-cis-ocropy-binarize-result" Result response message example: # Generated previously by the Processing Broker
job-id: "uuid" # e.g., "fe869a65-ea1c-4be5-8053-a3b8e01beecd"
# Generated by the Processing Worker
status: "value" # e.g., completed/failed
return: "value" # e.g., 0/-1
log-output: "output text" Notes:
|
Some comments from my side:
workspace:
id: 1234
path_to_mets: relative/path/to/mets.xml or only
|
As long as we use the same convention everywhere, i.e. the dash, I am okay with that.
I left it nested after removing the
Agree
Good point. I have assumed that the workspaces will be stored on external storage. My conceptual understanding was that each
I think supporting both and making just one of the fields mandatory is the right approach to staying flexible with our options. Btw, there is a I think utilizing the
Agree
Okay I will edit my first post on top with the suggested changes so far and it will be the most up-to-date version. |
External storage can just be mounted. That's why there is NFS in the picture. The idea is: all processing servers share the same storage, either via local mount or network mount.
Yes, either PS: in the future, please do not edit your message completely as you did, but instead posting new ones. It would make it difficult for people (myself included) to follow the discussion when the old messages were edited. |
Okay, now I clearly see why providing the path is a cleaner approach when the storage is shared among the processing workers.
The first draft was compact but super complicated for others to understand. I preferred to get rid of that. |
I do not agree with your current suggestion yet. Some points from my side:
So, my suggested example: job_id: uuid
processor_name: ocrd-cis-ocropy-binarize
path_to_mets: /path/to/mets.xml
input_file_grps:
- OCR-D-INPUT
output_file_grps:
- OCR-D-OUTPUT
page_id: PHYS_001,PHYS_002
parameters:
params_1: 1
params_2: 2
# If this field is not provided, do not push results back to the message queue
result_queue_name: ocrd-cis-ocropy-binarize-result
# The time this message is created in Unix time
created_time: 1668782988590 Except |
Agree.
Having more flexibility may help here. We could change the log level without having to restart the processor. Currently, I think it's not possible to change the log level of a running processor. However, this may be achievable if the log level is defined with an environment variable. We could drop that field out for now.
Yeah, the same as in 2. We could drop that too.
I decided to go with a single value to avoid having to parse the input/output. I could change that to a list but then extra work has to be done to adapt the input/output to the processor parameters for each processing request. EDIT: Extra work = Extra processing time.
Yes, same as in 2 and 3. |
For |
Okay, here is the newer version with the proposed changes. Processing request message example: # Generated fields
processor-name: "ocrd-.*" # e.g., "ocrd-cis-ocropy-binarize", not necessary for processing, but might be useful for human
job_id: "uuid" # e.g., "fe869a65-ea1c-4be5-8053-a3b8e01beecd", generated by the Processing Broker
# Mandatory fields - provided by the User/Workflow Server to the Processing Broker
input_file_grp:
- "OCR-D-.*" # e.g., "OCR-D-DEFAULT"
# either of the two fields below must be available
workspace_id: "uuid" # e.g., "4a5795d6-136c-40b7-8eed-eeca8ff1c249", generated by the Workspace Server
path_to_mets: "absolute path" # e.g., "/workspaces/workspace1/mets.xml", locally available on the Processing Worker
# Optional fields - provided by the User/Workflow Server to the Processing Broker
output_file_grp:
- "OCR-D-.*" # e.g., "OCR-D-BIN"
page_id: "id" # e.g., "PHYS_0005..PHYS_0010" will process only pages between 5-10
parameters:
- params_1: "value"
- ..
- params_N: "value"
# If this field is not provided, do not push results back to the message queue
result_queue_name: "name" # e.g., "ocrd-cis-ocropy-binarize-result" Result response message example: # Mandatory fields - reused from the processing request message or generated by the Processing Worker
job_id: "uuid" # e.g., "fe869a65-ea1c-4be5-8053-a3b8e01beecd"
status: "value" # e.g., completed/failed
# Optional fields - available only if provided in the processing request message
path_to_mets: "absolute path" |
By "extra work" I meant the conversion that will potentially happen millions of times (depending on how many messages are there) when parsing the messages, not that it's extra work for us to implement. |
@MehmedGIT I created 2 JSON schemas, one for processing messages and one for result messages. You can find them currently in OCR-D/spec#222. Please check them and try using them in your program to validate processing message. Some notes from me:
|
I will have a look and implement them soon. |
We need to come up with a schema for the message which submit to the queue. There might be different type of message, e.g. processing, result, error, logging, etc.
Use the ocrd_tool.schema.yml as a reference implementation to write this schema.
The text was updated successfully, but these errors were encountered: