Skip to content

Conversation

jorgee
Copy link
Contributor

@jorgee jorgee commented Apr 4, 2025

  • Use Instant object instead of strings in model objects
  • Unify nomenclature using taskRun and workflowRun instead of publishBy, runBy, etc
  • Use #outputs fragment using run CID instead of a separate result CID
  • Use DataOutput in place of WorkflowOutput and TaskOutput
  • Include script checksum in TaskRun
  • WorkflowResults and TaskResults renamed by WorkflowOutputs and TaskOutputs
  • Clean some unused code and unify duplicated code, (encoding search results,...)

Signed-off-by: jorgee <[email protected]>
@jorgee jorgee requested a review from pditommaso April 4, 2025 11:30
@jorgee
Copy link
Contributor Author

jorgee commented Apr 4, 2025

Missing parts discussed today:

Bugs:

  • Fix outputs path in source field in the Workflow's data outputs.
  • Add workflowRun in TaskRun

Other:

  • Task Inputs and Output model (suggestion @bentsherman)
  • find command to look for CID's (not showing)

@bentsherman
Copy link
Member

Based on our discussion today, let's make it so that you don't need to specify #outputs or /outputs to access a task output. That is more a workflow thing. Task outputs should simply be addressed as cid://<task-hash>/<relative-path>

Here is a proposed structure for task inputs and outputs:

{
  "inputs": {
    "val": { /* map of val names/values */ },
    "env": { /* map of env names/values */ },
    "file": [ /* list of file/path inputs */ ],
    "stdin": "<hash>/command.in" // or null if not specified
  }
}
{
  "outputs": {
    "env": { /* map of env names/values */ },
    "eval": { /* map of eval names/values */ },
    "file": [ /* list of file/path outputs */ ],
    "stdout": "<hash>/command.out" // or null if not specified
  }
}

This is how I modeled task inputs/outputs in #4553 (an experiment for static types) and it worked well. See for example ProcessInputs and ProcessOutputs.

@pditommaso pditommaso merged commit 14e2841 into cid-store Apr 5, 2025
5 checks passed
@pditommaso pditommaso deleted the cid-store-m0-quick-wins branch April 5, 2025 15:26
@pditommaso
Copy link
Member

Let's move the open points into a new PR

@pditommaso
Copy link
Member

Based on our discussion today, let's make it so that you don't need to specify #outputs or /outputs to access a task output.

Not sure to agree on this, there should be a convenient to access outputs both for workflow and tasks.

Here is a proposed structure for task inputs and outputs

I think we should unify both inputs and outputs and list of objects. The grouping can be a bit readable for humans, but the real consumer should be indexing sub-system. Having a flat, easy predictable scheme, likely would simplify things

@bentsherman
Copy link
Member

@pditommaso I would not rush to try to fit tasks and workflows into the same model. They are similar but not exactly the same. Instead we should think about the best way to model task inputs and outputs on their own terms.

The structure I propose, I think is actually easier to consume for both users and the data provenance. Rather than searching through a list, you simply look up a value by name in a map.

For a workflow run, #outputs makes sense because the output is a data structure. But the output of a task is a collection of files, environment variables, and stdout, so that probably requires a different syntax. Maybe something like:

  • cid://<hash>#file/path/to/file.txt
  • cid://<hash>#env/MY_ENV_VAR
  • cid://<hash>#stdout

But I don't know if all that is really needed. The simplest thing would be to just access output files as cid://<hash>/<path>

@jorgee if you understand enough the structure I proposed, can you make an initial PR for it? Should be easier to discuss there

@pditommaso
Copy link
Member

Not sure it's worth, at least in this milestone. It would be better to focus on simplify and consistency, hence my suggestion to use always a collection both inputs and outputs.

@jorgee
Copy link
Contributor Author

jorgee commented Apr 7, 2025

I partially agree with both comments. I like maps, but I agree with @pditommaso that we should refer to outputs in the same way for workflows and tasks. So, cid://<hash>#outputs should refer to the outputs, whether a hash is for a task or a workflow. I do not see very intuitive referring directly to a type, because you are assuming it is an output.

Regarding the structure of the output, I was trying to model them as a List of Parameter in all the cases (parameter, task inputs, task outputs and workflow outputs) where each parameter had a name, type and value. But I saw that, depending on the case, the name or type is missing. I suppose this will not happen in the future with static types,

Writing them to a list or a map depends on what we want to support. The reason why I changed workflow outputs to a map is that it is very easy to refer to a single output ( cid://<hash>#outputs.myOutput) and I think it could be a common use case for workflow outputs. If they were a list, you should refer to something like cid://<hash>#outputs?name=myOutput. because the '.' is used to navigate through the object parameters and '?' is used to filter in a list. But using this format also has two problems it is not a correct URI (query should be before fragment) and the '?' character has issues with the pattern expected by the Channel.fromPath.

@bentsherman
Copy link
Member

Should be a good topic to discuss on Friday. In my view, the correct model is also the simplest one in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants