Skip to content

Conversation

@lchu6
Copy link
Contributor

@lchu6 lchu6 commented Oct 7, 2021

Why are these changes needed?

Add metadata to workflow. Currently there is no option for user to attach any metadata to a step or workflow run, and workflow running metrics (except status) are not captured nor checkpointed.

We are adding various of metadata including:

  1. step-level user metadata. can be set with step.options(metadata={})
  2. step-level pre-run metadata. this captures pre-run metadata such as step_start_time, more metrics can be added later.
  3. step-level post-run metadata. this captures post-run metadata such as step_end_time, more metrics can be added later.
  4. workflow-level user metadata. can be set with workflow.run(metadata={})
  5. workflow-level pre-run metadata. this captures pre-run metadata such as workflow_start_time, more metrics can be added later.
  6. workflow-level post-run metadata. this captures post-run metadata such as workflow_end_time, more metrics can be added later.

Related issue number

#17090

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@fishbone
Copy link
Contributor

fishbone commented Oct 7, 2021

I think the high-level direction looks good, but we probably can improve the structure a little bit.

Here there are three metadata for workflow & step (6 in total):

For step metadata

  • user meta
    • do it in commit step
  • pre step meta
    • do it in commit step
  • post step meta
    • do it in checkpoint output

For workflow metadata

  • user meta
    • do it in run/run_async
  • pre meta/post meta
    • do it in workflow access

Let's focus on step metadata first and then we can get to workflow metadata later.

Copy link
Contributor

@fishbone fishbone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! The highlevel direction looks good! Let's fix the comments and add some test and go with another round of review.

@lchu6
Copy link
Contributor Author

lchu6 commented Oct 7, 2021

@iycheng thanks for the suggestion.

  • user meta (do it in commit step)
    I have it in here:

    wf_storage._put(
    which is inside _write_step_inputs that commit step uses. I think this is good as it is together with all other write input? or you are suggesting to do it somewhere else?

  • pre step meta (do it in commit step)
    I have it here:

    asyncio.get_event_loop().run_until_complete(store._put(

    right before the actually step running _wrap_run in the next line. Is the suggestion putting it inside
    commit_step(store, step_id, None, e, outer_most_step_id)
    and
    commit_step(store, step_id, persisted_output, None, outer_most_step_id)
    ?
    If so, I would assume:
    the start_time should still be recorded before the _wrap_run line, but we keep that value and bring it inside commit_step. (as an extra arg for commit_step I assume?)

  • post step meta (do it in checkpoint output)
    I have it here:

    step_end_metadata = {'end_time': time.time()}
    which is after the step_commit. I guess the suggestion is to put it inside the step_commit in the save_step_output portion. Correct?

@fishbone
Copy link
Contributor

fishbone commented Oct 8, 2021

user meta looks good to me.

I got your point. I think you are right about pre-run metadata and post-run metadata.
Let's add two functions into workflow_storage: save_step_prerun_metadata/save_step_postrun_metadata

and call them before and after wrap run.

persisted_output, volatile_output = _wrap_run(
func, step_type, step_id, catch_exceptions, max_retries, *args,
**kwargs)

@lchu6 lchu6 requested a review from ericl as a code owner October 11, 2021 16:55
@lchu6
Copy link
Contributor Author

lchu6 commented Oct 11, 2021

@iycheng Update:

For all 6 inputs, here are where they are now:

  1. step_user_metadata is within _write_step_inputs, so it is saved together with all other attributes ofWorkflowData.
  2. The rest 5 are stored with newly created methods: save_step_prerun_metadata, save_step_postrun_metadata, save_workflow_user_metadata, save_workflow_prerun_metadata and save_workflow_postrun_metadata in workflow_storage.
  3. The two step-level ones are put inside where we agreed on - before and after _wrap_run. For workflow-level user metadata, I put it inside execution.run which is where run/asyn_run called from. For workflow-level pre/post run metadata, this is where I am not sure where to put the best inside workflow_access, I currently put pre-meta at the beginning of run_or_resume while post-meta inside update_step_status (i.e. post-meta is recorded whenever FAILED or SUCCESSFUL is captured.) Let me know if you have a better place to put in mind.

Btw, I think the code changes are now much cleaner, thanks to the early feedbacks.

Copy link
Contributor

@fishbone fishbone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really thanks for the updating! It looks nice! I have some comments there which shouldn't be too hard to fix

  • let's try to keep workflow storage clean without logic related to the application
  • except for user-facing API, let's rename metadata to user_metadata if it's from the user. Because for user-facing API, there is only one type of metadata so no confusion there. But internally, we have several kinds of metadata.
  • once fixed them, please add some test cases you can follow the pattern here (python/ray/workflow/tests/test_basic_workflows_2.py)

@lchu6
Copy link
Contributor Author

lchu6 commented Oct 12, 2021

@iycheng added tests in 93eb53f.

@fishbone
Copy link
Contributor

There is lint failure, could you check this doc and format it?
https://docs.ray.io/en/latest/getting-involved.html#lint-and-formatting

@fishbone
Copy link
Contributor

it looks like rllib test failed. It looks like not related to this one. Could you merge to master and push it again?

@fishbone
Copy link
Contributor

Everything else looks good! It's almost there and thanks for the work!

@fishbone
Copy link
Contributor

fishbone commented Oct 12, 2021

lint failure

@lchu6
Copy link
Contributor Author

lchu6 commented Oct 12, 2021

lint failure

@iycheng can you help me on this one? I checked all failed checks and there is only one related - lint with the following error:

Warning, treated as error:
--
  | /ray/python/ray/workflow/common.py:docstring of ray.workflow.common.Workflow.run:27:Definition list ends without a blank line; unexpected unindent.

However, scripts/format.sh didn't give me any fix, and I couldn't find any problem with manual check on the docstring of def run.

@fishbone
Copy link
Contributor

lint failure

@iycheng can you help me on this one? I checked all failed checks and there is only one related - lint with the following error:

Warning, treated as error:
--
  | /ray/python/ray/workflow/common.py:docstring of ray.workflow.common.Workflow.run:27:Definition list ends without a blank line; unexpected unindent.

However, scripts/format.sh didn't give me any fix, and I couldn't find any problem with manual check on the docstring of def run.

Usually I run ci/travis/format.sh

Copy link
Contributor

@fishbone fishbone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

@fishbone
Copy link
Contributor

Test failure looks unrelated. @lchu-ibm could you give more details in the description part?

@lchu6
Copy link
Contributor Author

lchu6 commented Oct 13, 2021

@iycheng done with updating description.

@fishbone fishbone merged commit ce64e6d into ray-project:master Oct 13, 2021
@lchu6 lchu6 deleted the metadata branch October 13, 2021 22:47
fishbone pushed a commit that referenced this pull request Oct 13, 2021
## Why are these changes needed?

Quick fix for metadata put. Currently when workflow-level metadata is not given, it will output `null` to `user_run_metadata.json`, this fix will make it output `{}`.
## Related issue number

original issue: #17090
original PR: #19195
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants