Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect metrics from TF Events files #173

Closed
jlewi opened this issue Sep 20, 2018 · 11 comments
Closed

Collect metrics from TF Events files #173

jlewi opened this issue Sep 20, 2018 · 11 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Sep 20, 2018

Relevant issues
#87 Study Job CRD; Don't require users to write code to do HP tuning.
#39 support TFJob and other frameworks

We'd like to be able to collect metrics from a TF.Events file produced by a TensorFlow training job.

So at a high level what we need is

  1. A reusable binary/job that can ready a TF.Events file and report specified metrics via Katib metrics API
  2. In the StudyJobCRD we need to be able to launch this binary for each training job and pass it the location of the TF events file.
@gaocegege
Copy link
Member

I cannot figure out a better solution than PVC. We need help from the community

/cc @YujiOshima @ChanYiLin @ddysher

@jlewi
Copy link
Contributor Author

jlewi commented Sep 21, 2018

Can you explain the issue? Is this just a question of making the events file accessible by two processes e.g.

  1. The trainer which writes the TF.events file
  2. The metrics collector which reads it.

Using a PVC to share the TF.events file seems perfectly reasonable. We can also support object stores (S3, HDFS, GCS) since TF can read/write those directly.

@YujiOshima
Copy link
Contributor

I heard @gaocegege will make a WIP PR for TFJob support first in the next week.
I wrote down a rough design of a metrics collector for TF Event.

metricscollectordesign

We can implement the metrics collector for TF Event independent from TFJob support.
I will make TF Event metrics collector PR.

@YujiOshima YujiOshima mentioned this issue Oct 23, 2018
6 tasks
@johnugeorge
Copy link
Member

@YujiOshima how will this be compatible with PyTorch job?

@YujiOshima
Copy link
Contributor

@johnugeorge I will parse a tf.Event file with event.proto.
I'm not familiar with the PyTorch log format or ONNIX format.
Could you tell me how we can parse them?

@johnugeorge
Copy link
Member

@YujiOshima I am not aware of equivalent official one in pytorch. What are the other options to support it?

@YujiOshima
Copy link
Contributor

@johnugeorge I think there is two way.

  • Use the tensorboardX library on your worker. I think the logs are formatted TF.Events format.
  • Print log to stdout and use default metrics collector. The default metrics collector will parse stdout of your worker.
    WDYT?

@johnugeorge
Copy link
Member

The problem is that we are forcing the workers to use a particular library or particular format. Currently, I think this is the only way.

@jlewi
Copy link
Contributor Author

jlewi commented Oct 29, 2018

@johnugeorge I don't think we are forcing users to do things a particular way; the idea is to make Katib pluggable with respect to how metrics are collected. Support for TF.Events is just one of the methods that we want to be well supported in order to support TF,

@YujiOshima
Copy link
Contributor

Tf event metrics collector was added by #235 .
/close

@k8s-ci-robot
Copy link

@YujiOshima: Closing this issue.

In response to this:

Tf event metrics collector was added by #235 .
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants