-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] (RFC) Exposing a first-class tracker/logger #4423
Comments
The high level motivation here makes a lot of sense to me. One question is about: |
@richardliaw I won't have a lot of time to dive into the current tracker implementation until the end of the week so maybe this is already clearly addressed somewhere. How is checkpointing going to be handled? I ask because repeated results can be introduced if not very careful with how trials are restored and can be difficult to resolve. An easy solution, and one I've seen implemented before, is to have a stateful tracker instance which is saved/restored at the same time as the rest of the experiment's state, ensuring that the tracker and the experiment are always in sync. In this case, the tracker can give an increasing index to every write. This allows the tracker (if buffering results) and/or the listener to discard results if results were saved but invalid due to unexpected termination by backtracking to the restored index. If the goal is to be minimally intrusive, a relatively safe way ensuring the user will save/restore the experiment and the tracker together would be to give the a tracker instance to the user which could be pickled like any other python object and to very clearly document it. I would expect it much more likely for someone to forget or not expect it necessary to checkpoint the tracker if it was a module level call like |
Thanks for the comments!
Assuming a singleton design, we can default and auto-init to log
For this, I'm assuming you're referring to cases where you're on (say) a single machine, your experiment somehow fails, or you're using this as the new Function API, and you need to resume from a checkpoint. One option is to force users to rely on i.e., something like def main(args):
if track.can_load():
model = track.load()["model"]
else:
model = Model()
for i in range(track.epochs):
avg_train_loss = train(....)
avg_test_loss = test(...)
# this should also flush the train_loss stored by track.metric
track.checkpoint(epoch=i, avg_train_loss=avg_train_loss,
avg_test_loss=avg_test_loss, model) Let me know your thoughts! |
Just weighing in over here from my arm chair. I think the track goals are well-set and address my concerns with other "tracker" libraries (except track :) ). The module-level One issue I see coming up is dealing with model saving in a distributed environment. What does it mean to save models from different worker machines w/ the same path? Report the same metrics on them? Do you want to handle this case? Will this all be backed by best-effort writes to S3? That might result in conflicts, or at the very least non-obvious semantics. Finally, one observation is that this already solved by rllib's Artifacts are still useful for saving images, but honestly that seems like they should just be checkpoint()ed like all the other diagnostics, maybe as a separate argument that accepts lists of local file paths. The reason |
Essentially, yes, that was what I was thinking about, but in the context of preemptible VMs.
At first glance, I think that seems pretty reasonable! Thinking out loud: what if you have no guarantees that the trials are restored on the same machine? Should tune be responsible with transferring the checkpoint files, and, if so, by what API and (potentially) backend specific mechanism?
How about extending the trial executor API to be responsible of setting the configuration through some general mechanism, e.g., a config file/string of some sort? Related to my previous point, this could happen at the same time that the checkpoint state is setup since it would likely be backend dependent. With implementing a new trial executor (e.g., kubernetes trial executor) in mind, it would be nice if users didn't need to know the details of how trials are launched/restored, allowing users to use exactly the same code regardless of the backend (i.e., no special init required if relying on a trial executor). Maybe extending the trial executor is not the preferred solution, but I do think some new or old dev API should be provided to enable automatic configuration of track even when running outside a ray cluster. |
I forgot to say that I think the question is important even outside of the context of the Function API. For I'm not arguing that it should be responsible for it (though it might not be a bad idea, as you proposed) but it should be able to offer some consistency guarantees. I would expect users to be reluctant to use |
OK, thanks all for feedback.
Presumably, we're talking about individual trials; not something like a single trial that is running data-parallel. Each trial should be saving to its own Trial directory, and artifacts will be saved within there.
The idea is to eventually have an API like something like
I'm assuming this is using Track with Tune in the cluster setting. Tune is responsible for transferring the checkpoint files. Checkpoint files should be go within the logdir of the trial, and Tune already uses rsync to transfer files in this setting.
Hm, to clarify this response is for the setting of "using Track individually outside the context of using Tune". If using Tune, then configurations can be set through
Yeah, that's true. Stop and resume for |
Thanks for the clarification! I think making it the trial executor responsibility to be the most flexible since a kubernetes executor would require a very different setup and forcing
I think that is reasonable. What you have here is already quite good! Maybe checkpointing tools should be its own |
Closed in #4362. |
General Motivation
TL;DR: Minimally invasive utility for logging experiment results that integrates seamlessly with Tune.
One common barrier to adoption is that users already have a developed workflow by the time they need to use a hyperparameter tuning framework.
As @gehring mentions in #4414 ,
The solution to this is to expose a tracking/logging mechanism as a first class. This mechanism will do the following:
track.log_metrics(**metrics)
.Note: Integration with Ray:
This will replace the
reporter
object in the Tune Function API. The problem here is the case where users throw this logger all of their codebase; then it doesn't correspond nicely with the notion of atune
iteration.Broad Requirements
@gehring puts it nicely:
Implementation Notes
Currently, there is a WIP PR for this ([tune] Initial track integration #4362). We are following the implementation of Track, which uses a singleton implementation to achieve a cleaner API. It would be good to get some feedback on this.
The file directory will look like the following (matching Tune's directory setup):
Workflow
This is the ideal/proposed workflow. Notice that between the local and Ray version, nothing in the function changes. This is done by something introducing a special wrapper like this (in spirit).
Local Version
This should log a metric to disk, in the same format as Tune.
and
Ray Execution
Does this actually belong in Ray?
Unclear, but for now due to its tight integration with Tune, it makes sense to sit here.
cc @noahgolmant @gehring @ericl @vlad17
The text was updated successfully, but these errors were encountered: