Initial high-level code structure of CoreWorker. #4875

raulchen · 2019-05-26T23:45:04Z

What do these changes do?

This is the first PR that submits high-level code structure of CoreWorker, with empty implementations.

Related issue number

#4850

Linter

I've run scripts/format.sh to lint the changes in this PR.

BUILD.bazel

pcmoritz · 2019-05-26T23:47:51Z

src/ray/core_worker/common.h

+  /// Id of the argument, if passed by reference, otherwise nullptr.
+  std::shared_ptr<ObjectID> id;
+  /// Data of the argument, if passed by value, otherwise nullptr.
+  std::shared_ptr<Buffer> *data;


shouldn't this be std::shared_ptr<Buffer> data?

Yep, good catch.

pcmoritz · 2019-05-26T23:53:53Z

src/ray/core_worker/object_interface.h

+                bool delete_creating_tasks);
+
+ private:
+  CoreWorker *core_worker_;


This should be a shared_ptr to make sure memory management is done correctly

Replied here #4875 (comment)

src/ray/core_worker/object_interface.h

robertnishihara · 2019-05-26T23:49:23Z

src/ray/core_worker/task_interface.h

+  const ActorHandleID &ActorHandleID() const { return actor_handle_id_; };
+
+ private:
+  class ActorID actor_id_;


why is class here?

also, maybe make these const?

we should document all fields before merging

robertnishihara · 2019-05-26T23:50:09Z

src/ray/core_worker/task_interface.h

+/// submission.
+class CoreWorkerTaskInterface {
+ public:
+  CoreWorkerTaskInterface(CoreWorker *core_worker) { core_worker_ = core_worker; }


maybe use initializer list?

src/ray/core_worker/core_worker.h

src/ray/core_worker/task_interface.h

pcmoritz · 2019-05-26T23:54:27Z

src/ray/core_worker/task_execution.h

+  void StartWorker(const TaskExecutor &executor);
+
+ private:
+  CoreWorker *core_worker_;


This should be a shared_ptr (as above, there won't be overhead associated with it, because it is only created/deleted very rarely)

AmplabJenkins · 2019-05-27T02:01:09Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14447/
Test FAILed.

AmplabJenkins · 2019-05-27T02:30:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14449/
Test FAILed.

AmplabJenkins · 2019-05-27T02:44:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14448/
Test FAILed.

zhijunfu · 2019-05-27T03:51:49Z

src/ray/common/buffer.h

@@ -0,0 +1,34 @@
+#ifndef RAY_COMMON_IO_H
+#define RAY_COMMON_IO_H


s/RAY_COMMON_IO_H/RAY_COMMON_BUFFER_H ?

BUILD.bazel

zhijunfu · 2019-05-27T03:56:48Z

src/ray/core_worker/common.h

+/// Argument of a task.
+struct Arg {
+  /// Id of the argument, if passed by reference, otherwise nullptr.
+  const std::shared_ptr<ObjectID> id;


would ObjectID work here (instead of shared_ptr)?

I use shared_ptr because either id or data must be empty.
Do you prefer using NIL id to indicate emptiness? Then what about data, should we also add a NIL buffer?

I see, thanks. It is just not quite intuitive at the same glance for a shared_ptr of ObjectID :)

zhijunfu · 2019-05-27T04:04:15Z

src/ray/core_worker/core_worker.h

+  const Language language_;
+
+  /// The `CoreWorkerTaskInterface` instance.
+  const CoreWorkerTaskInterface task_interface_;


Maybe WorkerTaskInterface is sufficient here? It seems C++ worker would use this interface directly, instead of deriving from this, thus there shouldn't be another WorkerTaskInterface class in cpp worker?

same for other interfaces below.

I feel that CoreWorkerTaskInterface indicates more clearly that this interface is part of the CoreWorker interfaces. And the C++ worker user API won't use this directly. There should be a wrapper around this. E.g., we need to serialize args and translate the c++ function to function descriptor before calling this.
BTW, the length won't be a big problem. Because normally we would use it in this way: core_worker_.tasks().CallTask.

zhijunfu · 2019-05-27T04:07:49Z

src/ray/core_worker/task_execution.h

+      std::function<Status(const RayFunction &ray_function, const vector<Buffer> &args)>;
+
+  /// Start receving and executes tasks in a infinite loop.
+  void StartWorker(const TaskExecutor &executor);


s/StartWorker/Start ?

zhijunfu · 2019-05-27T04:13:22Z

src/ray/core_worker/task_interface.cc

+
+namespace ray {
+
+Status CoreWorkerTaskInterface::CallTask(const RayFunction &function,


would it be better to rename to SubmitTask and SubmitActorTask?
or if you prefer to use call, maybe CallRemote and CallRemoteActor ?

I didn't name it SubmitTask because there's also a SubmitTask in the TaskTransport interface.
But that's probably not a big deal. I'll change them to SubmitTask/SubmitActorTask.

zhijunfu · 2019-05-27T04:20:42Z

src/ray/core_worker/core_worker.h

+        language_(language),
+        task_interface_(this),
+        object_interface_(this),
+        task_execution_interface_(this){};


may you explain a bit on why various interfaces need to have a back pointer to the CoreWorker class? Usually we use shared_ptr to hold a reference but here CoreWorker shouldn't destructed as long as the worker is alive. Or do we expect various interfaces to potentially call each other using this pointer?

btw. if the back shared_ptr is indeed required, this class needs to be derived from enable_shared_from_this ?

Good point, if necessary, the reason should be documented.

Thanks, I tried using enable_shared_from_this but it turned out that we can't use shared_from_this in the constructor.
I'll change it back to normal pointers, because, as you mentioned, lifecycle of the sub instances is the same as the root CoreWorker instance. cc @pcmoritz

Also, the reason why we need a back pointer is because we would need to use object store methods in the task interface.

thanks. For the back pointer, you can instead use reference, which should also work here and also ensures validity by design.

For why back pointer is needed, yesterday I prototyped the task submission changes, and found that things like WorkerContext would also be needed by multiple interfaces (submission, execution, and probably also object store).

zhijunfu · 2019-05-27T04:22:51Z

src/ray/core_worker/task_interface.h

+class CoreWorker;
+
+/// Options of a task call.
+struct CallOptions {


need a constructor to initialize these const fields, since they cannot be changed directly

I think we can use list initialization, e.g., CallOptions call_options{1, {}}?
It's a bit verbose to write constructor for every these POD (plain old data) type.

zhijunfu · 2019-05-27T04:23:07Z

src/ray/core_worker/task_interface.h

+};
+
+/// Options of an actor creation task.
+struct ActorCreationOptions {


same as above

Replied #4875 (comment)

robertnishihara · 2019-05-27T20:45:06Z

src/ray/core_worker/common.h

+namespace ray {
+
+/// Type of this worker.
+enum class WorkerType { WORKER, DRIVER };


What other types make sense here, e.g., does it make sense to have an ACTOR type or a LOCAL_MODE_DRIVER type?

Yes, these 2 make sense. That's why WorkerType is an enum, instead of a boolean is_driver. But I'd like to defer adding them when we actually implement the functionalities.

AmplabJenkins · 2019-05-28T00:41:24Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14460/
Test FAILed.

AmplabJenkins · 2019-05-28T00:56:33Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14461/
Test FAILed.

AmplabJenkins · 2019-05-28T01:31:39Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14462/
Test FAILed.

jovany-wang · 2019-05-28T02:12:30Z

src/ray/common/buffer.h

+  virtual uint8_t *Data() = 0;
+
+  /// Length of this buffer.
+  virtual size_t Length() = 0;


It's better to name this Size() to follow STL style?

I feel both are fine. I can change it.

jovany-wang · 2019-05-28T02:14:23Z

src/ray/common/buffer.h

+class LocalMemoryBuffer : public Buffer {
+ public:
+  uint8_t *Data() { return data_; }
+  size_t Length() { return length_; }


Add override declaration for these 2 overriding methods:

uint8_t *Data() override { return data_; }

jovany-wang · 2019-05-28T02:24:15Z

src/ray/core_worker/common.h

+
+/// Argument of a task.
+struct Arg {
+  /// Id of the argument, if passed by reference, otherwise nullptr.


What do you think of adding a flag explicitly to indicate whether we pass args by ref or by value, so that it will have more readability? And we can get rid of if (id != nullptr) {}.

Yeah, will do.

jovany-wang · 2019-05-28T02:29:01Z

src/ray/core_worker/core_worker.h

+  ///
+  /// \param[in] worker_type Type of this worker.
+  /// \param[in] langauge Language of this worker.
+  CoreWorker(const WorkerType worker_type, const Language language)


Since worker_type and language are passed by value, it's unnecessary to declare them as const.

there's a minor difference. using const can prevent the argument from being modified in the function.

AmplabJenkins · 2019-05-28T04:16:18Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14465/
Test FAILed.

raulchen · 2019-05-28T07:45:35Z

All comments should have been addressed.

AmplabJenkins · 2019-05-28T10:02:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/980/
Test PASSed.

AmplabJenkins · 2019-05-28T10:33:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14473/
Test FAILed.

AmplabJenkins · 2019-05-28T11:20:11Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/981/
Test FAILed.

AmplabJenkins · 2019-05-28T13:53:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/982/
Test FAILed.

AmplabJenkins · 2019-05-28T22:53:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/992/
Test FAILed.

robertnishihara · 2019-05-28T23:09:05Z

@raulchen looks good to me, however compilation is failing in the tests.

AmplabJenkins · 2019-05-29T01:12:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/993/
Test PASSed.

AmplabJenkins · 2019-05-29T01:27:12Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/994/
Test FAILed.

AmplabJenkins · 2019-05-29T06:41:44Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/997/
Test PASSed.

AmplabJenkins · 2019-05-29T18:48:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1004/
Test FAILed.

This reverts commit 450d2b3.

AmplabJenkins · 2019-05-30T04:32:59Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14497/
Test FAILed.

AmplabJenkins · 2019-05-30T05:06:14Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14494/
Test FAILed.

AmplabJenkins · 2019-05-30T05:24:23Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14500/
Test FAILed.

raulchen · 2019-05-30T07:59:40Z

CI failure fixed.

AmplabJenkins · 2019-05-30T08:29:06Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14504/
Test FAILed.

AmplabJenkins · 2019-05-30T11:33:13Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1012/
Test FAILed.

AmplabJenkins · 2019-05-30T13:52:16Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1014/
Test FAILed.

* [rllib] Remove dependency on TensorFlow (ray-project#4764) * remove hard tf dep * add test * comment fix * fix test * Dynamic Custom Resources - create and delete resources (ray-project#3742) * Update tutorial link in doc (ray-project#4777) * [rllib] Implement learn_on_batch() in torch policy graph * Fix `ray stop` by killing raylet before plasma (ray-project#4778) * Fatal check if object store dies (ray-project#4763) * [rllib] fix clip by value issue as TF upgraded (ray-project#4697) * fix clip_by_value issue * fix typo * [autoscaler] Fix submit (ray-project#4782) * Queue tasks in the raylet in between async callbacks (ray-project#4766) * Add a SWAP TaskQueue so that we can keep track of tasks that are temporarily dequeued * Fix bug where tasks that fail to be forwarded don't appear to be local by adding them to SWAP queue * cleanups * updates * updates * [Java][Bazel] Refine auto-generated pom files (ray-project#4780) * Bump version to 0.7.0 (ray-project#4791) * [JAVA] setDefaultUncaughtExceptionHandler to log uncaught exception in user thread. (ray-project#4798) * Add WorkerUncaughtExceptionHandler * Fix * revert bazel and pom * [tune] Fix CLI test (ray-project#4801) * Fix pom file generation (ray-project#4800) * [rllib] Support continuous action distributions in IMPALA/APPO (ray-project#4771) * [rllib] TensorFlow 2 compatibility (ray-project#4802) * Change tagline in documentation and README. (ray-project#4807) * Update README.rst, index.rst, tutorial.rst and _config.yml * [tune] Support non-arg submit (ray-project#4803) * [autoscaler] rsync cluster (ray-project#4785) * [tune] Remove extra parsing functionality (ray-project#4804) * Fix Java worker log dir (ray-project#4781) * [tune] Initial track integration (ray-project#4362) Introduces a minimally invasive utility for logging experiment results. A broad requirement for this tool is that it should integrate seamlessly with Tune execution. * [rllib] [RFC] Dynamic definition of loss functions and modularization support (ray-project#4795) * dynamic graph * wip * clean up * fix * document trainer * wip * initialize the graph using a fake batch * clean up dynamic init * wip * spelling * use builder for ppo pol graph * add ppo graph * fix naming * order * docs * set class name correctly * add torch builder * add custom model support in builder * cleanup * remove underscores * fix py2 compat * Update dynamic_tf_policy_graph.py * Update tracking_dict.py * wip * rename * debug level * rename policy_graph -> policy in new classes * fix test * rename ppo tf policy * port appo too * forgot grads * default policy optimizer * make default config optional * add config to optimizer * use lr by default in optimizer * update * comments * remove optimizer * fix tuple actions support in dynamic tf graph * [rllib] Rename PolicyGraph => Policy, move from evaluation/ to policy/ (ray-project#4819) This implements some of the renames proposed in ray-project#4813 We leave behind backwards-compatibility aliases for *PolicyGraph and SampleBatch. * [Java] Dynamic resource API in Java (ray-project#4824) * Add default values for Wgym flags * Fix import * Fix issue when starting `raylet_monitor` (ray-project#4829) * Refactor ID Serial 1: Separate ObjectID and TaskID from UniqueID (ray-project#4776) * Enable BaseId. * Change TaskID and make python test pass * Remove unnecessary functions and fix test failure and change TaskID to 16 bytes. * Java code change draft * Refine * Lint * Update java/api/src/main/java/org/ray/api/id/TaskId.java Co-Authored-By: Hao Chen <[email protected]> * Update java/api/src/main/java/org/ray/api/id/BaseId.java Co-Authored-By: Hao Chen <[email protected]> * Update java/api/src/main/java/org/ray/api/id/BaseId.java Co-Authored-By: Hao Chen <[email protected]> * Update java/api/src/main/java/org/ray/api/id/ObjectId.java Co-Authored-By: Hao Chen <[email protected]> * Address comment * Lint * Fix SINGLE_PROCESS * Fix comments * Refine code * Refine test * Resolve conflict * Fix bug in which actor classes are not exported multiple times. (ray-project#4838) * Bump Ray master version to 0.8.0.dev0 (ray-project#4845) * Add section to bump version of master branch and cleanup release docs (ray-project#4846) * Fix import * Export remote functions when first used and also fix bug in which rem… (ray-project#4844) * Export remote functions when first used and also fix bug in which remote functions and actor classes are not exported from workers during subsequent ray sessions. * Documentation update * Fix tests. * Fix grammar * Update wheel versions in documentation to 0.8.0.dev0 and 0.7.0. (ray-project#4847) * [tune] Later expansion of local_dir (ray-project#4806) * [rllib] [RFC] Deprecate Python 2 / RLlib (ray-project#4832) * Fix a typo in kubernetes yaml (ray-project#4872) * Move global state API out of global_state object. (ray-project#4857) * Install bazel in autoscaler development configs. (ray-project#4874) * [tune] Fix up Ax Search and Examples (ray-project#4851) * update Ax for cleaner API * docs update * [rllib] Update concepts docs and add "Building Policies in Torch/TensorFlow" section (ray-project#4821) * wip * fix index * fix bugs * todo * add imports * note on get ph * note on get ph * rename to building custom algs * add rnn state info * [rllib] Fix error getting kl when simple_optimizer: True in multi-agent PPO * Replace ReturnIds with NumReturns in TaskInfo to reduce the size (ray-project#4854) * Refine TaskInfo * Fix * Add a test to print task info size * Lint * Refine * Update deps commits of opencensus to support building with bzl 0.25.x (ray-project#4862) * Update deps to support bzl 2.5.x * Fix * Upgrade arrow to latest master (ray-project#4858) * [tune] Auto-init Ray + default SearchAlg (ray-project#4815) * Bump version from 0.8.0.dev0 to 0.7.1. (ray-project#4890) * [rllib] Allow access to batches prior to postprocessing (ray-project#4871) * [rllib] Fix Multidiscrete support (ray-project#4869) * Refactor redis callback handling (ray-project#4841) * Add CallbackReply * Fix * fix linting by format.sh * Fix linting * Address comments. * Fix * Initial high-level code structure of CoreWorker. (ray-project#4875) * Drop duplicated string format (ray-project#4897) This string format is unnecessary. java_worker_options has been appended to the commandline later. * Refactor ID Serial 2: change all ID functions to `CamelCase` (ray-project#4896) * Hotfix for change of from_random to FromRandom (ray-project#4909) * [rllib] Fix documentation on custom policies (ray-project#4910) * wip * add docs * lint * todo sections * fix doc * [rllib] Allow Torch policies access to full action input dict in extra_action_out_fn (ray-project#4894) * fix torch extra out * preserve setitem * fix docs * [tune] Pretty print params json in logger.py (ray-project#4903) * [sgd] Distributed Training via PyTorch (ray-project#4797) Implements distributed SGD using distributed PyTorch. * [rllib] Rough port of DQN to build_tf_policy() pattern (ray-project#4823) * fetching objects in parallel in _get_arguments_for_execution (ray-project#4775) * [tune] Disallow setting resources_per_trial when it is already configured (ray-project#4880) * disallow it * import fix * fix example * fix test * fix tests * Update mock.py * fix * make less convoluted * fix tests * [rllib] Rename PolicyEvaluator => RolloutWorker (ray-project#4820) * Fix local cluster yaml (ray-project#4918) * [tune] Directional metrics for components (ray-project#4120) (ray-project#4915) * [Core Worker] implement ObjectInterface and add test framework (ray-project#4899) * [tune] Make PBT Quantile fraction configurable (ray-project#4912) * Better organize ray_common module (ray-project#4898) * Fix error * Fix compute actions return value

* [rllib] Remove dependency on TensorFlow (ray-project#4764) * remove hard tf dep * add test * comment fix * fix test * Dynamic Custom Resources - create and delete resources (ray-project#3742) * Update tutorial link in doc (ray-project#4777) * [rllib] Implement learn_on_batch() in torch policy graph * Fix `ray stop` by killing raylet before plasma (ray-project#4778) * Fatal check if object store dies (ray-project#4763) * [rllib] fix clip by value issue as TF upgraded (ray-project#4697) * fix clip_by_value issue * fix typo * [autoscaler] Fix submit (ray-project#4782) * Queue tasks in the raylet in between async callbacks (ray-project#4766) * Add a SWAP TaskQueue so that we can keep track of tasks that are temporarily dequeued * Fix bug where tasks that fail to be forwarded don't appear to be local by adding them to SWAP queue * cleanups * updates * updates * [Java][Bazel] Refine auto-generated pom files (ray-project#4780) * Bump version to 0.7.0 (ray-project#4791) * [JAVA] setDefaultUncaughtExceptionHandler to log uncaught exception in user thread. (ray-project#4798) * Add WorkerUncaughtExceptionHandler * Fix * revert bazel and pom * [tune] Fix CLI test (ray-project#4801) * Fix pom file generation (ray-project#4800) * [rllib] Support continuous action distributions in IMPALA/APPO (ray-project#4771) * [rllib] TensorFlow 2 compatibility (ray-project#4802) * Change tagline in documentation and README. (ray-project#4807) * Update README.rst, index.rst, tutorial.rst and _config.yml * [tune] Support non-arg submit (ray-project#4803) * [autoscaler] rsync cluster (ray-project#4785) * [tune] Remove extra parsing functionality (ray-project#4804) * Fix Java worker log dir (ray-project#4781) * [tune] Initial track integration (ray-project#4362) Introduces a minimally invasive utility for logging experiment results. A broad requirement for this tool is that it should integrate seamlessly with Tune execution. * [rllib] [RFC] Dynamic definition of loss functions and modularization support (ray-project#4795) * dynamic graph * wip * clean up * fix * document trainer * wip * initialize the graph using a fake batch * clean up dynamic init * wip * spelling * use builder for ppo pol graph * add ppo graph * fix naming * order * docs * set class name correctly * add torch builder * add custom model support in builder * cleanup * remove underscores * fix py2 compat * Update dynamic_tf_policy_graph.py * Update tracking_dict.py * wip * rename * debug level * rename policy_graph -> policy in new classes * fix test * rename ppo tf policy * port appo too * forgot grads * default policy optimizer * make default config optional * add config to optimizer * use lr by default in optimizer * update * comments * remove optimizer * fix tuple actions support in dynamic tf graph * [rllib] Rename PolicyGraph => Policy, move from evaluation/ to policy/ (ray-project#4819) This implements some of the renames proposed in ray-project#4813 We leave behind backwards-compatibility aliases for *PolicyGraph and SampleBatch. * [Java] Dynamic resource API in Java (ray-project#4824) * Add default values for Wgym flags * Fix import * Fix issue when starting `raylet_monitor` (ray-project#4829) * Refactor ID Serial 1: Separate ObjectID and TaskID from UniqueID (ray-project#4776) * Enable BaseId. * Change TaskID and make python test pass * Remove unnecessary functions and fix test failure and change TaskID to 16 bytes. * Java code change draft * Refine * Lint * Update java/api/src/main/java/org/ray/api/id/TaskId.java Co-Authored-By: Hao Chen <[email protected]> * Update java/api/src/main/java/org/ray/api/id/BaseId.java Co-Authored-By: Hao Chen <[email protected]> * Update java/api/src/main/java/org/ray/api/id/BaseId.java Co-Authored-By: Hao Chen <[email protected]> * Update java/api/src/main/java/org/ray/api/id/ObjectId.java Co-Authored-By: Hao Chen <[email protected]> * Address comment * Lint * Fix SINGLE_PROCESS * Fix comments * Refine code * Refine test * Resolve conflict * Fix bug in which actor classes are not exported multiple times. (ray-project#4838) * Bump Ray master version to 0.8.0.dev0 (ray-project#4845) * Add section to bump version of master branch and cleanup release docs (ray-project#4846) * Fix import * Export remote functions when first used and also fix bug in which rem… (ray-project#4844) * Export remote functions when first used and also fix bug in which remote functions and actor classes are not exported from workers during subsequent ray sessions. * Documentation update * Fix tests. * Fix grammar * Update wheel versions in documentation to 0.8.0.dev0 and 0.7.0. (ray-project#4847) * [tune] Later expansion of local_dir (ray-project#4806) * [rllib] [RFC] Deprecate Python 2 / RLlib (ray-project#4832) * Fix a typo in kubernetes yaml (ray-project#4872) * Move global state API out of global_state object. (ray-project#4857) * Install bazel in autoscaler development configs. (ray-project#4874) * [tune] Fix up Ax Search and Examples (ray-project#4851) * update Ax for cleaner API * docs update * [rllib] Update concepts docs and add "Building Policies in Torch/TensorFlow" section (ray-project#4821) * wip * fix index * fix bugs * todo * add imports * note on get ph * note on get ph * rename to building custom algs * add rnn state info * [rllib] Fix error getting kl when simple_optimizer: True in multi-agent PPO * Replace ReturnIds with NumReturns in TaskInfo to reduce the size (ray-project#4854) * Refine TaskInfo * Fix * Add a test to print task info size * Lint * Refine * Update deps commits of opencensus to support building with bzl 0.25.x (ray-project#4862) * Update deps to support bzl 2.5.x * Fix * Upgrade arrow to latest master (ray-project#4858) * [tune] Auto-init Ray + default SearchAlg (ray-project#4815) * Bump version from 0.8.0.dev0 to 0.7.1. (ray-project#4890) * [rllib] Allow access to batches prior to postprocessing (ray-project#4871) * [rllib] Fix Multidiscrete support (ray-project#4869) * Refactor redis callback handling (ray-project#4841) * Add CallbackReply * Fix * fix linting by format.sh * Fix linting * Address comments. * Fix * Initial high-level code structure of CoreWorker. (ray-project#4875) * Drop duplicated string format (ray-project#4897) This string format is unnecessary. java_worker_options has been appended to the commandline later. * Refactor ID Serial 2: change all ID functions to `CamelCase` (ray-project#4896) * Hotfix for change of from_random to FromRandom (ray-project#4909) * [rllib] Fix documentation on custom policies (ray-project#4910) * wip * add docs * lint * todo sections * fix doc * [rllib] Allow Torch policies access to full action input dict in extra_action_out_fn (ray-project#4894) * fix torch extra out * preserve setitem * fix docs * [tune] Pretty print params json in logger.py (ray-project#4903) * [sgd] Distributed Training via PyTorch (ray-project#4797) Implements distributed SGD using distributed PyTorch. * [rllib] Rough port of DQN to build_tf_policy() pattern (ray-project#4823) * fetching objects in parallel in _get_arguments_for_execution (ray-project#4775) * [tune] Disallow setting resources_per_trial when it is already configured (ray-project#4880) * disallow it * import fix * fix example * fix test * fix tests * Update mock.py * fix * make less convoluted * fix tests * [rllib] Rename PolicyEvaluator => RolloutWorker (ray-project#4820) * Fix local cluster yaml (ray-project#4918) * [tune] Directional metrics for components (ray-project#4120) (ray-project#4915) * [Core Worker] implement ObjectInterface and add test framework (ray-project#4899) * [tune] Make PBT Quantile fraction configurable (ray-project#4912) * Better organize ray_common module (ray-project#4898) * Fix error * [tune] Add requirements-dev.txt and update docs for contributing (ray-project#4925) * Add requirements-dev.txt and update docs. * Update doc/source/tune-contrib.rst Co-Authored-By: Richard Liaw <[email protected]> * Unpin everything except for yapf. * Fix compute actions return value * Bump version from 0.7.1 to 0.8.0.dev1. (ray-project#4937) * Update version number in documentation after release 0.7.0 -> 0.7.1 and 0.8.0.dev0 -> 0.8.0.dev1. (ray-project#4941) * [doc] Update developer docs with bazel instructions (ray-project#4944) * [C++] Add hash table to Redis-Module (ray-project#4911) * Flush lineage cache on task submission instead of execution (ray-project#4942) * [rllib] Add docs on how to use TF eager execution (ray-project#4927) * [rllib] Port remainder of algorithms to build_trainer() pattern (ray-project#4920) * Fix resource bookkeeping bug with acquiring unknown resource. (ray-project#4945) * Update aws keys for uploading wheels to s3. (ray-project#4948) * Upload wheels on Travis to branchname/commit_id. (ray-project#4949) * [Java] Fix serializing issues of `RaySerializer` (ray-project#4887) * Fix * Address comment. * fix (ray-project#4950) * [Java] Add inner class `Builder` to build call options. (ray-project#4956) * Add Builder class * format * Refactor by IDE * Remove uncessary dependency * Make release stress tests work and improve them. (ray-project#4955) * Use proper session directory for debug_string.txt (ray-project#4960) * [core] Use int64_t instead of int to keep track of fractional resources (ray-project#4959) * [core worker] add task submission & execution interface (ray-project#4922) * [sgd] Add non-distributed PyTorch runner (ray-project#4933) * Add non-distributed PyTorch runner * use dist.is_available() instead of checking OS * Nicer exception * Fix bug in choosing port * Refactor some code * Address comments * Address comments * Flush all tasks from local lineage cache after a node failure (ray-project#4964) * Remove typing from setup.py install_requirements. (ray-project#4971) * [Java] Fix bug of `BaseID` in multi-threading case. (ray-project#4974) * [rllib] Fix DDPG example (ray-project#4973) * Upgrade CI clang-format to 6.0 (ray-project#4976) * [Core worker] add store & task provider (ray-project#4966) * Fix bugs in the a3c code template. (ray-project#4984) * Inherit Function Docstrings and other metedata (ray-project#4985) * Fix a crash when unknown worker registering to raylet (ray-project#4992) * [gRPC] Use gRPC for inter-node-manager communication (ray-project#4968)

* [rllib] Remove dependency on TensorFlow (ray-project#4764) * remove hard tf dep * add test * comment fix * fix test * Dynamic Custom Resources - create and delete resources (ray-project#3742) * Update tutorial link in doc (ray-project#4777) * [rllib] Implement learn_on_batch() in torch policy graph * Fix `ray stop` by killing raylet before plasma (ray-project#4778) * Fatal check if object store dies (ray-project#4763) * [rllib] fix clip by value issue as TF upgraded (ray-project#4697) * fix clip_by_value issue * fix typo * [autoscaler] Fix submit (ray-project#4782) * Queue tasks in the raylet in between async callbacks (ray-project#4766) * Add a SWAP TaskQueue so that we can keep track of tasks that are temporarily dequeued * Fix bug where tasks that fail to be forwarded don't appear to be local by adding them to SWAP queue * cleanups * updates * updates * [Java][Bazel] Refine auto-generated pom files (ray-project#4780) * Bump version to 0.7.0 (ray-project#4791) * [JAVA] setDefaultUncaughtExceptionHandler to log uncaught exception in user thread. (ray-project#4798) * Add WorkerUncaughtExceptionHandler * Fix * revert bazel and pom * [tune] Fix CLI test (ray-project#4801) * Fix pom file generation (ray-project#4800) * [rllib] Support continuous action distributions in IMPALA/APPO (ray-project#4771) * [rllib] TensorFlow 2 compatibility (ray-project#4802) * Change tagline in documentation and README. (ray-project#4807) * Update README.rst, index.rst, tutorial.rst and _config.yml * [tune] Support non-arg submit (ray-project#4803) * [autoscaler] rsync cluster (ray-project#4785) * [tune] Remove extra parsing functionality (ray-project#4804) * Fix Java worker log dir (ray-project#4781) * [tune] Initial track integration (ray-project#4362) Introduces a minimally invasive utility for logging experiment results. A broad requirement for this tool is that it should integrate seamlessly with Tune execution. * [rllib] [RFC] Dynamic definition of loss functions and modularization support (ray-project#4795) * dynamic graph * wip * clean up * fix * document trainer * wip * initialize the graph using a fake batch * clean up dynamic init * wip * spelling * use builder for ppo pol graph * add ppo graph * fix naming * order * docs * set class name correctly * add torch builder * add custom model support in builder * cleanup * remove underscores * fix py2 compat * Update dynamic_tf_policy_graph.py * Update tracking_dict.py * wip * rename * debug level * rename policy_graph -> policy in new classes * fix test * rename ppo tf policy * port appo too * forgot grads * default policy optimizer * make default config optional * add config to optimizer * use lr by default in optimizer * update * comments * remove optimizer * fix tuple actions support in dynamic tf graph * [rllib] Rename PolicyGraph => Policy, move from evaluation/ to policy/ (ray-project#4819) This implements some of the renames proposed in ray-project#4813 We leave behind backwards-compatibility aliases for *PolicyGraph and SampleBatch. * [Java] Dynamic resource API in Java (ray-project#4824) * Add default values for Wgym flags * Fix import * Fix issue when starting `raylet_monitor` (ray-project#4829) * Refactor ID Serial 1: Separate ObjectID and TaskID from UniqueID (ray-project#4776) * Enable BaseId. * Change TaskID and make python test pass * Remove unnecessary functions and fix test failure and change TaskID to 16 bytes. * Java code change draft * Refine * Lint * Update java/api/src/main/java/org/ray/api/id/TaskId.java Co-Authored-By: Hao Chen <[email protected]> * Update java/api/src/main/java/org/ray/api/id/BaseId.java Co-Authored-By: Hao Chen <[email protected]> * Update java/api/src/main/java/org/ray/api/id/BaseId.java Co-Authored-By: Hao Chen <[email protected]> * Update java/api/src/main/java/org/ray/api/id/ObjectId.java Co-Authored-By: Hao Chen <[email protected]> * Address comment * Lint * Fix SINGLE_PROCESS * Fix comments * Refine code * Refine test * Resolve conflict * Fix bug in which actor classes are not exported multiple times. (ray-project#4838) * Bump Ray master version to 0.8.0.dev0 (ray-project#4845) * Add section to bump version of master branch and cleanup release docs (ray-project#4846) * Fix import * Export remote functions when first used and also fix bug in which rem… (ray-project#4844) * Export remote functions when first used and also fix bug in which remote functions and actor classes are not exported from workers during subsequent ray sessions. * Documentation update * Fix tests. * Fix grammar * Update wheel versions in documentation to 0.8.0.dev0 and 0.7.0. (ray-project#4847) * [tune] Later expansion of local_dir (ray-project#4806) * [rllib] [RFC] Deprecate Python 2 / RLlib (ray-project#4832) * Fix a typo in kubernetes yaml (ray-project#4872) * Move global state API out of global_state object. (ray-project#4857) * Install bazel in autoscaler development configs. (ray-project#4874) * [tune] Fix up Ax Search and Examples (ray-project#4851) * update Ax for cleaner API * docs update * [rllib] Update concepts docs and add "Building Policies in Torch/TensorFlow" section (ray-project#4821) * wip * fix index * fix bugs * todo * add imports * note on get ph * note on get ph * rename to building custom algs * add rnn state info * [rllib] Fix error getting kl when simple_optimizer: True in multi-agent PPO * Replace ReturnIds with NumReturns in TaskInfo to reduce the size (ray-project#4854) * Refine TaskInfo * Fix * Add a test to print task info size * Lint * Refine * Update deps commits of opencensus to support building with bzl 0.25.x (ray-project#4862) * Update deps to support bzl 2.5.x * Fix * Upgrade arrow to latest master (ray-project#4858) * [tune] Auto-init Ray + default SearchAlg (ray-project#4815) * Bump version from 0.8.0.dev0 to 0.7.1. (ray-project#4890) * [rllib] Allow access to batches prior to postprocessing (ray-project#4871) * [rllib] Fix Multidiscrete support (ray-project#4869) * Refactor redis callback handling (ray-project#4841) * Add CallbackReply * Fix * fix linting by format.sh * Fix linting * Address comments. * Fix * Initial high-level code structure of CoreWorker. (ray-project#4875) * Drop duplicated string format (ray-project#4897) This string format is unnecessary. java_worker_options has been appended to the commandline later. * Refactor ID Serial 2: change all ID functions to `CamelCase` (ray-project#4896) * Hotfix for change of from_random to FromRandom (ray-project#4909) * [rllib] Fix documentation on custom policies (ray-project#4910) * wip * add docs * lint * todo sections * fix doc * [rllib] Allow Torch policies access to full action input dict in extra_action_out_fn (ray-project#4894) * fix torch extra out * preserve setitem * fix docs * [tune] Pretty print params json in logger.py (ray-project#4903) * [sgd] Distributed Training via PyTorch (ray-project#4797) Implements distributed SGD using distributed PyTorch. * [rllib] Rough port of DQN to build_tf_policy() pattern (ray-project#4823) * fetching objects in parallel in _get_arguments_for_execution (ray-project#4775) * [tune] Disallow setting resources_per_trial when it is already configured (ray-project#4880) * disallow it * import fix * fix example * fix test * fix tests * Update mock.py * fix * make less convoluted * fix tests * [rllib] Rename PolicyEvaluator => RolloutWorker (ray-project#4820) * Fix local cluster yaml (ray-project#4918) * [tune] Directional metrics for components (ray-project#4120) (ray-project#4915) * [Core Worker] implement ObjectInterface and add test framework (ray-project#4899) * [tune] Make PBT Quantile fraction configurable (ray-project#4912) * Better organize ray_common module (ray-project#4898) * Fix error * [tune] Add requirements-dev.txt and update docs for contributing (ray-project#4925) * Add requirements-dev.txt and update docs. * Update doc/source/tune-contrib.rst Co-Authored-By: Richard Liaw <[email protected]> * Unpin everything except for yapf. * Fix compute actions return value * Bump version from 0.7.1 to 0.8.0.dev1. (ray-project#4937) * Update version number in documentation after release 0.7.0 -> 0.7.1 and 0.8.0.dev0 -> 0.8.0.dev1. (ray-project#4941) * [doc] Update developer docs with bazel instructions (ray-project#4944) * [C++] Add hash table to Redis-Module (ray-project#4911) * Flush lineage cache on task submission instead of execution (ray-project#4942) * [rllib] Add docs on how to use TF eager execution (ray-project#4927) * [rllib] Port remainder of algorithms to build_trainer() pattern (ray-project#4920) * Fix resource bookkeeping bug with acquiring unknown resource. (ray-project#4945) * Update aws keys for uploading wheels to s3. (ray-project#4948) * Upload wheels on Travis to branchname/commit_id. (ray-project#4949) * [Java] Fix serializing issues of `RaySerializer` (ray-project#4887) * Fix * Address comment. * fix (ray-project#4950) * [Java] Add inner class `Builder` to build call options. (ray-project#4956) * Add Builder class * format * Refactor by IDE * Remove uncessary dependency * Make release stress tests work and improve them. (ray-project#4955) * Use proper session directory for debug_string.txt (ray-project#4960) * [core] Use int64_t instead of int to keep track of fractional resources (ray-project#4959) * [core worker] add task submission & execution interface (ray-project#4922) * [sgd] Add non-distributed PyTorch runner (ray-project#4933) * Add non-distributed PyTorch runner * use dist.is_available() instead of checking OS * Nicer exception * Fix bug in choosing port * Refactor some code * Address comments * Address comments * Flush all tasks from local lineage cache after a node failure (ray-project#4964) * Remove typing from setup.py install_requirements. (ray-project#4971) * [Java] Fix bug of `BaseID` in multi-threading case. (ray-project#4974) * [rllib] Fix DDPG example (ray-project#4973) * Upgrade CI clang-format to 6.0 (ray-project#4976) * [Core worker] add store & task provider (ray-project#4966) * Fix bugs in the a3c code template. (ray-project#4984) * Inherit Function Docstrings and other metedata (ray-project#4985) * Fix a crash when unknown worker registering to raylet (ray-project#4992) * [gRPC] Use gRPC for inter-node-manager communication (ray-project#4968) * Fix Java CI failure (ray-project#4995) * fix handling of non-integral timeout values in signal.receive (ray-project#5002) * temp fix for build (ray-project#5006) * [tune] Tutorial UX Changes (ray-project#4990) * add integration, iris, ASHA, recursive changes, set reuse_actors=True, and enable Analysis as a return object * docstring * fix up example * fix * cleanup tests * experiment analysis * Fix valgrind build by installing new version of valgrind (ray-project#5008) * Fix no cpus test (ray-project#5009) * Fix tensorflow-1.14 installation in jenkins (ray-project#5007) * Add dynamic worker options for worker command. (ray-project#4970) * Add fields for fbs * WIP * Fix complition errors * Add java part * FIx * Fix * Fix * Fix lint * Refine API * address comments and add test * Fix * Address comment. * Address comments. * Fix linting * Refine * Fix lint * WIP: address comment. * Fix java * Fix py * Refin * Fix * Fix * Fix linting * Fix lint * Address comments * WIP * Fix * Fix * minor refine * Fix lint * Fix raylet test. * Fix lint * Update src/ray/raylet/worker_pool.h Co-Authored-By: Hao Chen <[email protected]> * Update java/runtime/src/main/java/org/ray/runtime/AbstractRayRuntime.java Co-Authored-By: Hao Chen <[email protected]> * Address comments. * Address comments. * Fix test. * Update src/ray/raylet/worker_pool.h Co-Authored-By: Hao Chen <[email protected]> * Address comments. * Address comments. * Fix * Fix lint * Fix lint * Fix * Address comments. * Fix linting * [docs] docs for running Tensorboard without sudo (ray-project#5015) * Instructions for running Tensorboard without sudo When we run Tensorboard to visualize the results of Ray outputs on multi-user clusters where we don't have sudo access, such as RISE clusters, a few commands need to first be run to make sure tensorboard can edit the tmp directory. This is a pretty common usecase so I figured we may as well put it in the documentation for Tune. * Update tune-usage.rst * [ci] Change Jenkins to py3 (ray-project#5022) * conda3 * integration * add nevergrad, remotedata * pytest 0.3.1 * otherdockers * setup * tune * [gRPC] Migrate gcs data structures to protobuf (ray-project#5024) * [rllib] Add QMIX mixer parameters to optimizer param list (ray-project#5014) * add mixer params * Update qmix_policy.py * [grpc] refactor rpc server to support multiple io services (ray-project#5023) * [rllib] Give error if sample_async is used with pytorch for A3C (ray-project#5000) * give error if sample_async is used with pytorch * update * Update a3c.py * [tune] Update MNIST Example (ray-project#4991) * Add entropy coeff schedule * Revert "Merge with ray master" This reverts commit 108bfa2, reversing changes made to 2e0eec9. * Revert "Revert "Merge with ray master"" This reverts commit 92c0f88. * Remove entropy decay stuff

Initial high-level code structure of CoreWorker.

121b24f

pcmoritz reviewed May 26, 2019

View reviewed changes

BUILD.bazel Show resolved Hide resolved

pcmoritz reviewed May 26, 2019

View reviewed changes

robertnishihara reviewed May 26, 2019

View reviewed changes

pcmoritz reviewed May 26, 2019

View reviewed changes

raulchen added 2 commits May 26, 2019 16:55

Rename buffer.h and add const

bca82ed

shared_ptr and more comments

ac1bdc2

zhijunfu reviewed May 27, 2019

View reviewed changes

zhijunfu mentioned this pull request May 27, 2019

[WIP] prototype for task & object interface antgroup/ant-ray#107

Closed

1 task

robertnishihara reviewed May 27, 2019

View reviewed changes

raulchen added 3 commits May 27, 2019 15:53

fixes

8259d9f

fix typo

35ce1a3

use reference

a9b4f12

format

ea75b8d

jovany-wang reviewed May 28, 2019

View reviewed changes

raulchen added 5 commits May 27, 2019 22:38

add override keyword and rename length to size

dc58275

rename Arg to TaskArg

cd35803

Refine TaskArg

f1ed0da

Change CoreWorker pointer to reference

81c0ff4

comment

2de7c9b

raulchen changed the title ~~[WIP] Initial high-level code structure of CoreWorker.~~ Initial high-level code structure of CoreWorker. May 28, 2019

zhijunfu approved these changes May 28, 2019

View reviewed changes

raulchen added 6 commits May 30, 2019 11:22

Fix compile

c28e958

fix lint

450d2b3

Revert "fix lint"

804be60

This reverts commit 450d2b3.

fix lint

15259e9

fix

d6b7256

fix

7cafd73

jovany-wang approved these changes May 30, 2019

View reviewed changes

raulchen merged commit 2912a7c into ray-project:master May 30, 2019

		@@ -0,0 +1,34 @@
		#ifndef RAY_COMMON_IO_H
		#define RAY_COMMON_IO_H


		namespace ray {

		Status CoreWorkerTaskInterface::CallTask(const RayFunction &function,

Initial high-level code structure of CoreWorker. #4875

Initial high-level code structure of CoreWorker. #4875

Uh oh!

Conversation

raulchen commented May 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do?

Related issue number

Linter

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented May 27, 2019

Uh oh!

AmplabJenkins commented May 27, 2019

Uh oh!

AmplabJenkins commented May 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented May 28, 2019

Uh oh!

AmplabJenkins commented May 28, 2019

raulchen commented May 26, 2019 •

edited

Loading

jovany-wang May 28, 2019 •

edited

Loading