@@ -808,14 +808,14 @@ signals that all requested workers have been launched. Hence the :func:`launch`
808
808
as all the requested workers have been launched.
809
809
810
810
Newly launched workers are connected to each other, and the master process, in a all-to-all manner.
811
- Specifying command argument, ``--worker `` results in the launched processes initializing themselves
811
+ Specifying command argument, ``--worker <cookie> `` results in the launched processes initializing themselves
812
812
as workers and connections being setup via TCP/IP sockets. Optionally ``--bind-to bind_addr[:port] ``
813
813
may also be specified to enable other workers to connect to it at the specified ``bind_addr `` and ``port ``.
814
814
This is useful for multi-homed hosts.
815
815
816
816
For non-TCP/IP transports, for example, an implementation may choose to use MPI as the transport,
817
- ``--worker `` must NOT be specified. Instead newly launched workers should call ``init_worker() ``
818
- before using any of the parallel constructs
817
+ ``--worker `` must NOT be specified. Instead newly launched workers should call ``init_worker(cookie ) ``
818
+ before using any of the parallel constructs.
819
819
820
820
For every worker launched, the :func: `launch ` method must add a :class: `WorkerConfig `
821
821
object (with appropriate fields initialized) to ``launched `` ::
@@ -918,7 +918,7 @@ When using custom transports:
918
918
workers defaulting to the TCP/IP socket transport implementation
919
919
- For every incoming logical connection with a worker, ``Base.process_messages(rd::AsyncStream, wr::AsyncStream) `` must be called.
920
920
This launches a new task that handles reading and writing of messages from/to the worker represented by the ``AsyncStream `` objects
921
- - ``init_worker(manager::FooManager) `` MUST be called as part of worker process initializaton
921
+ - ``init_worker(cookie, manager::FooManager) `` MUST be called as part of worker process initializaton
922
922
- Field ``connect_at::Any `` in :class: `WorkerConfig ` can be set by the cluster manager when ``launch `` is called. The value of
923
923
this field is passed in in all ``connect `` callbacks. Typically, it carries information on *how to connect * to a worker. For example,
924
924
the TCP/IP socket transport uses this field to specify the ``(host, port) `` tuple at which to connect to a worker
@@ -929,6 +929,55 @@ implementation simply executes an ``exit()`` call on the specified remote worker
929
929
930
930
``examples/clustermanager/simple `` is an example that shows a simple implementation using unix domain sockets for cluster setup
931
931
932
+ Network requirements for LocalManager and SSHManager
933
+ ----------------------------------------------------
934
+ Julia clusters are designed to be executed on already secured environments on infrastructure ranging from local laptops,
935
+ to departmental clusters or even on the Cloud. This section covers network security requirements for the inbuilt ``LocalManager ``
936
+ and ``SSHManager ``:
937
+
938
+ - The master process does not listen on any port. It only connects out to the workers.
939
+
940
+ - Each worker binds to only one of the local interfaces and listens on the first free port starting from 9009.
941
+
942
+ - ``LocalManager ``, i.e. ``addprocs(N) ``, by default binds only to the loopback interface.
943
+ This means that workers consequently started on remote hosts, or anyone with malafide intentions
944
+ is unable to connect to the cluster. A ``addprocs(4) `` followed by a ``addprocs(["remote_host"]) ``
945
+ will fail. Some users may need to create a cluster comprising on their local system and a few remote systems. This can be done by
946
+ explicitly requesting ``LocalManager `` to bind to an external network interface via the ``restrict `` keyword
947
+ argument - ``addprocs(4; restrict=false) ``.
948
+
949
+ - ``SSHManager ``, i.e. ``addprocs(list_of_remote_hosts) `` launches workers on remote hosts via SSH.
950
+ It is to be noted that by default SSH is only used to launch Julia workers.
951
+ Subsequent, master-worker and worker-worker connections use plain, unencrypted TCP/IP sockets. The remote hosts
952
+ must have passwordless login enabled. Additional SSH flags or credentials may be specified via keyword
953
+ argument ``sshflags ``.
954
+
955
+ - ``addprocs(list_of_remote_hosts; tunnel=true, sshflags=<ssh keys and other flags>) `` is useful when we wish to use
956
+ SSH connections for master-worker too. A typical scenario for this is a local laptop running the Julia REPL(i.e., the master)
957
+ with the rest of the cluster on the Cloud, say on Amazon EC2. You will need to open only port 22 into the remote cluster, with
958
+ SSH clients authenticated via PKI. ``sshflags `` can specify ``-e <keyfile> `` for the same.
959
+
960
+ Note that worker-worker connections are still plain TCP and the local security policy on the remote cluster
961
+ must allow for free connections between worker nodes, at least for ports 9009 and above.
962
+
963
+ Securing and encrypting all worker-worker traffic (via SSH), or encrypting individual messages can be done via
964
+ a custom ClusterManager.
965
+
966
+ Cluster cookie
967
+ --------------
968
+ All processes in a cluster share the same cookie which, by default, is a randomly generated string on the master process:
969
+
970
+ - ``cluster_cookie() `` returns the cookie, ``cluster_cookie(cookie) `` sets it.
971
+ - All connections are authenticated on both sides to ensure that only workers started by the master are allowed
972
+ to connect to each other.
973
+ - The cookie must be passed to the workers at startup via argument ``--worker <cookie> ``.
974
+ Custom ``ClusterManagers `` can retrieve the cookie on the master by calling
975
+ ``cluster_cookie() ``. Cluster managers not using the default TCP/IP transport (and hence not specifying ``--worker ``)
976
+ must call ``init_worker(cookie, manager) `` with the same cookie as on the master.
977
+
978
+ It is to be noted that environments requiring higher levels of security (for example, cookies can be pre-shared and hence not
979
+ specified as a startup arg) can implement the same via a custom ClusterManager.
980
+
932
981
933
982
Specifying network topology (Experimental)
934
983
-------------------------------------------
0 commit comments