Skip to content

Commit 46f8cad

Browse files
committed
added docs
1 parent b72a18d commit 46f8cad

File tree

4 files changed

+79
-6
lines changed

4 files changed

+79
-6
lines changed

base/initdefs.jl

+1-1
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ function init_parallel()
7373
global PGRP
7474
global LPROC
7575
LPROC.id = 1
76-
LPROC.cookie = randstring()
76+
cluster_cookie(randstring())
7777
assert(isempty(PGRP.workers))
7878
register_worker(LPROC)
7979
end

base/multi.jl

+13-1
Original file line numberDiff line numberDiff line change
@@ -281,8 +281,20 @@ end
281281

282282
const LPROC = LocalProcess()
283283

284+
"""
285+
cluster_cookie() -> cookie
286+
287+
Returns the cluster cookie.
288+
"""
284289
cluster_cookie() = LPROC.cookie
285290

291+
"""
292+
cluster_cookie(cookie) -> cookie
293+
294+
Sets and returns the cluster cookie.
295+
"""
296+
cluster_cookie(cookie) = (LPROC.cookie = cookie; cookie)
297+
286298
const map_pid_wrkr = Dict{Int, Union{Worker, LocalProcess}}()
287299
const map_sock_wrkr = ObjectIdDict()
288300
const map_del_wrkr = Set{Int}()
@@ -1213,7 +1225,7 @@ function init_worker(cookie::AbstractString, manager::ClusterManager=DefaultClus
12131225
empty!(PGRP.workers)
12141226
empty!(map_pid_wrkr)
12151227

1216-
LPROC.cookie = cookie
1228+
cluster_cookie(cookie)
12171229
nothing
12181230
end
12191231

doc/manual/parallel-computing.rst

+53-4
Original file line numberDiff line numberDiff line change
@@ -808,14 +808,14 @@ signals that all requested workers have been launched. Hence the :func:`launch`
808808
as all the requested workers have been launched.
809809

810810
Newly launched workers are connected to each other, and the master process, in a all-to-all manner.
811-
Specifying command argument, ``--worker`` results in the launched processes initializing themselves
811+
Specifying command argument, ``--worker <cookie>`` results in the launched processes initializing themselves
812812
as workers and connections being setup via TCP/IP sockets. Optionally ``--bind-to bind_addr[:port]``
813813
may also be specified to enable other workers to connect to it at the specified ``bind_addr`` and ``port``.
814814
This is useful for multi-homed hosts.
815815

816816
For non-TCP/IP transports, for example, an implementation may choose to use MPI as the transport,
817-
``--worker`` must NOT be specified. Instead newly launched workers should call ``init_worker()``
818-
before using any of the parallel constructs
817+
``--worker`` must NOT be specified. Instead newly launched workers should call ``init_worker(cookie)``
818+
before using any of the parallel constructs.
819819

820820
For every worker launched, the :func:`launch` method must add a :class:`WorkerConfig`
821821
object (with appropriate fields initialized) to ``launched`` ::
@@ -918,7 +918,7 @@ When using custom transports:
918918
workers defaulting to the TCP/IP socket transport implementation
919919
- For every incoming logical connection with a worker, ``Base.process_messages(rd::AsyncStream, wr::AsyncStream)`` must be called.
920920
This launches a new task that handles reading and writing of messages from/to the worker represented by the ``AsyncStream`` objects
921-
- ``init_worker(manager::FooManager)`` MUST be called as part of worker process initializaton
921+
- ``init_worker(cookie, manager::FooManager)`` MUST be called as part of worker process initializaton
922922
- Field ``connect_at::Any`` in :class:`WorkerConfig` can be set by the cluster manager when ``launch`` is called. The value of
923923
this field is passed in in all ``connect`` callbacks. Typically, it carries information on *how to connect* to a worker. For example,
924924
the TCP/IP socket transport uses this field to specify the ``(host, port)`` tuple at which to connect to a worker
@@ -929,6 +929,55 @@ implementation simply executes an ``exit()`` call on the specified remote worker
929929

930930
``examples/clustermanager/simple`` is an example that shows a simple implementation using unix domain sockets for cluster setup
931931

932+
Network requirements for LocalManager and SSHManager
933+
----------------------------------------------------
934+
Julia clusters are designed to be executed on already secured environments on infrastructure ranging from local laptops,
935+
to departmental clusters or even on the Cloud. This section covers network security requirements for the inbuilt ``LocalManager``
936+
and ``SSHManager``:
937+
938+
- The master process does not listen on any port. It only connects out to the workers.
939+
940+
- Each worker binds to only one of the local interfaces and listens on the first free port starting from 9009.
941+
942+
- ``LocalManager``, i.e. ``addprocs(N)``, by default binds only to the loopback interface.
943+
This means that workers consequently started on remote hosts, or anyone with malafide intentions
944+
is unable to connect to the cluster. A ``addprocs(4)`` followed by a ``addprocs(["remote_host"])``
945+
will fail. Some users may need to create a cluster comprising on their local system and a few remote systems. This can be done by
946+
explicitly requesting ``LocalManager`` to bind to an external network interface via the ``restrict`` keyword
947+
argument - ``addprocs(4; restrict=false)``.
948+
949+
- ``SSHManager``, i.e. ``addprocs(list_of_remote_hosts)`` launches workers on remote hosts via SSH.
950+
It is to be noted that by default SSH is only used to launch Julia workers.
951+
Subsequent, master-worker and worker-worker connections use plain, unencrypted TCP/IP sockets. The remote hosts
952+
must have passwordless login enabled. Additional SSH flags or credentials may be specified via keyword
953+
argument ``sshflags``.
954+
955+
- ``addprocs(list_of_remote_hosts; tunnel=true, sshflags=<ssh keys and other flags>)`` is useful when we wish to use
956+
SSH connections for master-worker too. A typical scenario for this is a local laptop running the Julia REPL(i.e., the master)
957+
with the rest of the cluster on the Cloud, say on Amazon EC2. You will need to open only port 22 into the remote cluster, with
958+
SSH clients authenticated via PKI. ``sshflags`` can specify ``-e <keyfile>`` for the same.
959+
960+
Note that worker-worker connections are still plain TCP and the local security policy on the remote cluster
961+
must allow for free connections between worker nodes, at least for ports 9009 and above.
962+
963+
Securing and encrypting all worker-worker traffic (via SSH), or encrypting individual messages can be done via
964+
a custom ClusterManager.
965+
966+
Cluster cookie
967+
--------------
968+
All processes in a cluster share the same cookie which, by default, is a randomly generated string on the master process:
969+
970+
- ``cluster_cookie()`` returns the cookie, ``cluster_cookie(cookie)`` sets it.
971+
- All connections are authenticated on both sides to ensure that only workers started by the master are allowed
972+
to connect to each other.
973+
- The cookie must be passed to the workers at startup via argument ``--worker <cookie>``.
974+
Custom ``ClusterManagers`` can retrieve the cookie on the master by calling
975+
``cluster_cookie()``. Cluster managers not using the default TCP/IP transport (and hence not specifying ``--worker``)
976+
must call ``init_worker(cookie, manager)`` with the same cookie as on the master.
977+
978+
It is to be noted that environments requiring higher levels of security (for example, cookies can be pre-shared and hence not
979+
specified as a startup arg) can implement the same via a custom ClusterManager.
980+
932981

933982
Specifying network topology (Experimental)
934983
-------------------------------------------

doc/stdlib/parallel.rst

+12
Original file line numberDiff line numberDiff line change
@@ -535,6 +535,18 @@ General Parallel Computing Support
535535
536536
A low-level API which given a ``IO`` connection, returns the pid of the worker it is connected to. This is useful when writing custom ``serialize`` methods for a type, which optimizes the data written out depending on the receiving process id.
537537

538+
.. function:: cluster_cookie() -> cookie
539+
540+
.. Docstring generated from Julia source
541+
542+
Returns the cluster cookie.
543+
544+
.. function:: cluster_cookie(cookie) -> cookie
545+
546+
.. Docstring generated from Julia source
547+
548+
Sets and returns the cluster cookie.
549+
538550
Shared Arrays
539551
-------------
540552

0 commit comments

Comments
 (0)