Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"sa" placement method appears to hang, or is very slow #305

Open
m8pple opened this issue Jan 4, 2022 · 2 comments
Open

"sa" placement method appears to hang, or is very slow #305

m8pple opened this issue Jan 4, 2022 · 2 comments

Comments

@m8pple
Copy link
Contributor

m8pple commented Jan 4, 2022

When trying to use the "sa" placement method it is unclear whether the placer is either hanging in
some way or is very slow. For a graph with ~350 devices it is not completing within ~20 minutes. A
microlog is created, but the location of the microlog is not echoed to the console, and no information
is ever put in it. Top shows "root" and "logserver" at 100%, but that is pretty normal, so it isn't clear
if it is working on anything or just spinning.

I read user_guide.md and placement.md, and as far as I can tell no specific parameters for
"sa" are needed by default.

I tried both forms of the placement command from the docs, i.e.:

place /sa = *
placement /sa = *

The same behaviour is seen for the "gc" method - externally it appears to hang.

It's possible it is working in the background, but it isn't clear whether progress is
being made, or how long it might take.

I also tried with the smallest graph I had to hand with 64 nodes, and the same
behaviour is seen - microlog is created, nothing printed to console, and no
result within 10 minutes.

Context

  • Machine : jennings
  • Orchestrator version : e74e6ee
  • Input xml : stationary_water_7x7x7_1024.xml.gz
  • Commands:
    load /app = ""/home/dt10/poets-dpd/experiments/orch-scaling/inputs/stationary_water_7x7x7_1024.xml""
    tlink /app = *
    place /sa= *
    
  • Overall log:
dt10@jennings:~/poets-dpd$ ../Orchestrator/orchestrate.sh -n
POETS> 14:33:11.01:  20(I) The microlog for the command 'load /engine = "../Config/POETSHardwareOneBox.ocfg"' will be written to '../Output/Microlog/Microlog_2022_01_04T14_33_11p0.plog'.
POETS> 14:33:11.01: 140(I) Topology loaded from file ||../Config/POETSHardwareOneBox.ocfg||.
POETS> load  /app = "/home/dt10/poets-dpd/experiments/orch-scaling/inputs/stationary_water_7x7x7_1024.xml"
tlink /app = *
place /sa = *POETS>POETS> 14:33:13.85:  23(I)  load  /app = "/home/dt10/poets-dpd/experiments/orch-scaling/inputs/stationary_water_7x7x7_1024.xml"
POETS> 14:33:13.85:  20(I) The microlog for the command ' load  /app = "/home/dt10/poets-dpd/experiments/orch-scaling/inputs/stationary_water_7x7x7_1024.xml"' will be written to '../Output/Microlog/Microlog_2022_01_04T14_33_12p0.plog'.
POETS> 14:33:13.85: 235(I) Application file /home/dt10/poets-dpd/experiments/orch-scaling/inputs/stationary_water_7x7x7_1024.xml loading...
POETS> 14:33:13.85:  65(I) Application file /home/dt10/poets-dpd/experiments/orch-scaling/inputs/stationary_water_7x7x7_1024.xml loaded in 1844 ms.
POETS> 14:33:13.85:  23(I) tlink /app = *
POETS> 14:33:13.85:  20(I) The microlog for the command 'tlink /app = *' will be written to '../Output/Microlog/Microlog_2022_01_04T14_33_13p0.plog'.
POETS> 14:33:13.85: 234(I) Typelinking graph instance 'blurble'...
POETS> 14:33:13.85: 249(I) Successfully typelinked graph instance 'blurble'.
POETS>
  • Microlog:
========================================================================================================================
04/01/2022 14:33:16.98 file ../Output/Microlog/Microlog_2022_01_04T14_33_16p0.plog
command [place /sa = *]
from console
========================================================================================================================

@mvousden
Copy link
Contributor

mvousden commented Jan 5, 2022

The sa algorithm (simulated annealing) and the gradient climber algorithm are both quite naive - their stopping condition is based off iteration count (which is currently 1e8, defined in SimulatedAnnealing.h).

This is (obviously) not good, unless you're placing something overnight, and are looking to reuse that placement configuation later. More practical/useful would be:

  • The ability to define a "configuration option" that defines the number of iterations to perform.
  • The ability to define a "configuration option" that stops annealing after a certain amount of time has passed.

and secondarily,

  • Annealing in parallel.

@m8pple
Copy link
Contributor Author

m8pple commented Jan 6, 2022

From a documentation/user point of view it would be good to add some guidance on this
around the "sa" part of the user guide.

For example, extend this part:

placement /sa: Given a typelinked application graph (or multiple), places it using simulated
annealing (with a random initial condition). The number of iterations can be defined at compile
time (and will later be more easily configurable).

with something like: "Note: with the current iteration count this placement method is mainly intended
for offline generation of placements. Typical run-time for a graph with X devices might be
around Y hours."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants