-
Notifications
You must be signed in to change notification settings - Fork 15
Description
@grondo's original suggestion for a feature to flux-jobspec
:
... isn't the key difference between
flux jobspec srun
andsbatch
thatsbatch
should insert aflux start
orflux broker
before the providedCOMMAND
(which is assumed to be a batch script), and ensure that the tasks run once per node? (or something similar
@SteVwonder's reply:
... To make that last piece about one per node work, I think
--nodes
would have to be a required argument. If it is optional and the user only specifies--cores
, then the tool will have to make a determination/assumption about the number of cores per node......actually, now that I think about this, is it even possible to specify that kind of request in the canonical jobspec? The request of thinking of is along the lines of: "I want N cores. I don't care about the number of nodes, but I want 1 process per node."
@grondo's reply:
It does feel like we got ourselves into a bind by requiring that the
slot
shape be defined as part of the resource request. There are certain things that are now not possible or very difficult to express (though it could just be my misunderstanding of current jobspec definition). In general, I don't see how you can define a slot at a parent of the resource type you want to request without affect the exact count (and exclusivity) of the parent resource type (if that makes any sense, specific example is what we're discussing here where you want N cores with a slot of a node, but don't care how many nodes. In fact, the exact slot shape can only be determined from a concrete resource set)
A couple ideas on how to handle this come to mind:
- add a concept of predefined slots that match resource names, e.g.
node
,socket
,core
that allow the slot to be left out of the jobspec resources section, e.g. in the case offlux jobspec sbatch
you could emit a resources section with no slot and atask:
withslot:
labelnode:
- add a special value for
count:
likeany
or0
that indicates no preference for the number of this type, and any slot that contains anany
resource as a direct child would by definition apply to exactly 1 of that resource.