-
Notifications
You must be signed in to change notification settings - Fork 245
Task Simulator #1565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task Simulator #1565
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 10 of 28 files at r1, all commit messages.
Reviewable status: 10 of 28 files reviewed, 4 unresolved discussions (waiting on @Marsella8)
lib/compiler/src/compiler/cost_estimator/cost_estimator.cc line 1 at r1 (raw file):
Why a newline?
lib/compiler/test/src/compiler/cost_estimator_for_test.h line 37 at r1 (raw file):
CostEstimator make_fake_constant_cost_estimator(float const &op_cost, float const &comm_cost);
Generally pass primitives by value
Suggestion:
CostEstimator make_fake_constant_cost_estimator(float op_cost,
float comm_cost);lib/compiler/include/compiler/cost_estimator/timed_dependency.struct.toml line 1 at r1 (raw file):
Why a leading newline in a bunch of these files?
lib/compiler/src/compiler/allowed_machine_views.cc line 27 at r1 (raw file):
OperatorTaskSpace const &task, MachineSpecification const &ms) { if (num_dims(mv) != num_dims(task)) {
Is this true? I don't remember where projection is handled--is a (8,) machine view a valid view for a (2, 4) task? Just double-checking 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 18 of 28 files at r1.
Reviewable status: all files reviewed, 22 unresolved discussions (waiting on @Marsella8)
lib/compiler/include/compiler/cost_estimator/task_simulator.h line 10 at r1 (raw file):
namespace FlexFlow { float task_simulator_forward_pass(ParallelComputationGraph const &pcg,
Suggestion:
float task_simulator_estimate_forward_pass_timelib/compiler/test/src/compiler/cost_estimator/task_simulator.cc line 38 at r1 (raw file):
TEST_SUITE(FF_TEST_SUITE) { CostEstimator estimator = make_fake_constant_cost_estimator(/*op_cost*/ 10.0f, /*comm_cost*/ 1.0f);
I'd recommend at least also doing some tests with a cost estimator that differentiates between different ops and comms to validate that the right ops and comms are getting passed in.
Code quote:
make_fake_constant_cost_estimator(/*op_cost*/ 10.0f, /*comm_cost*/ 1.0f);lib/compiler/test/src/compiler/cost_estimator/task_simulator.cc line 41 at r1 (raw file):
MachineSpecification machine_spec = MachineSpecification{3, 3, 3, 1, 1}; TEST_CASE("task_simulator: linear graph") {
Suggestion:
TEST_SUITE(FF_TEST_SUITE) {
TEST_CASE("task_simulator_forward_pass") {
CostEstimator estimator =
make_fake_constant_cost_estimator(/*op_cost*/ 10.0f, /*comm_cost*/ 1.0f);
MachineSpecification machine_spec = MachineSpecification{3, 3, 3, 1, 1};
SUBCASE("linear graph") {lib/compiler/test/src/compiler/cost_estimator/task_simulator.cc line 166 at r1 (raw file):
MachineViewDimension{stride_t{1}, MachineSpecificationDimension::INTER_NODE}, }};
Might make things a bit more readable
Suggestion:
std::vector<MachineViewDimension> dims =
{
MachineViewDimension{stride_t{1},
MachineSpecificationDimension::INTER_NODE},
MachineViewDimension{stride_t{1},
MachineSpecificationDimension::INTER_NODE},
MachineViewDimension{stride_t{1},
MachineSpecificationDimension::INTER_NODE},
};
MachineView mv0 = MachineView{
MachineSpaceCoordinate{0, 0, DeviceType::GPU},
dims,
};
MachineView mv1 = MachineView{
MachineSpaceCoordinate{0, 1, DeviceType::GPU},
dims};
MachineView mv2 = MachineView{
MachineSpaceCoordinate{1, 0, DeviceType::GPU},
dims};
MachineView mv3 = MachineView{
MachineSpaceCoordinate{1, 1, DeviceType::GPU},
dims};lib/utils/src/utils/graph/dataflow_graph/algorithms/get_outgoing_edges.cc line 6 at r1 (raw file):
namespace FlexFlow { std::unordered_set<DataflowEdge> get_outgoing_edges(DataflowGraphView const &g,
Add unit tests for these
lib/compiler/src/compiler/machine_mapping/machine_mapping_problem_tree/unmapped_op_cost_estimate_key.cc line 36 at r1 (raw file):
} OpCostEstimateKey get_mapped_op_cost_estimate_key_for_layer(
Should probably be moved to the file for OpCostEstimateKey instead of here, as the signature doesn't really involve UnmappedOpCostEstimateKey
lib/pcg/src/pcg/machine_specification.cc line 55 at r1 (raw file):
std::unordered_set<device_id_t> get_device_ids(MachineSpecification const &ms,
Normally we don't have dedicated overloads for a function f that are just (transform f)--is there a compelling argument why this is needed here? The problem is adding them in explodes the number of functions you have to write, and they don't contribute a lot as you already have transform
lib/pcg/src/pcg/machine_view.cc line 46 at r1 (raw file):
if (strides.size() != dims.size()) { throw mk_runtime_error( fmt::format("Dimensions of {} and {} must match when calling "
Suggestion:
fmt::format("Length of strides ({}) and dims ({}) must match when calling "lib/pcg/src/pcg/machine_view.cc line 64 at r1 (raw file):
MachineSpecification const &machine_specification) { assert(num_dims(machine_view) == task.degrees.size());
Prefer a throw with an actual error message
lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 0 at r1 (raw file):
Overall this implementation is pretty solid, and the lambdas do a lot to improve readability. If it's possible to pull out some composite data structures so that not all of the raw data structure manipulation has to be done directly, that might be nice--don't know if it's possible, but would be worth considering
lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 68 at r1 (raw file):
} float task_simulator_forward_pass(ParallelComputationGraph const &pcg,
It would be nice to have the task simulator have the ability to output a trace/profile object that essentially contains the execution times and device assignments of each task for the purposes of eventually rendering it in something like a legion-prof format (https://legion.stanford.edu/profiling/index.html) for easier understanding/visualization
Probably shouldn't be done in this PR, created an issue #1568 and assigned it to you
lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 82 at r1 (raw file):
DeduplicatedPriorityQueue<TimedDependency, std::vector<TimedDependency>> dependency_processing; std::unordered_set<TimedDependency> processed_dependencies;
Is this a cohesive data structure that could be factored out? Seems like you're using it for both TimedLayer and TimedDependency? Not a huge deal, just might simplify a bit of the code
Code quote:
DeduplicatedPriorityQueue<TimedDependency, std::vector<TimedDependency>>
dependency_processing;
std::unordered_set<TimedDependency> processed_dependencies;lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 82 at r1 (raw file):
DeduplicatedPriorityQueue<TimedDependency, std::vector<TimedDependency>> dependency_processing; std::unordered_set<TimedDependency> processed_dependencies;
Any reason for using separate priority queues instead of having a single priority queue that holds something like type Task = (float, variant<Layer, Edge>)?
Code quote:
DeduplicatedPriorityQueue<TimedLayer, std::vector<TimedLayer>>
layer_processing;
std::unordered_set<TimedLayer> processed_layers;
DeduplicatedPriorityQueue<TimedDependency, std::vector<TimedDependency>>
dependency_processing;
std::unordered_set<TimedDependency> processed_dependencies;lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 96 at r1 (raw file):
layer_processing.push(TimedLayer{layer, current_time + cost}); for (device_id_t d : device_mapping.at(layer)) { devices[d] = true;
Suggestion:
devices.at(d) = true;lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 98 at r1 (raw file):
devices[d] = true; } layer_frontier.erase(layer);
Maybe something like "ready_layers" might make it a bit clearer what these are? At least to me "frontier" could indicate the latest executed layers or the next to execute layers or the frontier of layers that aren't ready yet, so this name alone is a bit ambiguous
Code quote:
layer_frontierlib/compiler/src/compiler/cost_estimator/task_simulator.cc line 107 at r1 (raw file):
dependency, pcg, machine_mapping, estimator); dependency_processing.push(TimedDependency{dependency, start_time + cost}); };
So in this current version there's no modelling of congestion or tensor reuse at all, correct?
Code quote:
auto start_dependency_processing = [&](ParallelComputationGraphEdge const
&dependency,
float start_time) {
float cost = single_dependency_cost_estimator(
dependency, pcg, machine_mapping, estimator);
dependency_processing.push(TimedDependency{dependency, start_time + cost});
};lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 111 at r1 (raw file):
auto finish_layer_processing = [&](TimedLayer const &timed_layer) { for (device_id_t d : device_mapping.at(timed_layer.layer)) { devices[d] = false;
Suggestion:
devices.at(d) = false;lib/pcg/include/pcg/parallel_computation_graph/parallel_computation_graph.h line 44 at r1 (raw file):
std::unordered_set<parallel_layer_guid_t> get_source_layers(ParallelComputationGraph const &);
To avoid overloading the term "source" (since we also have get_source_layer for edges), maybe could use a different name like "get_initial_layers" (so you'd have source/(sink/destination) edges, and initial/terminal for the overall graph)? Not super important, just a thought
Code quote:
get_source_layerslib/pcg/src/pcg/operator_task_space.cc line 41 at r1 (raw file):
} OperatorTaskSpace get_operator_task_space(ParallelTensorShape const &shape) {
Seems odd that get_operator_task_space doesn't actually take in a parallel_layer_guid_t or any other type of operator representation? I assume that this is because the operator task space is arbitrary and so we use a convention that we derive it from the output tensor's space, but it would be good for the API to reflect that that could be changed and that it's really the operator+inputs+outputs that determines the task space
lib/pcg/test/src/pcg/parallel_computation_graph/parallel_computation_graph.cc line 108 at r1 (raw file):
} TEST_CASE("get_source_layer") {
Which overload is this? Try to put the argument types in the test names (FYI I normally shorten A const & to A in the signatures because they're semantically equivalent for value types, which is mostly what we deal with, just to try to make the TEST_CASE names a bit more readable)
Code quote:
TEST_CASE("get_source_layer") {lib/pcg/test/src/pcg/parallel_computation_graph/parallel_computation_graph.cc line 134 at r1 (raw file):
parallel_tensor_guid_t tensor1 = get_only(layer1_added.outputs); SUBCASE("get_source_layer") {
Why is there a subcase named get_source_layer within a testcase for get_source_layer?
Code quote:
SUBCASE("get_source_layer") {lib/pcg/test/src/pcg/parallel_computation_graph/parallel_computation_graph.cc line 158 at r1 (raw file):
} SUBCASE("three layers in serial") {
Suggestion:
SUBCASE("three layers in series")There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 8 of 36 files reviewed, 22 unresolved discussions (waiting on @lockshaw)
lib/compiler/include/compiler/cost_estimator/timed_dependency.struct.toml line 1 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why a leading newline in a bunch of these files?
fixed
lib/compiler/src/compiler/allowed_machine_views.cc line 27 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Is this true? I don't remember where projection is handled--is a (8,) machine view a valid view for a (2, 4) task? Just double-checking 🙂
Yes, MachineView stores the starting point (in the machine space, the 2D one) and an N dimensional vector (which "lives" in the task space) which says for each degree of the OperatorTaskSpace what is the stride and projection. So for a (2,4) task that we want to project across the INTER_NODE dimension in the machine space we would have MachineView((a, INTER_NODE), (b, INTER_NODE)) where a,b are some strides.
lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 82 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Is this a cohesive data structure that could be factored out? Seems like you're using it for both
TimedLayerandTimedDependency? Not a huge deal, just might simplify a bit of the code
yeah you are right, I've stepped back a bit and the behaviour for how the dependencies and layers are processed is pretty much the same, only difference is how you finish processing them. I have refactored the code accordingly, we now have a TimedComponent (variant of the other 2) and a single queue that stores both.
(Way cleaner, thanks a lot for pointing it out)
lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 82 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Any reason for using separate priority queues instead of having a single priority queue that holds something like
type Task = (float, variant<Layer, Edge>)?
Yeah you are absolutely right, have changed it accordingly.
lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 98 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Maybe something like "ready_layers" might make it a bit clearer what these are? At least to me "frontier" could indicate the latest executed layers or the next to execute layers or the frontier of layers that aren't ready yet, so this name alone is a bit ambiguous
Done.
lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 107 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
So in this current version there's no modelling of congestion or tensor reuse at all, correct?
correct, how should I go about modelling it (Something along the lines of, for a given instant of time t, a given device can at most transfer x tensors to any other device?).
Isn't it generally better for the task simulator to supply the tensors to the cost_estimator intelligently, and then let it figure out the costs, rather than trying to simulate those things directly with the task simulator (genuine question, I'm not sure how accurately the cost estimator models things).
lib/compiler/src/compiler/cost_estimator/task_simulator.cc line at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Overall this implementation is pretty solid, and the lambdas do a lot to improve readability. If it's possible to pull out some composite data structures so that not all of the raw data structure manipulation has to be done directly, that might be nice--don't know if it's possible, but would be worth considering
I have changed the device map std::unordered_map<parallel_layer_guid_t, std::unordered_set<device_id_t>> to be DeviceMapping, but I'm not sure which other ones would be worth pulling out.
lib/compiler/test/src/compiler/cost_estimator/task_simulator.cc line 166 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Might make things a bit more readable
refactored the test cases, should be a lot more readable now
lib/pcg/include/pcg/parallel_computation_graph/parallel_computation_graph.h line 44 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
To avoid overloading the term "source" (since we also have
get_source_layerfor edges), maybe could use a different name like "get_initial_layers" (so you'd have source/(sink/destination) edges, and initial/terminal for the overall graph)? Not super important, just a thought
you're right, changed
lib/pcg/src/pcg/machine_specification.cc line 55 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Normally we don't have dedicated overloads for a function
fthat are just (transform f)--is there a compelling argument why this is needed here? The problem is adding them in explodes the number of functions you have to write, and they don't contribute a lot as you already havetransform
removed.
lib/pcg/src/pcg/machine_view.cc line 64 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer a throw with an actual error message
Done.
lib/pcg/src/pcg/operator_task_space.cc line 41 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Seems odd that
get_operator_task_spacedoesn't actually take in aparallel_layer_guid_tor any other type of operator representation? I assume that this is because the operator task space is arbitrary and so we use a convention that we derive it from the output tensor's space, but it would be good for the API to reflect that that could be changed and that it's really the operator+inputs+outputs that determines the task space
I have changed it accordingly, but yeah the operator task space still does not make too much sense. I'm generally a bit confused about how the MachineView mapping to Operator should work. MachineView describes a N-dimensional strided orthotope (which then gets mapped to the MachineSpace etc...), but how does this match with an operator, which is instead a function that returns a tuple of vectors, each with it's own dimension. Right now OperatorTaskSpace is essentially copying the parallel degrees of the first output tensor, but are these the same across all output tensors (or am I just making the arbitrary decision of fetching the first output tensor shape, rather than any other one)?
lib/pcg/test/src/pcg/parallel_computation_graph/parallel_computation_graph.cc line 108 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Which overload is this? Try to put the argument types in the test names (FYI I normally shorten
A const &toAin the signatures because they're semantically equivalent for value types, which is mostly what we deal with, just to try to make the TEST_CASE names a bit more readable)
Done.
lib/pcg/test/src/pcg/parallel_computation_graph/parallel_computation_graph.cc line 134 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is there a subcase named
get_source_layerwithin a testcase forget_source_layer?
Done.
lib/utils/src/utils/graph/dataflow_graph/algorithms/get_outgoing_edges.cc line 6 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Add unit tests for these
Done.
lib/compiler/src/compiler/cost_estimator/cost_estimator.cc line 1 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why a newline?
Done.
lib/compiler/src/compiler/machine_mapping/machine_mapping_problem_tree/unmapped_op_cost_estimate_key.cc line 36 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Should probably be moved to the file for
OpCostEstimateKeyinstead of here, as the signature doesn't really involveUnmappedOpCostEstimateKey
Done.
lib/compiler/test/src/compiler/cost_estimator_for_test.h line 37 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Generally pass primitives by value
Done.
lib/compiler/test/src/compiler/cost_estimator/task_simulator.cc line 41 at r1 (raw file):
MachineSpecification machine_spec = MachineSpecification{3, 3, 3, 1, 1}; TEST_CASE("task_simulator: linear graph") {
Done.
lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 96 at r1 (raw file):
layer_processing.push(TimedLayer{layer, current_time + cost}); for (device_id_t d : device_mapping.at(layer)) { devices[d] = true;
Done.
lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 111 at r1 (raw file):
auto finish_layer_processing = [&](TimedLayer const &timed_layer) { for (device_id_t d : device_mapping.at(timed_layer.layer)) { devices[d] = false;
Done.
lib/pcg/src/pcg/machine_view.cc line 46 at r1 (raw file):
if (strides.size() != dims.size()) { throw mk_runtime_error( fmt::format("Dimensions of {} and {} must match when calling "
Done.
lib/pcg/test/src/pcg/parallel_computation_graph/parallel_computation_graph.cc line 158 at r1 (raw file):
} SUBCASE("three layers in serial") {
Done.
lib/compiler/include/compiler/cost_estimator/task_simulator.h line 10 at r1 (raw file):
namespace FlexFlow { float task_simulator_forward_pass(ParallelComputationGraph const &pcg,
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 8 of 36 files reviewed, 22 unresolved discussions (waiting on @lockshaw)
lib/compiler/test/src/compiler/cost_estimator/task_simulator.cc line 38 at r1 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I'd recommend at least also doing some tests with a cost estimator that differentiates between different ops and comms to validate that the right ops and comms are getting passed in.
added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 17 of 28 files at r2, 1 of 4 files at r4, all commit messages.
Reviewable status: 26 of 39 files reviewed, 8 unresolved discussions (waiting on @Marsella8)
lib/compiler/include/compiler/cost_estimator/timed_component.variant.toml line 0 at r4 (raw file):
Why not (time, variant<layer, edge>) instead of variant<(time, layer), (time, edge)>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 26 of 41 files reviewed, 8 unresolved discussions (waiting on @lockshaw)
lib/compiler/include/compiler/cost_estimator/timed_component.variant.toml line at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not
(time, variant<layer, edge>)instead ofvariant<(time, layer), (time, edge)>?
for some functions you only want to take in either a TimedLayer or a TimedComponent, find it more convenient to do the variant this way since otherwise I would have to either pass the variant and manually assert that it has the right type of unpacking and then passing time and layer/edge separately, which wasn't super elegant. tried both and the current one seemed to fit a bit better, not too much of a difference either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 16 files at r6, 1 of 6 files at r7.
Reviewable status: 23 of 47 files reviewed, 21 unresolved discussions (waiting on @Marsella8)
lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 2 at r6 (raw file):
namespace = "FlexFlow" name = "TaskGraph"
It would be nice if TaskGraph was a bit more semantically-rich, i.e., has node labels of type Task, where Task is a variant.toml of parallel_layer_guid_t and whatever your communication representation is. Then, your actual simulator code would take in a DiGraphView and a couple std::functions, which would be based on the TaskGraph. Right now it kinda feels like you have two levels of structures, (1) the full PCG, and (2) a fully anonymized DiGraphView, which makes the conversion/lowering code rather complex. It would be nice to have an intermediate structure to simplify things a bit.
lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 27 at r6 (raw file):
[[fields]] name = "is_allowed_to_run"
Conceptually shouldn't be part of the TaskGraph--should be passed as a separate parameter probably, or TaskGraph should be renamed/re-structured. TaskGraph to me conveys the set of tasks and the data dependencies between them, which is_allowed_to_run isn't part of
lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 28 at r6 (raw file):
[[fields]] name = "is_allowed_to_run" type = "std::function<bool(::FlexFlow::Node, ::FlexFlow::TasksStateTracker)>"
Probably shouldn't take in the whole TasksStateTracker, as it shouldn't need to access the set of ready tasks or the current time. In fact, they don't even need the TimedTasks in the other two fields of the tracker, just two sets of Node: finished and in-progress
lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 28 at r6 (raw file):
[[fields]] name = "is_allowed_to_run" type = "std::function<bool(::FlexFlow::Node, ::FlexFlow::TasksStateTracker)>"
Suggestion:
type = "std::function<bool(::FlexFlow::Node, ::FlexFlow::TasksStateTracker const &)>"lib/compiler/include/compiler/cost_estimator/tasks_state_tracker.struct.toml line 2 at r6 (raw file):
namespace = "FlexFlow" name = "TasksStateTracker"
Suggestion:
name = "TaskGraphExecutionState"lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 0 at r6 (raw file):
simulate_task_graph_execution.cc
lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 47 at r6 (raw file):
return state_tracker.ready_tasks.empty() && state_tracker.tasks_processing.empty(); }
Now that the domain-specific logic is factored out, feel free to make these lambdas again
Code quote:
static void start_task_processing(TasksStateTracker &state_tracker,
TaskGraph const &task_graph,
Node const &task) {
float cost = task_graph.cost_map.at(task);
state_tracker.tasks_processing.push(
TimedTask{state_tracker.current_time + cost, task});
state_tracker.ready_tasks.erase(task);
}
static bool dependencies_are_satisfied(TasksStateTracker const &state_tracker,
TaskGraph const &task_graph,
Node const &task) {
std::unordered_set<Node> incoming_dependencies =
get_predecessors(task_graph.graph, task);
return is_subseteq_of(incoming_dependencies, state_tracker.processed_tasks);
}
static void finish_task_processing(TasksStateTracker &state_tracker,
TaskGraph const &task_graph,
TimedTask const &timed_task) {
state_tracker.processed_tasks.insert(timed_task.node);
for (Node const &task : get_successors(task_graph.graph, timed_task.node)) {
if (dependencies_are_satisfied(state_tracker, task_graph, task)) {
state_tracker.ready_tasks.insert(task);
}
}
state_tracker.current_time = timed_task.endtime;
}
static bool is_processing_done(TasksStateTracker const &state_tracker) {
return state_tracker.ready_tasks.empty() &&
state_tracker.tasks_processing.empty();
}lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 55 at r6 (raw file):
} float simulate_forward_pass(TaskGraph const &task_graph) {
Suggestion:
simulate_task_graph_executionlib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 55 at r6 (raw file):
} float simulate_forward_pass(TaskGraph const &task_graph) {
TaskGraphProfile
Code quote:
floatlib/compiler/src/compiler/cost_estimator/task_simulator.cc line 70 at r6 (raw file):
DiGraph digraph = DiGraph::create<AdjacencyDiGraph>(); bidict<Node, std::variant<parallel_layer_guid_t, ParallelComputationGraphEdge>>
Avoid raw variants, introduce a Task type using dtgen's variant.toml
Code quote:
std::variant<parallel_layer_guid_t, ParallelComputationGraphEdge>lib/compiler/include/compiler/cost_estimator/timed_component_comparator.h line 10 at r4 (raw file):
struct TimedComponentComparator { bool operator()(TimedComponent const &lhs, TimedComponent const &rhs) const; };
I actually like having the explicit comparator, can you bring this back? It removes the need to understand what code the ord feature in dtgen generates, which is nice as it avoids ambiguity
Code quote:
struct TimedComponentComparator {
bool operator()(TimedComponent const &lhs, TimedComponent const &rhs) const;
};lib/compiler/include/compiler/cost_estimator/task_graph_profile.struct.toml line 2 at r7 (raw file):
namespace = "FlexFlow" name = "TaskGraphProfile"
Suggestion:
name = "TaskGraphExecutionTrace"lib/compiler/include/compiler/cost_estimator/timed_task.struct.toml line 2 at r7 (raw file):
namespace = "FlexFlow" name = "TimedTask"
Suggestion:
name = "InProgressTask"There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 23 of 47 files reviewed, 21 unresolved discussions (waiting on @lockshaw)
lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 27 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Conceptually shouldn't be part of the
TaskGraph--should be passed as a separate parameter probably, orTaskGraphshould be renamed/re-structured.TaskGraphto me conveys the set of tasks and the data dependencies between them, whichis_allowed_to_runisn't part of
Done.
lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 28 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Probably shouldn't take in the whole
TasksStateTracker, as it shouldn't need to access the set of ready tasks or the current time. In fact, they don't even need theTimedTasks in the other two fields of the tracker, just two sets ofNode: finished and in-progress
Done.
lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 47 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Now that the domain-specific logic is factored out, feel free to make these lambdas again
Done.
lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 55 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
TaskGraphProfile
wouldn't it be better to return the entire profile rather than only the endtime? (since this way in testing I can check against the entire trace, and given the profile you can choose to discard everything but the endtime).
lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
simulate_task_graph_execution.cc
Done.
lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 70 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Avoid raw variants, introduce a
Tasktype usingdtgen'svariant.toml
Done.
lib/compiler/include/compiler/cost_estimator/timed_component_comparator.h line 10 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I actually like having the explicit comparator, can you bring this back? It removes the need to understand what code the
ordfeature indtgengenerates, which is nice as it avoids ambiguity
Done.
lib/compiler/include/compiler/cost_estimator/timed_task.struct.toml line 2 at r7 (raw file):
namespace = "FlexFlow" name = "TimedTask"
Done.
lib/compiler/include/compiler/cost_estimator/tasks_state_tracker.struct.toml line 2 at r6 (raw file):
namespace = "FlexFlow" name = "TasksStateTracker"
Done.
lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 28 at r6 (raw file):
[[fields]] name = "is_allowed_to_run" type = "std::function<bool(::FlexFlow::Node, ::FlexFlow::TasksStateTracker)>"
Done.
lib/compiler/include/compiler/cost_estimator/task_graph_profile.struct.toml line 2 at r7 (raw file):
namespace = "FlexFlow" name = "TaskGraphProfile"
Done.
lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 55 at r6 (raw file):
} float simulate_forward_pass(TaskGraph const &task_graph) {
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 29 files at r2, 3 of 18 files at r6, 23 of 24 files at r8, all commit messages.
Reviewable status: 52 of 55 files reviewed, 23 unresolved discussions (waiting on @Marsella8)
lib/pcg/src/pcg/operator_task_space.cc line 41 at r1 (raw file):
Previously, Marsella8 wrote…
I have changed it accordingly, but yeah the operator task space still does not make too much sense. I'm generally a bit confused about how the
MachineViewmapping to Operator should work. MachineView describes a N-dimensional strided orthotope (which then gets mapped to the MachineSpace etc...), but how does this match with an operator, which is instead a function that returns a tuple of vectors, each with it's own dimension. Right nowOperatorTaskSpaceis essentially copying the parallel degrees of the first output tensor, but are these the same across all output tensors (or am I just making the arbitrary decision of fetching the first output tensor shape, rather than any other one)?
The latter--it's an arbitrary choice. Eventually the dependencies will be moved into the PCG/inferrable from the PCG itself as part of #1569
lib/compiler/include/compiler/cost_estimator/task_constraint.struct.toml line 2 at r8 (raw file):
namespace = "FlexFlow" name = "TaskConstraint"
Suggestion:
name = "TaskExecutionConstraint"lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 19 at r8 (raw file):
TaskGraphExecutionTrace simulate_task_graph_execution(TaskGraph const &task_graph,
Might want to assert that the task graph is acylic? Just to improve debuggability
lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 23 at r8 (raw file):
TaskGraphExecutionState execution_state = TaskGraphExecutionState{get_sources(task_graph.graph), {}, {}, 0.0};
Add argument name comments
Code quote:
TaskGraphExecutionState{get_sources(task_graph.graph), {}, {}, 0.0};lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 52 at r8 (raw file):
TaskProfile{in_progress_task.node, in_progress_task.endtime - task_graph.cost_map.at(in_progress_task.node),
Why not just have InProgressTask have a start_time? This seems easier/more-intuitive than having to infer if after it's already been executed
lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 68 at r8 (raw file):
}; while (!is_processing_done()) { for (Node const &task : sorted(execution_state.ready_tasks)) {
Might be better to just have ready_tasks be a std::set so you have a non-insertion-order-based-order and don't have to keep sorting as iteration in a std::set is already ordered?
Code quote:
sorted(execution_state.ready_tasks)lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 80 at r8 (raw file):
if (!execution_state.tasks_processing.empty()) { InProgressTask next_task = get_next_task();
Suggestion:
get_next_task_to_finishlib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 87 at r8 (raw file):
} return TaskGraphExecutionTrace{task_profiles, execution_state.current_time};
Would be good to include a check that all tasks were executed
lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 2 at r8 (raw file):
namespace = "FlexFlow" name = "TaskGraph"
I would get rid of TaskGraph, and just have the simulate function take in a DiGraphView, a cost function (as a std::function or a wrapped std::function, depending on your preference), and a constraint function
lib/compiler/include/compiler/machine_mapping/device_mapping.cc line 1 at r8 (raw file):
Leading newline?
lib/compiler/include/compiler/machine_mapping/device_mapping.cc line 24 at r8 (raw file):
device_mapping.insert( {layer, get_device_ids(op, machine_view, machine_spec)}); }
Change to map_values?
Code quote:
std::unordered_map<parallel_layer_guid_t, std::unordered_set<device_id_t>>
device_mapping;
for (auto const &[layer, machine_view] : machine_mapping.machine_views) {
OperatorTaskSpace op =
get_operator_task_space(pcg, layer));
device_mapping.insert(
{layer, get_device_ids(op, machine_view, machine_spec)});
}lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 64 at r8 (raw file):
} static std::pair<DiGraph, bidict<Node, PCGTask>>
Make this an actual dtgen type (this is the PCGTaskGraph mentioned in zulip)
Also, this should be a public function, as getting a task graph for a PCG is a generally useful function
Code quote:
static std::pair<DiGraph, bidict<Node, PCGTask>>lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 112 at r8 (raw file):
static bool is_allowed_to_run_super(Node const &task,
Make this a lambda
lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 133 at r8 (raw file):
device_map.raw_device_map.at(task_component.require_layer())); } }
Can you clean this up a bit with utils/containers?
Code quote:
std::unordered_set<device_id_t> devices_occupied;
for (Node const &in_progress_task : tasks_processing) {
auto task_component = node_map.at_l(in_progress_task);
if (task_component.is_layer()) {
devices_occupied = set_union(
devices_occupied,
device_map.raw_device_map.at(task_component.require_layer()));
}
}lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 149 at r8 (raw file):
auto [digraph, node_map] = get_digraph(pcg); auto cost_map = get_cost_map(node_map, pcg, machine_mapping, estimator);
I feel like this would be simpler just as a lambda that you pass in to the simulate function, rather than having to do all the manipulation to compute the cost map up front
Code quote:
auto cost_map = get_cost_map(node_map, pcg, machine_mapping, estimator);lib/utils/include/utils/deduplicated_priority_queue.h line 41 at r8 (raw file):
} std::vector<Elem> contents() const {
What is this function (semantically) and why is it necessary?
lib/compiler/include/compiler/cost_estimator/task_graph_execution_state.struct.toml line 0 at r8 (raw file):
Would be good to move all the task simulator stuff to its own directory (e.g., compiler/task_graph_simulator)
lib/compiler/include/compiler/cost_estimator/task_graph_execution_state.struct.toml line 28 at r8 (raw file):
[[fields]] name = "tasks_processing"
Suggestion:
name = "in_progress_tasks"lib/compiler/include/compiler/cost_estimator/task_graph_execution_state.struct.toml line 32 at r8 (raw file):
[[fields]] name = "processed_tasks"
Suggestion:
name = "finished_tasks"lib/compiler/include/compiler/machine_mapping/device_mapping.struct.toml line 2 at r8 (raw file):
namespace = "FlexFlow" name = "DeviceMapping"
Suggestion:
name = "UnstructuredDeviceMapping"lib/compiler/include/compiler/cost_estimator/task_graph_execution_trace.struct.toml line 27 at r8 (raw file):
[[fields]] name = "end_time" type = "float"
Remove, just have a helper function that calculates it from the set of task profiles
Code quote:
[[fields]]
name = "end_time"
type = "float"There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 29 files at r2, 18 of 19 files at r9, all commit messages.
Reviewable status: 54 of 55 files reviewed, 15 unresolved discussions (waiting on @Marsella8)
lib/compiler/include/compiler/machine_mapping/unstructured_device_mapping.h line 12 at r9 (raw file):
UnstructuredDeviceMapping get_device_mapping(MachineMapping const &machine_mapping,
Suggestion:
get_unstructured_device_mappinglib/compiler/src/compiler/task_graph_simulator/simulate_task_graph_execution.cc line 97 at r9 (raw file):
} if (execution_state.finished_tasks.size() != num_nodes(task_graph.graph)) { throw mk_runtime_error("Some tasks within the TaskGraph are unreachable");
Suggestion:
("Failed to execute all tasks in TaskGraph");lib/compiler/test/src/compiler/task_graph_simulator/task_simulator.cc line 37 at r9 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("task_simulator") {
Suggestion:
TEST_CASE("task_simulator_estimate_forward_pass_time") {lib/compiler/test/src/compiler/task_graph_simulator/simulate_task_graph_execution.cc line 17 at r9 (raw file):
SUBCASE("linear graph") { DiGraph g = DiGraph::create<AdjacencyDiGraph>();
Suggestion:
DiGraph g = DiGraph::create<AdjacencyDiGraph>();
SUBCASE("linear graph") {lib/compiler/test/src/compiler/task_graph_simulator/simulate_task_graph_execution.cc line 87 at r9 (raw file):
} SUBCASE("processing constraint") {
Suggestion:
SUBCASE("constrained to one node at a time")lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 131 at r9 (raw file):
transform(layer_nodes, [&](Node const &n) { return node_map.at_l(n).require_layer(); });
filtrans would probably be cleaner
Code quote:
std::unordered_set<Node> layer_nodes =
filter(in_progress_tasks,
[&](Node const &n) { return node_map.at_l(n).is_layer(); });
std::unordered_set<parallel_layer_guid_t> layers =
transform(layer_nodes, [&](Node const &n) {
return node_map.at_l(n).require_layer();
});There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 54 of 55 files reviewed, 15 unresolved discussions (waiting on @lockshaw)
lib/utils/include/utils/deduplicated_priority_queue.h line 41 at r8 (raw file):
Previously, lockshaw (Colin Unger) wrote…
What is this function (semantically) and why is it necessary?
pulls out all the contents of the deduplicated priority queue. Needed since i have to pass it to the constraint function.
lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 2 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
It would be nice if
TaskGraphwas a bit more semantically-rich, i.e., has node labels of typeTask, whereTaskis avariant.tomlofparallel_layer_guid_tand whatever your communication representation is. Then, your actual simulator code would take in aDiGraphViewand a couplestd::functions, which would be based on theTaskGraph. Right now it kinda feels like you have two levels of structures, (1) the full PCG, and (2) a fully anonymized DiGraphView, which makes the conversion/lowering code rather complex. It would be nice to have an intermediate structure to simplify things a bit.
Done.
lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 2 at r8 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I would get rid of
TaskGraph, and just have the simulate function take in a DiGraphView, a cost function (as a std::function or a wrapped std::function, depending on your preference), and a constraint function
Done.
lib/compiler/include/compiler/cost_estimator/task_graph_execution_state.struct.toml line at r8 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Would be good to move all the task simulator stuff to its own directory (e.g.,
compiler/task_graph_simulator)
Done.
lib/compiler/include/compiler/cost_estimator/task_graph_execution_trace.struct.toml line 27 at r8 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Remove, just have a helper function that calculates it from the set of task profiles
Done.
lib/compiler/include/compiler/machine_mapping/device_mapping.cc line 1 at r8 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Leading newline?
Done.
lib/compiler/include/compiler/machine_mapping/device_mapping.cc line 24 at r8 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Change to
map_values?
don't think it's doable elegantly, given that the new value is a function of both the key (layer) and the original value.
lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 23 at r8 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Add argument name comments
Done.
lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 52 at r8 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not just have
InProgressTaskhave astart_time? This seems easier/more-intuitive than having to infer if after it's already been executed
Done.
lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 68 at r8 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Might be better to just have
ready_tasksbe astd::setso you have a non-insertion-order-based-order and don't have to keep sorting as iteration in astd::setis already ordered?
the sorted is more so for consistency during testing (if there are multiple available tasks, process the one with the associated smallest node). It's more to have a fixed, arbitrary order than anything else, so probably better to not have this property embedded into the execution_tracker itself?
lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 87 at r8 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Would be good to include a check that all tasks were executed
Done.
lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 64 at r8 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Make this an actual
dtgentype (this is thePCGTaskGraphmentioned in zulip)Also, this should be a public function, as getting a task graph for a PCG is a generally useful function
Done.
lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 112 at r8 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Make this a lambda
Done.
lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 133 at r8 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Can you clean this up a bit with
utils/containers?
Done.
lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 149 at r8 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I feel like this would be simpler just as a lambda that you pass in to the simulate function, rather than having to do all the manipulation to compute the cost map up front
Done.
lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 131 at r9 (raw file):
Previously, lockshaw (Colin Unger) wrote…
filtranswould probably be cleaner
Done.
lib/compiler/test/src/compiler/task_graph_simulator/task_simulator.cc line 37 at r9 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("task_simulator") {
Done.
lib/compiler/include/compiler/cost_estimator/task_graph_execution_state.struct.toml line 28 at r8 (raw file):
[[fields]] name = "tasks_processing"
Done.
lib/compiler/include/compiler/cost_estimator/task_graph_execution_state.struct.toml line 32 at r8 (raw file):
[[fields]] name = "processed_tasks"
Done.
lib/compiler/src/compiler/task_graph_simulator/simulate_task_graph_execution.cc line 97 at r9 (raw file):
} if (execution_state.finished_tasks.size() != num_nodes(task_graph.graph)) { throw mk_runtime_error("Some tasks within the TaskGraph are unreachable");
Done.
lib/compiler/include/compiler/machine_mapping/unstructured_device_mapping.h line 12 at r9 (raw file):
UnstructuredDeviceMapping get_device_mapping(MachineMapping const &machine_mapping,
Done.
lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 80 at r8 (raw file):
if (!execution_state.tasks_processing.empty()) { InProgressTask next_task = get_next_task();
Done.
lib/compiler/include/compiler/machine_mapping/device_mapping.struct.toml line 2 at r8 (raw file):
namespace = "FlexFlow" name = "DeviceMapping"
Done.
lib/compiler/include/compiler/cost_estimator/task_constraint.struct.toml line 2 at r8 (raw file):
namespace = "FlexFlow" name = "TaskConstraint"
Done.
lib/compiler/test/src/compiler/task_graph_simulator/simulate_task_graph_execution.cc line 17 at r9 (raw file):
SUBCASE("linear graph") { DiGraph g = DiGraph::create<AdjacencyDiGraph>();
Done.
lib/compiler/test/src/compiler/task_graph_simulator/simulate_task_graph_execution.cc line 87 at r9 (raw file):
} SUBCASE("processing constraint") {
Done.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1565 +/- ##
==========================================
+ Coverage 60.47% 60.94% +0.47%
==========================================
Files 606 618 +12
Lines 14725 14978 +253
==========================================
+ Hits 8905 9129 +224
- Misses 5820 5849 +29
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add unit tests for get_device_ids and for
std::unordered_set<DataflowEdge> get_outgoing_edges(DataflowGraphView const &g, std::unordered_set<Node> const &ns)?
Reviewed 38 of 38 files at r11, 18 of 18 files at r12, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @Marsella8)
lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 104 at r10 (raw file):
Previously, Marsella8 wrote…
Done.
Not seeing this change?
lib/compiler/test/src/compiler/task_graph_simulator/task_simulator.cc line 170 at r10 (raw file):
Previously, Marsella8 wrote…
In the sense of having ForwardOpCostMetrics and BackwardOpCost metrics?
Just change OpCostMetrics to have a forward_execution_time and a backward_execution_time
lib/compiler/include/compiler/cost_estimator/tensor_set_movement.h line 0 at r11 (raw file):
Empty file?
lib/utils/include/utils/deduplicated_priority_queue.h line 42 at r11 (raw file):
} std::set<Elem> contents() const {
Suggestion:
std::set<Elem, Compare> lib/utils/include/utils/archetypes/value_type.h line 42 at r11 (raw file):
std::ostringstream oss; oss << "<value_type<" << TAG << ">>"; return oss.str();
Suggestion:
assert(false);lib/utils/include/utils/archetypes/value_type.h line 47 at r11 (raw file):
template <int TAG> std::ostream &operator<<(std::ostream &s, value_type<TAG> const &x) { return s << format_as(x);
Suggestion:
assert(false);lib/utils/test/src/utils/containers/lookup_in_map.cc line 15 at r12 (raw file):
SUBCASE("existing keys") { std::function<int(std::string)> func = lookup_in_map(map);
Suggestion:
std::function<int(std::string const &)>|
Previously, Marsella8 wrote…
Yes, but I think that's a good idea. The idea of the task graph is to represent the actual computation that occurs, not an abstracted partial version of it, so including the machine mapping information and making the communication explicit is exactly what I'd want. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 60 of 95 files reviewed, 8 unresolved discussions (waiting on @lockshaw and @wmdi)
lib/compiler/include/compiler/cost_estimator/tensor_set_movement.h line at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Empty file?
fixed
lib/compiler/include/compiler/task_graph_simulator/pcg_task.variant.toml line 21 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Yes, but I think that's a good idea. The idea of the task graph is to represent the actual computation that occurs, not an abstracted partial version of it, so including the machine mapping information and making the communication explicit is exactly what I'd want.
yeah ur right, done :)
lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 104 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Not seeing this change?
sorry, forgot to mention that try_require_layer does not exist
lib/compiler/test/src/compiler/task_graph_simulator/task_simulator.cc line 170 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Just change
OpCostMetricsto have aforward_execution_timeand abackward_execution_time
OpCostMetricswas not made by me, and it's predominantly used within @wmdi 's machine mapping so I don't wanna make changes that break her code, possibly a good topic for next week's meeting?
lib/utils/include/utils/archetypes/value_type.h line 42 at r11 (raw file):
std::ostringstream oss; oss << "<value_type<" << TAG << ">>"; return oss.str();
Done.
lib/utils/include/utils/archetypes/value_type.h line 47 at r11 (raw file):
template <int TAG> std::ostream &operator<<(std::ostream &s, value_type<TAG> const &x) { return s << format_as(x);
Done.
lib/utils/include/utils/deduplicated_priority_queue.h line 42 at r11 (raw file):
} std::set<Elem> contents() const {
Done.
lib/utils/test/src/utils/containers/lookup_in_map.cc line 15 at r12 (raw file):
SUBCASE("existing keys") { std::function<int(std::string)> func = lookup_in_map(map);
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 35 of 35 files at r13, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Marsella8 and @wmdi)
lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 104 at r10 (raw file):
Previously, Marsella8 wrote…
sorry, forgot to mention that
try_require_layerdoes not exist
Then try_require_operator? afaik it should be generated by dtgen
lib/compiler/test/src/compiler/task_graph_simulator/task_simulator.cc line 170 at r10 (raw file):
Previously, Marsella8 wrote…
OpCostMetricswas not made by me, and it's predominantly used within @wmdi 's machine mapping so I don't wanna make changes that break her code, possibly a good topic for next week's meeting?
It's fine to make that change, we've actually been discussing it for a while--iirc CostEstimator already provides that information, it's just not currently preserved by OpCostMetrics
lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 46 at r13 (raw file):
} auto running_ops = filtrans(in_progress_tasks, [&](Node const &n) {
Suggestion:
in_progress_opslib/pcg/include/pcg/operator_task_space.h line 26 at r13 (raw file):
parallel_layer_guid_t const &layer); OperatorTaskSpace get_operator_task_space_from_parallel_tensor_shape(
Don't allow this operation, for now the operator task space should rely on the parallel_layer_guid_t. It seems you're using this to get the device set for the task graph--you should instead compute that ahead of time when you have the parallel_layer_guid_t and save it in the PCGTasks in the TaskGraph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @lockshaw and @wmdi)
lib/pcg/include/pcg/operator_task_space.h line 26 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Don't allow this operation, for now the operator task space should rely on the
parallel_layer_guid_t. It seems you're using this to get the device set for the task graph--you should instead compute that ahead of time when you have theparallel_layer_guid_tand save it in thePCGTasks in theTaskGraph
ur right, have added a node_to_layer bidict to the PCGTaskGraph (we would technically need a tridict to have perfect "make invalid states unrepresentable", but I think this suffices hahah)
lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 104 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Then
try_require_operator? afaik it should be generated bydtgen
sorry, totally missed it.
lib/compiler/test/src/compiler/task_graph_simulator/task_simulator.cc line 170 at r10 (raw file):
Previously, lockshaw (Colin Unger) wrote…
It's fine to make that change, we've actually been discussing it for a while--iirc
CostEstimatoralready provides that information, it's just not currently preserved byOpCostMetrics
Done.
lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 46 at r13 (raw file):
} auto running_ops = filtrans(in_progress_tasks, [&](Node const &n) {
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 63 of 100 files reviewed, 7 unresolved discussions (waiting on @lockshaw and @wmdi)
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 245 at r14 (raw file):
float cost = context.cost_estimator.estimate_cost(mapped) .forward_runtime; // TODO(@pietro)(@wmdi) incorporate backward pass?
lib/compiler/src/compiler/machine_mapping/memory_optimization/machine_mapping_with_memory_result.cc line 35 at r14 (raw file):
if (mapping.cost.forward_runtime >= other_mapping.cost .forward_runtime && // TODO(@wmdi) does it have to be modified
lib/compiler/src/compiler/machine_mapping/memory_optimization/machine_mapping_with_memory_result.cc line 104 at r14 (raw file):
std::max(lhs_mm.cost.forward_runtime, rhs_mm.cost.forward_runtime), std::max(lhs_mm.cost.backward_runtime, rhs_mm.cost.backward_runtime), //(@wmdi) is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 37 of 37 files at r14, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @lockshaw, @Marsella8, and @wmdi)
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 245 at r14 (raw file):
Previously, Marsella8 wrote…
Should should be the sum of the forward and backward runtimes
lib/compiler/include/compiler/task_graph_simulator/pcg_task_graph.struct.toml line 24 at r14 (raw file):
[[fields]] name = "node_to_layer" type = "::FlexFlow::bidict<::FlexFlow::Node, ::FlexFlow::parallel_layer_guid_t>"
I would lean toward making this a std::unordered_set<DeviceId> (i.e., pre-compute the device sets and save the result) rather than saving the parallel_layer_guid_t directly, especially since this will stop being a bidict once you have both forward and backward tasks. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @lockshaw and @wmdi)
lib/compiler/include/compiler/task_graph_simulator/pcg_task_graph.struct.toml line 24 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I would lean toward making this a
std::unordered_set<DeviceId>(i.e., pre-compute the device sets and save the result) rather than saving theparallel_layer_guid_tdirectly, especially since this will stop being abidictonce you have both forward and backward tasks. Thoughts?
makes sense, updated
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 245 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Should should be the sum of the forward and backward runtimes
thanks, added, not sure how to update the associated tests accordingly, I've commented them out for now, hopefully @wmdi could fix them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 26 of 26 files at r15, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @lockshaw and @wmdi)
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 245 at r14 (raw file):
Previously, Marsella8 wrote…
thanks, added, not sure how to update the associated tests accordingly, I've commented them out for now, hopefully @wmdi could fix them?
They'll need to be uncommented before this is merged so you may as well keep them uncommented right now (if at least to reduce the diff size for review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @Marsella8)
lib/compiler/src/compiler/machine_mapping/memory_optimization/machine_mapping_with_memory_result.cc line 35 at r14 (raw file):
Previously, Marsella8 wrote…
I think so. Let's keep this comment for this PR, and I will fix it when I merge it.
lib/compiler/src/compiler/machine_mapping/memory_optimization/machine_mapping_with_memory_result.cc line 104 at r14 (raw file):
Previously, Marsella8 wrote…
Correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @lockshaw and @wmdi)
lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 245 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
They'll need to be uncommented before this is merged so you may as well keep them uncommented right now (if at least to reduce the diff size for review)
Done.
lib/compiler/src/compiler/machine_mapping/memory_optimization/machine_mapping_with_memory_result.cc line 35 at r14 (raw file):
Previously, wmdi (Mengdi Wu) wrote…
I think so. Let's keep this comment for this PR, and I will fix it when I merge it.
great thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 18 of 18 files at r16, 25 of 27 files at r17, 19 of 19 files at r18, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @wmdi)
Description of changes:
Related Issues:
Linked Issues:
Issues closed by this PR:
Before merging:
This change is