Task Simulator #1565

Marsella8 · 2024-12-27T22:37:14Z

Description of changes:

Related Issues:

Linked Issues:

Issue #

Issues closed by this PR:

Closes Enable task simulator to return a trace/profile object for understanding/visualization #1568

Before merging:

Did you update the flexflow-third-party repo, if modifying any of the Cmake files, the build configs, or the submodules?

This change is

lockshaw

Reviewed 10 of 28 files at r1, all commit messages.
Reviewable status: 10 of 28 files reviewed, 4 unresolved discussions (waiting on @Marsella8)

lib/compiler/src/compiler/cost_estimator/cost_estimator.cc line 1 at r1 (raw file):

Why a newline?

lib/compiler/test/src/compiler/cost_estimator_for_test.h line 37 at r1 (raw file):

CostEstimator make_fake_constant_cost_estimator(float const &op_cost,
                                                float const &comm_cost);

Generally pass primitives by value

Suggestion:

CostEstimator make_fake_constant_cost_estimator(float op_cost,
                                                float comm_cost);

lib/compiler/include/compiler/cost_estimator/timed_dependency.struct.toml line 1 at r1 (raw file):

Why a leading newline in a bunch of these files?

lib/compiler/src/compiler/allowed_machine_views.cc line 27 at r1 (raw file):

                           OperatorTaskSpace const &task,
                           MachineSpecification const &ms) {
  if (num_dims(mv) != num_dims(task)) {

Is this true? I don't remember where projection is handled--is a (8,) machine view a valid view for a (2, 4) task? Just double-checking 🙂

lockshaw

Reviewed 18 of 28 files at r1.
Reviewable status: all files reviewed, 22 unresolved discussions (waiting on @Marsella8)

lib/compiler/include/compiler/cost_estimator/task_simulator.h line 10 at r1 (raw file):

namespace FlexFlow {
float task_simulator_forward_pass(ParallelComputationGraph const &pcg,

Suggestion:

float task_simulator_estimate_forward_pass_time

lib/compiler/test/src/compiler/cost_estimator/task_simulator.cc line 38 at r1 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  CostEstimator estimator =
      make_fake_constant_cost_estimator(/*op_cost*/ 10.0f, /*comm_cost*/ 1.0f);

I'd recommend at least also doing some tests with a cost estimator that differentiates between different ops and comms to validate that the right ops and comms are getting passed in.

Code quote:

     make_fake_constant_cost_estimator(/*op_cost*/ 10.0f, /*comm_cost*/ 1.0f);

lib/compiler/test/src/compiler/cost_estimator/task_simulator.cc line 41 at r1 (raw file):

  MachineSpecification machine_spec = MachineSpecification{3, 3, 3, 1, 1};

  TEST_CASE("task_simulator: linear graph") {

Suggestion:

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("task_simulator_forward_pass") {
    CostEstimator estimator =
      make_fake_constant_cost_estimator(/*op_cost*/ 10.0f, /*comm_cost*/ 1.0f);
    MachineSpecification machine_spec = MachineSpecification{3, 3, 3, 1, 1};

    SUBCASE("linear graph") {

lib/compiler/test/src/compiler/cost_estimator/task_simulator.cc line 166 at r1 (raw file):

            MachineViewDimension{stride_t{1},
                                 MachineSpecificationDimension::INTER_NODE},
        }};

Might make things a bit more readable

Suggestion:

    std::vector<MachineViewDimension> dims = 
      {
            MachineViewDimension{stride_t{1},
                                 MachineSpecificationDimension::INTER_NODE},
            MachineViewDimension{stride_t{1},
                                 MachineSpecificationDimension::INTER_NODE},
            MachineViewDimension{stride_t{1},
                                 MachineSpecificationDimension::INTER_NODE},
        };
    MachineView mv0 = MachineView{
        MachineSpaceCoordinate{0, 0, DeviceType::GPU},
        dims,
        };
    MachineView mv1 = MachineView{
        MachineSpaceCoordinate{0, 1, DeviceType::GPU},
        dims};
    MachineView mv2 = MachineView{
        MachineSpaceCoordinate{1, 0, DeviceType::GPU},
        dims};
    MachineView mv3 = MachineView{
        MachineSpaceCoordinate{1, 1, DeviceType::GPU},
        dims};

lib/utils/src/utils/graph/dataflow_graph/algorithms/get_outgoing_edges.cc line 6 at r1 (raw file):

namespace FlexFlow {

std::unordered_set<DataflowEdge> get_outgoing_edges(DataflowGraphView const &g,

Add unit tests for these

lib/compiler/src/compiler/machine_mapping/machine_mapping_problem_tree/unmapped_op_cost_estimate_key.cc line 36 at r1 (raw file):

}

OpCostEstimateKey get_mapped_op_cost_estimate_key_for_layer(

Should probably be moved to the file for OpCostEstimateKey instead of here, as the signature doesn't really involve UnmappedOpCostEstimateKey

lib/pcg/src/pcg/machine_specification.cc line 55 at r1 (raw file):

std::unordered_set<device_id_t>
    get_device_ids(MachineSpecification const &ms,

Normally we don't have dedicated overloads for a function f that are just (transform f)--is there a compelling argument why this is needed here? The problem is adding them in explodes the number of functions you have to write, and they don't contribute a lot as you already have transform

lib/pcg/src/pcg/machine_view.cc line 46 at r1 (raw file):

  if (strides.size() != dims.size()) {
    throw mk_runtime_error(
        fmt::format("Dimensions of {} and {} must match when calling "

Suggestion:

        fmt::format("Length of strides ({}) and dims ({}) must match when calling "

lib/pcg/src/pcg/machine_view.cc line 64 at r1 (raw file):

    MachineSpecification const &machine_specification) {

  assert(num_dims(machine_view) == task.degrees.size());

Prefer a throw with an actual error message

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 0 at r1 (raw file):
Overall this implementation is pretty solid, and the lambdas do a lot to improve readability. If it's possible to pull out some composite data structures so that not all of the raw data structure manipulation has to be done directly, that might be nice--don't know if it's possible, but would be worth considering

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 68 at r1 (raw file):

}

float task_simulator_forward_pass(ParallelComputationGraph const &pcg,

It would be nice to have the task simulator have the ability to output a trace/profile object that essentially contains the execution times and device assignments of each task for the purposes of eventually rendering it in something like a legion-prof format (https://legion.stanford.edu/profiling/index.html) for easier understanding/visualization

Probably shouldn't be done in this PR, created an issue #1568 and assigned it to you

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 82 at r1 (raw file):

  DeduplicatedPriorityQueue<TimedDependency, std::vector<TimedDependency>>
      dependency_processing;
  std::unordered_set<TimedDependency> processed_dependencies;

Is this a cohesive data structure that could be factored out? Seems like you're using it for both TimedLayer and TimedDependency? Not a huge deal, just might simplify a bit of the code

Code quote:

  DeduplicatedPriorityQueue<TimedDependency, std::vector<TimedDependency>>
      dependency_processing;
  std::unordered_set<TimedDependency> processed_dependencies;

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 82 at r1 (raw file):

  DeduplicatedPriorityQueue<TimedDependency, std::vector<TimedDependency>>
      dependency_processing;
  std::unordered_set<TimedDependency> processed_dependencies;

Any reason for using separate priority queues instead of having a single priority queue that holds something like type Task = (float, variant<Layer, Edge>)?

Code quote:

  DeduplicatedPriorityQueue<TimedLayer, std::vector<TimedLayer>>
      layer_processing;
  std::unordered_set<TimedLayer> processed_layers;

  DeduplicatedPriorityQueue<TimedDependency, std::vector<TimedDependency>>
      dependency_processing;
  std::unordered_set<TimedDependency> processed_dependencies;

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 96 at r1 (raw file):

    layer_processing.push(TimedLayer{layer, current_time + cost});
    for (device_id_t d : device_mapping.at(layer)) {
      devices[d] = true;

Suggestion:

    devices.at(d) = true;

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 98 at r1 (raw file):

      devices[d] = true;
    }
    layer_frontier.erase(layer);

Maybe something like "ready_layers" might make it a bit clearer what these are? At least to me "frontier" could indicate the latest executed layers or the next to execute layers or the frontier of layers that aren't ready yet, so this name alone is a bit ambiguous

Code quote:

layer_frontier

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 107 at r1 (raw file):

        dependency, pcg, machine_mapping, estimator);
    dependency_processing.push(TimedDependency{dependency, start_time + cost});
  };

So in this current version there's no modelling of congestion or tensor reuse at all, correct?

Code quote:

  auto start_dependency_processing = [&](ParallelComputationGraphEdge const
                                             &dependency,
                                         float start_time) {
    float cost = single_dependency_cost_estimator(
        dependency, pcg, machine_mapping, estimator);
    dependency_processing.push(TimedDependency{dependency, start_time + cost});
  };

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 111 at r1 (raw file):

  auto finish_layer_processing = [&](TimedLayer const &timed_layer) {
    for (device_id_t d : device_mapping.at(timed_layer.layer)) {
      devices[d] = false;

Suggestion:

      devices.at(d) = false;

lib/pcg/include/pcg/parallel_computation_graph/parallel_computation_graph.h line 44 at r1 (raw file):

std::unordered_set<parallel_layer_guid_t>
    get_source_layers(ParallelComputationGraph const &);

To avoid overloading the term "source" (since we also have get_source_layer for edges), maybe could use a different name like "get_initial_layers" (so you'd have source/(sink/destination) edges, and initial/terminal for the overall graph)? Not super important, just a thought

Code quote:

get_source_layers

lib/pcg/src/pcg/operator_task_space.cc line 41 at r1 (raw file):

}

OperatorTaskSpace get_operator_task_space(ParallelTensorShape const &shape) {

Seems odd that get_operator_task_space doesn't actually take in a parallel_layer_guid_t or any other type of operator representation? I assume that this is because the operator task space is arbitrary and so we use a convention that we derive it from the output tensor's space, but it would be good for the API to reflect that that could be changed and that it's really the operator+inputs+outputs that determines the task space

lib/pcg/test/src/pcg/parallel_computation_graph/parallel_computation_graph.cc line 108 at r1 (raw file):

  }

  TEST_CASE("get_source_layer") {

Which overload is this? Try to put the argument types in the test names (FYI I normally shorten A const & to A in the signatures because they're semantically equivalent for value types, which is mostly what we deal with, just to try to make the TEST_CASE names a bit more readable)

Code quote:

  TEST_CASE("get_source_layer") {

lib/pcg/test/src/pcg/parallel_computation_graph/parallel_computation_graph.cc line 134 at r1 (raw file):

      parallel_tensor_guid_t tensor1 = get_only(layer1_added.outputs);

      SUBCASE("get_source_layer") {

Why is there a subcase named get_source_layer within a testcase for get_source_layer?

Code quote:

      SUBCASE("get_source_layer") {

lib/pcg/test/src/pcg/parallel_computation_graph/parallel_computation_graph.cc line 158 at r1 (raw file):

    }

    SUBCASE("three layers in serial") {

Suggestion:

    SUBCASE("three layers in series")

Marsella8

Reviewable status: 8 of 36 files reviewed, 22 unresolved discussions (waiting on @lockshaw)

lib/compiler/include/compiler/cost_estimator/timed_dependency.struct.toml line 1 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why a leading newline in a bunch of these files?

fixed

lib/compiler/src/compiler/allowed_machine_views.cc line 27 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Is this true? I don't remember where projection is handled--is a (8,) machine view a valid view for a (2, 4) task? Just double-checking 🙂

Yes, MachineView stores the starting point (in the machine space, the 2D one) and an N dimensional vector (which "lives" in the task space) which says for each degree of the OperatorTaskSpace what is the stride and projection. So for a (2,4) task that we want to project across the INTER_NODE dimension in the machine space we would have MachineView((a, INTER_NODE), (b, INTER_NODE)) where a,b are some strides.

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 82 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Is this a cohesive data structure that could be factored out? Seems like you're using it for both TimedLayer and TimedDependency? Not a huge deal, just might simplify a bit of the code

yeah you are right, I've stepped back a bit and the behaviour for how the dependencies and layers are processed is pretty much the same, only difference is how you finish processing them. I have refactored the code accordingly, we now have a TimedComponent (variant of the other 2) and a single queue that stores both.

(Way cleaner, thanks a lot for pointing it out)

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 82 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Any reason for using separate priority queues instead of having a single priority queue that holds something like type Task = (float, variant<Layer, Edge>)?

Yeah you are absolutely right, have changed it accordingly.

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 98 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Maybe something like "ready_layers" might make it a bit clearer what these are? At least to me "frontier" could indicate the latest executed layers or the next to execute layers or the frontier of layers that aren't ready yet, so this name alone is a bit ambiguous

Done.

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 107 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

So in this current version there's no modelling of congestion or tensor reuse at all, correct?

correct, how should I go about modelling it (Something along the lines of, for a given instant of time t, a given device can at most transfer x tensors to any other device?).

Isn't it generally better for the task simulator to supply the tensors to the cost_estimator intelligently, and then let it figure out the costs, rather than trying to simulate those things directly with the task simulator (genuine question, I'm not sure how accurately the cost estimator models things).

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Overall this implementation is pretty solid, and the lambdas do a lot to improve readability. If it's possible to pull out some composite data structures so that not all of the raw data structure manipulation has to be done directly, that might be nice--don't know if it's possible, but would be worth considering

I have changed the device map std::unordered_map<parallel_layer_guid_t, std::unordered_set<device_id_t>> to be DeviceMapping, but I'm not sure which other ones would be worth pulling out.

lib/compiler/test/src/compiler/cost_estimator/task_simulator.cc line 166 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Might make things a bit more readable

refactored the test cases, should be a lot more readable now

lib/pcg/include/pcg/parallel_computation_graph/parallel_computation_graph.h line 44 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

To avoid overloading the term "source" (since we also have get_source_layer for edges), maybe could use a different name like "get_initial_layers" (so you'd have source/(sink/destination) edges, and initial/terminal for the overall graph)? Not super important, just a thought

you're right, changed

lib/pcg/src/pcg/machine_specification.cc line 55 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Normally we don't have dedicated overloads for a function f that are just (transform f)--is there a compelling argument why this is needed here? The problem is adding them in explodes the number of functions you have to write, and they don't contribute a lot as you already have transform

removed.

lib/pcg/src/pcg/machine_view.cc line 64 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer a throw with an actual error message

Done.

lib/pcg/src/pcg/operator_task_space.cc line 41 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Seems odd that get_operator_task_space doesn't actually take in a parallel_layer_guid_t or any other type of operator representation? I assume that this is because the operator task space is arbitrary and so we use a convention that we derive it from the output tensor's space, but it would be good for the API to reflect that that could be changed and that it's really the operator+inputs+outputs that determines the task space

I have changed it accordingly, but yeah the operator task space still does not make too much sense. I'm generally a bit confused about how the MachineView mapping to Operator should work. MachineView describes a N-dimensional strided orthotope (which then gets mapped to the MachineSpace etc...), but how does this match with an operator, which is instead a function that returns a tuple of vectors, each with it's own dimension. Right now OperatorTaskSpace is essentially copying the parallel degrees of the first output tensor, but are these the same across all output tensors (or am I just making the arbitrary decision of fetching the first output tensor shape, rather than any other one)?

lib/pcg/test/src/pcg/parallel_computation_graph/parallel_computation_graph.cc line 108 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Which overload is this? Try to put the argument types in the test names (FYI I normally shorten A const & to A in the signatures because they're semantically equivalent for value types, which is mostly what we deal with, just to try to make the TEST_CASE names a bit more readable)

Done.

lib/pcg/test/src/pcg/parallel_computation_graph/parallel_computation_graph.cc line 134 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is there a subcase named get_source_layer within a testcase for get_source_layer?

Done.

lib/utils/src/utils/graph/dataflow_graph/algorithms/get_outgoing_edges.cc line 6 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Add unit tests for these

Done.

lib/compiler/src/compiler/cost_estimator/cost_estimator.cc line 1 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why a newline?

Done.

lib/compiler/src/compiler/machine_mapping/machine_mapping_problem_tree/unmapped_op_cost_estimate_key.cc line 36 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Should probably be moved to the file for OpCostEstimateKey instead of here, as the signature doesn't really involve UnmappedOpCostEstimateKey

Done.

lib/compiler/test/src/compiler/cost_estimator_for_test.h line 37 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Generally pass primitives by value

Done.

lib/compiler/test/src/compiler/cost_estimator/task_simulator.cc line 41 at r1 (raw file):

  MachineSpecification machine_spec = MachineSpecification{3, 3, 3, 1, 1};

  TEST_CASE("task_simulator: linear graph") {

Done.

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 96 at r1 (raw file):

    layer_processing.push(TimedLayer{layer, current_time + cost});
    for (device_id_t d : device_mapping.at(layer)) {
      devices[d] = true;

Done.

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 111 at r1 (raw file):

  auto finish_layer_processing = [&](TimedLayer const &timed_layer) {
    for (device_id_t d : device_mapping.at(timed_layer.layer)) {
      devices[d] = false;

Done.

lib/pcg/src/pcg/machine_view.cc line 46 at r1 (raw file):

  if (strides.size() != dims.size()) {
    throw mk_runtime_error(
        fmt::format("Dimensions of {} and {} must match when calling "

Done.

lib/pcg/test/src/pcg/parallel_computation_graph/parallel_computation_graph.cc line 158 at r1 (raw file):

    }

    SUBCASE("three layers in serial") {

Done.

lib/compiler/include/compiler/cost_estimator/task_simulator.h line 10 at r1 (raw file):

namespace FlexFlow {
float task_simulator_forward_pass(ParallelComputationGraph const &pcg,

Done.

Marsella8

Reviewable status: 8 of 36 files reviewed, 22 unresolved discussions (waiting on @lockshaw)

lib/compiler/test/src/compiler/cost_estimator/task_simulator.cc line 38 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I'd recommend at least also doing some tests with a cost estimator that differentiates between different ops and comms to validate that the right ops and comms are getting passed in.

added

lockshaw

Reviewed 17 of 28 files at r2, 1 of 4 files at r4, all commit messages.
Reviewable status: 26 of 39 files reviewed, 8 unresolved discussions (waiting on @Marsella8)

lib/compiler/include/compiler/cost_estimator/timed_component.variant.toml line 0 at r4 (raw file):
Why not (time, variant<layer, edge>) instead of variant<(time, layer), (time, edge)>?

Marsella8

Reviewable status: 26 of 41 files reviewed, 8 unresolved discussions (waiting on @lockshaw)

lib/compiler/include/compiler/cost_estimator/timed_component.variant.toml line at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not (time, variant<layer, edge>) instead of variant<(time, layer), (time, edge)>?

for some functions you only want to take in either a TimedLayer or a TimedComponent, find it more convenient to do the variant this way since otherwise I would have to either pass the variant and manually assert that it has the right type of unpacking and then passing time and layer/edge separately, which wasn't super elegant. tried both and the current one seemed to fit a bit better, not too much of a difference either way.

lockshaw

Reviewed 2 of 16 files at r6, 1 of 6 files at r7.
Reviewable status: 23 of 47 files reviewed, 21 unresolved discussions (waiting on @Marsella8)

lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 2 at r6 (raw file):

namespace = "FlexFlow"
name = "TaskGraph"

It would be nice if TaskGraph was a bit more semantically-rich, i.e., has node labels of type Task, where Task is a variant.toml of parallel_layer_guid_t and whatever your communication representation is. Then, your actual simulator code would take in a DiGraphView and a couple std::functions, which would be based on the TaskGraph. Right now it kinda feels like you have two levels of structures, (1) the full PCG, and (2) a fully anonymized DiGraphView, which makes the conversion/lowering code rather complex. It would be nice to have an intermediate structure to simplify things a bit.

lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 27 at r6 (raw file):

[[fields]]
name = "is_allowed_to_run"

Conceptually shouldn't be part of the TaskGraph--should be passed as a separate parameter probably, or TaskGraph should be renamed/re-structured. TaskGraph to me conveys the set of tasks and the data dependencies between them, which is_allowed_to_run isn't part of

lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 28 at r6 (raw file):

[[fields]]
name = "is_allowed_to_run"
type = "std::function<bool(::FlexFlow::Node, ::FlexFlow::TasksStateTracker)>"

Probably shouldn't take in the whole TasksStateTracker, as it shouldn't need to access the set of ready tasks or the current time. In fact, they don't even need the TimedTasks in the other two fields of the tracker, just two sets of Node: finished and in-progress

lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 28 at r6 (raw file):

[[fields]]
name = "is_allowed_to_run"
type = "std::function<bool(::FlexFlow::Node, ::FlexFlow::TasksStateTracker)>"

Suggestion:

type = "std::function<bool(::FlexFlow::Node, ::FlexFlow::TasksStateTracker const &)>"

lib/compiler/include/compiler/cost_estimator/tasks_state_tracker.struct.toml line 2 at r6 (raw file):

namespace = "FlexFlow"
name = "TasksStateTracker"

Suggestion:

name = "TaskGraphExecutionState"

lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 0 at r6 (raw file):
simulate_task_graph_execution.cc

lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 47 at r6 (raw file):

  return state_tracker.ready_tasks.empty() &&
         state_tracker.tasks_processing.empty();
}

Now that the domain-specific logic is factored out, feel free to make these lambdas again

Code quote:

static void start_task_processing(TasksStateTracker &state_tracker,
                                  TaskGraph const &task_graph,
                                  Node const &task) {
  float cost = task_graph.cost_map.at(task);
  state_tracker.tasks_processing.push(
      TimedTask{state_tracker.current_time + cost, task});
  state_tracker.ready_tasks.erase(task);
}

static bool dependencies_are_satisfied(TasksStateTracker const &state_tracker,
                                       TaskGraph const &task_graph,
                                       Node const &task) {
  std::unordered_set<Node> incoming_dependencies =
      get_predecessors(task_graph.graph, task);
  return is_subseteq_of(incoming_dependencies, state_tracker.processed_tasks);
}

static void finish_task_processing(TasksStateTracker &state_tracker,
                                   TaskGraph const &task_graph,
                                   TimedTask const &timed_task) {
  state_tracker.processed_tasks.insert(timed_task.node);
  for (Node const &task : get_successors(task_graph.graph, timed_task.node)) {
    if (dependencies_are_satisfied(state_tracker, task_graph, task)) {
      state_tracker.ready_tasks.insert(task);
    }
  }
  state_tracker.current_time = timed_task.endtime;
}

static bool is_processing_done(TasksStateTracker const &state_tracker) {
  return state_tracker.ready_tasks.empty() &&
         state_tracker.tasks_processing.empty();
}

lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 55 at r6 (raw file):

}

float simulate_forward_pass(TaskGraph const &task_graph) {

Suggestion:

 simulate_task_graph_execution

lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 55 at r6 (raw file):

}

float simulate_forward_pass(TaskGraph const &task_graph) {

TaskGraphProfile

Code quote:

float

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 70 at r6 (raw file):

  DiGraph digraph = DiGraph::create<AdjacencyDiGraph>();
  bidict<Node,
         std::variant<parallel_layer_guid_t, ParallelComputationGraphEdge>>

Avoid raw variants, introduce a Task type using dtgen's variant.toml

Code quote:

std::variant<parallel_layer_guid_t, ParallelComputationGraphEdge>

lib/compiler/include/compiler/cost_estimator/timed_component_comparator.h line 10 at r4 (raw file):

struct TimedComponentComparator {
  bool operator()(TimedComponent const &lhs, TimedComponent const &rhs) const;
};

I actually like having the explicit comparator, can you bring this back? It removes the need to understand what code the ord feature in dtgen generates, which is nice as it avoids ambiguity

Code quote:

struct TimedComponentComparator {
  bool operator()(TimedComponent const &lhs, TimedComponent const &rhs) const;
};

lib/compiler/include/compiler/cost_estimator/task_graph_profile.struct.toml line 2 at r7 (raw file):

namespace = "FlexFlow"
name = "TaskGraphProfile"

Suggestion:

name = "TaskGraphExecutionTrace"

lib/compiler/include/compiler/cost_estimator/timed_task.struct.toml line 2 at r7 (raw file):

namespace = "FlexFlow"
name = "TimedTask"

Suggestion:

name = "InProgressTask"

Marsella8

Reviewable status: 23 of 47 files reviewed, 21 unresolved discussions (waiting on @lockshaw)

lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 27 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Conceptually shouldn't be part of the TaskGraph--should be passed as a separate parameter probably, or TaskGraph should be renamed/re-structured. TaskGraph to me conveys the set of tasks and the data dependencies between them, which is_allowed_to_run isn't part of

Done.

lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 28 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Probably shouldn't take in the whole TasksStateTracker, as it shouldn't need to access the set of ready tasks or the current time. In fact, they don't even need the TimedTasks in the other two fields of the tracker, just two sets of Node: finished and in-progress

Done.

lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 47 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Now that the domain-specific logic is factored out, feel free to make these lambdas again

Done.

lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 55 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

TaskGraphProfile

wouldn't it be better to return the entire profile rather than only the endtime? (since this way in testing I can check against the entire trace, and given the profile you can choose to discard everything but the endtime).

lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

simulate_task_graph_execution.cc

Done.

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 70 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Avoid raw variants, introduce a Task type using dtgen's variant.toml

Done.

lib/compiler/include/compiler/cost_estimator/timed_component_comparator.h line 10 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I actually like having the explicit comparator, can you bring this back? It removes the need to understand what code the ord feature in dtgen generates, which is nice as it avoids ambiguity

Done.

lib/compiler/include/compiler/cost_estimator/timed_task.struct.toml line 2 at r7 (raw file):

namespace = "FlexFlow"
name = "TimedTask"

Done.

lib/compiler/include/compiler/cost_estimator/tasks_state_tracker.struct.toml line 2 at r6 (raw file):

namespace = "FlexFlow"
name = "TasksStateTracker"

Done.

lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 28 at r6 (raw file):

[[fields]]
name = "is_allowed_to_run"
type = "std::function<bool(::FlexFlow::Node, ::FlexFlow::TasksStateTracker)>"

Done.

lib/compiler/include/compiler/cost_estimator/task_graph_profile.struct.toml line 2 at r7 (raw file):

namespace = "FlexFlow"
name = "TaskGraphProfile"

Done.

lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 55 at r6 (raw file):

}

float simulate_forward_pass(TaskGraph const &task_graph) {

Done.

lockshaw

Reviewed 8 of 29 files at r2, 3 of 18 files at r6, 23 of 24 files at r8, all commit messages.
Reviewable status: 52 of 55 files reviewed, 23 unresolved discussions (waiting on @Marsella8)

lib/pcg/src/pcg/operator_task_space.cc line 41 at r1 (raw file):

Previously, Marsella8 wrote…

I have changed it accordingly, but yeah the operator task space still does not make too much sense. I'm generally a bit confused about how the MachineView mapping to Operator should work. MachineView describes a N-dimensional strided orthotope (which then gets mapped to the MachineSpace etc...), but how does this match with an operator, which is instead a function that returns a tuple of vectors, each with it's own dimension. Right now OperatorTaskSpace is essentially copying the parallel degrees of the first output tensor, but are these the same across all output tensors (or am I just making the arbitrary decision of fetching the first output tensor shape, rather than any other one)?

The latter--it's an arbitrary choice. Eventually the dependencies will be moved into the PCG/inferrable from the PCG itself as part of #1569

lib/compiler/include/compiler/cost_estimator/task_constraint.struct.toml line 2 at r8 (raw file):

namespace = "FlexFlow"
name = "TaskConstraint"

Suggestion:

name = "TaskExecutionConstraint"

lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 19 at r8 (raw file):

TaskGraphExecutionTrace
    simulate_task_graph_execution(TaskGraph const &task_graph,

Might want to assert that the task graph is acylic? Just to improve debuggability

lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 23 at r8 (raw file):

  TaskGraphExecutionState execution_state =
      TaskGraphExecutionState{get_sources(task_graph.graph), {}, {}, 0.0};

Add argument name comments

Code quote:

      TaskGraphExecutionState{get_sources(task_graph.graph), {}, {}, 0.0};

lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 52 at r8 (raw file):

        TaskProfile{in_progress_task.node,
                    in_progress_task.endtime -
                        task_graph.cost_map.at(in_progress_task.node),

Why not just have InProgressTask have a start_time? This seems easier/more-intuitive than having to infer if after it's already been executed

lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 68 at r8 (raw file):

  };
  while (!is_processing_done()) {
    for (Node const &task : sorted(execution_state.ready_tasks)) {

Might be better to just have ready_tasks be a std::set so you have a non-insertion-order-based-order and don't have to keep sorting as iteration in a std::set is already ordered?

Code quote:

sorted(execution_state.ready_tasks)

lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 80 at r8 (raw file):

    if (!execution_state.tasks_processing.empty()) {
      InProgressTask next_task = get_next_task();

Suggestion:

get_next_task_to_finish

lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 87 at r8 (raw file):

  }

  return TaskGraphExecutionTrace{task_profiles, execution_state.current_time};

Would be good to include a check that all tasks were executed

lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 2 at r8 (raw file):

namespace = "FlexFlow"
name = "TaskGraph"

I would get rid of TaskGraph, and just have the simulate function take in a DiGraphView, a cost function (as a std::function or a wrapped std::function, depending on your preference), and a constraint function

lib/compiler/include/compiler/machine_mapping/device_mapping.cc line 1 at r8 (raw file):

Leading newline?

lib/compiler/include/compiler/machine_mapping/device_mapping.cc line 24 at r8 (raw file):

    device_mapping.insert(
        {layer, get_device_ids(op, machine_view, machine_spec)});
  }

Change to map_values?

Code quote:

  std::unordered_map<parallel_layer_guid_t, std::unordered_set<device_id_t>>
      device_mapping;
  for (auto const &[layer, machine_view] : machine_mapping.machine_views) {
    OperatorTaskSpace op =
        get_operator_task_space(pcg, layer));
    device_mapping.insert(
        {layer, get_device_ids(op, machine_view, machine_spec)});
  }

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 64 at r8 (raw file):

}

static std::pair<DiGraph, bidict<Node, PCGTask>>

Make this an actual dtgen type (this is the PCGTaskGraph mentioned in zulip)

Also, this should be a public function, as getting a task graph for a PCG is a generally useful function

Code quote:

static std::pair<DiGraph, bidict<Node, PCGTask>>

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 112 at r8 (raw file):

static bool
    is_allowed_to_run_super(Node const &task,

Make this a lambda

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 133 at r8 (raw file):

          device_map.raw_device_map.at(task_component.require_layer()));
    }
  }

Can you clean this up a bit with utils/containers?

Code quote:

  std::unordered_set<device_id_t> devices_occupied;

  for (Node const &in_progress_task : tasks_processing) {
    auto task_component = node_map.at_l(in_progress_task);
    if (task_component.is_layer()) {
      devices_occupied = set_union(
          devices_occupied,
          device_map.raw_device_map.at(task_component.require_layer()));
    }
  }

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 149 at r8 (raw file):

  auto [digraph, node_map] = get_digraph(pcg);
  auto cost_map = get_cost_map(node_map, pcg, machine_mapping, estimator);

I feel like this would be simpler just as a lambda that you pass in to the simulate function, rather than having to do all the manipulation to compute the cost map up front

Code quote:

  auto cost_map = get_cost_map(node_map, pcg, machine_mapping, estimator);

lib/utils/include/utils/deduplicated_priority_queue.h line 41 at r8 (raw file):

  }

  std::vector<Elem> contents() const {

What is this function (semantically) and why is it necessary?

lib/compiler/include/compiler/cost_estimator/task_graph_execution_state.struct.toml line 0 at r8 (raw file):
Would be good to move all the task simulator stuff to its own directory (e.g., compiler/task_graph_simulator)

lib/compiler/include/compiler/cost_estimator/task_graph_execution_state.struct.toml line 28 at r8 (raw file):

[[fields]]
name = "tasks_processing"

Suggestion:

name = "in_progress_tasks"

lib/compiler/include/compiler/cost_estimator/task_graph_execution_state.struct.toml line 32 at r8 (raw file):

[[fields]]
name = "processed_tasks"

Suggestion:

name = "finished_tasks"

lib/compiler/include/compiler/machine_mapping/device_mapping.struct.toml line 2 at r8 (raw file):

namespace = "FlexFlow"
name = "DeviceMapping"

Suggestion:

name = "UnstructuredDeviceMapping"

lib/compiler/include/compiler/cost_estimator/task_graph_execution_trace.struct.toml line 27 at r8 (raw file):

[[fields]]
name = "end_time" 
type = "float"

Remove, just have a helper function that calculates it from the set of task profiles

Code quote:

[[fields]]
name = "end_time"
type = "float"

lockshaw

Reviewed 1 of 29 files at r2, 18 of 19 files at r9, all commit messages.
Reviewable status: 54 of 55 files reviewed, 15 unresolved discussions (waiting on @Marsella8)

lib/compiler/include/compiler/machine_mapping/unstructured_device_mapping.h line 12 at r9 (raw file):

UnstructuredDeviceMapping
    get_device_mapping(MachineMapping const &machine_mapping,

Suggestion:

get_unstructured_device_mapping

lib/compiler/src/compiler/task_graph_simulator/simulate_task_graph_execution.cc line 97 at r9 (raw file):

  }
  if (execution_state.finished_tasks.size() != num_nodes(task_graph.graph)) {
    throw mk_runtime_error("Some tasks within the TaskGraph are unreachable");

Suggestion:

("Failed to execute all tasks in TaskGraph");

lib/compiler/test/src/compiler/task_graph_simulator/task_simulator.cc line 37 at r9 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("task_simulator") {

Suggestion:

 TEST_CASE("task_simulator_estimate_forward_pass_time") {

lib/compiler/test/src/compiler/task_graph_simulator/simulate_task_graph_execution.cc line 17 at r9 (raw file):

    SUBCASE("linear graph") {
      DiGraph g = DiGraph::create<AdjacencyDiGraph>();

Suggestion:

    DiGraph g = DiGraph::create<AdjacencyDiGraph>();
    SUBCASE("linear graph") {

lib/compiler/test/src/compiler/task_graph_simulator/simulate_task_graph_execution.cc line 87 at r9 (raw file):

      }

      SUBCASE("processing constraint") {

Suggestion:

      SUBCASE("constrained to one node at a time")

lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 131 at r9 (raw file):

      transform(layer_nodes, [&](Node const &n) {
        return node_map.at_l(n).require_layer();
      });

filtrans would probably be cleaner

Code quote:

  std::unordered_set<Node> layer_nodes =
      filter(in_progress_tasks,
             [&](Node const &n) { return node_map.at_l(n).is_layer(); });
  std::unordered_set<parallel_layer_guid_t> layers =
      transform(layer_nodes, [&](Node const &n) {
        return node_map.at_l(n).require_layer();
      });

Marsella8

Reviewable status: 54 of 55 files reviewed, 15 unresolved discussions (waiting on @lockshaw)

lib/utils/include/utils/deduplicated_priority_queue.h line 41 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What is this function (semantically) and why is it necessary?

pulls out all the contents of the deduplicated priority queue. Needed since i have to pass it to the constraint function.

lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 2 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

It would be nice if TaskGraph was a bit more semantically-rich, i.e., has node labels of type Task, where Task is a variant.toml of parallel_layer_guid_t and whatever your communication representation is. Then, your actual simulator code would take in a DiGraphView and a couple std::functions, which would be based on the TaskGraph. Right now it kinda feels like you have two levels of structures, (1) the full PCG, and (2) a fully anonymized DiGraphView, which makes the conversion/lowering code rather complex. It would be nice to have an intermediate structure to simplify things a bit.

Done.

lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 2 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I would get rid of TaskGraph, and just have the simulate function take in a DiGraphView, a cost function (as a std::function or a wrapped std::function, depending on your preference), and a constraint function

Done.

lib/compiler/include/compiler/cost_estimator/task_graph_execution_state.struct.toml line at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Would be good to move all the task simulator stuff to its own directory (e.g., compiler/task_graph_simulator)

Done.

lib/compiler/include/compiler/cost_estimator/task_graph_execution_trace.struct.toml line 27 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Remove, just have a helper function that calculates it from the set of task profiles

Done.

lib/compiler/include/compiler/machine_mapping/device_mapping.cc line 1 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Leading newline?

Done.

lib/compiler/include/compiler/machine_mapping/device_mapping.cc line 24 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Change to map_values?

don't think it's doable elegantly, given that the new value is a function of both the key (layer) and the original value.

lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 23 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Add argument name comments

Done.

lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 52 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not just have InProgressTask have a start_time? This seems easier/more-intuitive than having to infer if after it's already been executed

Done.

lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 68 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Might be better to just have ready_tasks be a std::set so you have a non-insertion-order-based-order and don't have to keep sorting as iteration in a std::set is already ordered?

the sorted is more so for consistency during testing (if there are multiple available tasks, process the one with the associated smallest node). It's more to have a fixed, arbitrary order than anything else, so probably better to not have this property embedded into the execution_tracker itself?

lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 87 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Would be good to include a check that all tasks were executed

Done.

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 64 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Make this an actual dtgen type (this is the PCGTaskGraph mentioned in zulip)

Also, this should be a public function, as getting a task graph for a PCG is a generally useful function

Done.

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 112 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Make this a lambda

Done.

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 133 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Can you clean this up a bit with utils/containers?

Done.

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 149 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I feel like this would be simpler just as a lambda that you pass in to the simulate function, rather than having to do all the manipulation to compute the cost map up front

Done.

lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 131 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

filtrans would probably be cleaner

Done.

lib/compiler/test/src/compiler/task_graph_simulator/task_simulator.cc line 37 at r9 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("task_simulator") {

Done.

lib/compiler/include/compiler/cost_estimator/task_graph_execution_state.struct.toml line 28 at r8 (raw file):

[[fields]]
name = "tasks_processing"

Done.

lib/compiler/include/compiler/cost_estimator/task_graph_execution_state.struct.toml line 32 at r8 (raw file):

[[fields]]
name = "processed_tasks"

Done.

lib/compiler/src/compiler/task_graph_simulator/simulate_task_graph_execution.cc line 97 at r9 (raw file):

  }
  if (execution_state.finished_tasks.size() != num_nodes(task_graph.graph)) {
    throw mk_runtime_error("Some tasks within the TaskGraph are unreachable");

Done.

lib/compiler/include/compiler/machine_mapping/unstructured_device_mapping.h line 12 at r9 (raw file):

UnstructuredDeviceMapping
    get_device_mapping(MachineMapping const &machine_mapping,

Done.

lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 80 at r8 (raw file):

    if (!execution_state.tasks_processing.empty()) {
      InProgressTask next_task = get_next_task();

Done.

lib/compiler/include/compiler/machine_mapping/device_mapping.struct.toml line 2 at r8 (raw file):

namespace = "FlexFlow"
name = "DeviceMapping"

Done.

lib/compiler/include/compiler/cost_estimator/task_constraint.struct.toml line 2 at r8 (raw file):

namespace = "FlexFlow"
name = "TaskConstraint"

Done.

lib/compiler/test/src/compiler/task_graph_simulator/simulate_task_graph_execution.cc line 17 at r9 (raw file):

    SUBCASE("linear graph") {
      DiGraph g = DiGraph::create<AdjacencyDiGraph>();

Done.

lib/compiler/test/src/compiler/task_graph_simulator/simulate_task_graph_execution.cc line 87 at r9 (raw file):

      }

      SUBCASE("processing constraint") {

Done.

…ask-simulator

codecov · 2025-01-20T23:06:52Z

Codecov Report

Attention: Patch coverage is 84.82490% with 39 lines in your changes missing coverage. Please review.

Project coverage is 60.94%. Comparing base (3ee5e48) to head (d701bb0).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
...el_computation_graph/parallel_computation_graph.cc	17.39%	19 Missing ⚠️
lib/pcg/src/pcg/machine_view.cc	52.17%	11 Missing ⚠️
...k_graph_simulator/simulate_task_graph_execution.cc	93.22%	4 Missing ⚠️
...task_graph_simulator/task_graph_execution_trace.cc	81.81%	2 Missing ⚠️
lib/utils/include/utils/archetypes/value_type.h	0.00%	2 Missing ⚠️
lib/utils/include/utils/containers/minimum.h	75.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1565      +/-   ##
==========================================
+ Coverage   60.47%   60.94%   +0.47%     
==========================================
  Files         606      618      +12     
  Lines       14725    14978     +253     
==========================================
+ Hits         8905     9129     +224     
- Misses       5820     5849      +29

Flag	Coverage Δ
unittests	`60.94% <84.82%> (+0.47%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
lib/compiler/src/compiler/allowed_machine_views.cc	`95.31% <100.00%> (+0.07%)`	⬆️
...rc/compiler/cost_estimator/op_cost_estimate_key.cc	`100.00% <100.00%> (ø)`
...src/compiler/cost_estimator/tensor_set_movement.cc	`100.00% <100.00%> (ø)`
...ler/machine_mapping/get_optimal_machine_mapping.cc	`93.00% <100.00%> (+0.04%)`	⬆️
...er/src/compiler/machine_mapping/machine_mapping.cc	`100.00% <100.00%> (ø)`
...optimization/machine_mapping_with_memory_result.cc	`98.87% <100.00%> (+0.09%)`	⬆️
...ler/machine_mapping/unstructured_device_mapping.cc	`100.00% <100.00%> (ø)`
...ask_graph_simulator/in_progress_task_comparator.cc	`100.00% <100.00%> (ø)`
...rc/compiler/task_graph_simulator/pcg_task_graph.cc	`100.00% <100.00%> (ø)`
...rc/compiler/task_graph_simulator/task_simulator.cc	`100.00% <100.00%> (ø)`
... and 12 more

... and 2 files with indirect coverage changes

lockshaw

Can you add unit tests for get_device_ids and for
std::unordered_set<DataflowEdge> get_outgoing_edges(DataflowGraphView const &g, std::unordered_set<Node> const &ns)?

Reviewed 38 of 38 files at r11, 18 of 18 files at r12, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @Marsella8)

lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 104 at r10 (raw file):

Previously, Marsella8 wrote…

Done.

Not seeing this change?

lib/compiler/test/src/compiler/task_graph_simulator/task_simulator.cc line 170 at r10 (raw file):

Previously, Marsella8 wrote…

In the sense of having ForwardOpCostMetrics and BackwardOpCost metrics?

Just change OpCostMetrics to have a forward_execution_time and a backward_execution_time

lib/compiler/include/compiler/cost_estimator/tensor_set_movement.h line 0 at r11 (raw file):
Empty file?

lib/utils/include/utils/deduplicated_priority_queue.h line 42 at r11 (raw file):

  }

  std::set<Elem> contents() const {

Suggestion:

  std::set<Elem, Compare>

lib/utils/include/utils/archetypes/value_type.h line 42 at r11 (raw file):

  std::ostringstream oss;
  oss << "<value_type<" << TAG << ">>";
  return oss.str();

Suggestion:

  assert(false);

lib/utils/include/utils/archetypes/value_type.h line 47 at r11 (raw file):

template <int TAG>
std::ostream &operator<<(std::ostream &s, value_type<TAG> const &x) {
  return s << format_as(x);

Suggestion:

  assert(false);

lib/utils/test/src/utils/containers/lookup_in_map.cc line 15 at r12 (raw file):

    SUBCASE("existing keys") {
      std::function<int(std::string)> func = lookup_in_map(map);

Suggestion:

      std::function<int(std::string const &)>

lockshaw · 2025-01-20T23:27:38Z

lib/compiler/include/compiler/task_graph_simulator/pcg_task.variant.toml line 21 at r10 (raw file):

Previously, Marsella8 wrote…

meant OpCostEstimateKey

OpCostMetric

Basically, if we use TensorSetMovement, the digraph is not a representation of only the PCG, but of PCG + DeviceMapping + MachineSpecification ....

Yes, but I think that's a good idea. The idea of the task graph is to represent the actual computation that occurs, not an abstracted partial version of it, so including the machine mapping information and making the communication explicit is exactly what I'd want.

Marsella8

Reviewable status: 60 of 95 files reviewed, 8 unresolved discussions (waiting on @lockshaw and @wmdi)

lib/compiler/include/compiler/cost_estimator/tensor_set_movement.h line at r11 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Empty file?

fixed

lib/compiler/include/compiler/task_graph_simulator/pcg_task.variant.toml line 21 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Yes, but I think that's a good idea. The idea of the task graph is to represent the actual computation that occurs, not an abstracted partial version of it, so including the machine mapping information and making the communication explicit is exactly what I'd want.

yeah ur right, done :)

lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 104 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Not seeing this change?

sorry, forgot to mention that try_require_layer does not exist

lib/compiler/test/src/compiler/task_graph_simulator/task_simulator.cc line 170 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Just change OpCostMetrics to have a forward_execution_time and a backward_execution_time

OpCostMetricswas not made by me, and it's predominantly used within @wmdi 's machine mapping so I don't wanna make changes that break her code, possibly a good topic for next week's meeting?

lib/utils/include/utils/archetypes/value_type.h line 42 at r11 (raw file):

  std::ostringstream oss;
  oss << "<value_type<" << TAG << ">>";
  return oss.str();

Done.

lib/utils/include/utils/archetypes/value_type.h line 47 at r11 (raw file):

template <int TAG>
std::ostream &operator<<(std::ostream &s, value_type<TAG> const &x) {
  return s << format_as(x);

Done.

lib/utils/include/utils/deduplicated_priority_queue.h line 42 at r11 (raw file):

  }

  std::set<Elem> contents() const {

Done.

lib/utils/test/src/utils/containers/lookup_in_map.cc line 15 at r12 (raw file):

    SUBCASE("existing keys") {
      std::function<int(std::string)> func = lookup_in_map(map);

Done.

lockshaw

Reviewed 35 of 35 files at r13, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Marsella8 and @wmdi)

lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 104 at r10 (raw file):

Previously, Marsella8 wrote…

sorry, forgot to mention that try_require_layer does not exist

Then try_require_operator? afaik it should be generated by dtgen

lib/compiler/test/src/compiler/task_graph_simulator/task_simulator.cc line 170 at r10 (raw file):

Previously, Marsella8 wrote…

OpCostMetricswas not made by me, and it's predominantly used within @wmdi 's machine mapping so I don't wanna make changes that break her code, possibly a good topic for next week's meeting?

It's fine to make that change, we've actually been discussing it for a while--iirc CostEstimator already provides that information, it's just not currently preserved by OpCostMetrics

lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 46 at r13 (raw file):

    }

    auto running_ops = filtrans(in_progress_tasks, [&](Node const &n) {

Suggestion:

in_progress_ops

lib/pcg/include/pcg/operator_task_space.h line 26 at r13 (raw file):

                                       parallel_layer_guid_t const &layer);

OperatorTaskSpace get_operator_task_space_from_parallel_tensor_shape(

Don't allow this operation, for now the operator task space should rely on the parallel_layer_guid_t. It seems you're using this to get the device set for the task graph--you should instead compute that ahead of time when you have the parallel_layer_guid_t and save it in the PCGTasks in the TaskGraph

Marsella8

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @lockshaw and @wmdi)

lib/pcg/include/pcg/operator_task_space.h line 26 at r13 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Don't allow this operation, for now the operator task space should rely on the parallel_layer_guid_t. It seems you're using this to get the device set for the task graph--you should instead compute that ahead of time when you have the parallel_layer_guid_t and save it in the PCGTasks in the TaskGraph

ur right, have added a node_to_layer bidict to the PCGTaskGraph (we would technically need a tridict to have perfect "make invalid states unrepresentable", but I think this suffices hahah)

lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 104 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Then try_require_operator? afaik it should be generated by dtgen

sorry, totally missed it.

lib/compiler/test/src/compiler/task_graph_simulator/task_simulator.cc line 170 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

It's fine to make that change, we've actually been discussing it for a while--iirc CostEstimator already provides that information, it's just not currently preserved by OpCostMetrics

Done.

lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 46 at r13 (raw file):

    }

    auto running_ops = filtrans(in_progress_tasks, [&](Node const &n) {

Done.

Marsella8

Reviewable status: 63 of 100 files reviewed, 7 unresolved discussions (waiting on @lockshaw and @wmdi)

lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 245 at r14 (raw file):

    float cost =
        context.cost_estimator.estimate_cost(mapped)
            .forward_runtime; // TODO(@pietro)(@wmdi) incorporate backward pass?

@wmdi

lib/compiler/src/compiler/machine_mapping/memory_optimization/machine_mapping_with_memory_result.cc line 35 at r14 (raw file):

      if (mapping.cost.forward_runtime >=
              other_mapping.cost
                  .forward_runtime && // TODO(@wmdi) does it have to be modified

@wmdi

lib/compiler/src/compiler/machine_mapping/memory_optimization/machine_mapping_with_memory_result.cc line 104 at r14 (raw file):

            std::max(lhs_mm.cost.forward_runtime, rhs_mm.cost.forward_runtime),
            std::max(lhs_mm.cost.backward_runtime,
                     rhs_mm.cost.backward_runtime), //(@wmdi) is this correct?

@wmdi

lockshaw

Reviewed 37 of 37 files at r14, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @lockshaw, @Marsella8, and @wmdi)

lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 245 at r14 (raw file):

Previously, Marsella8 wrote…

@wmdi

Should should be the sum of the forward and backward runtimes

lib/compiler/include/compiler/task_graph_simulator/pcg_task_graph.struct.toml line 24 at r14 (raw file):

[[fields]]
name = "node_to_layer"
type = "::FlexFlow::bidict<::FlexFlow::Node, ::FlexFlow::parallel_layer_guid_t>"

I would lean toward making this a std::unordered_set<DeviceId> (i.e., pre-compute the device sets and save the result) rather than saving the parallel_layer_guid_t directly, especially since this will stop being a bidict once you have both forward and backward tasks. Thoughts?

Marsella8

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @lockshaw and @wmdi)

lib/compiler/include/compiler/task_graph_simulator/pcg_task_graph.struct.toml line 24 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I would lean toward making this a std::unordered_set<DeviceId> (i.e., pre-compute the device sets and save the result) rather than saving the parallel_layer_guid_t directly, especially since this will stop being a bidict once you have both forward and backward tasks. Thoughts?

makes sense, updated

lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 245 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Should should be the sum of the forward and backward runtimes

thanks, added, not sure how to update the associated tests accordingly, I've commented them out for now, hopefully @wmdi could fix them?

lockshaw

Reviewed 26 of 26 files at r15, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @lockshaw and @wmdi)

lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 245 at r14 (raw file):

Previously, Marsella8 wrote…

thanks, added, not sure how to update the associated tests accordingly, I've commented them out for now, hopefully @wmdi could fix them?

They'll need to be uncommented before this is merged so you may as well keep them uncommented right now (if at least to reduce the diff size for review)

wmdi

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @Marsella8)

lib/compiler/src/compiler/machine_mapping/memory_optimization/machine_mapping_with_memory_result.cc line 35 at r14 (raw file):

Previously, Marsella8 wrote…

@wmdi

I think so. Let's keep this comment for this PR, and I will fix it when I merge it.

lib/compiler/src/compiler/machine_mapping/memory_optimization/machine_mapping_with_memory_result.cc line 104 at r14 (raw file):

Previously, Marsella8 wrote…

@wmdi

Correct

Marsella8

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @lockshaw and @wmdi)

lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 245 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

They'll need to be uncommented before this is merged so you may as well keep them uncommented right now (if at least to reduce the diff size for review)

Done.

lib/compiler/src/compiler/machine_mapping/memory_optimization/machine_mapping_with_memory_result.cc line 35 at r14 (raw file):

Previously, wmdi (Mengdi Wu) wrote…

I think so. Let's keep this comment for this PR, and I will fix it when I merge it.

great thanks.

lockshaw

Reviewed 18 of 18 files at r16, 25 of 27 files at r17, 19 of 19 files at r18, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @wmdi)

task_simulator_forward_pass

7b04ccb

Marsella8 requested a review from lockshaw December 27, 2024 22:37

lockshaw requested changes Jan 9, 2025

View reviewed changes

lockshaw mentioned this pull request Jan 10, 2025

Enable task simulator to return a trace/profile object for understanding/visualization #1568

Closed

lockshaw requested changes Jan 10, 2025

View reviewed changes

task simulator fixes

f88ec3a

lockshaw mentioned this pull request Jan 10, 2025

Adding task-based simulator for PCGs #1365

Closed

Marsella8 commented Jan 10, 2025

View reviewed changes

additional testing + bug fix

d53f955

Marsella8 commented Jan 12, 2025

View reviewed changes

task simulator fix

694e4b9

lockshaw mentioned this pull request Jan 16, 2025

Top-level loop for compiler #1576

Open

2 tasks

lockshaw requested changes Jan 16, 2025

View reviewed changes

Pietro Max Marsella added 2 commits January 17, 2025 12:57

refactor to task simulator

78baba8

minor fix

f14aa94

Marsella8 commented Jan 17, 2025

View reviewed changes

Pietro Max Marsella added 3 commits January 18, 2025 01:19

task simulator refactor

4ec5ce0

added testing

a18e23d

added task graph profiler

13a9b75

lockshaw requested changes Jan 18, 2025

View reviewed changes

fixes

97e8ea5

Marsella8 commented Jan 19, 2025

View reviewed changes

lockshaw requested changes Jan 19, 2025

View reviewed changes

minor refactoring + moving things around

81bfbc5

lockshaw requested changes Jan 20, 2025

View reviewed changes

Pietro Max Marsella added 2 commits January 20, 2025 10:47

interface update

30a887f

minor fix

423848e

Marsella8 commented Jan 20, 2025

View reviewed changes

Pietro Max Marsella added 2 commits January 20, 2025 10:52

Merge remote-tracking branch 'origin/master' into task-simulator

448a59a

merge fix

aef3b4f

lockshaw and others added 3 commits January 20, 2025 14:39

Merge branch 'master' into task-simulator

b9e7ae3

minor fix

fabae44

Merge branch 'task-simulator' of github.com:Marsella8/FlexFlow into t…

a439bd8

…ask-simulator

lockshaw requested changes Jan 20, 2025

View reviewed changes

fixes

5cd8ebe

Marsella8 commented Jan 21, 2025

View reviewed changes

lockshaw requested changes Jan 21, 2025

View reviewed changes

fixes

925aedf

Marsella8 commented Jan 21, 2025

View reviewed changes

fmt

4263607

Marsella8 commented Jan 21, 2025

View reviewed changes

lockshaw requested changes Jan 21, 2025

View reviewed changes

fixes

f3ed102

Marsella8 commented Jan 21, 2025

View reviewed changes

lockshaw requested changes Jan 22, 2025

View reviewed changes

wmdi reviewed Jan 23, 2025

View reviewed changes

uncommented test

907f691

Marsella8 commented Jan 23, 2025

View reviewed changes

Pietro Max Marsella added 2 commits January 24, 2025 15:21

merge fix

f1023a5

fmt

96da7ec

lockshaw approved these changes Jan 24, 2025

View reviewed changes

Pietro Max Marsella added 2 commits January 24, 2025 15:25

test fix

010758f

fix

d701bb0

lockshaw enabled auto-merge (squash) January 24, 2025 23:31

lockshaw merged commit de7fa32 into flexflow:master Jan 24, 2025
5 of 7 checks passed

victorli2002 mentioned this pull request Feb 7, 2025

Adding to unity substitution set #1594

Open

victorli2002 mentioned this pull request Mar 12, 2025

MCMC #1601

Open

victorli2002 mentioned this pull request Apr 21, 2025

fix substitution #1606

Closed

Task Simulator #1565

Task Simulator #1565

Uh oh!

Conversation

Marsella8 commented Dec 27, 2024 • edited by lockshaw Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lockshaw left a comment

Choose a reason for hiding this comment

Uh oh!

lockshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Marsella8 left a comment

Choose a reason for hiding this comment

Uh oh!

Marsella8 left a comment

Choose a reason for hiding this comment

Uh oh!

lockshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Marsella8 left a comment

Choose a reason for hiding this comment

Uh oh!

lockshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Marsella8 left a comment

Choose a reason for hiding this comment

Uh oh!

lockshaw left a comment

Choose a reason for hiding this comment

Uh oh!

lockshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Marsella8 left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lockshaw left a comment

Choose a reason for hiding this comment

Uh oh!

lockshaw commented Jan 20, 2025

Uh oh!

Marsella8 left a comment

Choose a reason for hiding this comment

Uh oh!

lockshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Marsella8 left a comment

Choose a reason for hiding this comment

Uh oh!

Marsella8 left a comment

Choose a reason for hiding this comment

Uh oh!

lockshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Marsella8 left a comment

Choose a reason for hiding this comment

Uh oh!

lockshaw left a comment

Choose a reason for hiding this comment

Uh oh!

wmdi left a comment

Choose a reason for hiding this comment

Uh oh!

Marsella8 left a comment

Choose a reason for hiding this comment

Uh oh!

lockshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Marsella8 commented Dec 27, 2024 •

edited by lockshaw

Loading

codecov bot commented Jan 20, 2025 •

edited

Loading