Skip to content

Conversation

@Marsella8
Copy link
Contributor

@Marsella8 Marsella8 commented Dec 27, 2024

Description of changes:

Related Issues:

Linked Issues:

  • Issue #

Issues closed by this PR:

Before merging:

  • Did you update the flexflow-third-party repo, if modifying any of the Cmake files, the build configs, or the submodules?

This change is Reviewable

@Marsella8 Marsella8 requested a review from lockshaw December 27, 2024 22:37
Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 28 files at r1, all commit messages.
Reviewable status: 10 of 28 files reviewed, 4 unresolved discussions (waiting on @Marsella8)


lib/compiler/src/compiler/cost_estimator/cost_estimator.cc line 1 at r1 (raw file):

Why a newline?


lib/compiler/test/src/compiler/cost_estimator_for_test.h line 37 at r1 (raw file):

CostEstimator make_fake_constant_cost_estimator(float const &op_cost,
                                                float const &comm_cost);

Generally pass primitives by value

Suggestion:

CostEstimator make_fake_constant_cost_estimator(float op_cost,
                                                float comm_cost);

lib/compiler/include/compiler/cost_estimator/timed_dependency.struct.toml line 1 at r1 (raw file):

Why a leading newline in a bunch of these files?


lib/compiler/src/compiler/allowed_machine_views.cc line 27 at r1 (raw file):

                           OperatorTaskSpace const &task,
                           MachineSpecification const &ms) {
  if (num_dims(mv) != num_dims(task)) {

Is this true? I don't remember where projection is handled--is a (8,) machine view a valid view for a (2, 4) task? Just double-checking 🙂

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 18 of 28 files at r1.
Reviewable status: all files reviewed, 22 unresolved discussions (waiting on @Marsella8)


lib/compiler/include/compiler/cost_estimator/task_simulator.h line 10 at r1 (raw file):

namespace FlexFlow {
float task_simulator_forward_pass(ParallelComputationGraph const &pcg,

Suggestion:

float task_simulator_estimate_forward_pass_time

lib/compiler/test/src/compiler/cost_estimator/task_simulator.cc line 38 at r1 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  CostEstimator estimator =
      make_fake_constant_cost_estimator(/*op_cost*/ 10.0f, /*comm_cost*/ 1.0f);

I'd recommend at least also doing some tests with a cost estimator that differentiates between different ops and comms to validate that the right ops and comms are getting passed in.

Code quote:

     make_fake_constant_cost_estimator(/*op_cost*/ 10.0f, /*comm_cost*/ 1.0f);

lib/compiler/test/src/compiler/cost_estimator/task_simulator.cc line 41 at r1 (raw file):

  MachineSpecification machine_spec = MachineSpecification{3, 3, 3, 1, 1};

  TEST_CASE("task_simulator: linear graph") {

Suggestion:

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("task_simulator_forward_pass") {
    CostEstimator estimator =
      make_fake_constant_cost_estimator(/*op_cost*/ 10.0f, /*comm_cost*/ 1.0f);
    MachineSpecification machine_spec = MachineSpecification{3, 3, 3, 1, 1};

    SUBCASE("linear graph") {

lib/compiler/test/src/compiler/cost_estimator/task_simulator.cc line 166 at r1 (raw file):

            MachineViewDimension{stride_t{1},
                                 MachineSpecificationDimension::INTER_NODE},
        }};

Might make things a bit more readable

Suggestion:

    std::vector<MachineViewDimension> dims = 
      {
            MachineViewDimension{stride_t{1},
                                 MachineSpecificationDimension::INTER_NODE},
            MachineViewDimension{stride_t{1},
                                 MachineSpecificationDimension::INTER_NODE},
            MachineViewDimension{stride_t{1},
                                 MachineSpecificationDimension::INTER_NODE},
        };
    MachineView mv0 = MachineView{
        MachineSpaceCoordinate{0, 0, DeviceType::GPU},
        dims,
        };
    MachineView mv1 = MachineView{
        MachineSpaceCoordinate{0, 1, DeviceType::GPU},
        dims};
    MachineView mv2 = MachineView{
        MachineSpaceCoordinate{1, 0, DeviceType::GPU},
        dims};
    MachineView mv3 = MachineView{
        MachineSpaceCoordinate{1, 1, DeviceType::GPU},
        dims};

lib/utils/src/utils/graph/dataflow_graph/algorithms/get_outgoing_edges.cc line 6 at r1 (raw file):

namespace FlexFlow {

std::unordered_set<DataflowEdge> get_outgoing_edges(DataflowGraphView const &g,

Add unit tests for these


lib/compiler/src/compiler/machine_mapping/machine_mapping_problem_tree/unmapped_op_cost_estimate_key.cc line 36 at r1 (raw file):

}

OpCostEstimateKey get_mapped_op_cost_estimate_key_for_layer(

Should probably be moved to the file for OpCostEstimateKey instead of here, as the signature doesn't really involve UnmappedOpCostEstimateKey


lib/pcg/src/pcg/machine_specification.cc line 55 at r1 (raw file):

std::unordered_set<device_id_t>
    get_device_ids(MachineSpecification const &ms,

Normally we don't have dedicated overloads for a function f that are just (transform f)--is there a compelling argument why this is needed here? The problem is adding them in explodes the number of functions you have to write, and they don't contribute a lot as you already have transform


lib/pcg/src/pcg/machine_view.cc line 46 at r1 (raw file):

  if (strides.size() != dims.size()) {
    throw mk_runtime_error(
        fmt::format("Dimensions of {} and {} must match when calling "

Suggestion:

        fmt::format("Length of strides ({}) and dims ({}) must match when calling "

lib/pcg/src/pcg/machine_view.cc line 64 at r1 (raw file):

    MachineSpecification const &machine_specification) {

  assert(num_dims(machine_view) == task.degrees.size());

Prefer a throw with an actual error message


lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 0 at r1 (raw file):
Overall this implementation is pretty solid, and the lambdas do a lot to improve readability. If it's possible to pull out some composite data structures so that not all of the raw data structure manipulation has to be done directly, that might be nice--don't know if it's possible, but would be worth considering


lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 68 at r1 (raw file):

}

float task_simulator_forward_pass(ParallelComputationGraph const &pcg,

It would be nice to have the task simulator have the ability to output a trace/profile object that essentially contains the execution times and device assignments of each task for the purposes of eventually rendering it in something like a legion-prof format (https://legion.stanford.edu/profiling/index.html) for easier understanding/visualization

Probably shouldn't be done in this PR, created an issue #1568 and assigned it to you


lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 82 at r1 (raw file):

  DeduplicatedPriorityQueue<TimedDependency, std::vector<TimedDependency>>
      dependency_processing;
  std::unordered_set<TimedDependency> processed_dependencies;

Is this a cohesive data structure that could be factored out? Seems like you're using it for both TimedLayer and TimedDependency? Not a huge deal, just might simplify a bit of the code

Code quote:

  DeduplicatedPriorityQueue<TimedDependency, std::vector<TimedDependency>>
      dependency_processing;
  std::unordered_set<TimedDependency> processed_dependencies;

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 82 at r1 (raw file):

  DeduplicatedPriorityQueue<TimedDependency, std::vector<TimedDependency>>
      dependency_processing;
  std::unordered_set<TimedDependency> processed_dependencies;

Any reason for using separate priority queues instead of having a single priority queue that holds something like type Task = (float, variant<Layer, Edge>)?

Code quote:

  DeduplicatedPriorityQueue<TimedLayer, std::vector<TimedLayer>>
      layer_processing;
  std::unordered_set<TimedLayer> processed_layers;

  DeduplicatedPriorityQueue<TimedDependency, std::vector<TimedDependency>>
      dependency_processing;
  std::unordered_set<TimedDependency> processed_dependencies;

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 96 at r1 (raw file):

    layer_processing.push(TimedLayer{layer, current_time + cost});
    for (device_id_t d : device_mapping.at(layer)) {
      devices[d] = true;

Suggestion:

    devices.at(d) = true;

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 98 at r1 (raw file):

      devices[d] = true;
    }
    layer_frontier.erase(layer);

Maybe something like "ready_layers" might make it a bit clearer what these are? At least to me "frontier" could indicate the latest executed layers or the next to execute layers or the frontier of layers that aren't ready yet, so this name alone is a bit ambiguous

Code quote:

layer_frontier

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 107 at r1 (raw file):

        dependency, pcg, machine_mapping, estimator);
    dependency_processing.push(TimedDependency{dependency, start_time + cost});
  };

So in this current version there's no modelling of congestion or tensor reuse at all, correct?

Code quote:

  auto start_dependency_processing = [&](ParallelComputationGraphEdge const
                                             &dependency,
                                         float start_time) {
    float cost = single_dependency_cost_estimator(
        dependency, pcg, machine_mapping, estimator);
    dependency_processing.push(TimedDependency{dependency, start_time + cost});
  };

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 111 at r1 (raw file):

  auto finish_layer_processing = [&](TimedLayer const &timed_layer) {
    for (device_id_t d : device_mapping.at(timed_layer.layer)) {
      devices[d] = false;

Suggestion:

      devices.at(d) = false;

lib/pcg/include/pcg/parallel_computation_graph/parallel_computation_graph.h line 44 at r1 (raw file):

std::unordered_set<parallel_layer_guid_t>
    get_source_layers(ParallelComputationGraph const &);

To avoid overloading the term "source" (since we also have get_source_layer for edges), maybe could use a different name like "get_initial_layers" (so you'd have source/(sink/destination) edges, and initial/terminal for the overall graph)? Not super important, just a thought

Code quote:

get_source_layers

lib/pcg/src/pcg/operator_task_space.cc line 41 at r1 (raw file):

}

OperatorTaskSpace get_operator_task_space(ParallelTensorShape const &shape) {

Seems odd that get_operator_task_space doesn't actually take in a parallel_layer_guid_t or any other type of operator representation? I assume that this is because the operator task space is arbitrary and so we use a convention that we derive it from the output tensor's space, but it would be good for the API to reflect that that could be changed and that it's really the operator+inputs+outputs that determines the task space


lib/pcg/test/src/pcg/parallel_computation_graph/parallel_computation_graph.cc line 108 at r1 (raw file):

  }

  TEST_CASE("get_source_layer") {

Which overload is this? Try to put the argument types in the test names (FYI I normally shorten A const & to A in the signatures because they're semantically equivalent for value types, which is mostly what we deal with, just to try to make the TEST_CASE names a bit more readable)

Code quote:

  TEST_CASE("get_source_layer") {

lib/pcg/test/src/pcg/parallel_computation_graph/parallel_computation_graph.cc line 134 at r1 (raw file):

      parallel_tensor_guid_t tensor1 = get_only(layer1_added.outputs);

      SUBCASE("get_source_layer") {

Why is there a subcase named get_source_layer within a testcase for get_source_layer?

Code quote:

      SUBCASE("get_source_layer") {

lib/pcg/test/src/pcg/parallel_computation_graph/parallel_computation_graph.cc line 158 at r1 (raw file):

    }

    SUBCASE("three layers in serial") {

Suggestion:

    SUBCASE("three layers in series")

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 8 of 36 files reviewed, 22 unresolved discussions (waiting on @lockshaw)


lib/compiler/include/compiler/cost_estimator/timed_dependency.struct.toml line 1 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why a leading newline in a bunch of these files?

fixed


lib/compiler/src/compiler/allowed_machine_views.cc line 27 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Is this true? I don't remember where projection is handled--is a (8,) machine view a valid view for a (2, 4) task? Just double-checking 🙂

Yes, MachineView stores the starting point (in the machine space, the 2D one) and an N dimensional vector (which "lives" in the task space) which says for each degree of the OperatorTaskSpace what is the stride and projection. So for a (2,4) task that we want to project across the INTER_NODE dimension in the machine space we would have MachineView((a, INTER_NODE), (b, INTER_NODE)) where a,b are some strides.


lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 82 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Is this a cohesive data structure that could be factored out? Seems like you're using it for both TimedLayer and TimedDependency? Not a huge deal, just might simplify a bit of the code

yeah you are right, I've stepped back a bit and the behaviour for how the dependencies and layers are processed is pretty much the same, only difference is how you finish processing them. I have refactored the code accordingly, we now have a TimedComponent (variant of the other 2) and a single queue that stores both.

(Way cleaner, thanks a lot for pointing it out)


lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 82 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Any reason for using separate priority queues instead of having a single priority queue that holds something like type Task = (float, variant<Layer, Edge>)?

Yeah you are absolutely right, have changed it accordingly.


lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 98 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Maybe something like "ready_layers" might make it a bit clearer what these are? At least to me "frontier" could indicate the latest executed layers or the next to execute layers or the frontier of layers that aren't ready yet, so this name alone is a bit ambiguous

Done.


lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 107 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

So in this current version there's no modelling of congestion or tensor reuse at all, correct?

correct, how should I go about modelling it (Something along the lines of, for a given instant of time t, a given device can at most transfer x tensors to any other device?).

Isn't it generally better for the task simulator to supply the tensors to the cost_estimator intelligently, and then let it figure out the costs, rather than trying to simulate those things directly with the task simulator (genuine question, I'm not sure how accurately the cost estimator models things).


lib/compiler/src/compiler/cost_estimator/task_simulator.cc line at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Overall this implementation is pretty solid, and the lambdas do a lot to improve readability. If it's possible to pull out some composite data structures so that not all of the raw data structure manipulation has to be done directly, that might be nice--don't know if it's possible, but would be worth considering

I have changed the device map std::unordered_map<parallel_layer_guid_t, std::unordered_set<device_id_t>> to be DeviceMapping, but I'm not sure which other ones would be worth pulling out.


lib/compiler/test/src/compiler/cost_estimator/task_simulator.cc line 166 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Might make things a bit more readable

refactored the test cases, should be a lot more readable now


lib/pcg/include/pcg/parallel_computation_graph/parallel_computation_graph.h line 44 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

To avoid overloading the term "source" (since we also have get_source_layer for edges), maybe could use a different name like "get_initial_layers" (so you'd have source/(sink/destination) edges, and initial/terminal for the overall graph)? Not super important, just a thought

you're right, changed


lib/pcg/src/pcg/machine_specification.cc line 55 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Normally we don't have dedicated overloads for a function f that are just (transform f)--is there a compelling argument why this is needed here? The problem is adding them in explodes the number of functions you have to write, and they don't contribute a lot as you already have transform

removed.


lib/pcg/src/pcg/machine_view.cc line 64 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer a throw with an actual error message

Done.


lib/pcg/src/pcg/operator_task_space.cc line 41 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Seems odd that get_operator_task_space doesn't actually take in a parallel_layer_guid_t or any other type of operator representation? I assume that this is because the operator task space is arbitrary and so we use a convention that we derive it from the output tensor's space, but it would be good for the API to reflect that that could be changed and that it's really the operator+inputs+outputs that determines the task space

I have changed it accordingly, but yeah the operator task space still does not make too much sense. I'm generally a bit confused about how the MachineView mapping to Operator should work. MachineView describes a N-dimensional strided orthotope (which then gets mapped to the MachineSpace etc...), but how does this match with an operator, which is instead a function that returns a tuple of vectors, each with it's own dimension. Right now OperatorTaskSpace is essentially copying the parallel degrees of the first output tensor, but are these the same across all output tensors (or am I just making the arbitrary decision of fetching the first output tensor shape, rather than any other one)?


lib/pcg/test/src/pcg/parallel_computation_graph/parallel_computation_graph.cc line 108 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Which overload is this? Try to put the argument types in the test names (FYI I normally shorten A const & to A in the signatures because they're semantically equivalent for value types, which is mostly what we deal with, just to try to make the TEST_CASE names a bit more readable)

Done.


lib/pcg/test/src/pcg/parallel_computation_graph/parallel_computation_graph.cc line 134 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is there a subcase named get_source_layer within a testcase for get_source_layer?

Done.


lib/utils/src/utils/graph/dataflow_graph/algorithms/get_outgoing_edges.cc line 6 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Add unit tests for these

Done.


lib/compiler/src/compiler/cost_estimator/cost_estimator.cc line 1 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why a newline?

Done.


lib/compiler/src/compiler/machine_mapping/machine_mapping_problem_tree/unmapped_op_cost_estimate_key.cc line 36 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Should probably be moved to the file for OpCostEstimateKey instead of here, as the signature doesn't really involve UnmappedOpCostEstimateKey

Done.


lib/compiler/test/src/compiler/cost_estimator_for_test.h line 37 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Generally pass primitives by value

Done.


lib/compiler/test/src/compiler/cost_estimator/task_simulator.cc line 41 at r1 (raw file):

  MachineSpecification machine_spec = MachineSpecification{3, 3, 3, 1, 1};

  TEST_CASE("task_simulator: linear graph") {

Done.


lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 96 at r1 (raw file):

    layer_processing.push(TimedLayer{layer, current_time + cost});
    for (device_id_t d : device_mapping.at(layer)) {
      devices[d] = true;

Done.


lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 111 at r1 (raw file):

  auto finish_layer_processing = [&](TimedLayer const &timed_layer) {
    for (device_id_t d : device_mapping.at(timed_layer.layer)) {
      devices[d] = false;

Done.


lib/pcg/src/pcg/machine_view.cc line 46 at r1 (raw file):

  if (strides.size() != dims.size()) {
    throw mk_runtime_error(
        fmt::format("Dimensions of {} and {} must match when calling "

Done.


lib/pcg/test/src/pcg/parallel_computation_graph/parallel_computation_graph.cc line 158 at r1 (raw file):

    }

    SUBCASE("three layers in serial") {

Done.


lib/compiler/include/compiler/cost_estimator/task_simulator.h line 10 at r1 (raw file):

namespace FlexFlow {
float task_simulator_forward_pass(ParallelComputationGraph const &pcg,

Done.

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 8 of 36 files reviewed, 22 unresolved discussions (waiting on @lockshaw)


lib/compiler/test/src/compiler/cost_estimator/task_simulator.cc line 38 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I'd recommend at least also doing some tests with a cost estimator that differentiates between different ops and comms to validate that the right ops and comms are getting passed in.

added

@lockshaw lockshaw mentioned this pull request Jan 16, 2025
2 tasks
Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 17 of 28 files at r2, 1 of 4 files at r4, all commit messages.
Reviewable status: 26 of 39 files reviewed, 8 unresolved discussions (waiting on @Marsella8)


lib/compiler/include/compiler/cost_estimator/timed_component.variant.toml line 0 at r4 (raw file):
Why not (time, variant<layer, edge>) instead of variant<(time, layer), (time, edge)>?

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 26 of 41 files reviewed, 8 unresolved discussions (waiting on @lockshaw)


lib/compiler/include/compiler/cost_estimator/timed_component.variant.toml line at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not (time, variant<layer, edge>) instead of variant<(time, layer), (time, edge)>?

for some functions you only want to take in either a TimedLayer or a TimedComponent, find it more convenient to do the variant this way since otherwise I would have to either pass the variant and manually assert that it has the right type of unpacking and then passing time and layer/edge separately, which wasn't super elegant. tried both and the current one seemed to fit a bit better, not too much of a difference either way.

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 16 files at r6, 1 of 6 files at r7.
Reviewable status: 23 of 47 files reviewed, 21 unresolved discussions (waiting on @Marsella8)


lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 2 at r6 (raw file):

namespace = "FlexFlow"
name = "TaskGraph"

It would be nice if TaskGraph was a bit more semantically-rich, i.e., has node labels of type Task, where Task is a variant.toml of parallel_layer_guid_t and whatever your communication representation is. Then, your actual simulator code would take in a DiGraphView and a couple std::functions, which would be based on the TaskGraph. Right now it kinda feels like you have two levels of structures, (1) the full PCG, and (2) a fully anonymized DiGraphView, which makes the conversion/lowering code rather complex. It would be nice to have an intermediate structure to simplify things a bit.


lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 27 at r6 (raw file):

[[fields]]
name = "is_allowed_to_run"

Conceptually shouldn't be part of the TaskGraph--should be passed as a separate parameter probably, or TaskGraph should be renamed/re-structured. TaskGraph to me conveys the set of tasks and the data dependencies between them, which is_allowed_to_run isn't part of


lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 28 at r6 (raw file):

[[fields]]
name = "is_allowed_to_run"
type = "std::function<bool(::FlexFlow::Node, ::FlexFlow::TasksStateTracker)>"

Probably shouldn't take in the whole TasksStateTracker, as it shouldn't need to access the set of ready tasks or the current time. In fact, they don't even need the TimedTasks in the other two fields of the tracker, just two sets of Node: finished and in-progress


lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 28 at r6 (raw file):

[[fields]]
name = "is_allowed_to_run"
type = "std::function<bool(::FlexFlow::Node, ::FlexFlow::TasksStateTracker)>"

Suggestion:

type = "std::function<bool(::FlexFlow::Node, ::FlexFlow::TasksStateTracker const &)>"

lib/compiler/include/compiler/cost_estimator/tasks_state_tracker.struct.toml line 2 at r6 (raw file):

namespace = "FlexFlow"
name = "TasksStateTracker"

Suggestion:

name = "TaskGraphExecutionState"

lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 0 at r6 (raw file):
simulate_task_graph_execution.cc


lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 47 at r6 (raw file):

  return state_tracker.ready_tasks.empty() &&
         state_tracker.tasks_processing.empty();
}

Now that the domain-specific logic is factored out, feel free to make these lambdas again

Code quote:

static void start_task_processing(TasksStateTracker &state_tracker,
                                  TaskGraph const &task_graph,
                                  Node const &task) {
  float cost = task_graph.cost_map.at(task);
  state_tracker.tasks_processing.push(
      TimedTask{state_tracker.current_time + cost, task});
  state_tracker.ready_tasks.erase(task);
}

static bool dependencies_are_satisfied(TasksStateTracker const &state_tracker,
                                       TaskGraph const &task_graph,
                                       Node const &task) {
  std::unordered_set<Node> incoming_dependencies =
      get_predecessors(task_graph.graph, task);
  return is_subseteq_of(incoming_dependencies, state_tracker.processed_tasks);
}

static void finish_task_processing(TasksStateTracker &state_tracker,
                                   TaskGraph const &task_graph,
                                   TimedTask const &timed_task) {
  state_tracker.processed_tasks.insert(timed_task.node);
  for (Node const &task : get_successors(task_graph.graph, timed_task.node)) {
    if (dependencies_are_satisfied(state_tracker, task_graph, task)) {
      state_tracker.ready_tasks.insert(task);
    }
  }
  state_tracker.current_time = timed_task.endtime;
}

static bool is_processing_done(TasksStateTracker const &state_tracker) {
  return state_tracker.ready_tasks.empty() &&
         state_tracker.tasks_processing.empty();
}

lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 55 at r6 (raw file):

}

float simulate_forward_pass(TaskGraph const &task_graph) {

Suggestion:

 simulate_task_graph_execution

lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 55 at r6 (raw file):

}

float simulate_forward_pass(TaskGraph const &task_graph) {

TaskGraphProfile

Code quote:

float

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 70 at r6 (raw file):

  DiGraph digraph = DiGraph::create<AdjacencyDiGraph>();
  bidict<Node,
         std::variant<parallel_layer_guid_t, ParallelComputationGraphEdge>>

Avoid raw variants, introduce a Task type using dtgen's variant.toml

Code quote:

std::variant<parallel_layer_guid_t, ParallelComputationGraphEdge>

lib/compiler/include/compiler/cost_estimator/timed_component_comparator.h line 10 at r4 (raw file):

struct TimedComponentComparator {
  bool operator()(TimedComponent const &lhs, TimedComponent const &rhs) const;
};

I actually like having the explicit comparator, can you bring this back? It removes the need to understand what code the ord feature in dtgen generates, which is nice as it avoids ambiguity

Code quote:

struct TimedComponentComparator {
  bool operator()(TimedComponent const &lhs, TimedComponent const &rhs) const;
};

lib/compiler/include/compiler/cost_estimator/task_graph_profile.struct.toml line 2 at r7 (raw file):

namespace = "FlexFlow"
name = "TaskGraphProfile"

Suggestion:

name = "TaskGraphExecutionTrace"

lib/compiler/include/compiler/cost_estimator/timed_task.struct.toml line 2 at r7 (raw file):

namespace = "FlexFlow"
name = "TimedTask"

Suggestion:

name = "InProgressTask"

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 23 of 47 files reviewed, 21 unresolved discussions (waiting on @lockshaw)


lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 27 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Conceptually shouldn't be part of the TaskGraph--should be passed as a separate parameter probably, or TaskGraph should be renamed/re-structured. TaskGraph to me conveys the set of tasks and the data dependencies between them, which is_allowed_to_run isn't part of

Done.


lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 28 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Probably shouldn't take in the whole TasksStateTracker, as it shouldn't need to access the set of ready tasks or the current time. In fact, they don't even need the TimedTasks in the other two fields of the tracker, just two sets of Node: finished and in-progress

Done.


lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 47 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Now that the domain-specific logic is factored out, feel free to make these lambdas again

Done.


lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 55 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

TaskGraphProfile

wouldn't it be better to return the entire profile rather than only the endtime? (since this way in testing I can check against the entire trace, and given the profile you can choose to discard everything but the endtime).


lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

simulate_task_graph_execution.cc

Done.


lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 70 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Avoid raw variants, introduce a Task type using dtgen's variant.toml

Done.


lib/compiler/include/compiler/cost_estimator/timed_component_comparator.h line 10 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I actually like having the explicit comparator, can you bring this back? It removes the need to understand what code the ord feature in dtgen generates, which is nice as it avoids ambiguity

Done.


lib/compiler/include/compiler/cost_estimator/timed_task.struct.toml line 2 at r7 (raw file):

namespace = "FlexFlow"
name = "TimedTask"

Done.


lib/compiler/include/compiler/cost_estimator/tasks_state_tracker.struct.toml line 2 at r6 (raw file):

namespace = "FlexFlow"
name = "TasksStateTracker"

Done.


lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 28 at r6 (raw file):

[[fields]]
name = "is_allowed_to_run"
type = "std::function<bool(::FlexFlow::Node, ::FlexFlow::TasksStateTracker)>"

Done.


lib/compiler/include/compiler/cost_estimator/task_graph_profile.struct.toml line 2 at r7 (raw file):

namespace = "FlexFlow"
name = "TaskGraphProfile"

Done.


lib/compiler/src/compiler/cost_estimator/task_graph_traversal.cc line 55 at r6 (raw file):

}

float simulate_forward_pass(TaskGraph const &task_graph) {

Done.

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 29 files at r2, 3 of 18 files at r6, 23 of 24 files at r8, all commit messages.
Reviewable status: 52 of 55 files reviewed, 23 unresolved discussions (waiting on @Marsella8)


lib/pcg/src/pcg/operator_task_space.cc line 41 at r1 (raw file):

Previously, Marsella8 wrote…

I have changed it accordingly, but yeah the operator task space still does not make too much sense. I'm generally a bit confused about how the MachineView mapping to Operator should work. MachineView describes a N-dimensional strided orthotope (which then gets mapped to the MachineSpace etc...), but how does this match with an operator, which is instead a function that returns a tuple of vectors, each with it's own dimension. Right now OperatorTaskSpace is essentially copying the parallel degrees of the first output tensor, but are these the same across all output tensors (or am I just making the arbitrary decision of fetching the first output tensor shape, rather than any other one)?

The latter--it's an arbitrary choice. Eventually the dependencies will be moved into the PCG/inferrable from the PCG itself as part of #1569


lib/compiler/include/compiler/cost_estimator/task_constraint.struct.toml line 2 at r8 (raw file):

namespace = "FlexFlow"
name = "TaskConstraint"

Suggestion:

name = "TaskExecutionConstraint"

lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 19 at r8 (raw file):

TaskGraphExecutionTrace
    simulate_task_graph_execution(TaskGraph const &task_graph,

Might want to assert that the task graph is acylic? Just to improve debuggability


lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 23 at r8 (raw file):

  TaskGraphExecutionState execution_state =
      TaskGraphExecutionState{get_sources(task_graph.graph), {}, {}, 0.0};

Add argument name comments

Code quote:

      TaskGraphExecutionState{get_sources(task_graph.graph), {}, {}, 0.0};

lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 52 at r8 (raw file):

        TaskProfile{in_progress_task.node,
                    in_progress_task.endtime -
                        task_graph.cost_map.at(in_progress_task.node),

Why not just have InProgressTask have a start_time? This seems easier/more-intuitive than having to infer if after it's already been executed


lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 68 at r8 (raw file):

  };
  while (!is_processing_done()) {
    for (Node const &task : sorted(execution_state.ready_tasks)) {

Might be better to just have ready_tasks be a std::set so you have a non-insertion-order-based-order and don't have to keep sorting as iteration in a std::set is already ordered?

Code quote:

sorted(execution_state.ready_tasks)

lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 80 at r8 (raw file):

    if (!execution_state.tasks_processing.empty()) {
      InProgressTask next_task = get_next_task();

Suggestion:

get_next_task_to_finish

lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 87 at r8 (raw file):

  }

  return TaskGraphExecutionTrace{task_profiles, execution_state.current_time};

Would be good to include a check that all tasks were executed


lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 2 at r8 (raw file):

namespace = "FlexFlow"
name = "TaskGraph"

I would get rid of TaskGraph, and just have the simulate function take in a DiGraphView, a cost function (as a std::function or a wrapped std::function, depending on your preference), and a constraint function


lib/compiler/include/compiler/machine_mapping/device_mapping.cc line 1 at r8 (raw file):

Leading newline?


lib/compiler/include/compiler/machine_mapping/device_mapping.cc line 24 at r8 (raw file):

    device_mapping.insert(
        {layer, get_device_ids(op, machine_view, machine_spec)});
  }

Change to map_values?

Code quote:

  std::unordered_map<parallel_layer_guid_t, std::unordered_set<device_id_t>>
      device_mapping;
  for (auto const &[layer, machine_view] : machine_mapping.machine_views) {
    OperatorTaskSpace op =
        get_operator_task_space(pcg, layer));
    device_mapping.insert(
        {layer, get_device_ids(op, machine_view, machine_spec)});
  }

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 64 at r8 (raw file):

}

static std::pair<DiGraph, bidict<Node, PCGTask>>

Make this an actual dtgen type (this is the PCGTaskGraph mentioned in zulip)

Also, this should be a public function, as getting a task graph for a PCG is a generally useful function

Code quote:

static std::pair<DiGraph, bidict<Node, PCGTask>>

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 112 at r8 (raw file):

static bool
    is_allowed_to_run_super(Node const &task,

Make this a lambda


lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 133 at r8 (raw file):

          device_map.raw_device_map.at(task_component.require_layer()));
    }
  }

Can you clean this up a bit with utils/containers?

Code quote:

  std::unordered_set<device_id_t> devices_occupied;

  for (Node const &in_progress_task : tasks_processing) {
    auto task_component = node_map.at_l(in_progress_task);
    if (task_component.is_layer()) {
      devices_occupied = set_union(
          devices_occupied,
          device_map.raw_device_map.at(task_component.require_layer()));
    }
  }

lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 149 at r8 (raw file):

  auto [digraph, node_map] = get_digraph(pcg);
  auto cost_map = get_cost_map(node_map, pcg, machine_mapping, estimator);

I feel like this would be simpler just as a lambda that you pass in to the simulate function, rather than having to do all the manipulation to compute the cost map up front

Code quote:

  auto cost_map = get_cost_map(node_map, pcg, machine_mapping, estimator);

lib/utils/include/utils/deduplicated_priority_queue.h line 41 at r8 (raw file):

  }

  std::vector<Elem> contents() const {

What is this function (semantically) and why is it necessary?


lib/compiler/include/compiler/cost_estimator/task_graph_execution_state.struct.toml line 0 at r8 (raw file):
Would be good to move all the task simulator stuff to its own directory (e.g., compiler/task_graph_simulator)


lib/compiler/include/compiler/cost_estimator/task_graph_execution_state.struct.toml line 28 at r8 (raw file):

[[fields]]
name = "tasks_processing"

Suggestion:

name = "in_progress_tasks"

lib/compiler/include/compiler/cost_estimator/task_graph_execution_state.struct.toml line 32 at r8 (raw file):

[[fields]]
name = "processed_tasks"

Suggestion:

name = "finished_tasks"

lib/compiler/include/compiler/machine_mapping/device_mapping.struct.toml line 2 at r8 (raw file):

namespace = "FlexFlow"
name = "DeviceMapping"

Suggestion:

name = "UnstructuredDeviceMapping"

lib/compiler/include/compiler/cost_estimator/task_graph_execution_trace.struct.toml line 27 at r8 (raw file):

[[fields]]
name = "end_time" 
type = "float"

Remove, just have a helper function that calculates it from the set of task profiles

Code quote:

[[fields]]
name = "end_time"
type = "float"

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 29 files at r2, 18 of 19 files at r9, all commit messages.
Reviewable status: 54 of 55 files reviewed, 15 unresolved discussions (waiting on @Marsella8)


lib/compiler/include/compiler/machine_mapping/unstructured_device_mapping.h line 12 at r9 (raw file):

UnstructuredDeviceMapping
    get_device_mapping(MachineMapping const &machine_mapping,

Suggestion:

get_unstructured_device_mapping

lib/compiler/src/compiler/task_graph_simulator/simulate_task_graph_execution.cc line 97 at r9 (raw file):

  }
  if (execution_state.finished_tasks.size() != num_nodes(task_graph.graph)) {
    throw mk_runtime_error("Some tasks within the TaskGraph are unreachable");

Suggestion:

("Failed to execute all tasks in TaskGraph");

lib/compiler/test/src/compiler/task_graph_simulator/task_simulator.cc line 37 at r9 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("task_simulator") {

Suggestion:

 TEST_CASE("task_simulator_estimate_forward_pass_time") {

lib/compiler/test/src/compiler/task_graph_simulator/simulate_task_graph_execution.cc line 17 at r9 (raw file):

    SUBCASE("linear graph") {
      DiGraph g = DiGraph::create<AdjacencyDiGraph>();

Suggestion:

    DiGraph g = DiGraph::create<AdjacencyDiGraph>();
    SUBCASE("linear graph") {

lib/compiler/test/src/compiler/task_graph_simulator/simulate_task_graph_execution.cc line 87 at r9 (raw file):

      }

      SUBCASE("processing constraint") {

Suggestion:

      SUBCASE("constrained to one node at a time")

lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 131 at r9 (raw file):

      transform(layer_nodes, [&](Node const &n) {
        return node_map.at_l(n).require_layer();
      });

filtrans would probably be cleaner

Code quote:

  std::unordered_set<Node> layer_nodes =
      filter(in_progress_tasks,
             [&](Node const &n) { return node_map.at_l(n).is_layer(); });
  std::unordered_set<parallel_layer_guid_t> layers =
      transform(layer_nodes, [&](Node const &n) {
        return node_map.at_l(n).require_layer();
      });

Pietro Max Marsella added 2 commits January 20, 2025 10:47
Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 54 of 55 files reviewed, 15 unresolved discussions (waiting on @lockshaw)


lib/utils/include/utils/deduplicated_priority_queue.h line 41 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What is this function (semantically) and why is it necessary?

pulls out all the contents of the deduplicated priority queue. Needed since i have to pass it to the constraint function.


lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 2 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

It would be nice if TaskGraph was a bit more semantically-rich, i.e., has node labels of type Task, where Task is a variant.toml of parallel_layer_guid_t and whatever your communication representation is. Then, your actual simulator code would take in a DiGraphView and a couple std::functions, which would be based on the TaskGraph. Right now it kinda feels like you have two levels of structures, (1) the full PCG, and (2) a fully anonymized DiGraphView, which makes the conversion/lowering code rather complex. It would be nice to have an intermediate structure to simplify things a bit.

Done.


lib/compiler/include/compiler/cost_estimator/task_graph.struct.toml line 2 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I would get rid of TaskGraph, and just have the simulate function take in a DiGraphView, a cost function (as a std::function or a wrapped std::function, depending on your preference), and a constraint function

Done.


lib/compiler/include/compiler/cost_estimator/task_graph_execution_state.struct.toml line at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Would be good to move all the task simulator stuff to its own directory (e.g., compiler/task_graph_simulator)

Done.


lib/compiler/include/compiler/cost_estimator/task_graph_execution_trace.struct.toml line 27 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Remove, just have a helper function that calculates it from the set of task profiles

Done.


lib/compiler/include/compiler/machine_mapping/device_mapping.cc line 1 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Leading newline?

Done.


lib/compiler/include/compiler/machine_mapping/device_mapping.cc line 24 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Change to map_values?

don't think it's doable elegantly, given that the new value is a function of both the key (layer) and the original value.


lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 23 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Add argument name comments

Done.


lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 52 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not just have InProgressTask have a start_time? This seems easier/more-intuitive than having to infer if after it's already been executed

Done.


lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 68 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Might be better to just have ready_tasks be a std::set so you have a non-insertion-order-based-order and don't have to keep sorting as iteration in a std::set is already ordered?

the sorted is more so for consistency during testing (if there are multiple available tasks, process the one with the associated smallest node). It's more to have a fixed, arbitrary order than anything else, so probably better to not have this property embedded into the execution_tracker itself?


lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 87 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Would be good to include a check that all tasks were executed

Done.


lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 64 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Make this an actual dtgen type (this is the PCGTaskGraph mentioned in zulip)

Also, this should be a public function, as getting a task graph for a PCG is a generally useful function

Done.


lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 112 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Make this a lambda

Done.


lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 133 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Can you clean this up a bit with utils/containers?

Done.


lib/compiler/src/compiler/cost_estimator/task_simulator.cc line 149 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I feel like this would be simpler just as a lambda that you pass in to the simulate function, rather than having to do all the manipulation to compute the cost map up front

Done.


lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 131 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

filtrans would probably be cleaner

Done.


lib/compiler/test/src/compiler/task_graph_simulator/task_simulator.cc line 37 at r9 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("task_simulator") {

Done.


lib/compiler/include/compiler/cost_estimator/task_graph_execution_state.struct.toml line 28 at r8 (raw file):

[[fields]]
name = "tasks_processing"

Done.


lib/compiler/include/compiler/cost_estimator/task_graph_execution_state.struct.toml line 32 at r8 (raw file):

[[fields]]
name = "processed_tasks"

Done.


lib/compiler/src/compiler/task_graph_simulator/simulate_task_graph_execution.cc line 97 at r9 (raw file):

  }
  if (execution_state.finished_tasks.size() != num_nodes(task_graph.graph)) {
    throw mk_runtime_error("Some tasks within the TaskGraph are unreachable");

Done.


lib/compiler/include/compiler/machine_mapping/unstructured_device_mapping.h line 12 at r9 (raw file):

UnstructuredDeviceMapping
    get_device_mapping(MachineMapping const &machine_mapping,

Done.


lib/compiler/src/compiler/cost_estimator/simulate_task_graph_execution.cc line 80 at r8 (raw file):

    if (!execution_state.tasks_processing.empty()) {
      InProgressTask next_task = get_next_task();

Done.


lib/compiler/include/compiler/machine_mapping/device_mapping.struct.toml line 2 at r8 (raw file):

namespace = "FlexFlow"
name = "DeviceMapping"

Done.


lib/compiler/include/compiler/cost_estimator/task_constraint.struct.toml line 2 at r8 (raw file):

namespace = "FlexFlow"
name = "TaskConstraint"

Done.


lib/compiler/test/src/compiler/task_graph_simulator/simulate_task_graph_execution.cc line 17 at r9 (raw file):

    SUBCASE("linear graph") {
      DiGraph g = DiGraph::create<AdjacencyDiGraph>();

Done.


lib/compiler/test/src/compiler/task_graph_simulator/simulate_task_graph_execution.cc line 87 at r9 (raw file):

      }

      SUBCASE("processing constraint") {

Done.

@codecov
Copy link

codecov bot commented Jan 20, 2025

Codecov Report

Attention: Patch coverage is 84.82490% with 39 lines in your changes missing coverage. Please review.

Project coverage is 60.94%. Comparing base (3ee5e48) to head (d701bb0).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...el_computation_graph/parallel_computation_graph.cc 17.39% 19 Missing ⚠️
lib/pcg/src/pcg/machine_view.cc 52.17% 11 Missing ⚠️
...k_graph_simulator/simulate_task_graph_execution.cc 93.22% 4 Missing ⚠️
...task_graph_simulator/task_graph_execution_trace.cc 81.81% 2 Missing ⚠️
lib/utils/include/utils/archetypes/value_type.h 0.00% 2 Missing ⚠️
lib/utils/include/utils/containers/minimum.h 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1565      +/-   ##
==========================================
+ Coverage   60.47%   60.94%   +0.47%     
==========================================
  Files         606      618      +12     
  Lines       14725    14978     +253     
==========================================
+ Hits         8905     9129     +224     
- Misses       5820     5849      +29     
Flag Coverage Δ
unittests 60.94% <84.82%> (+0.47%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
lib/compiler/src/compiler/allowed_machine_views.cc 95.31% <100.00%> (+0.07%) ⬆️
...rc/compiler/cost_estimator/op_cost_estimate_key.cc 100.00% <100.00%> (ø)
...src/compiler/cost_estimator/tensor_set_movement.cc 100.00% <100.00%> (ø)
...ler/machine_mapping/get_optimal_machine_mapping.cc 93.00% <100.00%> (+0.04%) ⬆️
...er/src/compiler/machine_mapping/machine_mapping.cc 100.00% <100.00%> (ø)
...optimization/machine_mapping_with_memory_result.cc 98.87% <100.00%> (+0.09%) ⬆️
...ler/machine_mapping/unstructured_device_mapping.cc 100.00% <100.00%> (ø)
...ask_graph_simulator/in_progress_task_comparator.cc 100.00% <100.00%> (ø)
...rc/compiler/task_graph_simulator/pcg_task_graph.cc 100.00% <100.00%> (ø)
...rc/compiler/task_graph_simulator/task_simulator.cc 100.00% <100.00%> (ø)
... and 12 more

... and 2 files with indirect coverage changes

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add unit tests for get_device_ids and for
std::unordered_set<DataflowEdge> get_outgoing_edges(DataflowGraphView const &g, std::unordered_set<Node> const &ns)?

Reviewed 38 of 38 files at r11, 18 of 18 files at r12, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @Marsella8)


lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 104 at r10 (raw file):

Previously, Marsella8 wrote…

Done.

Not seeing this change?


lib/compiler/test/src/compiler/task_graph_simulator/task_simulator.cc line 170 at r10 (raw file):

Previously, Marsella8 wrote…

In the sense of having ForwardOpCostMetrics and BackwardOpCost metrics?

Just change OpCostMetrics to have a forward_execution_time and a backward_execution_time


lib/compiler/include/compiler/cost_estimator/tensor_set_movement.h line 0 at r11 (raw file):
Empty file?


lib/utils/include/utils/deduplicated_priority_queue.h line 42 at r11 (raw file):

  }

  std::set<Elem> contents() const {

Suggestion:

  std::set<Elem, Compare> 

lib/utils/include/utils/archetypes/value_type.h line 42 at r11 (raw file):

  std::ostringstream oss;
  oss << "<value_type<" << TAG << ">>";
  return oss.str();

Suggestion:

  assert(false);

lib/utils/include/utils/archetypes/value_type.h line 47 at r11 (raw file):

template <int TAG>
std::ostream &operator<<(std::ostream &s, value_type<TAG> const &x) {
  return s << format_as(x);

Suggestion:

  assert(false);

lib/utils/test/src/utils/containers/lookup_in_map.cc line 15 at r12 (raw file):

    SUBCASE("existing keys") {
      std::function<int(std::string)> func = lookup_in_map(map);

Suggestion:

      std::function<int(std::string const &)>

@lockshaw
Copy link
Collaborator

lib/compiler/include/compiler/task_graph_simulator/pcg_task.variant.toml line 21 at r10 (raw file):

Previously, Marsella8 wrote…

meant OpCostEstimateKey

OpCostMetric

Basically, if we use TensorSetMovement, the digraph is not a representation of only the PCG, but of PCG + DeviceMapping + MachineSpecification ....

Yes, but I think that's a good idea. The idea of the task graph is to represent the actual computation that occurs, not an abstracted partial version of it, so including the machine mapping information and making the communication explicit is exactly what I'd want.

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 60 of 95 files reviewed, 8 unresolved discussions (waiting on @lockshaw and @wmdi)


lib/compiler/include/compiler/cost_estimator/tensor_set_movement.h line at r11 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Empty file?

fixed


lib/compiler/include/compiler/task_graph_simulator/pcg_task.variant.toml line 21 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Yes, but I think that's a good idea. The idea of the task graph is to represent the actual computation that occurs, not an abstracted partial version of it, so including the machine mapping information and making the communication explicit is exactly what I'd want.

yeah ur right, done :)


lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 104 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Not seeing this change?

sorry, forgot to mention that try_require_layer does not exist


lib/compiler/test/src/compiler/task_graph_simulator/task_simulator.cc line 170 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Just change OpCostMetrics to have a forward_execution_time and a backward_execution_time

OpCostMetricswas not made by me, and it's predominantly used within @wmdi 's machine mapping so I don't wanna make changes that break her code, possibly a good topic for next week's meeting?


lib/utils/include/utils/archetypes/value_type.h line 42 at r11 (raw file):

  std::ostringstream oss;
  oss << "<value_type<" << TAG << ">>";
  return oss.str();

Done.


lib/utils/include/utils/archetypes/value_type.h line 47 at r11 (raw file):

template <int TAG>
std::ostream &operator<<(std::ostream &s, value_type<TAG> const &x) {
  return s << format_as(x);

Done.


lib/utils/include/utils/deduplicated_priority_queue.h line 42 at r11 (raw file):

  }

  std::set<Elem> contents() const {

Done.


lib/utils/test/src/utils/containers/lookup_in_map.cc line 15 at r12 (raw file):

    SUBCASE("existing keys") {
      std::function<int(std::string)> func = lookup_in_map(map);

Done.

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 35 of 35 files at r13, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Marsella8 and @wmdi)


lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 104 at r10 (raw file):

Previously, Marsella8 wrote…

sorry, forgot to mention that try_require_layer does not exist

Then try_require_operator? afaik it should be generated by dtgen


lib/compiler/test/src/compiler/task_graph_simulator/task_simulator.cc line 170 at r10 (raw file):

Previously, Marsella8 wrote…

OpCostMetricswas not made by me, and it's predominantly used within @wmdi 's machine mapping so I don't wanna make changes that break her code, possibly a good topic for next week's meeting?

It's fine to make that change, we've actually been discussing it for a while--iirc CostEstimator already provides that information, it's just not currently preserved by OpCostMetrics


lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 46 at r13 (raw file):

    }

    auto running_ops = filtrans(in_progress_tasks, [&](Node const &n) {

Suggestion:

in_progress_ops

lib/pcg/include/pcg/operator_task_space.h line 26 at r13 (raw file):

                                       parallel_layer_guid_t const &layer);

OperatorTaskSpace get_operator_task_space_from_parallel_tensor_shape(

Don't allow this operation, for now the operator task space should rely on the parallel_layer_guid_t. It seems you're using this to get the device set for the task graph--you should instead compute that ahead of time when you have the parallel_layer_guid_t and save it in the PCGTasks in the TaskGraph

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @lockshaw and @wmdi)


lib/pcg/include/pcg/operator_task_space.h line 26 at r13 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Don't allow this operation, for now the operator task space should rely on the parallel_layer_guid_t. It seems you're using this to get the device set for the task graph--you should instead compute that ahead of time when you have the parallel_layer_guid_t and save it in the PCGTasks in the TaskGraph

ur right, have added a node_to_layer bidict to the PCGTaskGraph (we would technically need a tridict to have perfect "make invalid states unrepresentable", but I think this suffices hahah)


lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 104 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Then try_require_operator? afaik it should be generated by dtgen

sorry, totally missed it.


lib/compiler/test/src/compiler/task_graph_simulator/task_simulator.cc line 170 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

It's fine to make that change, we've actually been discussing it for a while--iirc CostEstimator already provides that information, it's just not currently preserved by OpCostMetrics

Done.


lib/compiler/src/compiler/task_graph_simulator/task_simulator.cc line 46 at r13 (raw file):

    }

    auto running_ops = filtrans(in_progress_tasks, [&](Node const &n) {

Done.

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 63 of 100 files reviewed, 7 unresolved discussions (waiting on @lockshaw and @wmdi)


lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 245 at r14 (raw file):

    float cost =
        context.cost_estimator.estimate_cost(mapped)
            .forward_runtime; // TODO(@pietro)(@wmdi) incorporate backward pass?

@wmdi


lib/compiler/src/compiler/machine_mapping/memory_optimization/machine_mapping_with_memory_result.cc line 35 at r14 (raw file):

      if (mapping.cost.forward_runtime >=
              other_mapping.cost
                  .forward_runtime && // TODO(@wmdi) does it have to be modified

@wmdi


lib/compiler/src/compiler/machine_mapping/memory_optimization/machine_mapping_with_memory_result.cc line 104 at r14 (raw file):

            std::max(lhs_mm.cost.forward_runtime, rhs_mm.cost.forward_runtime),
            std::max(lhs_mm.cost.backward_runtime,
                     rhs_mm.cost.backward_runtime), //(@wmdi) is this correct?

@wmdi

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 37 of 37 files at r14, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @lockshaw, @Marsella8, and @wmdi)


lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 245 at r14 (raw file):

Previously, Marsella8 wrote…

@wmdi

Should should be the sum of the forward and backward runtimes


lib/compiler/include/compiler/task_graph_simulator/pcg_task_graph.struct.toml line 24 at r14 (raw file):

[[fields]]
name = "node_to_layer"
type = "::FlexFlow::bidict<::FlexFlow::Node, ::FlexFlow::parallel_layer_guid_t>"

I would lean toward making this a std::unordered_set<DeviceId> (i.e., pre-compute the device sets and save the result) rather than saving the parallel_layer_guid_t directly, especially since this will stop being a bidict once you have both forward and backward tasks. Thoughts?

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @lockshaw and @wmdi)


lib/compiler/include/compiler/task_graph_simulator/pcg_task_graph.struct.toml line 24 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I would lean toward making this a std::unordered_set<DeviceId> (i.e., pre-compute the device sets and save the result) rather than saving the parallel_layer_guid_t directly, especially since this will stop being a bidict once you have both forward and backward tasks. Thoughts?

makes sense, updated


lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 245 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Should should be the sum of the forward and backward runtimes

thanks, added, not sure how to update the associated tests accordingly, I've commented them out for now, hopefully @wmdi could fix them?

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 26 of 26 files at r15, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @lockshaw and @wmdi)


lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 245 at r14 (raw file):

Previously, Marsella8 wrote…

thanks, added, not sure how to update the associated tests accordingly, I've commented them out for now, hopefully @wmdi could fix them?

They'll need to be uncommented before this is merged so you may as well keep them uncommented right now (if at least to reduce the diff size for review)

Copy link
Collaborator

@wmdi wmdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @Marsella8)


lib/compiler/src/compiler/machine_mapping/memory_optimization/machine_mapping_with_memory_result.cc line 35 at r14 (raw file):

Previously, Marsella8 wrote…

@wmdi

I think so. Let's keep this comment for this PR, and I will fix it when I merge it.


lib/compiler/src/compiler/machine_mapping/memory_optimization/machine_mapping_with_memory_result.cc line 104 at r14 (raw file):

Previously, Marsella8 wrote…

@wmdi

Correct

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @lockshaw and @wmdi)


lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 245 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

They'll need to be uncommented before this is merged so you may as well keep them uncommented right now (if at least to reduce the diff size for review)

Done.


lib/compiler/src/compiler/machine_mapping/memory_optimization/machine_mapping_with_memory_result.cc line 35 at r14 (raw file):

Previously, wmdi (Mengdi Wu) wrote…

I think so. Let's keep this comment for this PR, and I will fix it when I merge it.

great thanks.

Pietro Max Marsella added 2 commits January 24, 2025 15:21
Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 18 of 18 files at r16, 25 of 27 files at r17, 19 of 19 files at r18, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @wmdi)

Pietro Max Marsella added 2 commits January 24, 2025 15:25
@lockshaw lockshaw enabled auto-merge (squash) January 24, 2025 23:31
@lockshaw lockshaw merged commit de7fa32 into flexflow:master Jan 24, 2025
5 of 7 checks passed
@victorli2002 victorli2002 mentioned this pull request Mar 12, 2025
@victorli2002 victorli2002 mentioned this pull request Apr 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable task simulator to return a trace/profile object for understanding/visualization

3 participants