Skip to content

Latest commit

 

History

History
211 lines (144 loc) · 13.7 KB

IOPDDL.md

File metadata and controls

211 lines (144 loc) · 13.7 KB

The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning

Teams from around the globe are invited to contribute submissions toward solving intra-operator parallelism for distributed deep learning, a combinatorial optimization problem that arises during the compilation of Machine Learning models onto the resources of multi-device hardware accelerators. Prizes will be awarded to top-ranking teams who commit to open-sourcing their solver prior to the following year's conference.

Rank Prize
First Place 🥇 $25,000
Second Place 🥈 $10,000
Third Place 🥉 $5,000

Important Dates

Date Event
2024-12-01 Contest Announced
2024-12-02 Contest GitHub Repository (including Benchmark Subset) Released
2025-02-15 Contest Registrations & Preliminary Submissions Due*
2025-03-01 Contest Final Submissions Due*
2025-03-03 Early Registration Deadline for ASPLOS 2025 / EuroSys 2025
2025-03-30 Contest Special Session during ASPLOS 2025 / EuroSys 2025 Workshops
2025-04-01 Contest Winners Announced during ASPLOS 2025 / EuroSys 2025 Conference

*Submissions are due by 11:59pm at any time on Earth.

Problem Description

With machine learning models becoming larger and larger, multi-device execution across several slices or pods of hardware accelerators is imperative to meet latency and throughput requirements across training and serving workloads. When executing an ML computation in a distributed manner, a key determinant of performance is the way that computation is sharded across the multiple devices. An example of a 3D tensor sharded onto a 2D device mesh is shown below:

Sharding a model to minimize exposed communication (e.g., using data parallelism or operator parallelism) can lead to significant performance gains. Unfortunately, given that model computations regularly contain hundreds or thousands of tensors and HLO ops, manually specifying how each should be executed across multiple devices is usually not feasible. Your challenge is to design an algorithm capable of efficiently and effectively performing this task.

As input, the algorithm will accept a graph (see: example.json) where nodes are associated with one or more strategies. Each strategy is annotated with a corresponding node cost, and certain pairs of node strategies come with a corresponding edge cost. Solutions should select one strategy per node that minimizes total cost as much as possible.

Nodes also incur a strategy-specific memory usage over a fixed time interval. The sum of usages at any given time point in a solution must not eclipse the usage limit for that benchmark.

Broadly speaking, the structural characteristics of a benchmark's graph topology and memory profile tend to vary wildly as a function of model task (e.g., training vs. serving) and model modality (e.g., language vs. vision).

Submission

Teams must submit a binary executable named iopddl that takes two arguments: a path to the input problem and a timeout duration in seconds:

$ ./iopddl example.json 10

The final line of the binary's output should be a comma-delimited list of node strategy indices enclosed in brackets (all intermediate lines will be ignored):

# Found solution [0, 0, 2, 0, 0] with cost 545  <--- this line ignored
# Found solution [0, 0, 2, 1, 0] with cost 445  <--- this line ignored
[0, 0, 2, 1, 0]

To ensure that the binaries can be executed properly, teams will have an opportunity to submit a preliminary version to the Contest Organizers, who will then test it on their machine.

Getting Started

To help you get started, we have released a separate GitHub repository containing several C++ files that define the basic problem and solution classes, along with various utilities. You are under no obligation to use these source codes.

File Description
iopddl.h Defines the Problem and Solution classes, along with methods for file I/O and solution evaluation.
solver.cc A simple random solver (teams will want to replace this with their own implementation).
main.cc Reads the given problem, calls the solver, and prints the solution to standard output.

You should be able to use the following commands to download the source codes, compile the project, and run the executable:

$ git clone --recursive https://github.com/google/iopddl.git
$ mkdir iopddl/build && cd iopddl/build && cmake .. && make
$ ./iopddl example.json 10

Please don't hesitate reaching out to the Contest Organizers if you encounter any trouble getting these materials to work.

Benchmarks

The contest will involve twenty-five industrial benchmarks; five have been released in advance to contest participants (for testing purposes), and the remaining twenty will be withheld for evaluation:

Benchmark Name # Nodes # Edges Timeout Will be released to participants?
asplos-2025-iopddl-A 34,932 54,801 1 minute(s) Yes
asplos-2025-iopddl-B TBD TBD 1 minute(s)
asplos-2025-iopddl-C TBD TBD 1 minute(s)
asplos-2025-iopddl-D TBD TBD 1 minute(s)
asplos-2025-iopddl-E TBD TBD 1 minute(s)
asplos-2025-iopddl-F TBD TBD 2 minute(s)
asplos-2025-iopddl-G 816 1,023 2 minute(s) Yes
asplos-2025-iopddl-H TBD TBD 2 minute(s)
asplos-2025-iopddl-I TBD TBD 2 minute(s)
asplos-2025-iopddl-J TBD TBD 2 minute(s)
asplos-2025-iopddl-K TBD TBD 3 minute(s)
asplos-2025-iopddl-L TBD TBD 3 minute(s)
asplos-2025-iopddl-M 32,894 47,067 3 minute(s) Yes
asplos-2025-iopddl-N TBD TBD 3 minute(s)
asplos-2025-iopddl-O TBD TBD 3 minute(s)
asplos-2025-iopddl-P TBD TBD 4 minute(s)
asplos-2025-iopddl-Q TBD TBD 4 minute(s)
asplos-2025-iopddl-R TBD TBD 4 minute(s)
asplos-2025-iopddl-S 28,526 38,826 4 minute(s) Yes
asplos-2025-iopddl-T TBD TBD 4 minute(s)
asplos-2025-iopddl-U TBD TBD 5 minute(s)
asplos-2025-iopddl-V TBD TBD 5 minute(s)
asplos-2025-iopddl-W TBD TBD 5 minute(s)
asplos-2025-iopddl-X TBD TBD 5 minute(s)
asplos-2025-iopddl-Y 62,185 91,020 5 minute(s) Yes

All benchmarks will become publicly available after the contest is complete.

Evaluation and Scoring

The contest organizers will execute each team's binary across the twenty withheld benchmarks on a dedicated 8-core Linux workstation with 32GB of RAM. Rankings will be established by calculating the total number of points per team, where points are derived by normalizing cost values against the best submitted solution.

For example, assume the following raw costs:

Team Benchmark X Benchmark Y Benchmark Z
Team A cost: 100 cost: 500 cost: 800
Team B cost: 300 cost: 200 cost: 900
Team C cost: 600 cost: 700 cost: 400
Min. Team Cost cost: 100 cost: 200 cost: 400

The points that a team earns for a particular benchmark is then equal to min_team_cost / team_cost

Team Benchmark X Benchmark Y Benchmark Z Total Points Ranking
Team A 1.000 points 0.400 points 0.500 points 1.900 points First Place 🥇
Team B 0.333 points 1.000 points 0.444 points 1.778 points Second Place 🥈
Team C 0.167 points 0.286 points 1.000 points 1.452 points Third Place 🥉

Presentations

Teams who successfully submit an entry will be invited to present an informal overview of their approach (roughly 10 to 15 minutes) at a special session held on March 30th during the Workshop & Tutorial days. Winners will be announced later in the week, with full results being released soon after the conference.

Contest Eligibility

All are welcome to participate in the contest -- including teams from academia, industry, and elsewhere -- with the exception of the Contest Organizers and employees of the Contest Sponsor. Individuals are prohibited from participating in multiple teams. In order to be eligible for prizes, teams must commit to releasing an open-source version of their implementation prior to ASPLOS 2026 (allowing teams sufficient time to write & submit papers detailing their approach).

Frequently Asked Questions

To raise a question, please create an issue in this repository, or feel free to reach out to the contest organizers directly.

What flavors of parallelism are in scope?

  • The strategies in our benchmarks include combinations of data parallelism and operator parallelism (i.e., for SPMD partitioning) but exclude inter-operator parallelism (i.e., for multi-stage pipelining) that would differentiate which operators of the graph are executed on specific devices.

What level of precision will the values in the benchmarks have?

  • All cost and memory usage values in the benchmarks can be represented as 64-bit integers, but beware of integer overflow when calculating total cost and/or total memory usage.

How should usage intervals be interpreted?

  • A node with usage interval [lower, upper] should be considered half-open with an exclusive upper bound. In other words, it will consume memory at time points {lower, lower + 1, ..., upper − 1}. Hence, any nodes with an empty interval [0, 0] essentially consume no memory.

Related Work

Contest Organizers

Michael D. Moffitt ([email protected]) & Pratik Fegade ([email protected])

Contest Sponsor