Reproducibility results of Sudoku-Extreme and ARC-AGI 1

We at [@HigherOrderCO](https://github.com/HigherOrderCO) decided to reproduce the HRM results because of the very interesting result specially when comparing the total compute time against other models / architectures (such as LLMs)

At first, we choose to run the small experiment of `Sudoku-Extreme 9x9`. We used 1 H200 GPUs and the training time was approximately one hour or so.

The training process was exactly the one described in the `README`, with:
`OMP_NUM_THREADS=1 torchrun --nproc-per-node 1 pretrain.py data_path=data/sudoku-extreme-1k-aug-1000 epochs=20000 eval_interval=2000 lr=1e-4 puzzle_emb_lr=1e-4 weight_decay=1.0 puzzle_emb_weight_decay=1.0`

and evaluation:
`OMP_NUM_THREADS=1 torchrun --nproc-per-node 1 evaluate.py checkpoint=checkpoints/Sudoku-extreme-1k-aug-1000\ ACT-torch/HierarchicalReasoningModel_ACTV1\ loose-caracara/step_26040 `

As evaluation results, we got:
- 45,8% of accuracy (10% less then the 55% reported in the paper)
- perfect halting accuracy
- 27275266 parameters

<img width="2386" height="158" alt="Image" src="https://github.com/user-attachments/assets/00504cb2-d00b-4bef-b701-bdbd1e0e9141" />

Then, we started a runtime to reproduce the `ARC-AGI-1` experiment. We used 8 H200 GPUs and the runtime took roughly 24 hours.

Built the dataset with:
`python dataset/build_arc_dataset.py`

And the training with: `OMP_NUM_THREADS=8 torchrun --nproc-per-node 8 pretrain.py `

Finally, the evaluation with:

`OMP_NUM_THREADS=8 torchrun --nproc-per-node 8 evaluate.py checkpoint=<CHECKPOINT_PATH>`

We got the results:

- ~25% accuracy (15% less then the reported 40%) 
- 58% halting accuracy
- 27276290 params

<img width="2544" height="1068" alt="Image" src="https://github.com/user-attachments/assets/3dad7a31-38e2-4ebd-be34-c21232634989" />

<img width="2840" height="1344" alt="Image" src="https://github.com/user-attachments/assets/4ee9ab9b-c633-4f66-b640-ea05dc5e7ddb" />

<img width="2616" height="88" alt="Image" src="https://github.com/user-attachments/assets/5a27a1cf-13a3-4f85-aaec-897ba0cc18e3" />


With this we successfully reproduced the HRM experiment.
Now, the only question that remains from my end is why do we got 10% less in sudoku and 15% less in ARC. Since I saw a tweet from someone from here saying the compute time from arc was from 50 ~ 200 hours (setup not shared, not sure which GPUs) I assume they run the training longer / slightly changed the setup.

Anyway it's surely interesting that they get 25% with 960 examples and 24 hours of training time. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reproducibility results of Sudoku-Extreme and ARC-AGI 1 #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reproducibility results of Sudoku-Extreme and ARC-AGI 1 #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions