Skip to content

Reproducibility results of Sudoku-Extreme and ARC-AGI 1 #12

@Lorenzobattistela

Description

@Lorenzobattistela

We at @HigherOrderCO decided to reproduce the HRM results because of the very interesting result specially when comparing the total compute time against other models / architectures (such as LLMs)

At first, we choose to run the small experiment of Sudoku-Extreme 9x9. We used 1 H200 GPUs and the training time was approximately one hour or so.

The training process was exactly the one described in the README, with:
OMP_NUM_THREADS=1 torchrun --nproc-per-node 1 pretrain.py data_path=data/sudoku-extreme-1k-aug-1000 epochs=20000 eval_interval=2000 lr=1e-4 puzzle_emb_lr=1e-4 weight_decay=1.0 puzzle_emb_weight_decay=1.0

and evaluation:
OMP_NUM_THREADS=1 torchrun --nproc-per-node 1 evaluate.py checkpoint=checkpoints/Sudoku-extreme-1k-aug-1000\ ACT-torch/HierarchicalReasoningModel_ACTV1\ loose-caracara/step_26040

As evaluation results, we got:

  • 45,8% of accuracy (10% less then the 55% reported in the paper)
  • perfect halting accuracy
  • 27275266 parameters
Image

Then, we started a runtime to reproduce the ARC-AGI-1 experiment. We used 8 H200 GPUs and the runtime took roughly 24 hours.

Built the dataset with:
python dataset/build_arc_dataset.py

And the training with: OMP_NUM_THREADS=8 torchrun --nproc-per-node 8 pretrain.py

Finally, the evaluation with:

OMP_NUM_THREADS=8 torchrun --nproc-per-node 8 evaluate.py checkpoint=<CHECKPOINT_PATH>

We got the results:

  • ~25% accuracy (15% less then the reported 40%)
  • 58% halting accuracy
  • 27276290 params
Image Image Image

With this we successfully reproduced the HRM experiment.
Now, the only question that remains from my end is why do we got 10% less in sudoku and 15% less in ARC. Since I saw a tweet from someone from here saying the compute time from arc was from 50 ~ 200 hours (setup not shared, not sure which GPUs) I assume they run the training longer / slightly changed the setup.

Anyway it's surely interesting that they get 25% with 960 examples and 24 hours of training time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions