Eval working #463

blindcrone · 2024-11-15T11:33:38Z

Added facilities for processing examples (which currently consist of an input, a target, and a length) and a means of evaluating them against a (currently hard-defaulted) loss function in a distributed fashion across the shards of the exo network

A lot of the piping here will make it easier to do distributed training.

exo/main.py

exo/networking/grpc/grpc_server.py

exo/orchestration/standard_node.py

dtnewman · 2024-11-15T14:22:07Z

exo/orchestration/standard_node.py

+    if DEBUG >= 2: print(f"computed target from: {base_shard} {target_index}, {self.topology}. target shard: {target_shard}")
+    target_peer = next((p for p in self.peers if p.id() == target_id), None)
+    if not target_peer:
+      raise valueerror(f"peer for {target_index} not found")


I think this should be ValueError

dtnewman · 2024-11-15T14:24:46Z

@blindcrone I put in some comments that I hope are useful. Note that these are all somewhat stylistic/superficial since I havent actually fetched or tested the branch on my machine.

Still debugging some tinygrad stuff, and fixing comms

blindcrone · 2024-11-19T16:24:17Z

Okay, this now in theory trains across nodes on MLX. I'll need to add the ability to save the weights somewhere to see how well it actually does, and it'd be nice to get tinygrad working this way too

Also I think the backprop approximation loss isn't exactly a correct approximation, so if anyone has a better suggestion based on remembering the backprop equations better please do educate

blindcrone marked this pull request as draft November 15, 2024 11:34

blindcrone force-pushed the eval-working branch from a03f6cc to f1cf227 Compare November 15, 2024 11:44