This fork was used to run experiments adjacent to another piece of work focusing on simplified linear transforms. Rough notes on this can be found in the research log.
Although, those notes and results are spread over two branches. One was dedicated to investigating whether weight decay scaling was a good idea, and the other to what settings to use with Tensor-Train decompositions.