How does the ANE work internally?

Good question! I don't think anyone outside Apple knows, but other NPUs seem to mostly focus on doing matrix multiplications really efficiently, so it's reasonable to assume the ANE is similar.

In the mean time, check out these explanations of the TPU to get an idea of how Google does it.