[ci] Find some way to ensure tests are sharded across nodes

In #12473 it seems like the Cortex-M tests are not using sharding so extra horizontal scaling doesn't reduce runtime. We should have a mechanism (maybe a post-run junit analysis) to print out info like the number of tests run on this shard vs others vs total tests run.

cc @Mousius @areusch @gigiblender