I ran some exploratory experiments to figure out whether the filler-tokens/IC-tokens/computation-tokens/pause-tokens (many people have had this idea over time thus the many names) help transformers. This is far from completed research. The experiments are all done with small GPT-2 transformers in the range of 5-40k parameters to validate whether there's any reason to run this in LLMs of bigger sizes although someone concurrently did run that: https://arxiv.org/abs/2310.02226.
A more detailed log book that has a bit more details on the experiments is here: https://wandb.ai/reasoning/think-hard/reports/Experiment-log-book--Vmlldzo1NDMwODg1?accessToken=l9091dz4i0vrvp1bdbfj0ui8wat3c4b1cbc5p9wcdwbjz6qmojlhqeqo3vrihpyu The summary is at the top, but if you feel adventurous feel to look through the experiments in the log book right beneath it which contain a lot more detail.