I have solved this problem with a DQN algorithm using 2 neural networks to compute the Q_values with prioritized experience replay.
The agent solves the problem in about 20-25 epsiodes, meaning it does not make any mistakes after that. However, the formal definition of solved is when the agent gets an average reward of 195 the last 100 epsiodes. This occurs around epsisode nb. 116