Automated trading is a method of participating in financial markets by using a computer program that makes automaticly the trading decisions and then executes them. Usually, the trading algorithm executes pre-set rules for entering and exiting trades. In contrast, in this project, the computer learns a trading policy using Machine Learning. Specifically, the trading algorithm is a Deep Reinforcement Learning (Deep RL) Agent. The agent is trained using trial and error in order to maximize the expected Reward. The Reward Function chosen is the trading Profit and Loss, also know as PnL. In this project, the Agent is able to trade bewteen 4 Cryptocurrencies (ADA, BTC, ETH and LTC) and US Dollars. The agents starts with $1000.
The Reinforcement Learning algorithm chosen is Double Dueling Deep Q Learning (DQN) using Prioritized Experience Replay. In Deep DQN, a neural network is trained in order to approximate the Q and V functions.
The RL algorithm can take 5 actions, one corresponding to each cryptocurrency and one corresponding to US Dollars. If the agent does not have in its portfolio the currency chosen, then, in the same timestep, it sells its whole portfolio and then buys as many of the selected currency as possible. If the agent choses a currency it already owns, then the action acts as a Hold. The agent takes an action once every hour.
The state consists of the High, Low and Close statistics for each cryptocurrency in the last 24 Hours. Also, it contains a One-Hot vector of length 5 that informs the Agent of which currency it currently has in its portfolio.
First, the Timeseries are encoded using separate encoders into Timeseries embeddings. Encoder Architectures experimented with include Multi Layer Perceptrons (MLPs), 1D Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) as well as Transformers. LSTM networks provided the optimal performance.
Then, the Timeseries embeddings are concatenated together, as well as with the portfolio embedding into the final state embedding.
Finally, MLP Head networks map the state embedding to the Q and V values.
After a specified number of training steps (Epoch) the Agent is evaluated. The evaluation episode is 1000 Timesteps long. The agent was trained and evaluated without trading fees. However, the final evaluation was conducted using the standard 0.1% trading fee. Surprisingly, this training regime resulted in the optimal performance. The best way to evaluate trading agents is against the naive method called Buy And Hold (B&H). This comparison provides a more accuracy estimation of the agent`s performance, since it is mostly aleviated of the effect of the market trend throughout the evaluation episode. The following plots depict the Agent`s performance after each training Epoch.
After training, the final agent was evaluated using the standard 0.1% trading fee. In 50 Evaluation episodes, the average Agent`s performance was on average 30% better than the Buy And Hold strategy.
T-Test was conducted to determine the statistical significance of the results. The p-value against the null hypothesis that the Buy And Hold strategy is better than the RL Agent is 0.000006.