Reinforcement Learning + FX Trading Strategy

This post is based on my previous article written in Japanese.(http://nekopuni.holy.jp/?p=1231)

Summary

– Applying reinforcement learning to trading strategy in fx market
– Estimating Q-value by Monte Carlo(MC) simulation
– Employing first-visit MC for simplicity
– Using short-term and long-term Sharpe-ratio of the strategy itself as a state variable, to test momentum strategy
– Using epsilon-greedy method to decide the action

First-visit MC

1. Calculate the state at day t(state = Sharpe ratio one day before day t)
2. Decide the action according to the state(long position, no position or short position)
3. Update rewards matrix based on the reward obtained at next time step t+1
4. Update Q matrix once t is equal to the last time step T
Loop above procedure until you feel happy

Python code

This time daily USDJPY from FRED is used for simulation.
So-called swap point and any transaction costs are not included in this code so far.

Results

Still this code takes huge computation time for me so the results(Q matrix) below is done by only 1000 iterations.

For long position(action = 0)
140824_mc1000_long_top

For flat position(action = 1)
140824_mc1000_flat_top

For short position(action = 2)
140824_mc1000_short_top

As seen in the figures above, the Q-value is lower if we have relatively extreme medium-term Sharpe ratio(lower than -1 and higher than 1)
For long position the Q-value is highest if short-term Sharpe is somewhere between 0.5 and 1.0 whilst between -0.5 and -1.0 is the highest area for short position.
Presumably this code could be seen as modest momentum strategy although I think this is depending on the sample period

140824_average_cumulative_return

Figure above is the average of cumulative return for each episode with expanding window.
This seems to be converging a certain level.idk..

Further task

– Out of sample
See if this reinforce learning has explanatory power

– other market data such as other currency pairs

– improvement of the computation time
this code is very far from online-learning