![]() There are various technical approaches to deep reinforcement learning, where the idea is to learn a policy that maximizes long-term reward represented numerically. The learning agent learns by interacting with the environment and then figures out how to best map states to actions. The typical setup involves an environment, an agent, states, and rewards. Perhaps the most common technical approach is Q-learning. Here, our neural network acts as a function approximator for a function Q, where Q(state, action) returns a long-term value of the action given the current state. The Q represents the “quality” of some move given a specific state the following pseudo-code outlines the algorithm: The simplest way to use an agent trained from Q-learning is to pick the action that has the maximum Q-value. In practice, when we are training the Q-learner, we do not always pick the action that has the maximum Q-value as the next move during the self-play phase. Instead, there are various exploration-exploitation methods designed to balance the ‘exploring’ of the state space in order to gain and access information on a wider range of actions and Q-values versus ‘exploiting’ what the model has already learned. One basic method is to start with completely random choices some percentage of the time and to then slowly decay to a smaller percentage as the model learns. Playing Battleship, we found that starting at 80% and decaying to 10% worked well. ![]() To help with faster training and model stability, more advanced deep Q-learning methods use techniques such as experience replay and double Q-learning. Experience replay is when games are stored in a cyclical memory buffer so that we can train batches of moves and we can sample from games that were already played. This helps the model avoid converging to a local minima because the model won’t be getting information from a sequence of moves in a single game. It also helps the model take into account past moves and positions, providing a richer source of training. ![]() Double Q-learning essentially uses two Q-learners: one to pick the action and another to assign the Q-value. This helps to minimize overestimation of Q-values. To generate the sample data, we began with the open source phoenix-battleship, which was written in elixir using the phoenix framework. We modified phoenix-battleship to save logs of ship locations and player moves and we made slight configuration changes for the sizes of ships and generated data. We hosted the app on Heroku, encouraged our co-workers at GA-CCRi to play,and saved logs of the games that were played using the add-on papertrail. We collected data from 83 real, two-person games. The following shows one GA-CCRi employee’s view of a game in progress with another employee. We wrote the code in PyTorch with guidance from the Reinforcement Learning (DQN) tutorial on as well as Practical PyTorch: Playing GridWorld with Reinforcement Learning and Deep reinforcement learning, battleship.ĭark blue squares show misses, the little explosions show hits, and the gray squares on the left show where that player’s ships are.
0 Comments
Leave a Reply. |