Reinforcement Learning

The RL problem presented in MLDemos is a Food Gathering problem, in which the goal is to provide a policy for navigating a continuous 2-Dimensional space and pick up food. The states, actions and rewards are defined next.

States
States are defined as 2-dimensional (x,y) ∈ R2 positions in the canvas space. The space is
continuous, and bound between [0., 1.] for practical purposes.

Actions
Actions are defined as movement from one state to another, following a set of possible direction (defined by the user). The sets of possible actions from each states are:
In all cases, an additional action ”wait” allows to not move.

Rewards
The State-Value function is computed in a cumulative way by considering how much food is collected throughout a trajectory from any given initial state, for a number of Evaluation Steps (defined by the user).
The state-value function is evaluated at each policy-optimization iteration for a number of states corresponding to the number of basis functions, initialized at the state corresponding to the center of the basis function (grid).

Policies
Three policies have been implemented in MLDemos. In all cases, the policy determines what action will be taken from each state using a grid-like distribution of basis functions. The action taken from a specific state will be ”influenced” by the policy using 3 different paradigms:
The first case is a peculiar case in that, while the states space is continuous, the policy provides the exact same action for a set of states, which makes it a somewhat discretized problem. The other two policies provide a continuous set of actions for a continuous states space and therefore pose no problems of a somewhat ontological nature.

In Practice
The easiest way to test the reinforcement learning process is to:
  1. Use the Reward Painter button in the drawing tools to paint food (red) onto the canvas
  2. Click the Initialize button to start the learning process
This will start the RL process, display the policy basis functions and update them every Display Steps iterations.

Options and Commands
The interface for Reinforcement Learning (the right-hand side of the Algorithm Options dialog) provides the following commands:
The options regarding the policy type, reward and evaluation have been described above.

Generate Rewards
It is possible to generate a set of pre-constructed rewards by dragging and dropping a gaussian of fixed size (Var option) or a gradient from the center of the canvas to the dropped position. Alternatively a number of standard benchmark functions is proposed. Use the Set button to draw the benchmark function onto the canvas.