Stochastic Method for Reward Maximization
Here are presented some stochastic methods for exploring a policy space and maximizing an arbitrary reward function.
Random Search
Randomly sample new policies (points) at each iteration and keep the
maximum value found. The display shows in black all visited policies,
and in green the best policies found at each iteartion.
Random Walk
Generate samples around the current maximum drawing from a normal distribution.
- Search Variance: variance of the normal distribution used for exploration
The display shows in green the history of the best policies found
at each iteration, in black all policies visited so far. Ellipses
display the search distribution (bold: 1x variance, thin: 2x variance).
Policy learning by Weighting Exploration with the Returns (PoWER)
PoWER uses a set of the k best policies to estimate the maximum, and explores new samples
drawing from a normal distribution whose variance can be constant or adapted to the current best
policies.
Detailed information on PoWER can be found in this paper.
- Search Variance: initial variance for exploration
- Adaptive: adapt search variance as a function of the current best policies
- k: size of the set of policies kept for optimization
The display shows in green the set of k policies, in white the history
of best policies throughout the iterations, and in black all visited
policies. Ellipses display the search distribution (bold: 1x variance,
thin: 2x variance).
Particle Filters
Particle Filters use a set of particles (samples) to model the reward
function. New particles are generated from the best older particles at
each new iteration, quickly discarding particles in region that have no
reward. More information on Wikipedia.
- Search Variance: amount of exploration done at the beginning of each iteration
- Adaptive: variance is adepted as a function of the best particle weight found so far
- Particles: amount of particles to be generated at each iteration
In its implementation here, we follow 4 steps for each iteration:
- Displace previous particles ( N(0, Search Variance) )
- Recompute the weights for each particle
- Find best particle
- Sample particles for the new iteration
- Check for degenerated weights and replace the corresponding particles
The display shows in green the current particles, with large radii for
larger weights. The best particles at previous iterations are shown in
white.
Gradient Descent
Gradient descent methods compute the gradient of the reward in the
neighborhood of the current minimum (or maximum in the case of gradient
ascent), and move the search in the direction of maximum gradient. More
information on Wikipedia.
- Speed: amount of displacement in the direction of maximum gradient.
- Adaptive: Displacement is adepted as a function of the current
gradient (i.e. faster for large gradients, slower for small gradients)
The display shows in white the history of best policies found at each iteration, and in green the current best policy.