# C1 Introduction

# Notation

# Intro to RL

  • This is a computational, goal oriented approach to learning from interaction. Rather than directly theorizing about how people or animals learn, we primarily explore idealized learning situations and evaluate the effectiveness of various learning methods.
  • Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them
  • These two characteristics—trial-and-error search and delayed reward—are the two most important distinguishing features of reinforcement learning.
  • RL is not
    • Supervised - In interactive problems it is often impractical to obtain examples of desired behavior that are both correct and representative of all the situations in which the agent has to act.
    • Unsupervised - RL does not rely on examples of correct behavior, it is trying to maximize a reward signal instead of trying to find hidden structure
  • One of the challenges that arise in reinforcement learning, and not in other kinds of learning, is the trade-off between exploration and exploitation
  • Another key feature of reinforcement learning is that it explicitly considers the whole problem of a goal-directed agent interacting with an uncertain environment

# Elements of RL

  • Beyond the agent and the environment, one can identify four main subelements of a reinforcement learning system: a policy, a reward signal, a value function, and, optionally, a model of the environment.
    • A policy defines the learning agent’s way of behaving at a given time. Policy is a mapping from perceived states of the environment to actions to be taken when in those states
    • A reward signal defines the goal of a reinforcement learning problem. On each time step, the environment sends to the reinforcement learning agent a single number called the reward. The agent’s sole objective is to maximize the total reward it receives over the long run. It is the primary basis for altering the policy. They can be stochastic functions of the state and actions.
    • Whereas the reward signal indicates what is good in an immediate sense, a value function specifies what is good in the long run. The value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. It is values with which we are most concerned when making and evaluating decisions. Estimating values is harder and we need good algos for that.
    • The model mimics the behavior of the environment, or more generally, that allows inferences to be made about how the environment will behave. Models are used for planning and allow for inferences to be made of how the env will behave. Methods for solving reinforcement learning problems that use models and planning are called model-based methods, as opposed to simpler model-free methods that are explicitly trial-and-error learners
  • Solution methods such as genetic algorithms, genetic programming, simulated annealing, and other optimization methods never estimate value functions. These methods apply multiple static policies each interacting over an extended period of time with a separate instance of the environment. The policies that obtain the most reward, and random variations of them, are carried over to the next generation of policies, and the process repeats. We call these evolutionary methods because their operation is analogous to the way biological evolution produces organisms with skilled behavior even if they do not learn during their individual lifetimes. Hence, it is not well suited to RL problems.