Standard RL offers methods for solving sequential decision making problems (typically formalized by a Markov Decision Process) that involve Maximisation of reward/minimize cost over some time horizon (and where a model of the environment may or may not be known).** **It assumes an agent that interacts with the environment to learn the optimal behaviour which requires a balance between exploration and exploitation.

RL Research community is trying to tackle the following open problem:

**Deep exploration:**The design of reinforcement learning algorithms that efficiently explore intractably large - state spaces remains an important open challenge. Here we are interested in answering questions like:*How to quickly gather information to learn to make good decisions(how can we efficiently explore the environment ? or What strategies can we use for effective data collection in online settings?).*Our solutions must be statistically/computationally efficient in non-tabular settings.**Environment Modelling**(How can learn the environment model? or How can we derive an optimal policy directly considering multiple environments,)**Experience Transfer (**how can we generalise, reuse and transfer experiences/knowledge across environments?)**Abstraction**(How can we effectively abstract states and actions? )

But what about real world applications? Interaction with the environment is expensive and sometimes even dangerous in real world settings and such an assumption limits the applicability of RL in real world settings. Furthermore,

The current implementations of the latest advances in this field have mainly been tailored to academia, focusing on fast prototyping and evaluating performance on simulated benchmark environments. Most real world problems on the other hand are not easy to model and we can’t always assume direct access to the environment or high fidelity simulators for learning. The practical success of RL algorithms has built upon a base of theory including gradient descent [7], temporal difference learning [58] and other foundational algorithms. These foundations are particularly poorly-understood for RL with nonlinear function approximation (e.g. via neural network), so-called ‘deep RL’. Which means theory lags practical algorithms which are giving us results. How can we make progress here? For RL to be practical we must acknowledge that solving practical problems in online settings(where agent learns as it interacts with the world) is the elephant in the room - deepRL might be ill equiped to solve problems of online learning nature due to limited data(typically needed for fitting neural nets). Success stories require massive computation offline with lots of data either historical or simulated) and thus it might not a good goal to solve industry problems with classic RL algorithms.

Some of the challenges associated by real world problems have different set of associated challenges. Some of them have been listed by [3]. When Constructing efficient RL algorithms for real world settings we we care about these aspects/properties:

- Training off-line from the fixed logs of an external behaviour policy.
- Learning on the real system from limited samples.
- High-dimensional continuous state and action spaces.
- Safety constraints that should never or at least rarely be violated.
- Tasks that may be partially observable, alternatively viewed as non-stationary or stochastic.
- Reward functions that are unspecified, multi-objective, or risk-sensitive.
- System operators who desire explainable policies and actions.
- Inference that must happen in real-time at the control frequency of the system.
- Large and/or unknown delays in the system actuators, sensors, or rewards

There is a lot of literature that acknowledges these issues but while there has been research focusing on these challenges individually, there has been little research on algorithms that address all of these challenges together. Ultimately we would like reinforcement learning algorithms that simultaneously perform well empirically and have strong theoretical guarantees. Such algorithms are especially important for high stakes domains like health care, education and customer service, where non- expert users demand excellent outcomes.

We cam tackle some of these applied RL challenges by combining ideas and algorithms from existing literature and sub-fields of RL such as inverse RL /learning from demonstration and classic reinforcement learning. Let us discuss a framework for thinking about the various scenarios we might encotuner in the real world based on the kind of data available and associated assumptions.

### Case A: When Expert is available

‘immitation learning’: For example recent work by Hester et al. [1] shows that we can leverage small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data

This comes under ‘immitation learning’ and it hold the promise of greatly reducing the amount of data need to learn a good policy. However it assumes that expert policy is available.

Inverse Reinforcement Learning: Similarly work by Ibarz et al. [2] shows that we can improve upon imitation learning by learning reward function (using human feedback in the form of expert demonstrations and trajectory preferences) and optimising it using reinforcement learning.

### Case B: When we can explore the environment

Most of the traditional RL research assumes that the agent interacts with the environment, receives feedback to learn the optimal behaviour and which also requires a balance between exploration and exploitation. The exploration can be of two types:

a) complete Intentional Exploration

b) under safety constraints: In real world we can sometime explore the environment. This problem is studied under: Risk Sensitive RL, Safe RL and Safe Exploration

### Case C: When log is available. (counterfactual reasoning or BatchRL)

BatchRL, Counterfactual reasoning: this involves What if reasoning for sequential decision making.

There is an enormous opportunity to leverage the increasing amounts of data to improve decisions made in healthcare, education, maintenance, and many other applications. Doing so requires what if / counterfactual reasoning, to reason about the potential outcomes should different decisions be made. In practice it mean we come up with policies and evaluate them by utilising existing data logs using counterfactual policy evaluation. A topic of my previous blog post:

**Learn more: **I recommend the following talks for learning more about these challenges:

- Efficient Reinforcement Learning When Data is Costly - Emma Brunskill Stanford University
- Reinforcement Learning for the People and or by the People - Emma Brunskill Stanford University
- Towards Generalization and Efficiency in Reinforcement Learning by wen sun at MS research

[1] https://arxiv.org/abs/1704.03732

[2] https://arxiv.org/abs/1811.06521

[3] https://openreview.net/pdf?id=S1xtR52NjN#cite.Mankowitz2016

add more