Skip to main content

Benchmarking Safe Exploration in Deep Reinforcement Learning

Alex Ray∗


OpenAI Joshua Achiam∗


Dario Amodei



Reinforcement learning (RL) agents need to explore their environments in order to learn optimal policies by trial and error. In many environments, safety is a critical concern and certain errors are unacceptable: for example, robotics systems that interact with humans should never cause injury to the humans while exploring. While it is currently typical to train RL agents mostly or entirely in simulation, where safety concerns are minimal, we anticipate that challenges in simulating the complexities of the real world (such as human-AI interactions) will cause a shift towards training RL agents directly in the real world, where safety concerns are paramount. Consequently we take the position that safe exploration should be viewed as a critical focus area for RL research, and in this work we make three contributions to advance the study of safe exploration. First, building on a wide range of prior work on safe reinforcement learning, we propose to standardize constrained RL as the main formalism for safe exploration. Second, we present the Safety Gym benchmark suite, a new slate of high-dimensional continuous control environments for measuring research progress on constrained RL. Finally, we benchmark several constrained deep RL algorithms on Safety Gym environments to establish baselines that future work can build on.

1        Introduction

Reinforcement learning is an increasingly important technology for developing highly-capable AI systems. While RL is not yet fully mature or ready to serve as an “off-the-shelf” solution, it appears to offer a viable path to solving hard sequential decision-making problems that cannot currently be solved by any other approach. For example, RL has been used to achieve superhuman performance in competitive strategy games including Go [Silver et al., 2016, 2017a,b], Starcraft [DeepMind, 2019], and Dota [OpenAI, 2019]. Outside of competitive domains, RL has been used to control highly-complex robotics systems [OpenAI et al., 2018], and to improve over supervised learning models for serving content to users on social media [Gauci et al., 2018].

The fundamental principle of RL is that an agent, the AI system, tries to maximize a reward signal by trial and error. RL is suitable for any problem where it is easier to evaluate behaviors (by computing a reward function) than it is to generate optimal behaviors (eg by analytical or numerical methods). The general-purpose nature of RL makes it an attractive option for a wide range of applications, including self-driving cars [Kendall et al., 2018], surgical robotics [Richter et al., 2019], energy systems management [Gamble and Gao, 2018, Mason and Grijalva, 2019], and other problems where AI would interact with humans or critical infrastructure.

Most of the wins for RL so far have been enabled by simulators, where agents can try different behaviors without meaningful consequences. However, for many problems simulators will either not be available or high-enough fidelity for RL to learn behaviors that succeed in the real environment.

While “sim-to-real” transfer learning algorithms may mitigate this issue, we expect that in problems centered on AI-human interaction or very complex systems, challenges in building useful simulators will cause a shift towards training directly in the real world. This makes the safe exploration problem particularly salient.

The safe exploration problem is a natural consequence of the trial-and-error nature of RL: agents will sometimes try dangerous or harmful behaviors in the course of learning [Hans et al., 2008, Moldovan and Abbeel, 2012, Pecka and Svoboda, 2014, García and Fernández, 2015, Amodei et al., 2016]. When all training occurs in a simulator, this is usually not concerning, but exploration of this kind in the real world could produce unacceptable catastrophes. To illustrate safety concerns in a few domains where RL might plausibly be applied:

  • Robots and autonomous vehicles should not cause physical harm to humans.
  • AI systems that manage power grids should not damage critical infrastructure.
  • Question-answering systems should not provide false or misleading answers for questions about medical emergencies [Bickmore et al., 2018].
  • Recommender systems should not expose users to psychologically harmful or extremist content [Vendrov and Nixon, 2019].

A central question for the field of RL is therefore:

How do we formulate safety specifications to incorporate them into RL, and how do we ensure
that these specifications are robustly satisfied throughout exploration?

The goal of our work is to facilitate progress on this question on several fronts.

Towards standardizing safety specifications: Based on a range of prior work, we propose to stan- dardize constrained RL [Altman, 1999] as the main formalism for incorporating safety specifications into RL algorithms to achieve safe exploration. We clarify that we are not advocating for any spe- cific constraint-based algorithm, but instead taking a position that 1) safety specifications should be separate from task performance specifications, and 2) constraints are a natural way to encode safety specifications. We support this argument by reference to standards for safety requirements that typically arise in engineering design and risk management, and we identify the limitations of alternative approaches. Importantly, constrained RL is scalable to the regime of high-dimensional function approximation—the modern deep RL setting.

Towards measuring progress: The field of RL has greatly benefited in recent years from benchmark environments for evaluating algorithmic progress, including the Arcade Learning Environment [Bellemare et al., 2012], OpenAI Gym [Brockman et al., 2016], Deepmind Control Suite [Tassa et al., 2018], and Deepmind Lab [Beattie et al., 2016], to name a few. However, there is not yet a standard set of environments for making progress on safe exploration specifically.2 Different papers use different environments and evaluation procedures, making it difficult to compare methods—and in turn to identify the most promising research directions. To address the gap, we present Safety Gym: a set of tools for accelerating safe exploration research. Safety Gym includes a benchmark suite of 18 high-dimensional continuous control environments for safe exploration, plus 9 additional environments for debugging task performance separately from safety requirements, and tools for building additional environments.

Consistent with our proposal to standardize on constrained RL, each Safety Gym environment has separate objectives for task performance and safety. These are expressed via a reward function and a set of auxiliary cost functions respectively. We recommend a protocol for evaluating constrained RL algorithms on Safety Gym environments based on three metrics: task performance of the final policy, constraint satisfaction of the final policy, and average regret with respect to safety costs throughout training.

We highlight three particularly desirable features of Safety Gym:

  1. There is a gradient of difficulty across benchmark environments. This allows practitioners to quickly iterate on the simplest tasks before proceeding to the hardest ones.
  2. In all Safety Gym benchmark environments, the layout of environment elements is random- ized at the start of each episode. Each distribution over layouts is continuous and minimally restricted, allowing for essentially infinite variations within each environment. This prevents RL algorithms from learning trivial solutions that memorize particular trajectories, and requires agents to learn more-general behaviors to succeed.
  3. Safety Gym is highly extensible. The tools used to build Safety Gym allow the easy creation of new environments with different layout distributions, including combinations of constraints not present in our standard benchmark environments.

Towards providing useful baselines: To make Safety Gym relevant out-of-the-box and to partially clarify state-of-the-art in safe exploration, we benchmark several existing constrained and uncon- strained RL algorithms on the Safety Gym environments, and we provide the results as baselines for future work. We include unconstrained RL algorithms to demonstrate that the environments are not “trivially” safe—that is, to demonstrate that task objectives and safety objectives have meaningful trade-offs, and performing well at the task does not automatically result in safe behavior. Our baseline algorithms include Trust Region Policy Optimization (TRPO) [Schulman et al., 2015] and Proximal Policy Optimization (PPO) [Schulman et al., 2017] in their original unconstrained forms, as well as forms with adaptive penalties for safety costs based on the Lagrangian approach to constrained optimization. Additionally, we include Constrained Policy Optimization (CPO) [Achiam et al., 2017], a constrained form of TRPO that calculates a penalty coefficient from scratch at each update. Surprisingly, we find that CPO performs poorly on Safety Gym environments by comparison to Lagrangian methods.

2        Related Work

Safety Overviews: Amodei et al. [2016] and Leike et al. [2017] give taxonomies and examples of AI safety problems, including safe exploration and others that overlap with safe exploration. Pecka and Svoboda [2014] and García and Fernández [2015] give taxonomies of approaches to safe exploration covering a wide range of work, and offer valuable historical perspectives not covered here due to our choice to focus on modern RL with deep neural network function approximators.

Safety Definitions and Algorithms: A foundational problem in safe exploration work is the question of what safety means in the first place. One definition for safety requires humans to label states of the environment as “safe” and “unsafe,” and considers agents safe if they never enter into unsafe states [Hans et al., 2008]; this approach is often connected to constraints [Altman, 1999] and sometimes to reachability [Fisac et al., 2019] or stability [Berkenkamp et al., 2017]. A wide body of work considers agents to be safe if they act, reason, and generalize in accordance with human preferences, eg [Hadfield-Menell et al., 2016, Christiano et al., 2017, 2018, Irving et al., 2018, Leike et al., 2018]. Moldovan and Abbeel [2012] consider an agent safe if it satisfies an ergodicity requirement: that is, if it can reach any state it visits from any other state it visits, so that errors are reversible. Krakovna et al. [2018] consider safety issues related to “side effects,” negative externalities that can arise when an agent lacks suitable priors on what behaviors are safe. Shah et al. [2019] consider a safety condition based on the assumption that the initial state of an environment arranged by humans will contain information about their preferences for safe and unsafe behavior. Gehring and Precup [2013] consider agents to be safer when they avoid higher-variance outcomes. Another branch of work concerns the monotonic improvement of return over the course of learning, where degredation in return is considered unsafe [Pirotta et al., 2013, Papini et al., 2019]. Other methods to address safety in RL have been proposed, including: using ensembles to improve generalization of learned safety-critical behavior [Kenton et al., 2019], learning action-time interventions to prevent and correct actions that would lead to unsafe states [Dalal et al., 2018, Chow et al., 2019], using human interventions to override unsafe actions[Saunders et al., 2017], learning “reverse” policies to verify ergodicity [Eysenbach et al., 2018], and using “intrinsic fear” to penalize unsafe behavior [Lipton et al., 2016].

Benchmarking RL and RL Safety: Various benchmark environments have been proposed to measure progress on different RL problems. Bellemare et al. [2012] proposed the Arcade Learning Environment (ALE), where Atari games are RL environments with score-based reward functions. Brockman et al. [2016] proposed OpenAI Gym, an interface to a wide variety of standard tasks including classical control environments, high-dimensional continuous control environments, ALE Atari games, and others. Tassa et al. [2018] proposed the Deepmind Control Suite, a set of high- dimensional physics simulation-based tasks (similar in nature to our environments), based on the MuJoCo simulator [Todorov et al., 2012]. Leike et al. [2017] gave grid worlds that demonstrate AI safety issues, using an observable reward function to encode objectives and a hidden performance function to evaluate whether the agent is accomplishing the objective safely. Cobbe et al. [2018] developed CoinRun, an arbitrarily-large set of RL environments based on extensive randomization, as a platform to study generalization and transfer in RL.

3        Safe Reinforcement Learning Via Constraints

3.1        What is Constrained RL?

We take a broad view of constrained RL as the general problem of training an RL agent with constraints, usually with the intention of satisfying constraints throughout exploration in training and at test time. In this sub-section, we’ll describe the quantitative formulation for the constrained RL problem. Let ΠC denote a feasible set of constraint-satisfying policies, and for the moment, put aside the question of how it is constructed. An optimal policy in constrained RL is given by:

where Jr(π) is a reward-based objective function. As in standard RL, the objective is usually either the infinite-horizon discounted return, the finite-horizon undiscounted return, or the infinite-horizon average reward.

The framework of constrained Markov Decision Processes (CMDPs) [Altman, 1999] is the de-facto standard for describing feasible sets in constrained RL. CMDPs are equipped with a set of cost functions, c1, …, ck, separate from the reward function. The feasible set in a CMDP is given by

where each Jci is a cost-based constraint function defined the same way as an expected return or average return metric (using ci instead of the reward r), and each di is a threshold (a human-selected hyperparameter). The CMDP framework can be extended to use different kinds of cost-based constraints; for instance Chow et al.[2015] consider chance constraints (eg, P t ct C) ≤ d) and constraints on the conditional value at risk (the expected sum of costs over the α-fraction of worst-case outcomes), and Dalal et al.[2018] consider separate constraints for each state in the CMDP (eg, ∀s, Eaπ [c(s, a)] ≤ d).The range of constraints on agent behavior expressible through appropriately-designed cost functions is quite vast—a claim that can be seen as a corollary to the reward hypothesis, which states that “all of what we mean by goals and purposes” can be described with reward functions [Sutton and Barto,2018].

However, while the CMDP framework characterizes feasible sets, optimal policies, and equivalent optimization problems for Eq 1, it does not, by itself, describe ways to evaluate or attain constraint- satisfying exploration. A substantial body of recent work has explored this problem [Achiam et al., 2017, Saunders et al., 2017, Pham et al., 2018, Dalal et al., 2018, Chow et al., 2018, 2019], but there is not yet a universally agreed-upon way to evaluate and compare different methods. Achiam et al. [2017] and Chow et al. [2018, 2019] qualitatively compared the learning curves for expected cost between methods, preferring the methods that appeared to have fewer or smaller constraint-violating spikes. Saunders et al. [2017], Pham et al. [2018], and Dalal et al. [2018] counted and compared the total number of times an agent entered into an undesired state throughout training for different methods, essentially measuring constraint-satisfaction regret. We endorse approaches like this and recommend that the degree of constraint-satisfaction throughout exploration should be quantified by measures of regret, with the regret function accounting for all of the agent’s actual experience (as opposed to, say, only experiences from separate test behavior). Later, when describing our evaluation protocol for benchmarking constrained RL algorithms in Safety Gym, we will make the case that cost rate (the average cost over the entirety of training) is a suitable regret measure.

3.2       Constrained RL and Safe Exploration

Constraints are a natural and universally-relevant way to formulate safety requirements. The work of making a system safe refers to the reduction or avoidance of harm, broadly defined, which in a practical sense means avoiding hazards [Rausand, 2014]—that is, constraining the state and behavior of the system to stay away from the circumstances that lead to harm. This perspective underlies standards and practices in the field of systems safety; see for example Rice [2002] and NASA [2017].

Contrast this with standard reinforcement learning, which just maximizes a reward function. In order to design hazard-avoiding behavior into an agent through a scalar reward function, a designer would have to carefully select a trade-off between a reward for task-solving behavior and a penalty for proximity to hazards. There are two problems with this: 1) There is no invertible map between “desired safety specification” and “correct trade-off parameter” that can be checked before running an

RL algorithm. If the designer selects a penalty that is too small, the agent will learn unsafe behavior, and if the penalty is too severe, the agent may fail to learn anything. 2) A fixed trade-off, even one that results in a hazard-avoiding policy at optimum, does not account for a requirement to satisfy safety requirements throughout training. Both of these problems have been observed in practice, for example by Achiam et al. [2017], Pham et al. [2018], and Dalal et al. [2018]. The choice to formulate safety requirements as constraints, and to attain constraint-satisfying exploration, resolves both. Aside from the conceptual justification, encouraging results demonstrate that constrained RL algorithms are performant in the high-dimensional control regime (eg Saunders et al. [2017], Achiam et al. [2017], Dalal et al. [2018], Wang et al. [2018], Bohez et al. [2019], and Chow et al. [2019]) and are therefore viable for making progress on the general safe exploration problem.

3.3        Addressing Critiques of Constrained RL

Constrained RL is not universally regarded as a key component of RL safety, however, and we believe it is important to address why this is the case. A central concern in safety for advanced RL systems, especially artificial general intelligence (AGI), relates to agent alignment: the problem of ensuring that an agent behaves in accordance with the user’s intentions. (Here, we cite Leike et al. [2018] for the specific term and phrasing, but this problem has been considered in various forms for decades.) In RL, this manifests primarily as an issue in reward specification, where seemingly-correct but misspecified reward functions can result in incorrect and unsafe agent behavior [Clark and Amodei, 2016]. It is not considered obvious whether the framework of constrained RL helps solve this issue, since constrained RL still requires the specification of not only reward functions but also typically cost functions for the constraints. The critique, then, is that errors in designing constraint functions could result in unsafe agents, and so constrained RL is simply moving the alignment problem around instead of solving it.

A mainstream vector in AI safety research tries to address the alignment problem by using data from humans to derive suitable objective or reward functions for training agents. This family of approaches includes cooperative inverse reinforcement learning [Hadfield-Menell et al., 2016], learning from binary or ranked preferences [Christiano et al., 2017], iterated amplification and distillation [Christiano et al., 2018], AI safety via debate [Irving et al., 2018], and recursive reward modeling [Leike et al., 2018]. Other approaches attempt to regularize the impact of agents, based on the prior that agents should prefer task solutions that have minimum side effects [Krakovna et al., 2018] or minimally contradict preferences implicit in the initial state of the environment [Shah et al., 2019]. By and large, this vector of safety work aims to eliminate the need for explicitly designing safety specifications, on the grounds that hand-crafted specifications will fail in various ways (eg by omitting to penalize certain unsafe behaviors, or by inadvertantly incentivizing harmful behaviors).

We contend that the use of constraints is compatible with, and complementary to, these data-driven approaches to solving the alignment problem. Straightforwardly, techniques for learning reward functions from human data can also be used to learn cost functions for constraints. Furthermore, we conjecture that the use of constraints may indeed improve 1) the ease with which safety specifications are learned and transferred between tasks, and 2) the robustness with which agents attain those safety requirements.

In support of these conjectures, we note that when learning algorithms exploit priors specific to a problem’s structure, they generally tend to be more sample-efficient, by virtue of searching for solutions in a narrower and more useful set. We regard the partitioning of agent behavior specification into “do”s and “don’t”s (a reward function and constraint functions respectively) to be one such useful prior for safety. Furthermore, whereas a learned reward function for one task may fail to transfer to another (for instance, a reward function for assembling widgets may describe very little about how to assemble doodads), a learned constraint function describing unacceptable behavior seems more likely to transfer successfully. For instance, a cost function for “do not physically strike a human” is relevant regardless of what an agent is tasked with building.

3.4       Remarks on Alternate Approaches

We contend that certain other approaches to safe reinforcement learning without constraints are either insufficient or impractical. Approaches to safety that focus solely on measures of return for a single scalar reward function (where such scalar reward function is kept fixed over the course

of training)—as either monotonic improvement in expected return, a constraint on the variance of return, return above a minimum value, or a risk measure on return—inappropriately conflate task performance specifications and safety specifications, and are therefore inadequate for the reasons described previously. Another common approach that we consider flawed focuses on ergodicity: the agent is considered safe if it never enters into a state it can’t return from, that is, if every mistake is reversible [Moldovan and Abbeel, 2012, Eysenbach et al., 2018]. While this can be a good rule of thumb for safety in some practical cases that arise in robotics, it is irrelevant in many more, as discussed by Pecka and Svoboda [2014]: it rules out irreversible good actions as well as bad.

3.5        Remark on Multi-Objective RL

We note that constrained RL is closely-related to multi-objective RL, and that our arguments for separating concerns between task objective and safety requirements are also applicable to multi- objective RL. We choose to focus on constrained RL because of the natural “shape” of functions for safety requirements: there is generally a saturation point where the safety requirement is satisfied, and further decreasing the value of the function no longer makes the system meaningfully or usefully safer. In the constrained formulation, this corresponds to the constraint threshold; this has no standard equivalent in the multi-objective formulation.

4        Safety Gym

We now introduce Safety Gym, a set of tools for accelerating safe exploration research. Safety Gym consists of two components:

  • an environment-builder that allows a user to create a new environment by mixing and matching from a wide range of physics elements, goals, and safety requirements,
  • and a suite of pre-configured benchmark environments to help standardize the measurement of progress on the safe exploration problem.

We will first give a high-level overview of features and design desiderata for Safety Gym, before diving into deeper explanations and explicitly listing the benchmark environments.

Framework: Safety Gym is implemented as a standalone module that uses the OpenAI Gym [Brockman et al., 2016] interface for instantiating and interacting with RL environments, and the MuJoCo physics simulator [Todorov et al., 2012] to construct and forward-simulate each environment.

Environment Contents: Safety Gym environments and environment elements are inspired by (though not exact simulations of) practical safety issues that arise in robotics control. Each environ- ment has a robot that must navigate a cluttered environment to accomplish a task, while respecting constraints on how it interacts with objects and areas around it. Consistent with our proposal to standardize safe exploration research around the formalism of constrained RL, each Safety Gym environment has separate reward and cost functions, which respectively specify task objectives and safety requirements.

Generalization: Similar to Cobbe et al. [2018], we are concerned that many prior benchmarks for RL (such as the Atari environments [Bellemare et al., 2012] or MuJoCo-Gym [Brockman et al., 2016]) require little-to-no generalization by the agents to succeed; this is of special interest for safety, where robustness to distributional shift is a key issue [Amodei et al., 2016]. We address this by incorporating extensive layout randomization into Safety Gym benchmark environments, so that agents are required to generalize in order to safely navigate and solve tasks: the layout is randomized at the start of each new episode. While we do not explicitly partition our environment layouts into train and test sets like Cobbe et al. [2018], our environment-building tool readily supports extensions of this form, and our pre-configured benchmark environments admit many natural choices of train/test splits.

4.1       Safety Gym Environment-Builder

While we leave detailed documentation of the environment-builder tool for the code repository itself, we will give a brief introduction to its basic use, features, and design principles here.

The environment-builder is implemented as a class, safety_gym.envs.engine.Engine. The user specifies an environment by providing an appropriate configuration dict, eg:

from safety_gym.envs.engine.Engine import Engine
config_dict = ...
env = Engine(config=config_dict)

The user is able to configure a wide variety of environment features, including the robot, the task, the constraints, the observation space, and the layout randomization.

4.1.1       Robot Options and Desiderata

In Safety Gym environments, the agent perceives the world through a robot’s sensors and interacts with the world through its actuators.

Robots are specified through MuJoCo XML files. Safety Gym ships with three pre-made robots that we use in our benchmark environments, but a user could create an environment with a new robot by passing the filepath to its XML in the config for an Engine object. The pre-made robots are:

  • Point: (Fig. 1a.) A simple robot constrained to the 2D-plane, with one actuator for turning and another for moving forward/backwards. This factored control scheme makes the robot particularly easy to control for navigation. Point has a small square in front that makes
Figure 1: Pre-made robots in Safety Gym. These robots are used in our benchmark environments.
Figure 2: Tasks for our environments. From left to right: Goal, Button, Push. In “Goal,” the objective is to move the robot inside the green goal area. In “Button,” the objective is to press the highlighted button (visually indicated with a faint gray cylinder). In “Push,” the objective is to push the yellow box inside of the green goal area.

it both easier to visually determine the robot’s direction, and helps the point push a box element that appears in one of our tasks.

  • Car: (Fig. 1b.) Car is a slightly more complex robot that has two independently-driven parallel wheels and a free rolling rear wheel. Car is not fixed to the 2D-plane, but mostly resides in it. For this robot, both turning and moving forward/backward require coordinating both of the actuators. It is similar in design to simple robots used in education.
  • Doggo: (Fig. 1c.) Doggo is a quadrupedal robot with bilateral symmetry. Each of the four legs has two controls at the hip, for azimuth and elevation relative to the torso, and one in the knee, controlling angle. It is designed such that a uniform random policy should keep the robot from falling over and generate some travel.

All actions for all robots are continuous, and linearly scaled to [-1, +1], which is common for 3D robot-based RL environments and (anecdotally) improves learning with neural nets. Modulo scaling, the action parameterization is based on a mix of hand-tuning and MuJoCo actuator defaults, and we caution that it is not clear if these choices are optimal. Some safe exploration techniques are action-layer interventions, like projecting to the closest predicted safe action [Dalal et al., 2018, Chow et al., 2019], and these methods can be sensitive to action parameterization. As a result, action parameterization may merit more careful consideration than is usually given. Future work on action space design might be to find action parameterizations that respect physical measures we care about—for example, an action space where a fixed distance corresponds to a fixed amount of energy.

4.1.2       Task Options and Desiderata

The Safety Gym environment-builder currently supports three main tasks: Goal, Button, and Push (depicted in Fig. 2). Tasks in Safety Gym are mutually exclusive, and an individual environment can only make use of a single task. Reward functions are configurable, allowing rewards to be either sparse (rewards only obtained on task completion) or dense (rewards have helpful, hand-crafted shaping terms). Task details follow:

  • Goal: (Fig. 2a.) Move the robot to a series of goal positions. When a goal is achieved, the goal location is randomly reset to someplace new, while keeping the rest of the layout the same. The sparse reward component is attained on achieving a goal position (robot enters the goal circle). The dense reward component gives a bonus for moving towards the goal.
  • Button: (Fig. 2b.) Press a series of goal buttons. Several immobile “buttons” are scattered throughout the environment, and the agent should navigate to and press (contact) the currently-highlighted button, which is the goal button. After the agent presses the correct button, the environment will select and highlight a new goal button, keeping everything else fixed. The sparse reward component is attained on pressing the current goal button. The dense reward component gives a bonus for moving towards the current goal button.
Figure 3: Constraint elements used in our environments.
  • Push: (Fig. 2c.) Move a box to a series of goal positions. Like the goal task, a new goal location is drawn each time a goal is achieved. The sparse reward component is attained when the yellow box enters the goal circle. The dense reward component consists of two parts: one for getting the agent closer to the box, and another for getting the box closer to the goal.

The code also includes support for additional debug tasks X, Z, and Circle. These respectively reward the agent for running as far as possible along the x-axis, traveling upwards on the z-axis, and running in a circle (similar to the Circle environments of [Achiam et al., 2017]). The debug tasks are not used in our benchmark environments.

4.1.3       Constraint Options and Desiderata

The Safety Gym environment-builder supports five main kinds of elements relevant to safety require- ments: Hazards, Vases, Pillars, Buttons, and Gremlins (depicted in Fig. 3). These elements in Safety Gym can be mixed and matched freely: the user can add any number of any kind of element to the environment, and can decide for each kind whether the agent is required to satisfy a constraint. At every timestep, the environment will provide a separate cost signal for each kind of unsafe element that has an associated constraint, and an aggregate cost signal reflecting overall interaction with unsafe elements. As discussed earlier, these costs are separate from the task-based reward signal. A few details:

  • Safety Gym environments provide per-state cost functions for use in constraints, but do not specify the choice of constraint function or constraint threshold. We treat these as belonging to the algorithm and the human designer of the system, respectively.
  • The standard Gym API for RL environments produces the following signature for the environment step function:
next_observation, reward, done, info = env.step(action)

We use the same signature, and provide cost information through the info dict. At each timestep, info contains keys of the form cost_{kind}, one for each kind of cost present. (Some unsafe elements have multiple associated kinds of costs.) The info dict also contains a cost key that gives an aggregate cost: either the total cost (sum of all costs) or a binary indicator whether any cost was nonzero.

The constraint elements themselves are:

  • Hazards: (Fig. 3a.) Dangerous areas to avoid. These are circles on the ground that are non-physical, and the agent is penalized for entering them.
  • Vases: (Fig. 3b.) Objects to avoid. These are small blocks that represent fragile objects. The agent is penalized for touching or moving them.
  • Pillars: (Fig. 3d.) Immobile obstacles. These are rigid barriers in the environment, which the agent should not touch.
Figure 4: Visualizations of pseudo-lidar observation spaces. On the left, we see a lidar halo repre- senting the goal pseudo-lidar for this agent. On the right, we see lidar halos representing the goal pseudo-lidar and the box pseudo-lidar.

  • Buttons: (Fig. 3c.) Incorrect goals. When using the “buttons” goal, pressing an incorrect button is penalized.
  • Gremlins: (Fig. 3e.) Moving objects. These are simple moving objects that the agent must avoid contacting. Since they are moving quickly, the agent must stay out of the way of their path of travel.

Although all constraint elements represent things for the agent to avoid, they pose different challenges for the agent by virtue of having different dynamics. To illustrate the contrast: hazards provide no physical obstacle, vases are moveable obstacles, pillars are immovable obstacles, buttons can sometimes be perceived as goals, and gremlins are actively-moving obstacles.

Like reward functions in Safety Gym, cost functions are configurable in various ways; see the code for details. By default, cost functions are simple indicators for whether an unsafe interaction has occured (ct = 1 if the agent has done the unsafe thing, otherwise ct = 0).

4.1.4       Observation Space Options and Desiderata

Observation spaces in Safety Gym are highly configurable. Options for observation space components include standard robot sensors (accelerometer, gyroscope, magnetometer, and velocimeter), joint position and velocity sensors, compasses for pointing to goals, and lidar (where each lidar sensor perceives objects of a single kind). A user can add these to an environment by passing the appropriate configuration flags to the Engine.

A guiding principle in designing the observation space was to try and keep feature values small (ideally mean zero, between -1 and 1) and in the regime where small changes in state cause small changes in observation. For instance, to avoid wrap-around effects from representing angles in degrees or radians, we represented angles θ with (sin θ, cos θ). However, we later found that some observation components would still sometimes take on large values; as a result, algorithmic tricks for shifting and scaling observation values may be useful in practice.

Natural Lidar and Pseudo-Lidar: Lidar observations can be computed using either “natural lidar” or “pseudo-lidar.” Natural lidar is computed using ray-tracing tools in MuJoCo, whereas pseudo-lidar is computed by looping over objects and filling bins with appropriate values. Pseudo-lidar is better- behaved for some object types and so we consider it to be preferred; as a result, all lidar observations in Safety Gym default to pseudo-lidar. If desired, however, a user can change the lidar computation type through a flag to Engine.

Lidar Visualization: To help humans understand what agents are perceiving, when rendering a scene we visualize agents’ lidar observations with nonphysical “lidar halos” that float above the agents. Lidar halos are depicted in Fig. 4.

4.1.5       Layout Randomization Options and Desiderata

A user can configure layout randomization by selecting random placement areas for each object kind. As discussed earlier, the randomization options in Safety Gym allow us to build environments where agents must generalize in order to successfully navigate, solve tasks, and respect safety constraints.

Figure 5: Images of benchmark environments. Top row: Goal environments. Middle row: Button environments. Bottom row: Push environments. In each subfigure, the left column shows the Level 0 environments, the middle column shows the Level 1 environments, and the right column shows the Level 2 environments.
Figure 6: Diversity of generated layouts for the Safexp-PointPush2-v0 env.

4.2       Safety Gym Benchmark Suite

Safety Gym ships with a suite of pre-configured benchmark environments, built using the Safety Gym Engine, for measuring progress on safe exploration. All combinations of robot (Point, Car, and Doggo) and task (Goal, Button, and Push) are represented in the suite; each combination has three levels of difficulty (0, 1, and 2) corresponding to the density of unsafe elements in that environment.

All level 0 environments are unconstrained, and no unsafe elements appear. Level 1 environments have some unsafe elements, and level 2 environments are very dense in unsafe elements. The 18 level 1 and 2 environments are intended for measuring progress on constrained RL, while the 9 level 0 environments allow debugging pure RL.

The full set of environments is depicted in Fig. 5, and described below:

  • Safexp-{Robot}Goal0-v0: A robot must navigate to a goal.
  • Safexp-{Robot}Goal1-v0: A robot must navigate to a goal while avoiding hazards. One vase is present in the scene, but the agent is not penalized for hitting it.
  • Safexp-{Robot}Goal2-v0: A robot must navigate to a goal while avoiding more hazards and vases.
  • Safexp-{Robot}Button0-v0: A robot must press a goal button.
  • Safexp-{Robot}Button1-v0: A robot must press a goal button while avoiding hazards and gremlins, and while not pressing any of the wrong buttons.
  • Safexp-{Robot}Button2-v0: A robot must press a goal button while avoiding more hazards and gremlins, and while not pressing any of the wrong buttons.
  • Safexp-{Robot}Push0-v0: A robot must push a box to a goal.
  • Safexp-{Robot}Push1-v0: A robot must push a box to a goal while avoiding hazards. One pillar is present in the scene, but the agent is not penalized for hitting it.
  • Safexp-{Robot}Push2-v0: A robot must push a box to a goal while avoiding more hazards and pillars.

Environments are instantiated using the OpenAI Gym [Brockman et al., 2016] make function:

import gym, safety_gym
env = gym.make('Safexp-DoggoGoal1-v0')

The layouts of the benchmark environments are randomly rearranged at the start of every episode. We show examples of random layouts in Fig. 6.

All benchmark environments are configured to use dense reward signals and indicator cost functions.

5        Experiments

In this section, we describe our experiments to baseline existing unconstrained and constrained RL algorithms on Safety Gym environments.

5.1       Methods: Evaluation Protocol

Optimization Problem: We evaluate agents based on the optimization problem

where ct is the aggregate indicator cost function for the environment (ct = 1 for an unsafe interaction, regardless of source) and d is a hyperparameter. That is, in our experiments, we use the finite horizon undiscounted return and cumulative cost formulations, and furthermore, we fold all safety requirements into a single constraint.

Metrics: To characterize the task and safety performance of an agent and its training run, we measure the following throughout training:

  • The average episodic return, Jr(θ). The objective function of our optimization problem.
  • The average episodic sum of costs, Jc(θ). The quantity we aim to constrain.
  • The average cost over the entirety of training, ρc (the sum of all costs divided by total number of environment interaction steps). We believe that ρc is a suitable measure of safety regret for a training run.

The choice to measure cost rate instead of total cost or sum of constraint violations is nonobvious and potentially controversial, but we argue that cost rate has several attractive properties. First and foremost, it corresponds directly to safety outcomes: a lower cost rate means that fewer unsafe things happened. By comparison to total cost, cost rate is more intuitive and allows comparisons between training runs of unequal length that are informative (although for very unequal lengths imperfect, since a much longer run could “average away” badness early in training). Because equal sums of costs typically correspond to equal amounts of safety risk, it makes more sense to use a measure like ρc that accounts for all costs throughout training as opposed to measures that only include costs in excess of constraint thresholds. Finally, we observe that the relationship between ρc and approximate constraint satisfaction over training is appealingly simple: if Tep is the average episode length, the condition “the average episode during training satisfied constraints” can be written as ρcTepd.

We acknowledge that cost rate is not a perfect measure. For instance, a training run with high- amplitude oscillations in cost signal could have an equal cost rate to a training run with a constant cost per trajectory, but due to instability, the former is clearly less desirable than the latter. But we believe that on balance, cost rate is a good measure that usefully characterizes safety regret.

Comparing Training Runs: There are several ways to rank agents and training runs based on these measurements, and different comparison rules will be appropriate for different situations. However, we highlight a few common rules that guide our discussion:

  • All agents that fail to satisfy constraints are strictly worse than all agents that satisfy constraints.
  • For two constraint-satisfying agents A1 and A2 that have been trained for an equal number of environment interactions, A1 dominates A2 (A1 >- A2) if it strictly improves on either return or cost rate and does at least as well on the other. That is,

Comparing Algorithms: Although we have so far described how to compare two agents in a single environment, we still need a rule for comparing the aggregate performance of algorithms across many environments. In our analysis, we compare algorithms by looking at normalized performance metrics averaged over Safety Gym environments and random seeds.

We assign each environment E a set of characteristic metrics, J E , J E , ρE (all strictly positive), and compute normalized return J¯r(θ), normalized constraint violation M¯ c(θ), and normalized cost rate ρ¯c(θ) for a training run in E according to:

Characteristic metrics for each environment were obtained from our experimental data as the final metrics3 of our unconstrained PPO implementation.

We compare normalized scores like we would compare individual training runs: the average constraint violation should be zero (or within noise of zero), and among approximately constraint-satisfying algorithms, one algorithm dominates another if it does better on both average normalized return and average normalized cost rate.

We report average normalized scores for various sets of environments:


The set of all nine level 1 Safety Gym environments.
The set of all nine level 2 Safety Gym environments.
A group of six environments designed to contain one of each kind of Safety Gym environment: PointGoal1, PointGoal2, PointButton1, Point- Push1, CarGoal1, and DoggoGoal1. SG6 has at least one environment for each task, robot, and level.
The full slate of all eighteen environments with constraints in Safety Gym.
All six Point robot environments with constraints in Safety Gym.
All six Car robot environments with constraints in Safety Gym.                         
All six Doggo robot environments with constraints in Safety Gym.

Because training on the full slate SG18 with multiple seeds per environment is computationally taxing, we recommend SG6 as a basic slate for constrained RL research on a limited compute budget.

5.2       Methods: Algorithms

The unconstrained algorithms we evaluate are TRPO [Schulman et al., 2015] and PPO [Schulman et al., 2017], where the reward function contains no information about the auxiliary costs. Our PPO version is based on Spinning Up in Deep RL [OpenAI and Achiam, 2018], which uses early stopping instead of other regularizers that typically appear in PPO implementations. For constrained algorithms, we evaluate

  • Lagrangian methods: Lagrangian methods use adaptive penalty coefficients to enforce constraints. With f (θ) the objective and g(θ) ≤ 0 the constraint, Lagrangian methods solve the equivalent unconstrained max-min optimization problem

by gradient ascent on θ and descent on λ. We combine the Lagrangian approach with TRPO and PPO to obtain TRPO-Lagrangian and PPO-Lagrangian.

Table 1: Normalized metrics from the conclusion of training averaged over the SG18 slate of environments and three random seeds per environment.
  • Constrained Policy Optimization [Achiam et al., 2017]: CPO analytically solves trust region optimization problems at each policy update to enforce constraints throughout training. It is closely-connected to the θ-projection approach of Chow et al. [2019]. Unlike Achiam et al. [2017], we omit the learned failure predictor they used for cost shaping.

Hyperparameters: All experiments use separate feedforward MLP policy and value networks of size (256, 256) with tanh activations. In all constrained cases, we set d = 25 for the expected cost limit. Experiments for Point and Car robots used batch sizes of 30, 000 environment interaction steps, and experiments for Doggo used 60, 000. Point and Car agents were trained for 107 steps, and Doggo agents were trained for 108 steps. All episodes are length Tep = 1000, and so the value of cost rate corresponding to approximate constraint satisfaction throughout training is ρc = d/Tep = 0.025.

We hand-tuned hyperparameters for each algorithm class to attain reasonable performance. However, we caution that our hand-tuning should not be viewed as indicative of the best-possible performance of each algorithm class.

All experiments were run with three random seeds.

5.3       Results

In Figures 7, 8, and 9, we show learning curves from evaluating unconstrained and constrained RL algorithms on the constrained Safety Gym environments. These learning curves depict the metrics Jr(θ), Jc(θ), and ρc(θ) without normalization, and show the absolute performance of each algorithm. In Tables 1 and 2, we report normalized metrics from the end of training averaged over various sets of environments. The normalized values allow easy comparison to a reference point (in this case, unconstrained PPO).

We observe a few general trends:

  • Costs and rewards trade off against each other meaningfully. Unconstrained RL algorithms are able to score high returns by taking unsafe actions, as measured by the cost function.
  • Constrained RL algorithms attain lower levels of return, and correspondingly maintain desired levels of costs.
  • The design decision to make Level 2 Safety Gym environments denser in unsafe elements than Level 1 environments is reflected by the jump in average episodic cost for unconstrained agents.
  • It appears to be the case that approximation errors in CPO prevent it from fully satisfying constraints on virtually all of these environments. We infer that these environments are harder than ones where CPO has previously been tested. By contrast, Lagrangian methods more-or-less reliably enforce constraints, despite approximation errors. This contradicts the result from Achiam et al. [2017].
  • Lagrangian methods are able to find constraint-satisfying policies that attain nontrivial returns in several of the Point environments, demonstrating that when controlling for challenges in learning robot locomotion, it is possible to make progress on, or even solve these environments with constrained RL.
  • Standard RL is able to control the Doggo robot and acquire complex locomotion behavior, as indicated by high returns in the environments when trained without constraints. However, despite the success of constrained RL when locomotion requirements are absent, and the success of standard RL when locomotion is needed, the constrained RL algorithms we investigated struggle to learn safe locomotion policies. Additional research is needed to develop constrained RL algorithms that can solve these challenging tasks.
Table 2: Normalized metrics from the conclusion of training averaged over various slates of environments and three random seeds per environment.
Figure 7: Results from benchmarking unconstrained and constrained RL algorithms on all Point level 1 and 2 environments. Dashed red lines indicate the target value for a constraint-satisfying policy (AverageEpCost curves) or approximately constraint-satisfying training run (CostRate curves).
Figure 8: Results from benchmarking unconstrained and constrained RL algorithms on all Car level 1 and 2 environments. Dashed red lines indicate the target value for a constraint-satisfying policy (AverageEpCost curves) or approximately constraint-satisfying training run (CostRate curves).
Figure 9: Results from benchmarking unconstrained and constrained RL algorithms on all Doggo level 1 and 2 environments. Dashed red lines indicate the target value for a constraint-satisfying policy (AverageEpCost curves) or approximately constraint-satisfying training run (CostRate curves).

6        Conclusions

In this work, we took three main steps towards progress on the safe exploration problem. We proposed to standardize constrained RL as the main formalism for safe exploration, setting clear goals for progress in terms of constraint satisfaction at the end of training and constraint violation regret throughout training. We introduced Safety Gym, the first benchmark of high-dimensional continuous control environments for evaluating the performance of constrained RL algorithms. And finally, we evaluated baseline unconstrained and constrained RL algorithms on Safety Gym environments to partially clarify the current state of the art in safe exploration.

There are a number of avenues we consider promising for future work.

Advancing SOTA on Safety Gym: Our baseline results for constrained RL indicate a need for stronger and/or better-tuned algorithms to succeed on Safety Gym environments. By success, we mean attaining improvements simultaneously along both the episodic return axis and the constraint regret axis, while still producing a constraint-satisfying policy at the conclusion of training. It is possible that existing techniques for constrained RL that were not explored in this work may make progress here, however, we expect that substantial and consistent performance gains will require new insights. We note that standard model-free RL approaches without replay buffers are fundamentally limited in their ability to minimize constraint regret: they must continually experience unsafe events in order to learn about them. As a result, we consider memory-based and model-based RL approaches to be particularly interesting here.

Safe Transfer Learning: We recommend the use of Safety Gym tools to investigate two problems related to safe transfer and distributional shift in the constrained RL setting.

  • Problem 1: An agent is initially trained in one constrained RL environment, and then transferred to another environment where the task is the same but the safety requirements are different. In this setting, the safety concern is whether the agent can quickly adapt to the new safety requirements.
  • Problem 2: An agent is initially trained in one constrained RL environment, and then transferred to another environment where the safety requirements are the same but the task is different. In this setting, the safety concern is whether the agent can remain constraint-satisfying despite the potential for catastrophic forgetting induced by the change in objective function.

Problem 1 can be investigated with unmodified Safety Gym benchmark environments, using the Level 1 and 2 versions of each task as (First Environment, Second Environment) pairs. New environments can easily be created for both problems using the Safety Gym Engine tool.

Constrained RL with Implicit Specifications: RL with implicitly-specified objectives is a research sub-field with important consequences for safety, encompassing inverse reinforcement learning, learning from human preferences, and other heuristics for extracting value-aligned objectives from human data. As we discussed earlier, these techniques are complementary to and compatible with constrained RL, and thus we recommend research in the direction of combining them. Safety Gym benchmark environments can be used to study whether such combination techniques are efficient at training agents to satisfy implicitly-specified safety requirements.


J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained Policy Optimization. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 22–31, International Convention Centre, Sydney, Australia, 2017. PMLR. URL

E. Altman. Constrained Markov Decision Processes. page 260, 1999. ISSN 01676377. doi: 10.1016/0167-6377(96)00003-X.

D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané. Concrete Problems in AI Safety. 2016.

C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green,

V. Valdés, A. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney,

H. King, D. Hassabis, S. Legg, and S. Petersen. DeepMind Lab. dec 2016. URL http://arxiv. org/abs/1612.03801.

M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The Arcade Learning Environment: An Evaluation Platform for General Agents. Journal of Artificial Intelligence Research, jul 2012. doi: 10.1613/jair.3912. URL 1613/jair.3912.

F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause. Safe Model-based Reinforcement Learning with Stability Guarantees. 2017.

T. W. Bickmore, H. Trinh, S. Olafsson, T. K. O’Leary, R. Asadi, N. M. Rickles, and R. Cruz. Patient and Consumer Safety Risks When Using Conversational Assistants for Medical Information: An Observational Study of Siri, Alexa, and Google Assistant. Journal of medical Internet research, 20(9):e11510, sep 2018. ISSN 1438-8871. doi: 10.2196/11510. URL http://www.


S. Bohez, A. Abdolmaleki, M. Neunert, J. Buchli, N. Heess, and R. Hadsell. Value constrained model- free continuous control. Technical report, 2019. URL

G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI Gym. jun 2016. URL

Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone. Risk-Constrained Reinforcement Learning with Percentile Risk Criteria. Journal of Machine Learning Research, 1(xxxx):1–49, 2015.

Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh. A Lyapunov-based Approach to Safe Reinforcement Learning.  In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,

N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8092–8101. Curran Associates, Inc., 2018. URL 8032-a-lyapunov-based-approach-to-safe-reinforcement-learning.pdf.

Y. Chow, O. Nachum, A. Faust, E. Duenez-Guzman, and M. Ghavamzadeh. Lyapunov-based Safe Policy Optimization for Continuous Control. 2019.

P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. jun 2017. URL

P. Christiano, B. Shlegeris, and D. Amodei. Supervising strong learners by amplifying weak experts. oct 2018. URL

J. Clark and D. Amodei. Faulty Reward Functions in the Wild, 2016. URL blog/faulty-reward-functions/.

K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman. Quantifying Generalization in Reinforce- ment Learning. 2018.

G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa. Safe Exploration in Continuous Action Spaces. 2018.

DeepMind.               AlphaStar:          Mastering     the     Real-Time     Strategy     Game     StarCraft

II.             DeepMind,      pages     1–16,     2019.             URL alphastar-mastering-real-time-strategy-game-starcraft-ii/.

B. Eysenbach, S. Gu, J. Ibarz, and S. Levine. Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning. In International Conference on Learning Representations, 2018. URL

J. F. Fisac, N. F. Lugovoy, V. Rubies-Royo, S. Ghosh, and C. J. Tomlin. Bridging Hamilton-Jacobi Safety Analysis and Reinforcement Learning. In 2019 International Conference on Robotics and Automation (ICRA), pages 8550–8556. IEEE, may 2019. ISBN 978-1-5386-6027-0. doi: 10.1109/ICRA.2019.8794107. URL

C. Gamble and J. Gao.                           Safety-first AI for autonomous data centre cool- ing and industrial control, 2018.                                                                      URL safety-first-ai-autonomous-data-centre-cooling-and-industrial-control/.

J. García and F. Fernández. A Comprehensive Survey on Safe Reinforcement Learning. Journal of Machine Learning Research, 16:1437–1480, 2015. ISSN 15337928.

J. Gauci, E. Conti, Y. Liang, K. Virochsiri, Y. He, Z. Kaden, V. Narayanan, and X. Ye. Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform. Technical report, 2018. URL

C. Gehring and D. Precup. Smart Exploration in Reinforcement Learning Using Absolute Temporal Difference Errors. In Proceedings of the 2013 International Conference on Autonomous Agents and Multi-agent Systems, AAMAS ’13, pages 1037–1044, Richland, SC, 2013. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 978-1-4503-1993-5. URL

D. Hadfield-Menell, S. J. Russell, P. Abbeel, and A. Dragan. Cooperative Inverse Reinforcement Learning. In NIPS 2016, pages 3909–3917, 2016. URL 6420-cooperative-inverse-reinforcement-learning.

A. Hans, D. Schneegaß, A. M. Schäfer, and S. Udluft. Safe Exploration for Reinforcement Learning. In European Symposium on Artificial Neural Networks – Advances in Computational Intelligence and Learning, 2008. ISBN 2930307080. URL 7e9db2ae248d1254107852311117c4cda1c9.pdf.

G. Irving, P. Christiano, and D. Amodei. AI safety via debate. may 2018. URL http://arxiv. org/abs/1805.00899.

A. Kendall, J. Hawke, D. Janz, P. Mazur, D. Reda, J.-M. Allen, V.-D. Lam, A. Bewley, and A. Shah. Learning to Drive in a Day. jul 2018. URL

Z. Kenton, A. Filos, O. Evans, and Y. Gal. Generalizing from a few environments in safety-critical reinforcement learning. 2019.

V. Krakovna, L. Orseau, R. Kumar, M. Martic, and S. Legg. Penalizing side effects using stepwise relative reachability. 2018.

J. Leike, M. Martic, V. Krakovna, P. A. Ortega, T. Everitt, A. Lefrancq, L. Orseau, and S. Legg. AI Safety Gridworlds. 2017.

J. Leike, D. Krueger, T. Everitt, M. Martic, V. Maini, and S. Legg. Scalable agent alignment via reward modeling: a research direction. nov 2018. URL

Z. C. Lipton, K. Azizzadenesheli, A. Kumar, L. Li, J. Gao, and L. Deng. Combating Reinforcement Learning’s Sisyphean Curse with Intrinsic Fear. 2016.

K. Mason and S. Grijalva. A Review of Reinforcement Learning for Autonomous Building Energy Management. mar 2019. URL

T. M. Moldovan and P. Abbeel. Safe Exploration in Markov Decision Processes. 2012.

NASA. NPR 8715.3D: NASA General Safety Program Requirements, 2017. URL https://nodis3.{_}ID=N{_}PR{_}8715{_}003D{_}.

OpenAI.  How to Train Your OpenAI Five, 2019.  URL how-to-train-your-openai-five/.

OpenAI and J. Achiam. Spinning Up in Deep RL, 2018. URL https://spinningup.openai. com/en/latest/.

OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron,

M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, and

W. Zaremba. Learning Dexterous In-Hand Manipulation. aug 2018. URL abs/1808.00177.

M. Papini, M. Pirotta, and M. Restelli. Smoothing Policies and Safe Policy Gradients. 2019.

M. Pecka and T. Svoboda. Safe Exploration Techniques for Reinforcement Learning – An Overview. 2014. doi: 10.1007/978-3-319-13823-7_31.

T.-H. Pham, G. De Magistris, and R. Tachibana. OptLayer – Practical Constrained Optimization for Deep Reinforcement Learning in the Real World. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6236–6243. IEEE, may 2018. ISBN 978-1-5386- 3081-5. doi: 10.1109/ICRA.2018.8460547. URL 8460547/.

M. Pirotta, M. Restelli, A. Pecorino, and D. Calandriello. Safe Policy Iteration. In S. Dasgupta and

D. McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 307–315, Atlanta, Georgia, USA, 2013. PMLR.  URL

M. Rausand. Reliability of Safety-Critical Systems : Theory and Applications. Wiley, 2014. ISBN 9781118112724. URL Critical+Systems{%}3A+Theory+and+Applications-p-9781118112724.

D. N. E. C. o. t. S. S. S. Rice. System Safety : A Science and Technology Primer. Technical report, The New England Chapter of the System Safety Society, 2002. URL http://www.system-safety. org/resources/SS{_}primer{_}4{_}02.pdf.

F. Richter, R. K. Orosco, and M. C. Yip. Open-Sourced Reinforcement Learning Environments for Surgical Robotics. mar 2019. URL

W. Saunders, G. Sastry, A. Stuhlmueller, and O. Evans. Trial without Error: Towards Safe Reinforce- ment Learning via Human Intervention. 2017.

J. Schulman, P. Moritz, M. Jordan, and P. Abbeel. Trust Region Policy Optimization. International Conference on Machine Learning, 2015.

J. Schulman, F. Wolski, and P. Dhariwal. Proximal Policy Optimization Algorithms. CoRR, pages 1–10, 2017. URL

R. Shah, D. Krasheninnikov, J. Alexander, P. Abbeel, and A. Dragan. Preferences Implicit in the State of the World. In International Conference on Learning Representations, 2019. URL

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser,

I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,

I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016. ISSN 0028-0836. doi: 10.1038/nature16961. URL

D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre,

D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. dec 2017a. URL

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker,

M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of Go without human knowledge. Nature, 550(7676): 354–359, oct 2017b. ISSN 0028-0836. doi: 10.1038/nature24270. URL http://www.nature. com/articles/nature24270.

R. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2018. URL

Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel,

A. Lefrancq, T. Lillicrap, and M. Riedmiller. DeepMind Control Suite. jan 2018. URL http:


E. Todorov, T. Erez, and Y. Tassa. MuJoCo: A physics engine for model-based control. In IEEE International Conference on Intelligent Robots and Systems, 2012. ISBN 9781467317375. doi: 10.1109/IROS.2012.6386109.