Information for recursive reward modeling

Basic information

Associated people: Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg

Associated organizations: Google DeepMind

Overview

Basic reward modeling has two steps:

The user trains a reward model or reward function to learn their intentions by giving feedback.
The reward model/reward function is used to train a reinforcement learning agent.

Recursive reward modeling takes the basic setup, but has another agent trained to help the user give feedback.

Goals of the agenda

Reward modeling aims to solve the “agent alignment problem”, which is to produce behavior that is in accordance with the user’s intentions. The agenda only aims to align one AI agent to one user, leaving out the problem of e.g. aggregating different preferences.

The level of capability of the agent being aligned is not clear from the paper (e.g. is reward modeling intended to be able to align superintelligent AI?).

Assumptions the agenda makes

AI timelines

No specific assumptions.

Nature of intelligence

No specific assumptions.

Other

The reward modeling paper lists two assumptions:

“We can learn user intentions to a sufficiently high accuracy”
“For many tasks we want to solve, evaluation of outcomes is easier than producing the correct behavior”

Documents

Title	Publication date	Author	Publisher	Affected organizations	Affected people	Affected agendas	Notes
AI Alignment Podcast: An Overview of Technical AI Alignment with Rohin Shah (Part 2)	2019-04-25	Lucas Perry	Future of Life Institute		Rohin Shah, Dylan Hadfield-Menell, Gillian Hadfield	Embedded agency, Cooperative inverse reinforcement learning, inverse reinforcement learning, deep reinforcement learning from human preferences, recursive reward modeling, iterated amplification	Part two of a podcast episode that goes into detail about some technical approaches to AI alignment.
Scalable agent alignment via reward modeling: a research direction	2018-11-19	Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg	arXiv	Google DeepMind		Recursive reward modeling, Imitation learning, inverse reinforcement learning, Cooperative inverse reinforcement learning, myopic reinforcement learning, iterated amplification, debate	This paper introduces the (recursive) reward modeling agenda, discussing its basic outline, challenges, and ways to overcome those challenges. The paper also discusses alternative agendas and their relation to reward modeling.
New safety research agenda: scalable agent alignment via reward modeling	2018-11-20	Victoria Krakovna	LessWrong	Google DeepMind	Jan Leike	Recursive reward modeling, iterated amplification	Blog post on LessWrong announcing the recursive reward modeling agenda. Some comments in the discussion thread clarify various aspects of the agenda, including its relation to Paul Christiano’s iterated amplification agenda, whether the DeepMind safety team is thinking about the problem of whether the human user is a safe agent, and more details about alternating quantifiers in the analogy to complexity theory. Jan Leike is listed as an affected person for this document because he is the lead author and is mentioned in the blog post, and also because he responds to several questions raised in the comments.