New safety research agenda: scalable agent alignment via reward modeling |
2018-11-20 |
Victoria Krakovna |
LessWrong |
Google DeepMind |
Jan Leike |
Recursive reward modeling, iterated amplification |
Blog post on LessWrong announcing the recursive reward modeling agenda. Some comments in the discussion thread clarify various aspects of the agenda, including its relation to Paul Christiano’s iterated amplification agenda, whether the DeepMind safety team is thinking about the problem of whether the human user is a safe agent, and more details about alternating quantifiers in the analogy to complexity theory. Jan Leike is listed as an affected person for this document because he is the lead author and is mentioned in the blog post, and also because he responds to several questions raised in the comments. |
Scalable agent alignment via reward modeling: a research direction |
2018-11-19 |
Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg |
arXiv |
Google DeepMind |
|
Recursive reward modeling, Imitation learning, inverse reinforcement learning, Cooperative inverse reinforcement learning, myopic reinforcement learning, iterated amplification, debate |
This paper introduces the (recursive) reward modeling agenda, discussing its basic outline, challenges, and ways to overcome those challenges. The paper also discusses alternative agendas and their relation to reward modeling. |