REvolve: Reward Evolution with Large Language Models using Human Feedback

Centre for Applied Autonomous Sensor Systems (AASS), Örebro University, Sweden
*equal contribution

Agents trained with REvolve-designed rewards : [Left] Autonomous Driving; [Middle] Humanoid Locomotion; [Right] Adroit Hand Manipulation.

Abstract

Designing effective reward functions is crucial to training reinforcement learning (RL) algorithms. However, this design is non-trivial, even for domain experts, due to the subjective nature of certain tasks that are hard to quantify explicitly. In recent works, large language models (LLMs) have been used for reward generation from natural language task descriptions, leveraging their extensive instruction tuning and commonsense understanding of human behavior. In this work, we hypothesize that LLMs, guided by human feedback, can be used to formulate reward functions that reflect human implicit knowledge. We study this in three challenging settings – autonomous driving, humanoid locomotion, and dexterous manipulation – wherein notions of good behavior are tacit and hard to quantify. To this end, we introduce REvolve, a truly evolutionary framework that uses LLMs for reward design in RL. REvolve generates and refines reward functions by utilizing human feedback to guide the evolution process, effectively translating implicit human knowledge into explicit reward functions for training (deep) RL agents. Experimentally, we demonstrate that agents trained on REvolve-designed rewards outperform other state-of-the-art baselines.

REvolve Overview

Interpolate start reference image.

Given the task of autonomous driving and abstracted environment variables, a reward designer \(G\) (LLM) outputs a population of reward functions, each used to train an AD policy \(\pi(R)\) in driving simulation. Then, we collect human preferences and natural language feedback on pairs of policy rollouts \(\theta \sim \Theta_{\pi(R)}\) through a human user feedback interface. Policy (and thus, corresponding reward function) fitness \(\sigma\) is calculated, and the fittest individuals, along with their NL feedback \(\lambda\), are refined by \(G\). The process leverages genetic programming for evolution. The flames symbolize trainable parameters.

REvolve offers several key advantages:
(1) Using Evolutionary Algorithms (EAs) for Reward Design. Traditional gradient-based optimization is unsuitable for reward design due to the lack of a differentiable cost function. Instead, REvolve utilizes meta-heuristic optimization through EAs. Our evolutionary framework considerably outperforms iterative frameworks like Eureka -- without additional computational costs.
(2) Utilizing human feedback to guide the search. Human preference data is directly mapped into fitness scores, effectively allowing humans to serve as fitness functions.
(3) Eliminating the need for additional reward model training. Unlike RLHF, REvolve requires no reward model and output reward functions are interpretable.



Interpolate start reference image.

Illustration of how GPT-4 applies mutation and crossover to reward functions. Mutation (left): shows the modification of the "smoothness reward" component. A red `-' sign indicates the line removed from the parent reward function, while a green `+' sign indicates the line added to the new reward function. Crossover (right): demonstrates how parent reward functions are combined to create a child reward function, incorporating the most effective reward components from each parent.

The four main steps in REvolve are:
Initilization: We start by initializing a reward database with K reward function individuals using GPT-4.
Reproduction: Each successive generation of K individuals is created by applying genetic operators (crossover and mutation), on the existing reward database individuals.
Selection: Newly reproduced individuals are retained based on fitness scores, following a survival of the fittest approach. To compute the fitness score \(\sigma\), we ask human evaluators to judge policy rollouts from different policies in a pairwise fashion.
Termination: The evolutionary process repeats until it reaches a predetermined number of generations \(N\) , or when an individual attains a fitness score equivalent to human-level driving performance.

Results

Interpolate start reference image.

How does REvolve fair against the baselines? REvolve exhibits continuous improvement across successive generations, ultimately achieving a higher fitness score on the manually designed fitness scale.
How do humans judge REvolve policies? REvolve policies are ranked the best amongst other baselines.
How do different genetic operators impact overall performance? REvolve with mutation + crossover > REvolve with crossover > REvolve with mutation
Do REvolve-designed reward functions generalize to new environments? In two novel environments -- (Env 1) featuring lanes and a completely altered landscape, and (Env2) characterized by increased traffic with multiple cars actively maneuvering -- REvolve outperforms expert-designed rewards.





Interpolate start reference image.
Interpolate start reference image.

Best (based on fitness evaluations) reward functions \(R^\ast\) from REvolve and REvolve Auto. It can be observed that each reward function component and their aggregation is interpretable. Hence, they can be scrutinized and tweaked if necessary to meet safety standards.

Comparison with other AD frameworks

Interpolate start reference image.

Comparison of REvolve with existing frameworks used to train AD agents. (a.) Expert Designed reward function \(R\) to train the AD agent; (b.) learning from demonstrations (LfD) and learning from interventions (LfI), where the agent is trained to imitate the human; (c.) humans acting as reward functions within the training loop, by assessing policy rollouts and providing scalar rewards, which requires significant manual effort; (d.) Reinforcement learning from human feedback (RLHF), where human preference data is used to train an additional (black-box) reward model. This reward model is used as a proxy for human rewards to train the AD policy; (e) Proposed REvolve, which uses GPT-4 as a reward function generator \(G\) and evolves them based on (minimal) human feedback on the rollouts sampled from the trained models. This feedback is directly incorporated into the reward design process. REvolve outputs interpretable reward functions, thereby avoiding learning an additional reward model. Here, \(\pi \in \Pi\) is a trainable policy in the set of policies \(\Pi\). The trainable blocks are denoted by the symbol of the flame.

BibTeX

@misc{hazra2024revolverewardevolutionlarge,
      title={REvolve: Reward Evolution with Large Language Models using Human Feedback},
      author={Rishi Hazra and Alkis Sygkounas and Andreas Persson and Amy Loutfi and Pedro Zuidberg Dos Martires},
      year={2024},
      eprint={2406.01309},
      archivePrefix={arXiv},
      primaryClass={cs.NE},
      url={https://arxiv.org/abs/2406.01309},
}