REvolve: Reward Evolution with Large Language Models for Autonomous Driving

Centre for Applied Autonomous Sensor Systems (AASS), Örebro University, Sweden
*equal contribution

Autonomous Driving agent trained with REvolve-designed rewards.

Abstract

Designing effective reward functions is crucial to training reinforcement learning (RL) algorithms. However, this design is non-trivial, even for domain experts, due to the subjective nature of certain tasks that are hard to quantify explicitly. In recent works, large language models (LLMs) have been used for reward generation from natural language task descriptions, leveraging their extensive instruction tuning and commonsense understanding of human behavior. In this work, we hypothesize that LLMs, guided by human feedback, can be used to formulate human-aligned reward functions. Specifically, we study this in the challenging setting of autonomous driving (AD), wherein notions of "good" driving are tacit and hard to quantify. To this end, we introduce REvolve, an evolutionary framework that uses LLMs for reward design in AD. REvolve creates and refines reward functions by utilizing human feedback to guide the evolution process, effectively translating implicit human knowledge into explicit reward functions for training (deep) RL agents. We demonstrate that agents trained on REvolve-designed rewards align closely with human driving standards, thereby outperforming other state-of-the-art baselines.

REvolve Overview

Interpolate start reference image.

Given the task of autonomous driving and abstracted environment variables, a reward designer \(G\) (LLM) outputs a population of reward functions, each used to train an AD policy \(\pi(R)\) in driving simulation. Then, we collect human preferences and natural language feedback on pairs of policy rollouts \(\theta \sim \Theta_{\pi(R)}\) through a human user feedback interface. Policy (and thus, corresponding reward function) fitness \(\sigma\) is calculated, and the fittest individuals, along with their NL feedback \(\lambda\), are refined by \(G\). The process leverages genetic programming for evolution. The flames symbolize trainable parameters.

REvolve offers several key advantages:
(1) Framing reward design as a search problem. Compared to greedy search in Eureka, REvolve uses Genetic Programming to prevents premature convergence without incurring additional computational costs.
(2) Utilizing human feedback to guide the search. Human preference data is directly mapped into fitness scores, effectively allowing humans to serve as fitness functions.
(3) Eliminating the need for additional reward model training. Unlike RLHF, REvolve requires no reward model and output reward functions are interpretable.



Interpolate start reference image.

Illustration of how GPT-4 applies mutation and crossover to reward functions. Mutation (left): shows the modification of the "smoothness reward" component. A red `-' sign indicates the line removed from the parent reward function, while a green `+' sign indicates the line added to the new reward function. Crossover (right): demonstrates how parent reward functions are combined to create a child reward function, incorporating the most effective reward components from each parent.

The four main steps in REvolve are:
Initilization: We start by initializing a reward database with K reward function individuals using GPT-4.
Reproduction: Each successive generation of K individuals is created by applying genetic operators (crossover and mutation), on the existing reward database individuals.
Selection: Newly reproduced individuals are retained based on fitness scores, following a survival of the fittest approach. To compute the fitness score \(\sigma\), we ask human evaluators to judge policy rollouts from different policies in a pairwise fashion.
Termination: The evolutionary process repeats until it reaches a predetermined number of generations \(N\) , or when an individual attains a fitness score equivalent to human-level driving performance.

Results

Interpolate start reference image.

How does REvolve fair against the baselines on manually designed fitness scores? REvolve exhibits continuous improvement across successive generations, ultimately achieving a higher fitness score on the manually designed fitness scale.
Do human fitness scores induce human-aligned behavior? REvolve policies secure second and third Elo ranks, surpassed only by Human Driving.
How does REvolve affect policy learning? REvolve-designed rewards converge to a higher number of episodic steps compared to expert-designed rewards, signifying a higher rate of successful actions per episode.
Do REvolve-designed reward functions generalize to new environments? In two novel environments -- (Env 1) featuring lanes and a completely altered landscape, and (Env2) characterized by increased traffic with multiple cars actively maneuvering -- REvolve outperforms expert-designed rewards.




Comparison of Human Driving with rollouts from policies trained with REvolve and Eureka-designed reward functions. REvolve-designed rewards lead to no collisions, smoother driving, better lane following, and better turn handling at intersections, compared to Eureka.



Interpolate start reference image.
Interpolate start reference image.

Best (based on fitness evaluations) reward functions \(R^\ast\) from REvolve and REvolve Auto. It can be observed that each reward function component and their aggregation is interpretable. Hence, they can be scrutinized and tweaked if necessary to meet safety standards.

Comparison with other AD frameworks

Interpolate start reference image.

Comparison of REvolve with existing frameworks used to train AD agents. (a.) Expert Designed reward function \(R\) to train the AD agent; (b.) learning from demonstrations (LfD) and learning from interventions (LfI), where the agent is trained to imitate the human; (c.) humans acting as reward functions within the training loop, by assessing policy rollouts and providing scalar rewards, which requires significant manual effort; (d.) Reinforcement learning from human feedback (RLHF), where human preference data is used to train an additional (black-box) reward model. This reward model is used as a proxy for human rewards to train the AD policy; (e) Proposed REvolve, which uses GPT-4 as a reward function generator \(G\) and evolves them based on (minimal) human feedback on the rollouts sampled from the trained models. This feedback is directly incorporated into the reward design process. REvolve outputs interpretable reward functions, thereby avoiding learning an additional reward model. Here, \(\pi \in \Pi\) is a trainable policy in the set of policies \(\Pi\). The trainable blocks are denoted by the symbol of the flame.

BibTeX

@misc{hazra2024revolve,
      title={REvolve: Reward Evolution with Large Language Models for Autonomous Driving},
      author={Rishi Hazra and Alkis Sygkounas and Andreas Persson and Amy Loutfi and Pedro Zuidberg Dos Martires},
      year={2024},
      eprint={2406.01309},
      archivePrefix={arXiv},
      primaryClass={cs.NE}
}