Planning an optimal route in a complex environment requires efficient
reasoning about the surrounding scene. While human drivers prioritize important
objects and ignore details not relevant to the decision, learning-based planners
typically extract features from dense, high-dimensional grid representations containing all vehicle and
road context information. In this paper, we propose PlanT,
a novel approach for planning in the context of self-driving that uses a standard
transformer architecture. PlanT is based on imitation learning with a compact
object-level input representation. On the Longest6 benchmark for CARLA, PlanT
outperforms all prior methods (matching the driving score of the expert) while
being 5.3× faster than equivalent pixel-based planning baselines during inference.
Combining PlanT with an off-the-shelf perception module provides a sensor-based driving system that is
more than 10 points better in terms of driving score
than the existing state of the art. Furthermore, we propose an evaluation protocol
to quantify the ability of planners to identify relevant objects, providing insights
regarding their decision-making. Our results indicate that PlanT can focus on the
most relevant object in the scene, even when this object is geometrically distant.
This work was supported by the BMWi (KI Delta Learning, project number: 19A19013O), the
BMBF (Tübingen AI Center, FKZ: 01IS18039A), the DFG (SFB 1233, TP 17, project number:
276693517), by the ERC (853489 - DEXIM), and by EXC (number 2064/1 – project number
390727645). We thank the International Max Planck Research School for Intelligent Systems
(IMPRS-IS) for supporting K. Renz, K. Chitta and O.-B. Mercea. The authors also thank Niklas
Hanselmann and Markus Flicke for proofreading and Bernhard Jaeger for helpful discussions.
The template for this website was borrowed and adapted from
Despoina Paschalidou.