The provided codebase can run all experiments described in the paper. It actually supports more functionality than presented in the paper. Not all experiments make it into the final cut of a paper. This document describes functionality that is not part of the paper but could potentially be useful for other people.
We provide a reimplementation of the privileged PlanT planner from the CoRL 2022 paper PlanT: Explainable Planning Transformers via Object-Level Representations. The architecture can be found in the plant.py file. The method is also trained with train.py by setting the argument --use_plant 1
. To evaluate the method, use the plant_agent.py file. The reimplementation is faithful w.r.t. the core ideas of the method, but has small implementation differences. For example, our PlanT also handles stop signs and pedestrians. We were able to reproduce PlanTs expert level performance with this code.
Performance on training towns (Longest6).
Method | DS ↑ | RC ↑ | IS ↑ |
---|---|---|---|
PlanT | 81 | 94 | 0.87 |
PlanT (ours) | 82 | 96 | 0.85 |
Expert (ours) | 81 | 90 | 0.91 |
With a learned planner like PlanT one can do Learning by Cheating and imitate PlanTs predictions with a sensor agent. We provide code for adding PlanTs offline predictions to the dataset. A usage example can be found in slurm_relabel_dataset.sh. The code is a neat engineering example of how one can abuse pytorch as a multiprocessing library. Early preliminary results with learning by cheating (LBC twice) didn't look promising, so we haven't used this much.
Imitation learning with temporal inputs in AD is notoriously hard because driving is so smooth in time (for a detailed discussion of the issue, checkout our survey). A workaround for this problem is to only use temporal LiDAR data and the realignment trick, which realigns all point clouds into the same coordinate system. This removes data about the ego history but keeps data about other vehicles histories. Our codebase implements this idea, which can be used by setting --lidar_seq_len 6
to a value > 1 and --realign_lidar 1
in the training options. The question now becomes how to process this temporal data. The classic solution in the autonomous driving space is to stack temporal frames in the channel dimension. The code supports this idea, you simply need to set the --lidar_architecture
option to an image backbone such as regnety_032
. Stacking temporal frames on channel dimensions is known as a poor architectural choice in the video classification community. We therefore also implement the classic Video ResNet R(2+1)D here and the more modern Video Swin TransFormers here. These methods treat time as a separate dimension. The implementations are made in such a way that they can be run as if they were a TIMM model. To use them, simply set --lidar_architecture
to video_resnet18
or video_swin_tiny
. In preliminary experiments we achieved similar performance with the video_resnet18, 6 stacked LIDARs and realignment, than the single frame model. This suggests that the model was not suffering from causal confusion. However, video architectures are more computationally expensive than image architectures, so they need a significant performance improvement to justify the additional compute. Since this was not the case, we stuck with the simpler single frame model for the paper. Temporal cameras are not supported, though it should not be too hard to extend the code.
There are currently 2 dominant ideas in 3D perception (BEV segmentation). One is transferring features from perspective space to BEV via the nonlinear global operation of transformers. The other is to transfer perspective features to BEV via a geometric camera model. The first idea, BEVFormer, has an interesting analog with TransFuser and Interfuser in the sensor fusion community. Variants of the second idea have also been employed in recent competitive End-to-End approaches like CaT, Think Twice and Mile. Sensor fusion then simply amounts to concatenating LiDAR and camera features, since they are in the same coordinate system. This repository implements both ideas. TransFuser on the one hand and a reproduction of SimpleBEV called bev_encoder.py.
Just like TransFuser it uses BEV and perspective decoders as auxiliary losses and uses the resulting BEV feature map for downstream planning. The BEV LiDAR image is concatenated with the image features after projection and then processed by a CNN in BEV. Global context can be captured by the receptive field of the BEV CNN (or downstream transformer decoder). The implemented architecture tries to be faithful to the original BEV segmentation architecture and is not tuned for end-to-end driving yet. For example, SimpleBEV throws away the last block of the CNN, which TransFuser does not do. Our preliminary results were quite chaotic in terms of training seed variance, but we think this could be competitive with TransFuser as a sensor fusion approach if tuned properly. We didn't investigate this further since the project went into a different direction. The backbone can be used by setting the training option --backbone bev_encoder
.