I am a Ph.D. candidate in the Department of Control Science and Engineering at Zhejiang University, where I have been since 2021, under the supervision of Prof. Yong Liu at the APRIL Lab. Prior to this, I obtained my B.Eng from the same department with an honor degree at Chu Kochen Honors College in 2021. Currently, I am interning in ADLab at Shanghai AI Laboratory.
World models envision potential future states based on various ego actions. They embed extensive knowledge about the driving environment, facilitating safe and scalable autonomous driving. Most existing methods primarily focus on either data generation or the pretraining paradigms of world models. Unlike the aforementioned prior works, we propose Drive-OccWorld, which adapts a vision-centric 4D forecasting world model to end-to-end planning for autonomous driving. Specifically, we first introduce a semantic and motion-conditional normalization in the memory module, which accumulates semantic and dynamic information from historical BEV embeddings. These BEV features are then conveyed to the world decoder for future occupancy and flow forecasting, considering both geometry and spatiotemporal modeling. Additionally, we propose injecting flexible action conditions, such as velocity, steering angle, trajectory, and commands, into the world model to enable controllable generation and facilitate a broader range of downstream applications. Furthermore, we explore integrating the generative capabilities of the 4D world model with end-to-end planning, enabling continuous forecasting of future states and the selection of optimal trajectories using an occupancy-based cost function. Extensive experiments on the nuScenes dataset demonstrate that our method can generate plausible and controllable 4D occupancy, opening new avenues for driving world generation and end-to-end planning.
2024
Preprint
DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes
Jianbiao Mei, Xuemeng Yang, Licheng Wen, and 7 more authors
Recent advances in diffusion models have improved controllable streetscape generation and supported downstream perception and planning tasks. However, challenges remain in accurately modeling driving scenes and generating long videos. To alleviate these issues, we propose DreamForge, an advanced diffusion-based autoregressive video generation model tailored for 3D-controllable long-term generation. To enhance the lane and foreground generation, we introduce perspective guidance and integrate object-wise position encoding to incorporate local 3D correlation and improve foreground object modeling. We also propose motion-aware temporal attention to capture motion cues and appearance changes in videos. By leveraging motion frames and an autoregressive generation paradigm, we can autoregressively generate long videos (over 200 frames) using a 7-frame model, achieving superior quality compared to the baseline in 16-frame video evaluations. Finally, we integrate our method with the realistic simulation platform DriveArena to provide more reliable open-loop and closed-loop evaluations for vision-based driving agents.
Preprint
Drivearena: A closed-loop generative simulation platform for autonomous driving
Xuemeng Yang*, Licheng Wen*, Yukai Ma*, and 8 more authors
This paper presented DriveArena, the first high-fidelity closed-loop simulation system designed for driving agents navigating in real scenarios. DriveArena features a flexible, modular architecture, allowing for the seamless interchange of its core components: Traffic Manager, a traffic simulator capable of generating realistic traffic flow on any worldwide street map, and World Dreamer, a high-fidelity conditional generative model with infinite autoregression. This powerful synergy empowers any driving agent capable of processing real-world images to navigate in DriveArena’s simulated environment. The agent perceives its surroundings through images generated by World Dreamer and output trajectories. These trajectories are fed into Traffic Manager, achieving realistic interactions with other vehicles and producing a new scene layout. Finally, the latest scene layout is relayed back into World Dreamer, perpetuating the simulation cycle. This iterative process fosters closed-loop exploration within a highly realistic environment, providing a valuable platform for developing and evaluating driving agents across diverse and challenging scenarios. DriveArena signifies a substantial leap forward in leveraging generative image data for the driving simulation platform, opening insights for closed-loop autonomous driving.
NeurIPS
Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving
Jianbiao Mei*, Yukai Ma*, Xuemeng Yang, and 8 more authors
In Advances in Neural Information Processing Systems (NeurIPS), 2024
Autonomous driving has advanced significantly due to sensors, machine learning, and artificial intelligence improvements. However, prevailing methods struggle with intricate scenarios and causal relationships, hindering adaptability and interpretability in varied environments. To address the above problems, we introduce LeapAD, a novel paradigm for autonomous driving inspired by the human cognitive process. Specifically, LeapAD emulates human attention by selecting critical objects relevant to driving decisions, simplifying environmental interpretation, and mitigating decision-making complexities. Additionally, LeapAD incorporates an innovative dual-process decision-making module, which consists of an Analytic Process (System-II) for thorough analysis and reasoning, along with a Heuristic Process (System-I) for swift and empirical processing. The Analytic Process leverages its logical reasoning to accumulate linguistic driving experience, which is then transferred to the Heuristic Process by supervised fine-tuning. Through reflection mechanisms and a growing memory bank, LeapAD continuously improves itself from past mistakes in a closed-loop environment. Closed-loop testing in CARLA shows that LeapAD outperforms all methods relying solely on camera input, requiring 1-2 orders of magnitude less labeled data. Experiments also demonstrate that as the memory bank expands, the Heuristic Process with only 1.8B parameters can inherit the knowledge from a GPT-4 powered Analytic Process and achieve continuous performance improvement
TIP
Camera-based 3d semantic scene completion with sparse guidance network
Jianbiao Mei, Yu Yang, Mengmeng Wang, and 5 more authors
Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80% mIoU and 45.45% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory.
APIN
Learning spatiotemporal relationships with a unified framework for video object segmentation
Jianbiao Mei, Mengmeng Wang, Yu Yang, and 2 more authors
Video object segmentation (VOS) has made significant progress with matching-based methods, but most approaches still show two problems. Firstly, they apply a complicated and redundant two-extractor pipeline to use more reference frames for cues, increasing the models’ parameters and complexity. Secondly, most of these methods neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames), i.e., they need adequate modeling of spatial-temporal relationships. In this paper, to address the two problems, we propose a unified transformer-based framework for VOS, a compact and unified single-extractor pipeline with strong spatial and temporal interaction ability. Specifically, to slim the common-used two-extractor pipeline while keeping the model’s effectiveness and flexibility, we design a single dynamic feature extractor with an ingenious dynamic input adapter to encode two significant inputs, i.e., reference sets (historical frames with predicted masks) and query frame (current frame), respectively. Moreover, the relationships among different frames and inside every frame are crucial for this task. We introduce a vision transformer to exploit and model both the temporal and spatial relationships simultaneously. By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS. Extensive experiments demonstrate the superiority of our model over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. We also explore potential solutions, such as sequence organizers, to improve the model’s efficiency. On DAVIS17 validation, we achieve 50% faster inference speed with only a slight 0.2% (J&F) drop in segmentation quality.
PRL
LiDAR video object segmentation with dynamic kernel refinement
Jianbiao Mei, Yu Yang, Mengmeng Wang, and 3 more authors
In this paper, we formalize memory- and tracking-based methods to perform the LiDAR-based Video Object Segmentation (VOS) task, which segments points of the specific 3D target (given in the first frame) in a LiDAR sequence. LiDAR-based VOS can directly provide target-aware geometric information for practical application scenarios like behavior analysis and anticipating danger. We first construct a LiDAR-based VOS dataset named KITTI-VOS based on SemanticKITTI, which acts as a testbed and facilitates comprehensive evaluations of algorithm performance. Next, we provide two types of baselines, i.e., memory-based and tracking-based baselines, to explore this task. Specifically, the first memory-based pipeline is built on a space–time memory network equipped with the non-local spatiotemporal attention-based memory bank. We further design a more potent variant to introduce the locality into the spatiotemporal attention module by local self-attention and cross-attention modules. For the second tracking-based baseline, we modify two representative 3D object tracking methods to adapt to LiDAR-based VOS tasks. Finally, we propose a refine module that takes mask priors and generates object-aware kernels, which could boost all the baselines’ performance. We evaluate the proposed methods on the dataset and demonstrate their effectiveness.
Preprint
DQFormer: Towards Unified LiDAR Panoptic Segmentation with Decoupled Queries
Yu Yang*, Jianbiao Mei*, Liang Liu, and 6 more authors
LiDAR panoptic segmentation, which jointly performs instance and semantic segmentation for things and stuff classes, plays a fundamental role in LiDAR perception tasks. While most existing methods explicitly separate these two segmentation tasks and utilize different branches (i.e., semantic and instance branches), some recent methods have embraced the query-based paradigm to unify LiDAR panoptic segmentation. However, the distinct spatial distribution and inherent characteristics of objects(things) and their surroundings(stuff) in 3D scenes lead to challenges, including the mutual competition of things/stuff and the ambiguity of classification/segmentation. In this paper, we propose decoupling things/stuff queries according to their intrinsic properties for individual decoding and disentangling classification/segmentation to mitigate ambiguity. To this end, we propose a novel framework dubbed DQFormer to implement semantic and instance segmentation in a unified workflow. Specifically, we design a decoupled query generator to propose informative queries with semantics by localizing things/stuff positions and fusing multi-level BEV embeddings. Moreover, a query-oriented mask decoder is introduced to decode corresponding segmentation masks by performing masked cross-attention between queries and mask embeddings. Finally, the decoded masks are combined with the semantics of the queries to produce panoptic results. Extensive experiments on nuScenes and SemanticKITTI datasets demonstrate the superiority of our DQFormer framework.
RAL
LiCROcc: Teach Radar for Accurate Semantic Occupancy Prediction using LiDAR and Camera
Yukai Ma*, Jianbiao Mei*, Xuemeng Yang, and 6 more authors
Semantic Scene Completion (SSC) is pivotal in autonomous driving perception, frequently confronted with the complexities of weather and illumination changes. The long-term strategy involves fusing multi-modal information to bolster the system’s robustness. Radar, increasingly utilized for 3D target detection, is gradually replacing LiDAR in autonomous driving applications, offering a robust sensing alternative. In this paper, we focus on the potential of 3D radar in semantic scene completion, pioneering cross-modal refinement techniques for improved robustness against weather and illumination changes, and enhancing SSC this http URL model architecture, we propose a three-stage tight fusion approach on BEV to realize a fusion framework for point clouds and images. Based on this foundation, we designed three cross-modal distillation modules-CMRD, BRD, and PDD. Our approach enhances the performance in both radar-only (R-LiCROcc) and radar-camera (RC-LiCROcc) settings by distilling to them the rich semantic and structural information of the fused features of LiDAR and camera. Finally, our LC-Fusion (teacher model), R-LiCROcc and RC-LiCROcc achieve the best performance on the nuScenes-Occupancy dataset, with mIOU exceeding the baseline by 22.9%, 44.1%, and 15.5%, respectively.
2023
ACM MM
Centerlps: Segment instances by centers for lidar panoptic segmentation
Jianbiao Mei*, Yu Yang*, Mengmeng Wang, and 5 more authors
In Proceedings of the 31st ACM International Conference on Multimedia (ACM MM), 2023
This paper focuses on LiDAR Panoptic Segmentation (LPS), which has attracted more attention recently due to its broad application prospect for autonomous driving and robotics. The mainstream LPS approaches either adopt a top-down strategy relying on 3D object detectors to discover instances or utilize time-consuming heuristic clustering algorithms to group instances in a bottom-up manner. Inspired by the center representation and kernel-based segmentation, we propose a new detection-free and clustering-free framework called CenterLPS, with the center-based instance encoding and decoding paradigm. Specifically, we propose a sparse center proposal network to generate the sparse 3D instance centers, as well as center feature embedding, which can well encode characteristics of instances. Then a center-aware transformer is applied to collect the context between different center feature embedding and around centers. Moreover, we generate the kernel weights based on the enhanced center feature embedding and initialize dynamic convolutions to decode the final instance masks. Finally, a mask fusion module is devised to unify the semantic and instance predictions and improve the panoptic quality. Extensive experiments on SemanticKITTI and nuScenes demonstrate the effectiveness of our proposed center-based framework CenterLPS.
IROS
PANet: LiDAR Panoptic Segmentation with Sparse Instance Proposal and Aggregation
Jianbiao Mei*, Yu Yang*, Mengmeng Wang, and 3 more authors
In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023
Reliable LiDAR panoptic segmentation (LPS), including both semantic and instance segmentation, is vital for many robotic applications, such as autonomous driving. This work proposes a new LPS framework named PANet to eliminate the dependency on the offset branch and improve the performance on large objects, which are always over-segmented by clustering algorithms. Firstly, we propose a non-learning Sparse Instance Proposal (SIP) module with the “sampling-shifting-grouping" scheme to directly group thing points into instances from the raw point cloud efficiently. More specifically, balanced point sampling is introduced to generate sparse seed points with more uniform point distribution over the distance range. And a shift module, termed bubble shifting, is proposed to shrink the seed points to the clustered centers. Then we utilize the connected component label algorithm to generate instance proposals. Furthermore, an instance aggregation module is devised to integrate potentially fragmented instances, improving the performance of the SIP module on large objects. Extensive experiments show that PANet achieves state-of-the-art performance among published works on the SemanticKITII validation and nuScenes validation for the panoptic segmentation task.
IROS
SSC-RS: Elevate LiDAR semantic scene completion with representation separation and BEV fusion
Jianbiao Mei, Yu Yang, Mengmeng Wang, and 3 more authors
In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023
Semantic scene completion (SSC) jointly predicts the semantics and geometry of the entire 3D scene, which plays an essential role in 3D scene understanding for autonomous driving systems. SSC has achieved rapid progress with the help of semantic context in segmentation. However, how to effectively exploit the relationships between the semantic context in semantic segmentation and geometric structure in scene completion remains under exploration. In this paper, we propose to solve outdoor SSC from the perspective of representation separation and BEV fusion. Specifically, we present the network, named SSC-RS, which uses separate branches with deep supervision to explicitly disentangle the learning procedure of the semantic and geometric representations. And a BEV fusion network equipped with the proposed Adaptive Representation Fusion (ARF) module is presented to aggregate the multi-scale features effectively and efficiently. Due to the low computational burden and powerful representation ability, our model has good generality while running in real-time. Extensive experiments on SemanticKITTI demonstrate our SSC-RS achieves state-of-the-art performance.
ACM TIST
Fast real-time video object segmentation with a tangled memory network
Jianbiao Mei*, Mengmeng Wang*, Yu Yang, and 2 more authors
ACM Transactions on Intelligent Systems and Technology (ACM TIST), 2023
In this article, we present a fast real-time tangled memory network that segments the objects effectively and efficiently for semi-supervised video object segmentation (VOS). We propose a tangled reference encoder and a memory bank organization mechanism based on a state estimator to fully utilize the mask features and alleviate memory overhead and computational burden brought by the unlimited memory bank used in many memory-based methods. First, the tangled memory network exploits the mask features that uncover abundant object information like edges and contours but are not fully explored in existing methods. Specifically, a tangled two-stream reference encoder is designed to extract and fuse the features from both RGB frames and the predicted masks. Second, to indicate the quality of the predicted mask and feedback the online prediction state for organizing the memory bank, we devise a target state estimator to learn the IoU score between the predicted mask and ground truth. Moreover, to accelerate the forward process and avoid memory overflow, we use a memory bank of fixed size to store historical features by designing a new efficient memory bank organization mechanism based on the mask state score provided by the state estimator. We conduct comprehensive experiments on the public benchmarks DAVIS and YouTube-VOS, demonstrating that our method obtains competitive results while running at high speed (66 FPS on the DAVIS16-val set).
2022
TIP
Delving deeper into mask utilization in video object segmentation
Mengmeng Wang*, Jianbiao Mei*, Lina Liu, and 3 more authors
This paper focuses on the mask utilization of video object segmentation (VOS). The mask here mains the reference masks in the memory bank, i.e., several chosen high-quality predicted masks, which are usually used with the reference frames together. The reference masks depict the edge and contour features of the target object and indicate the boundary of the target against the background, while the reference frames contain the raw RGB information of the whole image. It is obvious that the reference masks could play a significant role in the VOS, but this is not well explored yet. To tackle this, we propose to investigate the mask advantages of both the encoder and the matcher. For the encoder, we provide a unified codebase to integrate and compare eight different mask-fused encoders. Half of them are inherited or summarized from existing methods, and the other half are devised by ourselves. We find the best configuration from our design and give valuable observations from the comparison. Then, we propose a new mask-enhanced matcher to reduce the background distraction and enhance the locality of the matching process. Combining the mask-fused encoder, mask-enhanced matcher and a standard decoder, we formulate a new architecture named MaskVOS, which sufficiently exploits the mask benefits for VOS. Qualitative and quantitative results demonstrate the effectiveness of our method. We hope our exploration could raise the attention of mask utilization in VOS.
2021
Preprint
Transvos: Video object segmentation with transformers
Jianbiao Mei*, Mengmeng Wang*, Yeneng Lin, and 2 more authors
Recently, Space-Time Memory Network (STM) based methods have achieved state-of-the-art performance in semi-supervised video object segmentation (VOS). A crucial problem in this task is how to model the dependency both among different frames and inside every frame. However, most of these methods neglect the spatial relationships (inside each frame) and do not make full use of the temporal relationships (among different frames). In this paper, we propose a new transformer-based framework, termed TransVOS, introducing a vision transformer to fully exploit and model both the temporal and spatial relationships. Moreover, most STM-based approaches employ two separate encoders to extract features of two significant inputs, i.e., reference sets (history frames with predicted masks) and query frame (current frame), respectively, increasing the models’ parameters and complexity. To slim the popular two-encoder pipeline while keeping the effectiveness, we design a single two-path feature extractor to encode the above two inputs in a unified way. Extensive experiments demonstrate the superiority of our TransVOS over state-of-the-art methods on both DAVIS and YouTube-VOS datasets.
Please contact me via email: jianbiaomei [at] zju dot edu dot cn