Jianbiao Mei
 
 I am a Ph.D. candidate in the Department of Control Science and Engineering at Zhejiang University, where I have been since 2021, under the supervision of Prof. Yong Liu at the APRIL Lab. Prior to this, I obtained my B.Eng from the same department with an honors degree at Chu Kochen Honors College in 2021. Currently, I am an intern researcher at Shanghai AI Laboratory.
Research Interests
The following are my current research interests:
- 3D Perception
- World Models
- Multimodal Large Language Models
News
| Sep 26, 2025 | 🎉 Paper X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability is accepted by NeurIPS 2025 ! 📢 We have also released the 3D and 4D World Modeling: A Survey ! | 
|---|---|
| Jun 26, 2025 | 🎉 Paper DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving is accepted by ICCV 2025 ! | 
| Dec 10, 2024 | 🎉 Paper Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving is accepted by AAAI 2025 Oral ! | 
| Oct 2, 2024 | 🎉 Paper Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving is accepted by NeurIPS 2024 ! 📢 We have also developed a closed-loop high-fidelity simulation platform called DriveArena! | 
| Sep 11, 2024 | 🎉 Paper Camera-Based 3D Semantic Scene Completion With Sparse Guidance Network is accepted by 2024 IEEE Transactions on Image Processing (TIP) ! | 
Selected Publications
* denotes equal contribution.2025
-  PreprintVision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World ModelsJianbiao Mei*, Yu Yang*, Xuemeng Yang, and 4 more authorsarXiv preprint arXiv:2510.16729, 2025End-to-end autonomous driving systems increasingly rely on vision-centric world models to understand and predict their environment. However, a common ineffectiveness in these models is the full reconstruction of future scenes, which expends significant capacity on redundantly modeling static backgrounds. To address this, we propose IR-WM, an Implicit Residual World Model that focuses on modeling the current state and evolution of the world. IR-WM first establishes a robust bird’s-eye-view representation of the current state from the visual observation. It then leverages the BEV features from the previous timestep as a strong temporal prior and predicts only the "residual", i.e., the changes conditioned on the ego-vehicle’s actions and scene context. To alleviate error accumulation over time, we further apply an alignment module to calibrate semantic and dynamic misalignments. Moreover, we investigate different forecasting-planning coupling schemes and demonstrate that the implicit future state generated by world models substantially improves planning accuracy. On the nuScenes benchmark, IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning. @article{mei2025vision, title = {Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models}, author = {Mei*, Jianbiao and Yang*, Yu and Yang, Xuemeng and Wen, Licheng and Lv, Jiajun and Shi, Botian and Yong, Liu}, journal = {arXiv preprint arXiv:2510.16729}, year = {2025}, }
-  PreprintRE-Searcher: Robust Agentic Search with Goal-oriented Planning and Self-reflectionDaocheng Fu*, Jianbiao Mei*, Licheng Wen, and 8 more authorsarXiv preprint arXiv:2509.26048, 2025Large language models (LLMs) excel at knowledge-intensive question answering and reasoning, yet their real-world deployment remains constrained by knowledge cutoff, hallucination, and limited interaction modalities. Augmenting LLMs with external search tools helps alleviate these issues, but it also exposes agents to a complex search environment in which small, plausible variations in query formulation can steer reasoning into unproductive trajectories and amplify errors. We present a systematic analysis that quantifies how environmental complexity induces fragile search behaviors and, in turn, degrades overall performance. To address this challenge, we propose a simple yet effective approach to instantiate a search agent, RE-Searcher. During search, RE-Searcher explicitly articulates a concrete search goal and subsequently reflects on whether the retrieved evidence satisfies that goal. This combination of goal-oriented planning and self-reflection enables RE-Searcher to resist spurious cues in complex search environments and perform robust search. Extensive experiments show that our method improves search accuracy and achieves state-of-the-art results. Perturbation studies further demonstrate substantial resilience to noisy or misleading external signals, mitigating the fragility of the search process. We believe these findings offer practical guidance for integrating LLM-powered agents into more complex interactive environments and enabling more autonomous decision-making. @article{fu2025re, title = {RE-Searcher: Robust Agentic Search with Goal-oriented Planning and Self-reflection}, author = {Fu*, Daocheng and Mei*, Jianbiao and Wen, Licheng and Yang, Xuemeng and Yang, Cheng and Wu, Rong and Hu, Tao and Li, Siqi and Shen, Yufan and Cai, Xinyu and others}, journal = {arXiv preprint arXiv:2509.26048}, year = {2025}, }
-  Preprint3d and 4d world modeling: A surveyLingdong Kong*, Wesley Yang*, Jianbiao Mei*, and 8 more authorsarXiv preprint arXiv:2509.07996, 2025World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit. While prior work largely emphasizes generative methods for 2D image and video data, they overlook the rapidly growing body of work that leverages native 3D and 4D representations such as RGB-D imagery, occupancy grids, and LiDAR point clouds for large-scale scene modeling. At the same time, the absence of a standardized definition and taxonomy for “world models” has led to fragmented and sometimes inconsistent claims in the literature. This survey addresses these gaps by presenting the first comprehensive review explicitly dedicated to 3D and 4D world modeling and generation. We establish precise definitions, introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics tailored to 3D/4D settings. We further discuss practical applications, identify open challenges, and highlight promising research directions, aiming to provide a coherent and foundational reference for advancing the field. @article{kong20253d, title = {3d and 4d world modeling: A survey}, author = {Kong*, Lingdong and Yang*, Wesley and Mei*, Jianbiao and Liu*, Youquan and Liang*, Ao and Zhu*, Dekai and Lu*, Dongyue and Yin*, Wei and Hu, Xiaotao and Jia, Mingkai and others}, journal = {arXiv preprint arXiv:2509.07996}, year = {2025}, demo = {https://worldbench.github.io/survey} }
-  NeurIPSX-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible ControllabilityYu Yang, Alan Liang, Jianbiao Mei, and 3 more authorsIn Advances in Neural Information Processing Systems (NeurIPS), 2025Diffusion models are advancing autonomous driving by enabling realistic data synthesis, predictive end-to-end planning, and closed-loop simulation, with a primary focus on temporally consistent generation. However, the generation of large-scale 3D scenes that require spatial coherence remains underexplored. In this paper, we propose X-Scene, a novel framework for large-scale driving scene generation that achieves both geometric intricacy and appearance fidelity, while offering flexible controllability. Specifically, X-Scene supports multi-granular control, including low-level conditions such as user-provided or text-driven layout for detailed scene composition and high-level semantic guidance such as user-intent and LLM-enriched text prompts for efficient customization. To enhance geometrical and visual fidelity, we introduce a unified pipeline that sequentially generates 3D semantic occupancy and the corresponding multiview images, while ensuring alignment between modalities. Additionally, we extend the generated local region into a large-scale scene through consistency-aware scene outpainting, which extrapolates new occupancy and images conditioned on the previously generated area, enhancing spatial continuity and preserving visual coherence. The resulting scenes are lifted into high-quality 3DGS representations, supporting diverse applications such as scene exploration. Comprehensive experiments demonstrate that X-Scene significantly advances controllability and fidelity for large-scale driving scene generation, empowering data generation and simulation for autonomous driving. @inproceedings{yang2025x, title = {X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability}, author = {Yang, Yu and Liang, Alan and Mei, Jianbiao and Ma, Yukai and Liu, Yong and Lee, Gim Hee}, booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, year = {2025}, demo = {https://x-scene.github.io/}, }
-  PreprintO^2-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question AnsweringJianbiao Mei*, Tao Hu*, Daocheng Fu*, and 11 more authors2025Large Language Models (LLMs), despite their advancements, are fundamentally limited by their static parametric knowledge, hindering performance on tasks requiring open-domain up-to-date information. While enabling LLMs to interact with external knowledge environments is a promising solution, current efforts primarily address closed-end problems. Open-ended questions, which characterized by lacking a standard answer or providing non-unique and diverse answers, remain underexplored. To bridge this gap, we present O^2-Searcher, a novel search agent leveraging reinforcement learning to effectively tackle both open-ended and closed-ended questions in the open domain. O^2-Searcher leverages an efficient, locally simulated search environment for dynamic knowledge acquisition, effectively decoupling the external world knowledge from model’s sophisticated reasoning processes. It employs a unified training mechanism with meticulously designed reward functions, enabling the agent to identify problem types and adapt different answer generation strategies. Furthermore, to evaluate performance on complex open-ended tasks, we construct O^2-QA, a high-quality benchmark featuring 300 manually curated, multi-domain open-ended questions with associated web page caches. Extensive experiments show that O^2-Searcher, using only a 3B model, significantly surpasses leading LLM agents on O^2-QA. It also achieves SOTA results on various closed-ended QA benchmarks against similarly-sized models, while performing on par with much larger ones. @article{mei2025o2searcher, title = {O$^2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering}, author = {Mei*, Jianbiao and Hu*, Tao and Fu*, Daocheng and Wen, Licheng and Yang, Xuemeng and Wu, Rong and Cai, Pinlong and Cai, Xinyu and Gao, Xing and Yang, Yu and Xie, Chengjun and Shi, Botian and Liu, Yong and Qiao, Yu}, year = {2025}, demo = {https://github.com/KnowledgeXLab/O2-Searcher}, }
-  AAAI OralDriving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous DrivingYu Yang*, Jianbiao Mei*, Yukai Ma, and 5 more authorsAAAI Conference on Artificial Intelligence (AAAI), 2025World models envision potential future states based on various ego actions. They embed extensive knowledge about the driving environment, facilitating safe and scalable autonomous driving. Most existing methods primarily focus on either data generation or the pretraining paradigms of world models. Unlike the aforementioned prior works, we propose Drive-OccWorld, which adapts a vision-centric 4D forecasting world model to end-to-end planning for autonomous driving. Specifically, we first introduce a semantic and motion-conditional normalization in the memory module, which accumulates semantic and dynamic information from historical BEV embeddings. These BEV features are then conveyed to the world decoder for future occupancy and flow forecasting, considering both geometry and spatiotemporal modeling. Additionally, we propose injecting flexible action conditions, such as velocity, steering angle, trajectory, and commands, into the world model to enable controllable generation and facilitate a broader range of downstream applications. Furthermore, we explore integrating the generative capabilities of the 4D world model with end-to-end planning, enabling continuous forecasting of future states and the selection of optimal trajectories using an occupancy-based cost function. Extensive experiments on the nuScenes dataset demonstrate that our method can generate plausible and controllable 4D occupancy, opening new avenues for driving world generation and end-to-end planning. @article{yang2024driving, title = {Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving}, author = {Yang*, Yu and Mei*, Jianbiao and Ma, Yukai and Du, Siliang and Chen, Wenqing and Qian, Yijie and Feng, Yuxiang and Liu, Yong}, journal = {AAAI Conference on Artificial Intelligence (AAAI)}, year = {2025}, demo = {https://drive-occworld.github.io} }
-  ICCVDrivearena: A closed-loop generative simulation platform for autonomous drivingXuemeng Yang*, Licheng Wen*, Tiantian Wei*, and 8 more authorsInternational Conference on Computer Vision (ICCV), 2025This paper presented DriveArena, the first high-fidelity closed-loop simulation system designed for driving agents navigating in real scenarios. DriveArena features a flexible, modular architecture, allowing for the seamless interchange of its core components: Traffic Manager, a traffic simulator capable of generating realistic traffic flow on any worldwide street map, and World Dreamer, a high-fidelity conditional generative model with infinite autoregression. This powerful synergy empowers any driving agent capable of processing real-world images to navigate in DriveArena’s simulated environment. The agent perceives its surroundings through images generated by World Dreamer and output trajectories. These trajectories are fed into Traffic Manager, achieving realistic interactions with other vehicles and producing a new scene layout. Finally, the latest scene layout is relayed back into World Dreamer, perpetuating the simulation cycle. This iterative process fosters closed-loop exploration within a highly realistic environment, providing a valuable platform for developing and evaluating driving agents across diverse and challenging scenarios. DriveArena signifies a substantial leap forward in leveraging generative image data for the driving simulation platform, opening insights for closed-loop autonomous driving. @article{yang2024drivearena, title = {Drivearena: A closed-loop generative simulation platform for autonomous driving}, author = {Yang*, Xuemeng and Wen*, Licheng and Wei*, Tiantian and Ma*, Yukai and Mei*, Jianbiao and Li*, Xin and Lei, Wenjie and Fu, Daocheng and Cai, Pinlong and Dou, Min and others}, journal = {International Conference on Computer Vision (ICCV)}, year = {2025}, demo = {https://pjlab-adg.github.io/DriveArena/}, }
2024
-  PreprintDreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving ScenesJianbiao Mei*, Tao Hu*, Licheng Wen, and 7 more authorsarXiv preprint arXiv:2409.04003, 2024Recent advances in diffusion models have improved controllable streetscape generation and supported downstream perception and planning tasks. However, challenges remain in accurately modeling driving scenes and generating long videos. To alleviate these issues, we propose DreamForge, an advanced diffusion-based autoregressive video generation model tailored for 3D-controllable long-term generation. To enhance the lane and foreground generation, we introduce perspective guidance and integrate object-wise position encoding to incorporate local 3D correlation and improve foreground object modeling. We also propose motion-aware temporal attention to capture motion cues and appearance changes in videos. By leveraging motion frames and an autoregressive generation paradigm, we can autoregressively generate long videos (over 200 frames) using a 7-frame model, achieving superior quality compared to the baseline in 16-frame video evaluations. Finally, we integrate our method with the realistic simulation platform DriveArena to provide more reliable open-loop and closed-loop evaluations for vision-based driving agents. @article{mei2024dreamforge, title = {DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes}, author = {Mei*, Jianbiao and Hu*, Tao and Wen, Licheng and Yang, Xuemeng and Yang, Yu and Wei, Tiantian and Ma, Yukai and Dou, Min and Shi, Botian and Liu, Yong}, journal = {arXiv preprint arXiv:2409.04003}, year = {2024}, demo = {https://pjlab-adg.github.io/DriveArena/dreamforge}, }
-  NeurIPSContinuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous DrivingJianbiao Mei*, Yukai Ma*, Xuemeng Yang, and 8 more authorsIn Advances in Neural Information Processing Systems (NeurIPS), 2024Autonomous driving has advanced significantly due to sensors, machine learning, and artificial intelligence improvements. However, prevailing methods struggle with intricate scenarios and causal relationships, hindering adaptability and interpretability in varied environments. To address the above problems, we introduce LeapAD, a novel paradigm for autonomous driving inspired by the human cognitive process. Specifically, LeapAD emulates human attention by selecting critical objects relevant to driving decisions, simplifying environmental interpretation, and mitigating decision-making complexities. Additionally, LeapAD incorporates an innovative dual-process decision-making module, which consists of an Analytic Process (System-II) for thorough analysis and reasoning, along with a Heuristic Process (System-I) for swift and empirical processing. The Analytic Process leverages its logical reasoning to accumulate linguistic driving experience, which is then transferred to the Heuristic Process by supervised fine-tuning. Through reflection mechanisms and a growing memory bank, LeapAD continuously improves itself from past mistakes in a closed-loop environment. Closed-loop testing in CARLA shows that LeapAD outperforms all methods relying solely on camera input, requiring 1-2 orders of magnitude less labeled data. Experiments also demonstrate that as the memory bank expands, the Heuristic Process with only 1.8B parameters can inherit the knowledge from a GPT-4 powered Analytic Process and achieve continuous performance improvement @inproceedings{mei2024continuously, title = {Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving}, author = {Mei*, Jianbiao and Ma*, Yukai and Yang, Xuemeng and Wen, Licheng and Cai, Xinyu and Li, Xin and Fu, Daocheng and Zhang, Bo and Cai, Pinlong and Dou, Min and others}, booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, year = {2024}, demo = {https://leapad-2024.github.io/LeapAD/}, }
-  TIPCamera-based 3d semantic scene completion with sparse guidance networkJianbiao Mei, Yu Yang, Mengmeng Wang, and 5 more authorsIEEE Transactions on Image Processing (TIP), 2024Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80% mIoU and 45.45% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory. @article{mei2024camera, title = {Camera-based 3d semantic scene completion with sparse guidance network}, author = {Mei, Jianbiao and Yang, Yu and Wang, Mengmeng and Zhu, Junyu and Ra, Jongwon and Ma, Yukai and Li, Laijian and Liu, Yong}, journal = {IEEE Transactions on Image Processing (TIP)}, year = {2024}, publisher = {IEEE}, }
-  RALLiCROcc: Teach Radar for Accurate Semantic Occupancy Prediction using LiDAR and CameraYukai Ma*, Jianbiao Mei*, Xuemeng Yang, and 6 more authorsIEEE Robotics and Automation Letters (RAL), 2024Semantic Scene Completion (SSC) is pivotal in autonomous driving perception, frequently confronted with the complexities of weather and illumination changes. The long-term strategy involves fusing multi-modal information to bolster the system’s robustness. Radar, increasingly utilized for 3D target detection, is gradually replacing LiDAR in autonomous driving applications, offering a robust sensing alternative. In this paper, we focus on the potential of 3D radar in semantic scene completion, pioneering cross-modal refinement techniques for improved robustness against weather and illumination changes, and enhancing SSC this http URL model architecture, we propose a three-stage tight fusion approach on BEV to realize a fusion framework for point clouds and images. Based on this foundation, we designed three cross-modal distillation modules-CMRD, BRD, and PDD. Our approach enhances the performance in both radar-only (R-LiCROcc) and radar-camera (RC-LiCROcc) settings by distilling to them the rich semantic and structural information of the fused features of LiDAR and camera. Finally, our LC-Fusion (teacher model), R-LiCROcc and RC-LiCROcc achieve the best performance on the nuScenes-Occupancy dataset, with mIOU exceeding the baseline by 22.9%, 44.1%, and 15.5%, respectively. @article{ma2024licrocc, title = {LiCROcc: Teach Radar for Accurate Semantic Occupancy Prediction using LiDAR and Camera}, author = {Ma*, Yukai and Mei*, Jianbiao and Yang, Xuemeng and Wen, Licheng and Xu, Weihua and Zhang, Jiangning and Shi, Botian and Liu, Yong and Zuo, Xingxing}, journal = {IEEE Robotics and Automation Letters (RAL)}, year = {2024}, demo = {https://hr-zju.github.io/LiCROcc}, }
-  AAAI OralA Multimodal, Multi-Task Adapting Framework for Video Action RecognitionMengmeng Wang, Jiazheng Xing, Boyuan Jiang, and 6 more authorsIn Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing approaches tend to prioritize strong supervised performance at the expense of compromising the models’ generalization capabilities during transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP adapting framework named \name to address these challenges, preserving both high supervised performance and robust transferability. Firstly, to enhance the individual modality architectures, we introduce multimodal adapters to both the visual and text branches. Specifically, we design a novel visual TED-Adapter, that performs global Temporal Enhancement and local temporal Difference modeling to improve the temporal representation capabilities of the visual encoder. Moreover, we adopt text encoder adapters to strengthen the learning of semantic label information. Secondly, we design a multi-task decoder with a rich set of supervisory signals to adeptly satisfy the need for strong supervised performance and generalization within a multimodal framework. Experimental results validate the efficacy of our approach, demonstrating exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios. @inproceedings{wang2024multimodal, title = {A Multimodal, Multi-Task Adapting Framework for Video Action Recognition}, author = {Wang, Mengmeng and Xing, Jiazheng and Jiang, Boyuan and Chen, Jun and Mei, Jianbiao and Zuo, Xingxing and Dai, Guang and Wang, Jingdong and Liu, Yong}, booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)}, volume = {38}, number = {6}, pages = {5517--5525}, year = {2024}, }
-  ICRAA Coarse-to-Fine Place Recognition Approach using Attention-guided Descriptors and Overlap EstimationChencan Fu, Lin Li, Jianbiao Mei, and 4 more authorsIn 2024 IEEE International Conference on Robotics and Automation (ICRA), 2024Place recognition is a challenging but crucial task in robotics. Current description-based methods may be limited by representation capabilities, while pairwise similarity-based methods require exhaustive searches, which is time-consuming. In this paper, we present a novel coarse-to-fine approach to address these problems, which combines BEV (Bird’s Eye View) feature extraction, coarse-grained matching and fine-grained verification. In the coarse stage, our approach utilizes an attention-guided network to generate attention-guided descriptors. We then employ a fast affinity-based candidate selection process to identify the Top-K most similar candidates. In the fine stage, we estimate pairwise overlap among the narrowed-down place candidates to determine the final match. Experimental results on the KITTI and KITTI-360 datasets demonstrate that our approach outperforms state-of-the-art methods. @inproceedings{fu2024coarse, title = {A Coarse-to-Fine Place Recognition Approach using Attention-guided Descriptors and Overlap Estimation}, author = {Fu, Chencan and Li, Lin and Mei, Jianbiao and Ma, Yukai and Peng, Linpeng and Zhao, Xiangrui and Liu, Yong}, booktitle = {2024 IEEE International Conference on Robotics and Automation (ICRA)}, pages = {8493--8499}, year = {2024}, organization = {IEEE}, }
-  3DVExploit Spatiotemporal Contextual Information for 3D Single Object Tracking via Memory NetworksJongwon Ra, MengMeng Wang, Jianbiao Mei, and 3 more authorsIn 2024 International Conference on 3D Vision (3DV), 2024The point cloud-based 3D single object tracking plays an indispensable role in autonomous driving. However, the application of 3D object tracking in the real world is still challenging due to the inherent sparsity and self-occlusion of point cloud data. Therefore, it is necessary to exploit as much useful information from limited data as we can. Since 3D object tracking is a video-level task, the appearance of objects changes gradually over time, and there is rich spatiotemporal contextual information among historical frames. However, existing methods do not fully utilize this information. To address this, we propose a new method called SCTrack, which utilizes a memory-based paradigm to exploit spatiotemporal contextual information. SCTrack incorporates both long-term and short-term memory banks to store the spatiotemporal features of targets from historical frames. By doing so, the tracker can benefit from the entire video sequence and make more informed predictions. Additionally, SCTrack extracts the mask prior to augmenting the target representation, improving the target-background discriminability. Extensive experiments on KITTI, nuScenes, and Waymo Open datasets verify the effectiveness of our proposed method. @inproceedings{ra2024exploit, title = {Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking via Memory Networks}, author = {Ra, Jongwon and Wang, MengMeng and Mei, Jianbiao and Liu, Shanqi and Yang, Yu and Liu, Yong}, booktitle = {2024 International Conference on 3D Vision (3DV)}, pages = {842--851}, year = {2024}, organization = {IEEE}, }
2023
-  ACM MMCenterlps: Segment instances by centers for lidar panoptic segmentationJianbiao Mei*, Yu Yang*, Mengmeng Wang, and 5 more authorsIn Proceedings of the 31st ACM International Conference on Multimedia (ACM MM), 2023This paper focuses on LiDAR Panoptic Segmentation (LPS), which has attracted more attention recently due to its broad application prospect for autonomous driving and robotics. The mainstream LPS approaches either adopt a top-down strategy relying on 3D object detectors to discover instances or utilize time-consuming heuristic clustering algorithms to group instances in a bottom-up manner. Inspired by the center representation and kernel-based segmentation, we propose a new detection-free and clustering-free framework called CenterLPS, with the center-based instance encoding and decoding paradigm. Specifically, we propose a sparse center proposal network to generate the sparse 3D instance centers, as well as center feature embedding, which can well encode characteristics of instances. Then a center-aware transformer is applied to collect the context between different center feature embedding and around centers. Moreover, we generate the kernel weights based on the enhanced center feature embedding and initialize dynamic convolutions to decode the final instance masks. Finally, a mask fusion module is devised to unify the semantic and instance predictions and improve the panoptic quality. Extensive experiments on SemanticKITTI and nuScenes demonstrate the effectiveness of our proposed center-based framework CenterLPS. @inproceedings{mei2023centerlps, title = {Centerlps: Segment instances by centers for lidar panoptic segmentation}, author = {Mei*, Jianbiao and Yang*, Yu and Wang, Mengmeng and Li, Zizhang and Hou, Xiaojun and Ra, Jongwon and Li, Laijian and Liu, Yong}, booktitle = {Proceedings of the 31st ACM International Conference on Multimedia (ACM MM)}, pages = {1884--1894}, year = {2023}, }
-  IROSSSC-RS: Elevate LiDAR semantic scene completion with representation separation and BEV fusionJianbiao Mei, Yu Yang, Mengmeng Wang, and 3 more authorsIn 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023Semantic scene completion (SSC) jointly predicts the semantics and geometry of the entire 3D scene, which plays an essential role in 3D scene understanding for autonomous driving systems. SSC has achieved rapid progress with the help of semantic context in segmentation. However, how to effectively exploit the relationships between the semantic context in semantic segmentation and geometric structure in scene completion remains under exploration. In this paper, we propose to solve outdoor SSC from the perspective of representation separation and BEV fusion. Specifically, we present the network, named SSC-RS, which uses separate branches with deep supervision to explicitly disentangle the learning procedure of the semantic and geometric representations. And a BEV fusion network equipped with the proposed Adaptive Representation Fusion (ARF) module is presented to aggregate the multi-scale features effectively and efficiently. Due to the low computational burden and powerful representation ability, our model has good generality while running in real-time. Extensive experiments on SemanticKITTI demonstrate our SSC-RS achieves state-of-the-art performance. @inproceedings{mei2023ssc, title = {SSC-RS: Elevate LiDAR semantic scene completion with representation separation and BEV fusion}, author = {Mei, Jianbiao and Yang, Yu and Wang, Mengmeng and Huang, Tianxin and Yang, Xuemeng and Liu, Yong}, booktitle = {2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, pages = {1--8}, year = {2023}, organization = {IEEE}, }
-  ACM TISTFast real-time video object segmentation with a tangled memory networkJianbiao Mei*, Mengmeng Wang*, Yu Yang, and 2 more authorsACM Transactions on Intelligent Systems and Technology (ACM TIST), 2023In this article, we present a fast real-time tangled memory network that segments the objects effectively and efficiently for semi-supervised video object segmentation (VOS). We propose a tangled reference encoder and a memory bank organization mechanism based on a state estimator to fully utilize the mask features and alleviate memory overhead and computational burden brought by the unlimited memory bank used in many memory-based methods. First, the tangled memory network exploits the mask features that uncover abundant object information like edges and contours but are not fully explored in existing methods. Specifically, a tangled two-stream reference encoder is designed to extract and fuse the features from both RGB frames and the predicted masks. Second, to indicate the quality of the predicted mask and feedback the online prediction state for organizing the memory bank, we devise a target state estimator to learn the IoU score between the predicted mask and ground truth. Moreover, to accelerate the forward process and avoid memory overflow, we use a memory bank of fixed size to store historical features by designing a new efficient memory bank organization mechanism based on the mask state score provided by the state estimator. We conduct comprehensive experiments on the public benchmarks DAVIS and YouTube-VOS, demonstrating that our method obtains competitive results while running at high speed (66 FPS on the DAVIS16-val set). @article{mei2023fast, title = {Fast real-time video object segmentation with a tangled memory network}, author = {Mei*, Jianbiao and Wang*, Mengmeng and Yang, Yu and Li, Yanjun and Liu, Yong}, journal = {ACM Transactions on Intelligent Systems and Technology (ACM TIST)}, volume = {14}, number = {3}, pages = {1--21}, year = {2023}, publisher = {ACM New York, NY}, }
-  TNNLSActionclip: Adapting language-image pretrained models for video action recognitionMengmeng Wang, Jiazheng Xing, Jianbiao Mei, and 2 more authorsIEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2023The canonical approach to video action recognition dictates a neural network model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferability on new datasets with unseen concepts. In this article, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters’ requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub “pre-train, adapt and fine-tune.” This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then, it makes the action recognition task to act more like pre-training problems via adaptation engineering. Finally, it is fine-tuned end-to-end on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP , which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. @article{wang2023actionclip, title = {Actionclip: Adapting language-image pretrained models for video action recognition}, author = {Wang, Mengmeng and Xing, Jiazheng and Mei, Jianbiao and Liu, Yong and Jiang, Yunliang}, journal = {IEEE Transactions on Neural Networks and Learning Systems (TNNLS)}, year = {2023}, publisher = {IEEE}, }
2022
-  TIPDelving deeper into mask utilization in video object segmentationMengmeng Wang*, Jianbiao Mei*, Lina Liu, and 3 more authorsIEEE Transactions on Image Processing (TIP), 2022This paper focuses on the mask utilization of video object segmentation (VOS). The mask here mains the reference masks in the memory bank, i.e., several chosen high-quality predicted masks, which are usually used with the reference frames together. The reference masks depict the edge and contour features of the target object and indicate the boundary of the target against the background, while the reference frames contain the raw RGB information of the whole image. It is obvious that the reference masks could play a significant role in the VOS, but this is not well explored yet. To tackle this, we propose to investigate the mask advantages of both the encoder and the matcher. For the encoder, we provide a unified codebase to integrate and compare eight different mask-fused encoders. Half of them are inherited or summarized from existing methods, and the other half are devised by ourselves. We find the best configuration from our design and give valuable observations from the comparison. Then, we propose a new mask-enhanced matcher to reduce the background distraction and enhance the locality of the matching process. Combining the mask-fused encoder, mask-enhanced matcher and a standard decoder, we formulate a new architecture named MaskVOS, which sufficiently exploits the mask benefits for VOS. Qualitative and quantitative results demonstrate the effectiveness of our method. We hope our exploration could raise the attention of mask utilization in VOS. @article{wang2022delving, title = {Delving deeper into mask utilization in video object segmentation}, author = {Wang*, Mengmeng and Mei*, Jianbiao and Liu, Lina and Tian, Guanzhong and Liu, Yong and Pan, Zaisheng}, journal = {IEEE Transactions on Image Processing (TIP)}, volume = {31}, pages = {6255--6266}, year = {2022}, publisher = {IEEE}, }
-  ECCVE-nerv: Expedite neural video representation with disentangled spatial-temporal contextZizhang Li, Mengmeng Wang, Huaijin Pi, and 3 more authorsIn European Conference on Computer Vision (ECCV), 2022Recently, the image-wise implicit neural representation of videos, NeRV, has gained popularity for its promising results and swift speed compared to regular pixel-wise implicit representations. However, the redundant parameters within the network structure can cause a large model size when scaling up for desirable performance. The key reason of this phenomenon is the coupled formulation of NeRV, which outputs the spatial and temporal information of video frames directly from the frame index input. In this paper, we propose E-NeRV, which dramatically expedites NeRV by decomposing the image-wise implicit neural representation into separate spatial and temporal context. Under the guidance of this new formulation, our model greatly reduces the redundant model parameters, while retaining the representation ability. We experimentally find that our method can improve the performance to a large extent with fewer parameters, resulting in a more than 8x faster speed on convergence. @inproceedings{li2022nerv, title = {E-nerv: Expedite neural video representation with disentangled spatial-temporal context}, author = {Li, Zizhang and Wang, Mengmeng and Pi, Huaijin and Xu, Kechun and Mei, Jianbiao and Liu, Yong}, booktitle = {European Conference on Computer Vision (ECCV)}, pages = {267--284}, year = {2022}, organization = {Springer Nature Switzerland Cham}, }
2021
-  PreprintTransvos: Video object segmentation with transformersJianbiao Mei*, Mengmeng Wang*, Yeneng Lin, and 2 more authorsarXiv preprint arXiv:2106.00588, 2021Recently, Space-Time Memory Network (STM) based methods have achieved state-of-the-art performance in semi-supervised video object segmentation (VOS). A crucial problem in this task is how to model the dependency both among different frames and inside every frame. However, most of these methods neglect the spatial relationships (inside each frame) and do not make full use of the temporal relationships (among different frames). In this paper, we propose a new transformer-based framework, termed TransVOS, introducing a vision transformer to fully exploit and model both the temporal and spatial relationships. Moreover, most STM-based approaches employ two separate encoders to extract features of two significant inputs, i.e., reference sets (history frames with predicted masks) and query frame (current frame), respectively, increasing the models’ parameters and complexity. To slim the popular two-encoder pipeline while keeping the effectiveness, we design a single two-path feature extractor to encode the above two inputs in a unified way. Extensive experiments demonstrate the superiority of our TransVOS over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. @article{mei2021transvos, title = {Transvos: Video object segmentation with transformers}, author = {Mei*, Jianbiao and Wang*, Mengmeng and Lin, Yeneng and Yuan, Yi and Liu, Yong}, journal = {arXiv preprint arXiv:2106.00588}, year = {2021}, }