Abstract We present a novel method, AutoSpatial, an efficient approach with structured spatial grounding to enhance VLMs’ spatial reasoning. By combining minimal manual supervision with large-scale Visual Question-Answering (VQA) pairs auto-labeling, our approach tackles the challenge of VLMs’ limited spatial understanding in social navigation tasks. By applying a hierarchical two-round VQA strategy during training, AutoSpatial achieves both global and detailed understanding of scenarios, demonstrating more accurate spatial perception, movement prediction, Chain of Thought (CoT) reasoning, final action, and explanation compared to other SOTA approaches.
Overview Artificial reverberation has been added to anechoic speech data to train more robust machine learning models for automatic speech processing. We are developing methods for automatic speech recognition, source separation and localization, binaural audio generation, and speech emotion recognition.
Software pygsound: pygsound is a python package for impulse response generation based on state-of-the-art geometric sound propagation engine. The simulation is implemented with C++ and uses pybind11 for python APIs.
– Overview Autonomous Driving has become one of the most anticipated technologies in both industry and academic research groups. Most of the current efforts in autonomous driving have found success in idealistic conditions such as sparse and homogeneous traffic on highways and urban areas. The GAMMA group, instead, aims to advance autonomous driving research in highly dense and heterogeneous traffic conditions that characterizes social and psychological aspects of human drivers in uncertain environments.
– Overview Autonomous Driving has become one of the most anticipated technologies in both industry and academic research groups. Most of the current efforts in autonomous driving have found success in idealistic conditions such as sparse and homogeneous traffic on highways and urban areas. The GAMMA group, instead, aims to advance autonomous driving research in highly dense and heterogeneous traffic conditions that characterizes social and psychological aspects of human drivers in uncertain environments.
Abstract We present AutonoVi, a novel algorithm for autonomous vehicle navigation that supports dynamic maneuvers and satisfies traffic constraints and norms. Our approach is based on optimization-based maneuver planning that supports dynamic lane-changes, swerving, and braking in all traffic scenarios and guides the vehicle to its goal position. We take into account various traffic constraints, including collision avoidance with other vehicles, pedestrians, and cyclists using control velocity obstacles. We use a data-driven approach to model the vehicle dynamics for control and collision avoidance.
Abstract Driven by the need from real-world applications, Auxiliary Modality Learning (AML) offers the possibility to utilize more information from auxiliary data in training, while only requiring data from one or fewer modalities in testing, to save the overall computational cost and reduce the amount of input data for inferencing. In this work, we formally define ‘Auxiliary Modality Learning’ (AML), systematically classify types of auxiliary modality (in visual computing) and architectures for AML, and analyze their performance.
Offline Training: We highlight our behavior-guided navigation policy for autonomous driving. We use a behavior-rich simulator that can generate aggressive or conservative driving styles. In Step 1, we use the CMetric behavior classification algorithm to compute a set of parameters that characterize aggressive behaviors such as over-speeding, overtaking, and sudden lane changing. In Step 2, we use these parameters to train a behavior-based action class navigation policy for action prediction and local navigation.
Abstract We present BehAV, a novel approach for autonomous robot navigation in outdoor scenes guided by human instructions and leveraging Vision Language Models (VLMs). Our method interprets human commands using a Large Language Model (LLM), and categorizes the instructions into navigation and behavioral guidelines. Navigation guidelines consist of directional commands (e.g., “move forward until”) and associated landmarks (e.g., “the building with blue windows”), while behavioral guidelines encompass regulatory actions (e.
Overview of BoMuDA:The input consists of N sources from which the Best-Source is selected by the Alt-Inc algorithm. The Alt-Inc algorithm proceeds in an unsupervised fashion to generate the final set of pseudo-labels that are used to perform boundless DA. The final output consists of the segmentation map of an image in the target domain.
Paper Code BoMuDA GitHub Code
Multi-camera multi-object tracking (MCMOT) faces significant challenges in maintaining consistent object identities across varying camera perspectives, particularly when precise calibration and extensive annotations are required. In this paper, we present CALIBFREE, a self-supervised representation learning framework that does not need any calibration or manual labeling for the MCMOT task. By disentangling view-agnostic and view-specific features through single-view distillation and cross-view reconstruction, our method adapts to complex, dynamic scenarios with minimal overhead.