Researchdirections

AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning

Abstract We present a novel method, AutoSpatial, an efficient approach with structured spatial grounding to enhance VLMs’ spatial reasoning. By combining minimal manual supervision with large-scale Visual Question-Answering (VQA) pairs auto-labeling, our approach tackles the challenge of VLMs’ limited spatial understanding in social navigation tasks. By applying a hierarchical two-round VQA strategy during training, AutoSpatial achieves both global and detailed understanding of scenarios, demonstrating more accurate spatial perception, movement prediction, Chain of Thought (CoT) reasoning, final action, and explanation compared to other SOTA approaches.

Automatic Speech Processing

Overview Artificial reverberation has been added to anechoic speech data to train more robust machine learning models for automatic speech processing. We are developing methods for automatic speech recognition, source separation and localization, binaural audio generation, and speech emotion recognition. Software pygsound: pygsound is a python package for impulse response generation based on state-of-the-art geometric sound propagation engine. The simulation is implemented with C++ and uses pybind11 for python APIs.

Autonomous Driving Research

– Overview Autonomous Driving has become one of the most anticipated technologies in both industry and academic research groups. Most of the current efforts in autonomous driving have found success in idealistic conditions such as sparse and homogeneous traffic on highways and urban areas. The GAMMA group, instead, aims to advance autonomous driving research in highly dense and heterogeneous traffic conditions that characterizes social and psychological aspects of human drivers in uncertain environments.

Autonomous Driving Research

Autonomous Vehicle Planning with Dynamic Maneuvers and Traffic Constraints.

Abstract We present AutonoVi, a novel algorithm for autonomous vehicle navigation that supports dynamic maneuvers and satisfies traffic constraints and norms. Our approach is based on optimization-based maneuver planning that supports dynamic lane-changes, swerving, and braking in all traffic scenarios and guides the vehicle to its goal position. We take into account various traffic constraints, including collision avoidance with other vehicles, pedestrians, and cyclists using control velocity obstacles. We use a data-driven approach to model the vehicle dynamics for control and collision avoidance.

Auxiliary Modality Learning with Generalized Curriculum Distillation

Abstract Driven by the need from real-world applications, Auxiliary Modality Learning (AML) offers the possibility to utilize more information from auxiliary data in training, while only requiring data from one or fewer modalities in testing, to save the overall computational cost and reduce the amount of input data for inferencing. In this work, we formally define ‘Auxiliary Modality Learning’ (AML), systematically classify types of auxiliary modality (in visual computing) and architectures for AML, and analyze their performance.

B-GAP: Behavior-Rich Simulation and Navigation for Autonomous Driving

Offline Training: We highlight our behavior-guided navigation policy for autonomous driving. We use a behavior-rich simulator that can generate aggressive or conservative driving styles. In Step 1, we use the CMetric behavior classification algorithm to compute a set of parameters that characterize aggressive behaviors such as over-speeding, overtaking, and sudden lane changing. In Step 2, we use these parameters to train a behavior-based action class navigation policy for action prediction and local navigation.

BehAV: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes

Abstract We present BehAV, a novel approach for autonomous robot navigation in outdoor scenes guided by human instructions and leveraging Vision Language Models (VLMs). Our method interprets human commands using a Large Language Model (LLM), and categorizes the instructions into navigation and behavioral guidelines. Navigation guidelines consist of directional commands (e.g., “move forward until”) and associated landmarks (e.g., “the building with blue windows”), while behavioral guidelines encompass regulatory actions (e.

BoMuDA: Boundless Multi-Source Domain Adaptive Segmentation in Unconstrained Environments

Overview of BoMuDA:The input consists of N sources from which the Best-Source is selected by the Alt-Inc algorithm. The Alt-Inc algorithm proceeds in an unsupervised fashion to generate the final set of pseudo-labels that are used to perform boundless DA. The final output consists of the segmentation map of an image in the target domain. Paper Code BoMuDA GitHub Code

CALIBFREE: Self-Supervised Feature Disentanglement for Calibration-Free Multi-Camera Multi-Object Tracking

Multi-camera multi-object tracking (MCMOT) faces significant challenges in maintaining consistent object identities across varying camera perspectives, particularly when precise calibration and extensive annotations are required. In this paper, we present CALIBFREE, a self-supervised representation learning framework that does not need any calibration or manual labeling for the MCMOT task. By disentangling view-agnostic and view-specific features through single-view distillation and cross-view reconstruction, our method adapts to complex, dynamic scenarios with minimal overhead.