Researchdirections

UAVTwin: Neural Digital Twins for UAVs using Gaussian Splatting

We present UAVTwin, a method for creating digital twins from real-world environments and facilitating data augmentation for training downstream models embedded in unmanned aerial vehicles (UAVs). Specifically, our approach focuses on synthesizing foreground components, such as various human instances in motion within complex scene backgrounds, from UAV perspectives. This is achieved by integrating 3D Gaussian Splatting (3DGS) for reconstructing backgrounds along with controllable synthetic human models that display diverse appearances and actions in multiple poses.

V-Trans4Style: Visual Transition Recommendation for Video Production Style Adaptation

Pooja Guhan1, Tsung-Wei Huang2, Guan-Ming Su2, Subhadra Gopalakrishnan2, Dinesh Manocha1 1University of Maryland College Park, 2Dolby Laboratories Abstract We introduce V-Trans4Style, an innovative algorithm tailored for dynamic video content editing needs. It is designed to adapt videos to different production styles like documentaries, dramas, feature films, or a specific YouTube channel’s video-making technique. Our algorithm recommends optimal visual transitions to help achieve this flexibility using a more bottom-up approach. We first employ a transformer-based encoder-decoder network to learn recommending temporally consistent and visually seamless sequences of visual transitions using only the input videos.

VAPOR: Legged Robot Navigation in Unstructured Outdoor Environments using Offline Reinforcement Learning

Abstract We present VAPOR, a novel method for autonomous legged robot navigation in unstructured, densely vegetated outdoor environments using offline Reinforcement Learning (RL). Our method trains a novel RL policy using an actor-critic network and arbitrary data collected in real outdoor vegetation. Our policy uses height and intensity-based cost maps derived from 3D LiDAR point clouds, a goal cost map, and processed proprioception data as state inputs, and learns the physical and geometric properties of the surrounding obstacles such as height, density, and solidity/stiffness.

VERN: Vegetation-aware Robot Navigation in Dense Unstructured Outdoor Environments

Abstract We propose a novel method for autonomous legged robot navigation in densely vegetated environments with a variety of pliable/traversable and non-pliable/untraversable vegetation. We present a novel few-shot learning classifier that can be trained on a few hundred RGB images to differentiate flora that can be navigated through, from the ones that must be circumvented. Using the vegetation classification and 2D lidar scans, our method constructs a vegetation-aware traversability cost map that accurately represents the pliable and non-pliable obstacles with lower, and higher traversability costs, respectively.

VL-TGS: Trajectory Generation and Selection using Vision Language Models in Mapless Outdoor Environments

Abstract We present a multi-modal trajectory generation and selection algorithm for real-world mapless outdoor nav- igation in human-centered environments. Such environments contain rich features like crosswalks, grass, and curbs, which are easily interpretable by humans, but not by mobile robots. We aim to compute suitable trajectories that (1) satisfy the environment-specific traversability constraints and (2) generate human-like paths while navigating on crosswalks, sidewalks, etc. Our formulation uses a Conditional Variational Autoen- coder (CVAE) generative model enhanced with traversability constraints to generate multiple candidate trajectories for global navigation.

VLM-GroNav: Robot Navigation Using Physically Grounded Vision-Language Models in Outdoor Environments

Abstract We present a novel autonomous robot navigation algorithm for outdoor environments that is capable of handling diverse terrain traversability conditions. Our approach, VLM-GroNav, uses vision-language models (VLMs) and integrates them with physical grounding that is used to assess intrinsic terrain properties such as deformability and slipperiness. We use proprioceptive-based sensing, which provides direct measurements of these physical properties, and enhances the overall semantic understanding of the terrains. Our formulation uses in-context learning to ground the VLM’s semantic understanding with proprioceptive data to allow dynamic updates of traversability estimates based on the robot’s real-time physical interactions with the environment.

VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models

Abstract We propose VLM-Social-Nav, a novel Vision-Language Model (VLM) based navigation approach to compute a robot’s motion in human-centered environments. Our goal is to make real-time decisions on robot actions that are socially compliant with human expectations. We utilize a perception model to detect important social entities and prompt a VLM to generate guidance for socially compliant robot behavior. VLM-Social-Nav uses a VLM-based scoring module that computes a cost term that ensures socially appropriate and effective robot actions generated by the underlying planner.

Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments

Abstract We introduce Vision-Language Attention Distillation (Vi-LAD), a novel approach for distilling socially compliant navigation knowledge from a large Vision-Language Model (VLM) into a lightweight transformer model for real-time robotic navigation. Unlike traditional methods that rely on expert demonstrations or human-annotated datasets, Vi-LAD performs knowledge distillation and fine-tuning at the intermediate layer representation level (i.e., attention maps) by leveraging the backbone of a pre-trained vision-action model. These attention maps highlight key navigational regions in a given scene, which serve as implicit guidance for socially aware motion planning.

Video Manipulations Beyond Faces: A Dataset with Human-Machine Analysis

Trisha Mittal1, Ritwik Sinha2, Viswanathan Swaminathan2, John Collomosse2, Dinesh Manocha1 1University of Maryland, 2Adobe Research Abstract As tools for content editing mature, and artificial intelligence (AI) based algorithms for synthesizing media grow, the presence of manipulated content across online media is increasing. This phenomenon causes the spread of misinformation, creating a greater need to distinguish between “real” and “manipulated” content. To this end, we present VideoSham, a dataset consisting of 826 videos (413 real and 413 manipulated).