Researchdirections

CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation

Overview Household environments are visually diverse. Embodied agents performing Vision-and-Language Navigation (VLN) in the wild must be able to handle this diversity, while also following arbitrary language instructions. Recently, Vision-Language models like CLIP have shown great performance on the task of zero-shot object recognition. In this work, we ask if these models are also capable of zero-shot language grounding. In particular, we utilize CLIP to tackle the novel problem of zero-shot VLN using natural language referring expressions that describe target objects, in contrast to past work that used simple language templates describing object classes.

CMetric: A Driving Behavior Measure using Centrality Functions

Overview of CMetric: (left) The sensors on autonomous vehicle observe the positions of other vehicles or road-agents; (middle) The positions and corresponding spatial distances between vehicles are represented through a graph, DGG; (right) Our CMetric uses the closeness and degree centrality functions to measure the style of each vehicle. These styles are used to classify a global driving behavior (such as aggressive or conservative) for each vehicle. Paper Code Dataset Supplementary Material CMetric GitHub Code Argoverse Coming Soon

COVID surveillance robot: Monitoring social distancing constraints in indoor scenarios

Abstract Observing social/physical distancing norms between humans has become an indispensable precaution to slow down the transmission of COVID-19. We present a novel method to automatically detect pairs of humans in a crowded scenario who are not maintaining social distancing, i.e. about 2 meters of space between them using an autonomous mobile robot and existing CCTV (Closed-Circuit TeleVision) cameras. The robot is equipped with commodity sensors, namely an RGB-D (Red Green Blue—Depth) camera and a 2-D lidar to detect social distancing breaches within their sensing range and navigate towards the location of the breach.

COVID-19 Prevention Robot in Dense Areas

– Overview COVID-19 or Coronavirus cases have spiked across the world. There were more than a million confirmed cases in the US as of May 6, 2020 - an increase of more than 25,000 cases from the day before. To slow the spread of COVID-19, CDC and WHO are encouraging people to practice “social distancing” measures, i.e. 2 meters away from other humans. With COVID-19, the goal of social distancing is to slow down the outbreak in order to reduce the chance of infection among high-risk populations and to reduce the burden on the health care system.

CROSS-GAiT: Cross-Attention-Based Multimodal Representation Fusion for Parametric Gait Adaptation in Complex Terrains

Abstract We present CROSS-GAiT, a novel algorithm for quadruped robots that uses Cross Attention to fuse terrain representations derived from visual and time-series inputs, including linear accelerations, angular velocities, and joint efforts. These fused representations are used to adjust the robot’s step height and hip splay, enabling adaptive gaits that respond dynamically to varying terrain conditions. We generate these terrain representations by processing visual inputs through a masked Vision Transformer (ViT) encoder and time-series data through a dilated causal convolutional encoder.

Can LLMs Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis

Overview We present a novel approach to automatically synthesize “wayfinding instructions” for an embodied robot agent. In contrast to prior approaches that are heavily reliant on human-annotated datasets designed exclusively for specific simulation platforms, our algorithm uses in-context learning to condition an LLM to generate instructions using just a few references. Using an LLM-based Visual Question Answering strategy, we gather detailed information about the environment which is used by the LLM for instruction synthesis.

Can a Robot Trust You? A DRL-Based Approach to Trust-Driven Human-Guided Navigation

Overview This work explores the idea of human-robot trust from a robot’s perspective, rather than from a human standpoint. It raises the question as to whether a robot can trust humans on the guidance they give it when placed in an unknown environment. In order to answer this, a deep reinforcement learning approach is presented, in which a policy is learnt via a social reward. Video Paper Can a Robot Trust You?

Can an Embodied Agent Find Your Cat-shaped Mug? LLM-Based Zero-Shot Object Navigation

Overview We present LGX, a novel algorithm for Object Goal Navigation in a language-driven, zero-shot manner, where an embodied agent navigates to an arbitrarily described target object in a previously unexplored environment. Our approach leverages the capabilities of Large Language Models (LLMs) for making navigational decisions by mapping the LLMs implicit knowledge about the semantic context of the environment into sequential inputs for robot motion planning. We conduct experiments both in simulation and real world environments, and showcase factors that influence the decision making capabilities of LLMs for zero-shot navigation.

Classifying Driver Behaviors for Autonomous Vehicle Navigation

Abstract We present a novel approach to automatically identify driver behaviors from vehicle trajectories and use them for safe navigation of autonomous vehicles. We propose a novel set of features that can be easily extracted from car trajectories. We derive a data-driven mapping between these features and six driver behaviors using an elaborate web-based user study. We also compute a summarized score indicating a level of awareness that is needed while driving next to other vehicles.

Classifying Group Emotions for Socially-Aware Autonomous Vehicle Navigation

Abstract We present a real-time, data-driven algorithm to enhance the social-invisibility of autonomous robot navigation within crowds. Our approach is based on prior psychological research, which reveals that people notice and–importantly–react negatively to groups of social actors when they have negative group emotions or entitativity, moving in a tight group with similar appearances and trajectories. In order to evaluate that behavior, we performed a user study to develop navigational algorithms that minimize emotional reactions.