Abstract We present a new method to capture the acoustic characteristics of real-world rooms using commodity devices, and use the captured characteristics to generate similar sounding sources with virtual models. Given the captured audio and an approximate geometric model of a real-world room, we present a novel learning-based method to estimate its acoustic material properties. Our approach is based on deep neural networks that estimate the reverberation time and equalization of the room from recorded audio.
Abstract We propose a novel method for generating scene-aware training data for far-field automatic speech recognition. We use a deep learning-based estimator to non-intrusively compute the sub-band reverberation time of an environment from its speech samples. We model the acoustic characteristics of a scene with its reverberation time and represent it using a multivariate Gaussian distribution. We use this distribution to select acoustic impulse responses from a large real-world dataset for augmenting speech data.
Abstract We propose a novel learning framework for garment draping prediction that can incorporate arbitrary loss functions at runtime. Previous methods do not address several inconsistencies arising from the enforcement of physical constraints, such as wrinkle dynamics, (heterogeneous) material properties, the (in)ability to fit a wide range of body shapes, etc. To address these problems, we propose a semi-supervised learning framework composed of three key components. First, a physics-inspired supervision on a novel neural network that captures multi-scale features dynamically.
Abstract We propose a scalable neural network framework to reconstruct the 3D mesh of a human body from multi-view images, in the subspace of the SMPL model. Use of multi-view images can significantly reduce the projection ambiguity of the problem, increasing the reconstruction accuracy of the 3D human body under clothing. Our experiments show that this method benefits from the synthetic dataset generated from our pipeline since it has good flexibility of variable control and can provide ground-truth for validation.
Abstract We propose a method to detect individualized highlights for users on given target videos based on their preferred highlight clips marked on previous videos they have watched. Our method explicitly leverages the contents of both the preferred clips and the target videos using pre-trained features for the objects and the human activities. We design a multi-head attention mechanism to adaptively weigh the preferred clips based on their object- and human-activity-based contents, and fuse them using these weights into a single feature representation for each user.
Abstract In this paper, we propose a novel learning framework for autonomous systems that uses a small amount of auxiliary information'' that complements the learning of the main modality, calledsmall-shot auxiliary modality distillation network (AMD-S-Net)“. The AMD-S-Net contains a two-stream framework design that can fully extract information from different types of data (i.e., paired/unpaired multi-modality data) to distill knowledge more effectively. We also propose a novel training paradigm based on the ``reset operation” that enables the teacher to explore the local loss landscape near the student domain iteratively, providing local landscape information and potential directions to discover better solutions by the student, thus achieving higher learning performance.
Overview As robots become more and more ubiquitous in our everyday lives, it becomes essential to study their behaviour in real-world social environments. Robots placed in human-centred environments need to make decisions that conform to social notions effortlessly in a flexible manner. They need to learn to navigate and interact with humans in a ‘socially-aware manner’. Humans need to feel safe enough to trust robot actions in everyday environments. In a similar manner, a robot dependent on human assistance also needs to identify if the guidance given to it is trustworthy.
Abstract Most existing social robot navigation techniques either leverage hand-crafted rules or human demonstrations to connect robot perception to socially compliant actions. How- ever, there remains a significant gap in effectively translating perception into socially compliant actions, much like how human reasoning naturally occurs in dynamic environments. Considering the recent success of Vision-Language Models (VLMs), we propose using language to bridge the gap in human-like reasoning between perception and socially aware robot actions.
Abstract We present a real-time, data-driven algorithm to enhance the social-invisibility of autonomous vehicles within crowds. Our approach is based on prior psychological research, which reveals that people notice and–importantly–react negatively to groups of social actors when they have high entitativity, moving in a tight group with similar appearances and trajectories. In order to evaluate that behavior, we performed a user study to develop navigational algorithms that minimize entitativity. This study establishes mapping between emotional reactions and multi-robot trajectories and appearances, and further generalizes the finding across various environmental conditions.
Abstract We present a real-time, data-driven algorithm to enhance the social-invisibility of robots within crowds. Our approach is based on prior psychological research, which reveals that people notice and–importantly–react negatively to groups of social actors when they have high entitativity, moving in a tight group with similar appearances and trajectories. In order to evaluate that behavior, we performed a user study to develop navigational algorithms that minimize entitativity. This study establishes mapping between emotional reactions and multi-robot trajectories and appearances, and further generalizes the finding across various environmental conditions.