Researchdirections

Learning Acoustic Scattering Fields for Dynamic Interactive Sound Propagation

Abstract We present a novel hybrid sound propagation algorithm for interactive applications. Our approach is designed for dynamic scenes and uses a neural network-based learned scattered field representation along with ray tracing to generate specular, diffuse, diffraction, and occlusion effects efficiently. We use geometric deep learning to approximate the acoustic scattering field using spherical harmonics. We use a large 3D dataset for training, and compare its accuracy with the ground truth generated using an accurate wave-based solver.

Learning Unseen Emotions from Gestures via Semantically-Conditioned Zero-Shot Perception with Adversarial Autoencoders

Abstract We present a novel generalized zero-shot algorithm to recognize perceived emotions from gestures. Our task is to map gestures to novel emotion categories not encountered in training. We introduce an adversarial autoencoder-based representation learning that correlates 3D motion-captured gesture sequences with the vectorized representation of the natural-language perceived emotion terms using word2vec embeddings. The language-semantic embedding provides a representation of the emotion label space, and we leverage this underlying distribution to map the gesture sequences to the appropriate categorical emotion labels.

Learning Unseen Emotions from Gestures via Semantically-Conditioned Zero-Shot Perception with Adversarial Autoencoders

Abstract We present a novel generalized zero-shot algorithm to recognize perceived emotions from gestures. Our task is to map gestures to novel emotion categories not encountered in training. We introduce an adversarial autoencoder-based representation learning that correlates 3D motion-captured gesture sequences with the vectorized representation of the natural-language perceived emotion terms using word2vec embeddings. The language-semantic embedding provides a representation of the emotion label space, and we leverage this underlying distribution to map the gesture sequences to the appropriate categorical emotion labels.

Listen2Scene: Interactive material-aware binaural sound propagation for reconstructed 3D scenes

Abstract We present an end-to-end binaural audio rendering approach (Listen2Scene) for virtual reality (VR) and augmented reality (AR) applications. We propose a novel neural-network-based binaural sound propagation method to generate acoustic effects for 3D models of real environments. Any clean audio or dry audio can be convolved with the generated acoustic effects to render audio corresponding to the real environment. We propose a graph neural network that uses both the material and the topology information of the 3D scenes and generates a scene latent vector.

Locmotion in Virtual Reality

Overview Virtual Reality locomotion centers around developing techniques to enable users to explore large virtual environments. Exploration can be done through various interfaces, such as teleportation, flying, or natural walking, each with their own pros and cons. Our group is currently focused on improving locomotion using interfaces that work using natural walking, due to its benefits to the user’s sense of presence in the virtual experience. Software Pasumi: Pasumi is a library for simulating users walking in virtual environments while being steered in the physical environment using redirected walking.

Low-frequency Compensated Synthetic Impulse Responses for Improved Far-field Speech Recognition

Abstract We propose a method for generating low-frequency compensated synthetic impulse responses that improve the performance of far-field speech recognition systems trained on artificially augmented datasets. We design linear-phase filters that adapt the simulated impulse responses to equalization distributions corresponding to real-world captured impulse responses. Our filtered synthetic impulse responses are then used to augment clean speech data from LibriSpeech dataset [1]. We evaluate the performance of our method on the real-world LibriSpeech test set.

M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers

Abstract We present a novel architecture for 3D object detection, M3DeTR, which combines different point cloud representations (raw, voxels, bird-eye view) with different feature scales based on multi-scale feature pyramids. M3DeTR is the first approach that unifies multiple point cloud representations, feature scales, as well as models mutual relationships between point clouds simultaneously using transformers. We perform extensive ablation experiments that highlight the benefits of fusing representation and scale, and modeling the relationships.

M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues

Abstract We present M3ER, a learning-based method for emotion recognition from multiple input modalities. Our approach combines cues from multiple co-occurring modalities (such as face, text, and speech) and also is more robust than other methods to sensor noise in any of the individual modalities. M3ER models a novel, data-driven multiplicative fusion method to combine the modalities, which learn to emphasize the more reliable cues and suppress others on a per-sample basis.

MESH2IR: Neural Acoustic Impulse Response Generator for Complex 3D Scenes

Abstract We propose a mesh-based neural network (MESH2IR) to generate acoustic impulse responses (IRs) for indoor 3D scenes represented using a mesh. The IRs are used to create a high-quality sound experience in interactive applications and audio processing. Our method can handle input triangular meshes with arbitrary topologies (2K - 3M triangles). We present a novel training technique to train MESH2IR using energy decay relief and highlight its benefits. We also show that training MESH2IR on IRs preprocessed using our proposed technique significantly improves the accuracy of IR generation.

METEOR:A Dense, Heterogeneous, and Unstructured Traffic Dataset With Rare Behaviors

A new and complex traffic dataset for unstructured scenarios in India. Total dataset size is 100GB and increasing with more than 1000 one-minute video clips, over 2 million annotated frames with ego-vehicle trajectories, and more than 13 million bounding boxes. Up to 40 total agents and 9 unique agents per frame. Annotations include rare and interesting driving behaviors such as cut-ins, yielding, overtaking, overspeeding, zigzagging, and rule-breaking etc.