a senior undergraduate student at Indian Institute of Technology Kanpur.
At my university I am a part of the Visual Computing Lab supervised by Prof. Vinay Namboodiri. My research interests are Computer Vision, Deep Learning and Computer Graphics (VR/AR).
Besides being an student, I also currently work part-time as a research consultant for Fyusion Inc. (a 3D computer vision and machine learning company based in San Francisco) where I solve computer vision related problems using deep learning. A recent Techcrunch post about the company can be found here.
Last summer I was a research intern at Fyusion Inc, San Francisco and the summer before that, I was a research intern at the wonderful Graphdeco team at Inria Sophia Antipolis, France.
Besides research, I also like to tinker with various new sensors like Google Tango, Oculus Rift, Leap Motion in my free time. Sometimes, I try to post about some of these experiments on Youtube. I like open-source philosophy and hence I also try to open-source some of my experiments on github.
News
- I will be attending Siggraph Asia 2017, Bangkok, Thailand.
- I will be presenting my paper at ISMAR 2017, Nantes, France.
- I will be interning at Fyusion Inc., San Francisco for the summer of 2017.
- I will be interning at Graphdeco, Inria Sophia Antipolis for the summer of 2016.
Understanding the motion of objects in order to predict and control their movements is one of the crucial problems in Artificial Intelligence (AI). It is evident that humans and a large number of animals possess this extraordinary ability of easily manipulating object motion using visual inputs. For example, it enables humans to drive vehicles safely, play games like billiards and football, navigate in crowded and newer environments. This seemingly simple, but stark, ability raises the natural question— how is the visual input used to infer and understand the motion of surrounding objects. Some recent works propose an explanatory framework based on generative physics representations, which states that the brain infers and retains a noisy, but detailed, representation of physics underlying the motion of objects and uses generative simulations based on these representations for predicting the object motion. In view of these works, it is extremely interesting and intellectually challenging to study how can such representations of physics underlying the visual inputs be modeled. Also, a model of the physical constraints on the visual inputs can be used to perform simulations, forecast object interactions and generate video sequences of moving objects. All of these potential applications are highly challenging problems in computer vision. In view of this, we aim to study the problem of modeling abstractions of physics that underlie given visual inputs and propose a contextual RNN-GAN based approach to learning these models.
Abstract, Report, Presentation
We propose to develop a model for depth maps estimation for an RGB sensor using the approach of supervision transfer across multiple modalities including– i. RGB and ii. Depth. The problem statement can be described as follows– We want to generate a model for depth estimation for a given RGB sensor. For this task, we consider another RGBD sensor and capture image frames that have some overlap in their field of view. We aim to learn CNN based representations for Depth of the RGB sensor using informations from the two sensors using supervision transfer across the different image modalities. We next aim to learn “invertible” CNNs to get partial network that can generate depth maps. If successful, the novel approach would add an extra modality, Depth, for the RGB sensor. The motivations for taking up this ambitious big project are twofold– i. It is an unexplored problem and has significant research value in contemporary field of vision ii. If successful, this approach will have multiple applications of immense practical importance. One such example is autonomous cars, where camera and depth sensor do not have identical field of view. It can be used in making low-cost motion capturing systems by replacing some of its many RGBD sensors with RGB cameras.
Research Intern, INRIA Sophia Antipolis, France (supervised by George Drettakis, Adrien Bousseau, Erik Reinhard)
May to July 2016
Abstract, Presentation, Report-1, Report-2
We describe three approaches for texture synthesis based on material editing of indoor scenes using the data obtained from a commodity depth sensors. In the first approach we propose an extended version of patch match for material texture synthesis. The second approach is more of data- centric approach where we propose a combination of patch regression trained on a dataset and patch match based ap- proach for texture synthesis. In the third and final approach we exploit the recent advances in deep learning systems to develop an end to end pipeline for texture synthesis. The input for our system is an RGB image and a corresponding depth/normal map captured by a Kinect sensor. Material editing is carried out by changing depth/normals in a region of the image. Once editing is done, we use the proposed tex- ture synthesis algorithms to create real-like texture in the changed region of the image. We observe that the proposed approaches are able to capture structures effectively and also able to synthesize real-like textures. Each of the methods have their own limitations which are discussed in the indi- vidual sections. At the moment our approach is single view and runs offline on a PC.
In this project, we develop a technique to create a mixed reality application where virtual objects can smartly interact with physical world. The aim of the project is to build a framework for developing various applications like helping challenged people, interactive visualizations and gaming. One of the application whose demo was shown was furniture placement in The project developed on Project Tango device [1] involves stitching of meshes acquired from the 3D point cloud data. We explore both online as well as offline techniques for stitching of meshes. Further, we augment virtual objects with animation and comprehensive interactions to forge a mixed reality and an alternate virtual reality (using a Head Mounted Display). The novelty of the proposed scheme is that all the processes except segmentation and path planning are executed real-time.
G S S Srinivas Rao, Neeraj Thakur, and Vinay P. Namboodiri
Proceedings of 16th IEEE International Symposium on Mixed and Augmented Reality (ISMAR-Adjunct), Nantes, France, 2017
Abstract, DOI, Paper, Video, Poster
The feeling of presence in virtual reality has enabled a large number of applications. These applications typically deal with 360° content. However, a large amount of existing content is available in terms of images and videos i.e 2D content. Unfortunately, these do not react to the viewer's position or motion when viewed through a VR HMD. Thus in this work, we propose reactive displays for VR which instigate a feeling of discovery while exploring 2D content. We create this by taking into account user's position and motion to compute homography based mappings that adapt the 2D content and re-project it onto the display. This allows the viewer to obtain a more richer experience of interacting with 2D content similar to the effect of viewing through the window at a scene. We also provide a VR interface that uses a constrained set of reactive displays to easily browse through 360° content. The proposed interface tackles the problem of nausea caused by existing interfaces like photospheres by providing a natural room-like intermediate interface before changing 360° content. We perform user studies to evaluate both of our interfaces. The results show that the proposed reactive display interfaces are indeed beneficial.
Sandeep Reddy, G S S Srinivas Rao, and Rajesh M Hegde
IEEE NCC 2016, IIT Guwhati 2016
Virtual reality systems have been widely used in many popular and diverse applications including education and gaming. However, development of a dynamic virtual reality system which combines both audio and visual scenes has hitherto not been investigated. In this work a dynamic virtual reality system which synchronizes both audio and visual information is developed. Realtime audio and visual information is obtained from a spherical audio visual camera with 64 microphones and 5 cameras. Subsequently, a head mounted display application is designed to render spherical video. A three dimensional sound rendering algorithm using head related transfer functions is developed. Finally, a virtual reality system that combines both spherical audio and video is realized. The head position of the user is also integrated into this system adaptively to make the system dynamic. Both subjective and objective evaluations of the proposed virtual reality system indicate its significance.
Nishchal K. Verma, Dhekane Eeshan Gunesh, G. S. S. Srinivas Rao, and Aakansha Mishra
IEEE AIPR 2015, Washington DC 2015
In this paper, High Accuracy Optical Flow (HAOF) based future image frames generator model is proposed. The aim of this work is to develop a framework which is capable of predicting the future image frames for any given sequence of images. The requirement is to predict large number of image frames with better clarity and better accuracy. In the first step, the vertical and horizontal components of flow velocities of the intensities at each pixel positions are estimated using High Accuracy Optical Flow (HAOF) algorithm. The estimated flow velocities in all the image frames at all the pixel positions are then modeled using separate Artificial Neural Networks (ANN). The trained models are used to predict the flow velocities of intensities at all the pixel positions in the future image frames. The intensities at all the pixel positions are mapped to new positions by using the velocities predicted by the model. The concept of Bilinear Interpolation is used to obtain predicted images from the new positions of intensities. The quality of the predicted image frames is evaluated by using Canny Edge Detection based Image Comparison Metric (CIM) and Mean Structural Similarity Index Measure (MSSIM). The predictor model is simulated by applying it on the two image sequences-an image sequence of a fighter jet landing over the navy deck, and another image sequence of a train moving on a bridge. The proposed framework is found to give promising results with better clarity and better accuracy.
Google Developer Groups, Brussels | Event Page, Video
Stuttgart VR & AR Meetup, Stuttgart, Germany | Event Page, Slides, Featured in Stuttgarter Zeitung
Google Developers Group, Nice, France | Event Page