This paper targets real-time production of virtual reality (VR) video produced from a 360-degree camera rig, like the Google Jump or Facebook Surround 360. These VR camera rigs use a collection of cameras to simultaneously capture many views of a 360-degree scene. You can then post-process the videos to composite into a single omnidirectional stereo (3D-360) video, which can be watched on any standard VR headset or video app.

While capture and viewing are straightforward, the processing pipeline is not only computationally intensive and time consuming but also hard to parallelize. As a result, it’s challenging to use this pipeline for video livestreams or other real-time video applications, as the latency to process each frame is so high. In their SIGGRAPH Asia paper, Anderson et al. note that it takes 10 hours on 1000 cores to process an hour of Google Jump video data, and this not take into account the time to upload data to the server.

The bottleneck in this VR video processing pipeline is the bilateral solver, a fast and accurate algorithm for computing edge-aware flow fields from coarse, noisy ones. The bilateral solver is the core of what makes the final composited 360-video look coherent and immersive from any viewing angle, and also the most computationally complex part of the pipeline. More importantly, it’s hard to parallelize in its original form, which prevents us from using multicores, GPUs, or other custom accelerators to get performance improvements.

In the paper, we characterize some of the reasons the bilateral solver is hard to parallelize (high-dimensional sparse data, the use of second-order global optimization), and present a new algorithm, the hardware-friendly bilateral solver (HFBS), which improves upon these challenges and is designed to be easy to deploy on parallel hardware.

To push the limits of HFBS, we designed an FPGA accelerator that highlights the advantages of our new algorithm and also uses some more optimizations. The FPGA accelerator essentially implements the optimization step of the bilateral solver, with specialized memory layout and fixed-point datapath widths for added speedups.

We evaluated our algorithm and design by comparing against the prior work, a single-threaded CPU baseline. HFBS, by comparison, got speedups of 4×, 32×, and 50× faster on CPU, GPU, and FPGA platforms, respectively. For all platforms, we were able to take advantage of all the many cores, threads, or parallel workers available on the compute substrate, demonstrating the primary goal of our investigation. Our parallel algorithm does have higher error than the original bilateral solver, but is faster than every more accurate algorithm, and more accurate than any faster algorithm.

We also found that our FPGA accelerator was significantly more power-efficient than the CPU and GPU versions of our design. While expected, it opened up the opportunity to consider the specifications of a mobile VR video camera rig that could execute the full pipeline. In the paper we find that a many-FPGA system with a host CPU to process the 16 camera pairs in the Google Jump camera rig would use about 500 W, making it significantly more plausible to deploy real-time VR video processing either in a portable streaming platform or in the data center.

You can find more details in the paper and other resources:

A Hardware-Friendly Bilateral Solver for Real-Time Virtual Reality Video.
Amrita Mazumdar, Armin Alaghi, Jonathan T. Barron, David Gallup, Luis Ceze, Mark Oskin, Steven M. Seitz. In High Performance Graphics 2017.
paper (pdf), slides (pdf), code