### **Exploring Computation-Communication Tradeoffs** in Camera Systems

#### Amrita Mazumdar

Thierry Moreau Luis Ceze Sung Kim Mark Oskin Meghan Cowan Visvesh Sathe

Armin Alaghi





## Camera applications are a prominent workload with tight constraints







## Hardware implementations compound the camera system design space

camera system





DogChat™

### We can represent camera applications as <u>camera processing pipelines</u> to clarify design space exploration

block 1

sensor



functions in the application

### We can represent camera applications as <u>camera processing pipelines</u> to clarify design space exploration



sensor

image processing

**DogChat**<sup>™</sup>



feature tracking image rendering

# Developers can trade off between computation and communication costs



sensor

image processing

DogChat™



#### offloaded to cloud

### Developers can trade off between computation and communication costs





**DogChat**<sup>™</sup>

# Optional and required blocks in camera pipelines introduce more tradeoffs

edge detection



sensor

image processing





# Custom hardware platforms explode the camera system design space





### **Custom hardware platforms explode** the camera system design space

**ASIC** 

edge detection

DSP



### Challenges for modern camera systems

### Low-power: face authentication for energy-harvesting cameras with ASIC design



motion detection

### Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration



prep



### Challenges for modern camera systems

### Low-power: face authentication for energy-harvesting cameras with ASIC design



motion detection

### Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration



prep



### Face authentication with energy harvesting cameras



**WISP Cam** energy-harvesting camera powered by RF 1 frame / second ~1 mW processing / frame

# Face authentication with energy harvesting cameras











### **CPU-based face authentication neural** networks can exceed WISPcam power budgets

sensor

on-chip CPU

neural network

other application functions

cloud

## CPU-based face authentication neural networks can exceed WISPcam power budgets



### adding optional blocks can reduce power consumption for a neural network

# Exploring design tradeoffs in ASIC accelerators

neural network



Evaluated NN topology and har impact on energy and accuracy

Selected a 400-8-1 network topology and used 8-bit datapaths for optimal energy/accuracy point

### many more details in paper!

![](_page_16_Figure_6.jpeg)

face detection

eaming face detection celerator

Explored classifier and other algorithm parameters to optimize energy optimality

## **Evaluation** Which pipeline achieves the lowest overall power?

Synthesized ASIC accelerators in Synopsys

Constructed simulator to evaluate power consumption on real-world video input

Computed power for computation and transfer of resulting data for each pipeline configuration

![](_page_17_Picture_4.jpeg)

![](_page_17_Picture_5.jpeg)

# Which pipeline achieves the lowest power consumption?

| pla    | compu  |             |    |  |
|--------|--------|-------------|----|--|
| sensor |        |             |    |  |
| sensor | motion |             |    |  |
| sensor |        | face detect |    |  |
| sensor |        |             | NN |  |
| sensor | motion | face detect |    |  |
| sensor | motion |             | NN |  |
| sensor |        | face detect | NN |  |
| sensor | motion | face detect | NN |  |

(ratios)

ute transfer

![](_page_18_Figure_4.jpeg)

# Which pipeline achieves the lowest power consumption?

| pla    | compu  |             |    |      |
|--------|--------|-------------|----|------|
| sensor |        |             |    | <1%  |
| sensor | motion |             |    | <1%  |
| sensor |        | face detect |    | 10%  |
| sensor |        |             | NN | 16%  |
| sensor | motion | face detect |    | >99% |
| sensor | motion |             | NN | >99% |
| sensor |        | face detect | NN | >99% |
| sensor | motion | face detect | NN | >99% |

![](_page_19_Figure_2.jpeg)

### Which pipeline achieves the lowest power consumption?

| pla    | compu  |             |    |      |
|--------|--------|-------------|----|------|
| sensor |        |             |    | <1%  |
| sensor | motion |             |    | <1%  |
| sensor |        | face detect |    | 10%  |
| sensor |        |             | NN | 16%  |
| sensor | motion | face detect |    | >99% |
| sensor | motion |             | NN | >99% |
| sensor |        | face detect | NN | >99% |
| sensor | motion | face detect | NN | >99% |

![](_page_20_Figure_2.jpeg)

![](_page_20_Picture_3.jpeg)

# Which pipeline achieves the lowest power consumption?

| pla    | compu  |             |    |      |
|--------|--------|-------------|----|------|
| sensor |        |             |    | <1%  |
| sensor | motion |             |    | <1%  |
| sensor |        | face detect |    | 10%  |
| sensor |        |             | NN | 16%  |
| sensor | motion | face detect |    | >99% |
| sensor | motion |             | NN | >99% |
| sensor |        | face detect | NN | >99% |
| sensor | motion | face detect | NN | >99% |

![](_page_21_Figure_2.jpeg)

### In-camera processing for face authentication

![](_page_22_Picture_1.jpeg)

motion detection

![](_page_22_Figure_5.jpeg)

- In isolation, even well-designed hardware can show sub-optimal performance
- Optional blocks can improve the overall cost, if they balance compute and communication better than the original design

### Challenges for modern camera systems

### Low-power: face authentication for energy-harvesting cameras with ASIC design

![](_page_23_Picture_2.jpeg)

motion detection

### Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration

![](_page_23_Picture_5.jpeg)

prep

![](_page_23_Figure_7.jpeg)

### Challenges for modern camera systems

### Low-power: face authentication for energy-harvesting cameras with ASIC design

![](_page_24_Picture_2.jpeg)

motion detectior

### Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration

![](_page_24_Picture_5.jpeg)

prep

![](_page_24_Figure_7.jpeg)

### **Producing real-time VR video from a camera rig**

![](_page_25_Picture_1.jpeg)

16 GoPro cameras 4K-30 fps 3.6 GB/s raw video

#### Goal: 30 fps 3D-360 stereo video 1.8 GB/s output

![](_page_25_Picture_4.jpeg)

26

### **Producing real-time VR video from a camera rig**

![](_page_26_Picture_1.jpeg)

16 GoPro cameras 4K-30 fps 3.6 GB/s raw video cloud processing prevents realtime video Goal: 30 fps 3D-360 stereo video 1.8 GB/s output

![](_page_26_Picture_5.jpeg)

27

### VR pipeline is usually offloaded to perform heavy computation

![](_page_27_Figure_1.jpeg)

need to accelerate "depth from flow" to achieve high performance

## Offloading before the costly step doesn't avoid compute-communication tradeoffs

![](_page_28_Figure_1.jpeg)

### Evaluation Which pipeline achieves the highest frame rate?

Designed a simple parallel accelerator for Xilinx Zynq SoC, simulated for Virtex UltraScale+

Evaluated against CPU and GPU implementations in Halide

Assumed 2GB/s network link for communication

#### implementation details in paper

![](_page_29_Picture_7.jpeg)

![](_page_29_Picture_9.jpeg)

| pipeline configuration |      |       |              |        | com |
|------------------------|------|-------|--------------|--------|-----|
| sensor                 |      |       |              |        |     |
| sensor                 | prep |       |              |        |     |
| sensor                 | prep | align |              |        |     |
| sensor                 | prep | align | depth (CPU)  |        |     |
| sensor                 | prep | align | depth (GPU)  |        |     |
| sensor                 | prep | align | depth (FPGA) |        |     |
| sensor                 | prep | align | depth (CPU)  | stitch |     |
| sensor                 | prep | align | depth (GPU)  | stitch |     |
| sensor                 | prep | align | depth (FPGA) | stitch |     |

![](_page_30_Figure_2.jpeg)

| pipeline configuration |        |      |       |              | comp   |     |
|------------------------|--------|------|-------|--------------|--------|-----|
|                        | sensor |      |       |              |        | 100 |
|                        | sensor | prep |       |              |        | 10( |
|                        | sensor | prep | align |              |        | 10( |
|                        | sensor | prep | align | depth (CPU)  |        | 0.0 |
|                        | sensor | prep | align | depth (GPU)  |        | 11. |
|                        | sensor | prep | align | depth (FPGA) |        | 174 |
|                        | sensor | prep | align | depth (CPU)  | stitch | 0.0 |
|                        | sensor | prep | align | depth (GPU)  | stitch | 11. |
|                        | sensor | prep | align | depth (FPGA) | stitch | 174 |
|                        |        |      |       |              |        | -   |

![](_page_31_Figure_2.jpeg)

<sup>32</sup> 

| pipeline configuration |        |      |       |              | comp   |     |
|------------------------|--------|------|-------|--------------|--------|-----|
|                        | sensor |      |       |              |        | 100 |
|                        | sensor | prep |       |              |        | 10( |
|                        | sensor | prep | align |              |        | 10( |
|                        | sensor | prep | align | depth (CPU)  |        | 0.0 |
|                        | sensor | prep | align | depth (GPU)  |        | 11. |
|                        | sensor | prep | align | depth (FPGA) |        | 174 |
|                        | sensor | prep | align | depth (CPU)  | stitch | 0.0 |
|                        | sensor | prep | align | depth (GPU)  | stitch | 11. |
|                        | sensor | prep | align | depth (FPGA) | stitch | 174 |
|                        |        |      |       |              |        | -   |

![](_page_32_Figure_2.jpeg)

| pipeline configuration |        |      |       |              | comp   |     |
|------------------------|--------|------|-------|--------------|--------|-----|
|                        | sensor |      |       |              |        | 100 |
|                        | sensor | prep |       |              |        | 10( |
|                        | sensor | prep | align |              |        | 10( |
|                        | sensor | prep | align | depth (CPU)  |        | 0.0 |
|                        | sensor | prep | align | depth (GPU)  |        | 11. |
|                        | sensor | prep | align | depth (FPGA) |        | 174 |
|                        | sensor | prep | align | depth (CPU)  | stitch | 0.0 |
|                        | sensor | prep | align | depth (GPU)  | stitch | 11. |
|                        | sensor | prep | align | depth (FPGA) | stitch | 174 |
|                        |        |      |       |              |        | -   |

![](_page_33_Figure_2.jpeg)

### In-camera processing for real-time VR video

![](_page_34_Picture_1.jpeg)

- Computation and communication together highlight benefits not seen when considered separately
- For VR video, in-camera processing pipelines enable applications that could not even be achieved via cloud offload

### In-camera processing pipelines help characterize camera systems

In-camera pipelines evaluate computation-communication trade-offs

Use hardware-software co-design to balance constraints and optimize designs

Achieve optimal performance by considering bottlenecks in context of full system

### Thank you!

![](_page_35_Picture_6.jpeg)