### **Experiments with Bayesian Inference Accelerators** (Why Al Algorithms that are NOT Deep Neural Nets Also Want to be Silicon)

University of Wisconsin-Madison Virtual Computer Architecture Seminar October 15, 2020

Rob A. Rutenbar Senior Vice Provost for Research Professor, CS & ECE



### Accelerators: Why Now...?



#### Moore's Law

- A great 40-year run
- Now, running out of gas

• Apps too slow, or too power-hungry...?

• Let's try transistors!

#### Pitt Research

http://www.nature.com/news/the-chips-are-down-for-moore-s-law-1.19338

### **Focus: Deep Neural Nets (DNNs)**



#### In hindsight, hardware is obvious here:

• Breakthrough performance; widely useful; too slow on CPU

#### And -- look like (giant) DSP tasks:

- Feed-forward (mostly), limited operators, limited precision, etc.
- Main differences: scale, #weights, data movement

### **Aside:** Prior to Today's Bayesian HW

#### • Speech recognition in Si: CMU In Silico Vox project





Economist, March 12, 2005

#### Medallia announces \$ acquisition

of Voci Technologies

in

Medallia acquires artificial intelligence speech transcription company, Voci Technologies for \$59M

# MEDALLIA VOCI

# CMU accelerators for high-speed recognition went to market as **Voci Technologies**

#### Started life as non-DNN FPGAs...

"Voci transcribes 100% of live and recorded calls into text that can be analyzed quickly to determine customer satisfaction, adding a powerful set of signals to the Medallia Experience Cloud. At the same time, Voci enables call analysis moments after each interaction has completed, optimizing every aspect of call center operations securely. Especially important as virtual and remote contact center operations take shape."



#### ... but ended up as DNN-GPUs

\*Edited from: https://www.mergersight.com/post/medallia-announces-59m-acquisition-of-voci-technologies

# So, DNNs – Is This All This Is...?

#### • Actually -- No



#### Pitt Research

• Focus: Bayesian inference

• X = hypothesis; Y = evidence



# **Inference on Prob Graphical Models**

#### PGMS include:

- Nodes: encode what we observe/know, how much we believe it
- Edges: encode relationships (joint dependencies/affinities)
- Inference: solve for "most likely" labels @ nodes



### **Short Tutorial: Inference on PGMs**

- 4 nodes, 3 edges, 3 discrete labels
  - Markov Random Field (MRF), in Factor Graph form



Xi take values in { ○ , ■ , ◆ } -- discrete label set

### **PGMs: Factors** \$\phi\$



### **PGM: Labeling Entire Graph**

- What is "affinity" of whole graph for a set of labels?
- Answer: Product of the factors φ



### **From Factors to Probabilities**

- Affinity != probability. How to get probabilities?
- Answer: Normalize via Z (called 'partition function')



### **Focus: MAP Inference Problem**

- Maximum A Posteriori inference task
- Question: What is **most likely** set of labels for graph?





### **"Big 3" Inference Methods for PGMs**





Belief Propagation

Graph Cuts (→ Network Flow)







**Jungwook Choi** PhD Illinois '15 Hanyang University



Belief Propagation





**Tianqi Gao** PhD Illinois '20 Apple SEG



Graph Cuts (→ Network Flow)



Glenn Ko PhD Illinois '17 Harvard

### **"Big 3" Inference Methods for PGMs**





Belief Propagation Graph Cuts (→ Network Flow)



(Gibbs/MCMC)

#### **Belief Propagation: Iterative & Local**

• Smart order of **local**, **message passing** computations (like Viterbi!) that calculate a "**belief**" per label, per node



### **Belief Propagate: Iterative & Local**

• But if graph has **loops** – **no guarantee** of convergence!



### Why We Care: Images are PGMs



### **Our BP: Sequential Tree-Reweighting**

- Idea: Decompose a loopy graph to a set of trees, do inference sequentially across trees, recombine "smart"
  - [Kolmogorov PAMI'06]: Empirically good on loopy case; **slow**



**Recombine "smart": Energy[p] = weighted sum from decomp** 

#### HW: First, Streaming "Diagonal Order" Arch

#### Key: Diagonal ordering of all message pass → parallelism

- Decoupled, streaming arch
- Launch/retire 1 pixel/clock
  - Complete label-set likelihood updates (~1Kb) for all labels
- 14-stage pixel pipeline
  - So: 14 pixels "in flight" / clock





### **Next: Parallel/Configurable Pipes**

• Not just one pipeline any longer: *more parallel*...



P Parallel processor elements (pixel streams)



Efficient memory subsystem overlaps BW and computation, checks for data conflicts

Novel, configurable Factor-Eval Units removes O(|labels|<sup>2</sup>) complexity (FFT tricks)

### **Results: Configurable BP Engine**

- Xilinx Virtex5 FPGA
- **12-40X** faster than SW (PE = 4, ~2015)
- No loss of quality
- First custom HW to run >1 Middlebury ML benchmark



Input Gnd Truth TRW-S SW BP Engine

### **"Big 3" Inference Methods for PGMs**







Graph Cuts (→ Network Flow)



(Gibbs/MCMC)

#### **GC: Transform from MRF to Network Flow**



# **GC: Why Hardware**

- Push-Relabel Network Flow: a "min cut" algorithm can be executed (almost) entirely with just neighbor values
- Neighbor: Nodes that share an edge in PGM (N-E-W-S)
- Iterative and Convergent: a "well behaved" algorithm

→Perfect for large images, modeled as grid-MRFs
 →There are tricks for doing gray-scale/color images

#### **GC: Pixel-Parallel Array Processor**

• FPGA target, one processor per pixel



### **Pixel Processor: Key Tricks**

#### Serial bottlenecks

- Cannot push flow to a node that is already "pushing" out
- **Solution**: Checkerboard scheduling + ordering
- Not all local: Global convergence detection
- **Solution**: O(rows+cols) shift register to array center to check activity









### **Next: What About "Big" Images?**

• We built a full virtual tile (memory) system on array



### **Virtual Tile Architecture**

 Virtual tiles "stack" on the physical tile array on-chip



Multiple virtual tiles



Physical tile processor array • Geometric nuances (lots!) at tile (and page) **edges** 



Physical tile array

## **GC Virtual Tile Engine: Results**

| 1536-pixel<br>tile array          | Array Size                                                                  | Slice<br>LUTs | BRAM  | -    | AXI Memory<br>Protocol           | Page Size                      | Ultra RAM |
|-----------------------------------|-----------------------------------------------------------------------------|---------------|-------|------|----------------------------------|--------------------------------|-----------|
| AWS F-node<br>(Xilinx UltraScale) | 16x96 = <u>1536 pixel</u><br>processor<br>2*(16+96)=224<br>shadow processor | 86.6%         | 66.5% | 18Kb | 512b wide<br>192 Burst<br>length | 60 Virtual<br>Tiles<br>11.25Mb | 16%       |

|                        | Our Hard-             | CUDA                  | Kobori             | Nikitakis          | Szeliski          | CPU 2              | CPU 3                |
|------------------------|-----------------------|-----------------------|--------------------|--------------------|-------------------|--------------------|----------------------|
|                        | ware                  | Cuts 2                | et al [1]          | et al [2]          | et al [15]        |                    |                      |
|                        |                       |                       | FPGA 1             | FPGA 2             | CPU 1             |                    |                      |
| Device                 | Virtex UltraScale+    | Nividia Titan         | Virtex 6           | Virtex 7           | NA                | Intel Xeon E5      | Intel i5             |
| Frequency              | 125 MHz               | 1405 MHz              | 201 MHz            | 260 MHz            | NA                | 3.4GHz             | 3.1GHz               |
| Image Flower (600x450) | $9.93 \mathrm{ms}$    | $11.67 \mathrm{ms}$   | $30.7 \mathrm{ms}$ | NA                 | $188 \mathrm{ms}$ | $1.03 \mathrm{~s}$ | $2.553 \mathrm{~s}$  |
| Image Person (600x450) | $12.27 \mathrm{\ ms}$ | $16.09 \mathrm{\ ms}$ | $36.7 \mathrm{ms}$ | NA                 | 140  ms           | 1.9 s              | $4.716 \mathrm{\ s}$ |
| Image Sponge (640x480) | 7.88 ms               | 1299 ms               | $45.8 \mathrm{ms}$ | NA                 | 142  ms           | 1.29 s             | $2.67 \mathrm{~s}$   |
| Synthesis (128x128)    | 0.13  ms              | NA                    | NA                 | $0.95 \mathrm{ms}$ | NA                | NA                 | NA                   |
| 50 Images Average      | $8.20 \mathrm{ms}$    | $17.23 \mathrm{\ ms}$ | NA                 | NA                 | NA                | NA                 | NA                   |
| Avg speedup            | 1X                    | 2.10X                 | 5.77X              | 7.28X              | 23.44X            | 240X               | 506X                 |

#### And – It Really Works on Real Images



[1] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, and C. Rother, "A comparative study of en- ergy minimization methods for markov random fields with smoothness- based priors," IEEE transactions on pattern analysis and machine in- telligence, vol. 30, no. 6, pp. 1068–1080, 2008.

[2] A. Nikitakis and I. Papaefstathiou, "Highly efficient reconfigurable par- allel graph cuts for embedded vision," in Proceedings of the 2016 Con- ference on Design, Automation & Test in Europe. EDA Consortium, 2016, pp. 1405–1410.

[3] V. Gulshan, C. Rother, A. Criminisi, A. Blake, and A. Zisserman, "Geodesic star convexity for interactive image segmentation," in Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010.



### **"Big 3" Inference Methods for PGMs**





Belief Propagation Graph Cuts (→ Network Flow)



### **Gibbs Sampling: Serial Baseline**



- Lets us compute Pr( xi = Label[k] ) \forall k
- Like GC: iterate to convergence

# **Gibbs Sampler (GS) Core**



- Up to 64 labels/node
- 32b variable fixed-pt
- Tightly coupled PRNG
- Iterative architecture for small footprint

### **Two Levels of Parallelism**

Sample **independent tiles** in parallel – treat as if they were separate images

while ( < max Gibbs sampling iterations)
foreach ( tile in an image )
while ( < max tile sampling iterations )
foreach ( node in a tile )
sample (\*)</pre>

Sample **independent nodes** in parallel – checkerboard / graph coloring schedule

#### **PGMA:** Prototype PGM Accelerator



Refs: Ko et al., VLSI 2020. Whatmough et al. VLSI, 2020.

### **PGMA vs A53 vs eFPGA (on SOC)**



• **PGMA**: **1000X+** throughput vs CPU; **6X+** ops/W vs eFPGA Pitt Research

### **PGMA ML Results**



Four example applications:

- Image restoration
- Stereo matching
- Image segmentation
- Sound source separation

#### Features:

- No labeled dataset
- Completely unsupervised
- Both training and inference on-the-fly

### Conclusions

 $\exists$  (AI apps X) [ interesting(X)  $\land \neg$ DNN(X)  $\land$  hardwareworthy(X) ]



**Jungwook Choi** PhD Illinois '16 Hanyang University



**Tianqi Gao** PhD Illinois '20 Apple SEG



**Belief Prop** 



**Graph Cuts** 



### Acknowledgements

#### • Contributors:

- Harvard: Yuji Chai, Marco Donato, Paul N. Whatmough, Thierry Tambe, David Brooks and Gu-Yeon Wei
- Illinois: Paris Smaragdis, Minje Kim, Shang-nien Tsai

#### • Sponsors:

- DARPA/SRC: FCRP C2S2, SONIC, JUMP ADA
- DARPA CRAFT
- Intel and ARM