# FPGA Based Acceleration of Computational Fluid Flow Simulation on Unstructured Mesh Geometry

Zoltán Nagy, Csaba Nemes, Antal Hiba, András Kiss, Árpád Csík, Péter Szolgay

Computer and Automation Research Institute Pázmány Péter Catholic University Széchenyi István University



# Introduction

- Challenges in numerical solution of Partial Differential Equations
- High performance Computational Fluid Dynamics (CFD) accelerator
- Arithmetic unit generation and optimization
- Off-chip data access optimization
- Results
- Conclusions, future work

Acceleration numerical solution of Partial Differential Equations (PDE)

- Wide variety of physical phenomenon
  - sound, heat, elasticity, electrodynamics or fluid flow
- Computational fluid dynamics (CFD)
- Discretization over fine mesh

MTA SZTAKI

- 5M mesh points Air flow simulation around a car or airplane
- •200M mesh points jet engine acoustic modeling
- Weeks of simulation time on clusters
  - Low processor utilization ~10%
  - Weak scalability over ~100 nodes



## Workflow



## Inviscid, Adiabatic, Compressible flows

• Euler equations:

$$\frac{\partial \rho}{\partial t} + \nabla (\rho \mathbf{v}) = 0$$
$$\frac{\partial (\rho \mathbf{v})}{\partial t} + \nabla \left( \rho \mathbf{v} \mathbf{v} + \hat{I} p \right) = 0$$

$$\frac{\partial E}{\partial t} + \nabla \left( \left( E + p \right) \mathbf{v} \right) = 0$$

 Total energy is defined as:

$$E = \frac{p}{\gamma - 1} + \frac{1}{2}\rho \mathbf{v} \cdot \mathbf{v}$$

- Notations:
  - •t: time
  - • $\nabla$ : Nabla operator
  - •p: density
  - v(u, v): velocity vector field
  - •p: pressure
  - •I: identity matrix
  - •E: total energy
  - γ: ratio of specific heats

#### MTA SZTAKI **First order Lax-Friedrichs** approximation





# **Data-flow model**



 Mathematical expression implemented in SystemC is converted to a hypergraph. Nodes = arithmetic units •Hyperarcs =

connections



## Arithmetic unit











## MTA SZTAKI Forward facing step 2D unstructured mesh



# Adjacency Matrix:198,006 nodes



- •Bandwidth: 198,006
- •Memory:
  - •4 time dependent variables
  - •32 byte/cell, ~6MB
- •Node degree: 3
  - •3 x 3byte adjacency list
  - 3 x 2 normal vector coordinate
  - •57 byte/cell, ~10.7MB
  - 3 clk/cell
  - 325MHz clock
  - memory bandwidth 23.5GB/s
  - nonuniform memory access pattern



## Renumbering



- •Bandwidth: 580
- Memory requirements:
  - •1,160 cell
  - •32 byte/cell: ~36.2kB
- •Node degree: 3
  - 3 x 2byte adjacency list
  - 3 x 2 normal vector coordinate
  - ●54 byte/cell, ~61.1kB
  - 3 clk/cell
  - 325MHz clock
  - memory bandwidth 12.7GB/s



## **Scramjet 3D unstructured mesh**



#### MTA SZTAKI Adjacency Matrix Scramjet: 210,379 nodes



- Bandwidth: 210,379
- •Memory:
  - •5 time dependent variables
  - •40 byte/cell, ~8MB
- •Node degree: 4
  - •4 x 3byte adjacency list
  - 4 x 3 normal vector coordinate
  - •108 byte/cell, ~21.6MB
  - •4 clk/cell
  - 325MHz clock
  - memory bandwidth 28.2GB/s
  - nonuniform memory access pattern



## Renumbering



- Bandwidth: 10,317
- •Memory requirements:
  - •20,634 cell
  - •40 byte/cell: ~806kB
- •Node degree: 4
  - •4 x 2byte adjacency list
  - 4 x 3 normal vector coordinate
  - •104 byte/cell, ~2.04MB
  - •4 clk/cell
  - 325MHz clock
  - memory bandwidth 14.95GB/s



# System Architecture







# Performance

- Alpha-Data ADM-XRC-6T1
- •FPGA: Xilinx XC6VSX475T
  - •DSP: 525 (26%)
  - •FF: 49,072 (12%)
  - •LUT: 34,543 (8%)
  - 3 arithmetic units
- Clock frequency: 325MHz
- 325 million triangle update/s
- •69.22GFLOPs
- •76.3 times speedup

#### Intel Xeon E5620 2.4GHz





# Conclusions, future work

- Supersonic flow simulation
  - High performance FPGA
  - Automatic arithmetic unit generation, partitioning, placement -> high clock frequency
  - Node reordering -> Efficient unstructured mesh handling
  - •Nearly two orders (76.3 times) speedup
- Future work
  - Mesh partitioning
  - Multi FPGA

### MTA SZTAKI Example: 2D intersection of a Scramjet engine (1.4M grid points)

