# MocCUDA: Running CUDA Codes on Fugaku

Slides contrib.: W. Moses (MIT), I. Ivanov (TokyoTech)

Jens Domke, Dr. rer. nat. <jens.domke@riken.jp > High Performance Big Data Research Team, RIKEN R-CCS, Kobe, Japan



#### Initial DL4Fugaku team and Collaborators





significant assistance by Ikuo Miyoshi from Fujitsu limited

**Technical Staff** Mr. Kazuto ANDO

Researcher Dr. Keigo NITADORI

Visiting researcher Dr. Mohamed Wahib

## Exploring and Merging Different Routes to O(100,000s) Nodes Deep Learning

 $\begin{array}{c|c} Stage 3 & Stage 4 \\ \hline \mathbf{A}_0, \mathbf{G}_1, \nabla E_1 & \mathbf{A}_0^{-1}, \mathbf{G}_1^{-1}, \nabla E_1 \\ \hline \mathbf{A}_1, \mathbf{G}_2, \nabla E_2 & \mathbf{A}_1^{-1}, \mathbf{G}_2^{-1}, \nabla E_2 \end{array}$ 

Data-parallel > Model-parallel

Data-paralle

 $A_1, G_2, \nabla E_2$   $A_2, G_3, \nabla E_3$ 

graph-based Non-intrusive partitioning strategy for large DNN models achieving superlinear scaling [1]

Deep tearning Distributed Execution Energy

Not Device Device Device

 $A_1, G_2, \nabla$ Laver-wise distribution and Data-parallel Model-parallel (K-FAC) inverse-free further design accelerate K-FAC [5] A model-parallel 2nd-order method 102.4 G.PU.8 1024 GPUs 2048 512 Megatron-LM Turing-NLG UT Austin, UChicago, ANL (K-FAC) trains ResNet-50 on 1K GPUs Out-of-core distributed training (pure AIST, Koc U. Model-parallelism in 10 minutes [4] data-parallel) outperforming SoTA enables 3D CNN training TokyoTech, NVIDIA, RIKEN, AIST NLP models on 2K GPUs [2] on 2K GPUs with 64x AIST, Matsuoka-lab, RIKEN larger spatial size and Merging Theory<sup>I</sup> and Practice better convergence [3] Inference (FP32) images/sec Matsuoka-lab, LLNL, LBL, RIKEN MocCUDA: Porting CUDA-based Same efficienc Porting CPU-based Deep Neural on Intel CPL Deep Neural Network Library to Engineering for Network Library to A64FX chip A64FX and (other CPU arch.) 150 Fujitsu, RIKEN, ARM Performance Foundation **RIKEN, Matsuoka-lab, AIST** batch size #OMP

[1] M. Fareed et al., "A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs", Submitted to PPoPP21

ŧţ

[2] M. Wahib et al., "Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA", ACM/IEEE SC20 (Supercomputing 2020)

[3] Y. Oyama et al., "The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism," arXiv e-prints, pp. 1-12, 2020.

[4] K. Osawa, et al., "Large-scale distributed second-order optimization using kronecker-factored approximate curvature for deep convolutional neural networks," Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2019-June, pp. 12351-12359, 2019.

[5] J. G. Pauloski, Z. Zhang, L. Huang, W. Xu, and I. T. Foster, "Convolutional Neural Network Training with Distributed K-FAC," arXiv e-prints, pp. 1-11, 2020.

Jens Domke

TEL OPS

Ideal performance @FP32

#### Internal discussions (early spring '20)

- Prof. Matsuoka eye-opening remarks a long time ago:
  "A64FX is more like a GPU than a CPU"
- Brainstorming Alex & Wahib & myself

Jens Domke

- CUDA does gemm-based conv... why? -> Memory BW
- oneDNN focuses on direct conv. (gemm-based only "for debug")
- NNPACK / native CPU backends for "normal" CPUs all slow
- If A64FX *"is a GPU"*, then why don't we mimic its computation?
- Option A: use oneDNN's internal gemm-based conv
  - need rewrite of interface to pytorch, or within high-level API of oneDNN (3)
  - refactoring torch scripts still necessary; Amdahl's law issues still present ☺



Hardware

Intel CPU

A64FX

**NVIDIA** 

GPU



## oneDNN for pytorch (and TF, +others...)



- Disadvantages of Intel's oneDNN approach
  - Tedious to port to A64FX (years of engineering by Fujitsu)
  - GEMM-based convolution not exposed
  - Tuned for "normal" CPUs with assumption: Memory is slow
  - Pytorch scripts need to be modified (convert model/tensor with X.to\_mkldnn())
  - Amdahl's law problem (maybe too much sequential pytorch stuff in betw. parallel DNNL sections)



#### **Option B: can we replace CUDA RT & cuDNN?**



Computer simula R-CCS create the future

Computer simulations

## MocCUDA for x86 and A64FX (cuda-"native")



- Architecture (only functions implemented if called by pytorch):
  - Wrapper library for CUDA runtime → Easy ☺
  - Wrapper libs for cudnn (& cublas) → medium hard (& trivial), but no reference code available (hence, took more time)
  - Wrapper libs for native cuda kernels (<<<...>>> in torch's .cu files)
    → annoying, non-trivial coding / reverse engineering, but doable ⊗
  - Async work dispatch queues (cf. cuda streams) → use Apple's GCD
  - Finally: use **SSL2** for BLAS ops; use **Horovod** for MPI/multi-node
- Time: only 1-2 months of R&D without prior knowledge of CUDA programming or how to write DL kernels (bnorm, maxpool, conv, etc.)!

Can run Resnet50 (batchsize >= 2) with LD\_PRELOAD=moccuda.so
 Jens Domke

## **Experimental MocCUDA Benchmarking**



#### • Tests on Fugaku:

- Alex' github.com/undertherain/benchmarker/ (Infer.: conv2d layer; Train.: Resnet50 w/ synthetic img)
- Fujitsu's Resnet50 test\_train.py script
- Horovod's synthetic Resnet50 benchmark
- Fujitsu's official pytorch v1.5 on Fugaku (FJ's fcc + oneDNN + SSL2)
- Self-compiled **Pytorch (v1.4) with CUDA** support (nvcc + clang13 + SSL2)
  - CUDA(4Arm)/cudnn "installed" in \$HOME from RPMs
  - +3 changes to prevent inline of some functions ( $\rightarrow$  not possible in v1.5 anymore)
- Execute test\_train.py (10 epochs), and benchmarker (6 epoch) on A64FX, eg: for CMGs in (1, 2, 3, 4)
  - for OMP in (1, 2, ..., 64)
    - for batch\_size in (1, 2, ..., 288)

test\_train.py (--type cpu\_nomkl | cpu\_mkltensor | gpu) & eq. for benchmarker

## Conv2d layer: Img/s (top) & Speedup (bottom)

- Conv2d can be implemented as im2col+gemm
- Higher is better
- Possible to run on 1-to-4 A64FX CMGs
- Native CPU backend usually worst performance
- MocCUDA can outperform oneDNN for large batch sizes (up to ~4x speedup possible)

200

150

0

300

250 200 150 100

50

mg/s (higher is



Computer simulations

R-CCS create the future

#### **Benchmarker: Resnet50 training**

2.5

1.5

0.5

0

atch size

Speedup





- Full multi-layer resnet50 training run (forward and back 3.5 propagation) possible 3
- Higher is better
- MocCUDA outperforms oneDNN by up to 3x



## Fujitsu's Resnet50 test train.py script

50

40

t/epoch ∞

20

10



- Lower is better
- Native slow; not scaling with batch size (OOM issues)
- oneDNN big problem with #OMP > #cores
- MocCUDA almost competitive
  - small #OMP: ~1/2 speed
  - #OMP ~= #cores: 5%-20% slower
- >5x speedup for #OMP > #coreJens Domke



## Horovod's synthetic Resnet50 benchmark



A64FX with 2 MPI Ranks • Higher is better 60 native150 MocCUDA Left: 2 ranks on 50 oneDNN150 1 node; Right: 40 scaling #MPI img/sec 30 ranks with fixed #OMP=12 20 12x A64FX Fugaku nodes native150 Native slow 10 MocCUDA oneDNN150 oneDNN best. 0 20 but with **odd** 40 60 batch size 10 5 80 15 20 #OMP performance **behavior** (#OMP=1 best; >1 decreasing) 100 0 MocCUDA close to oneDNN with 12 cores per MPI rank / CMG batch size

#ranks

50

## Primary remaining issue: native CUDA kernels



- Pytorch has various tensor operations implemented in native cuda (add, mul, threshold, softmax, ... more complex ops)
- Number is reducing over time
  - → more and more move to libcudnn, libcublas, libcufft, ....
- Number will likely \_not\_ decrease to 0
  some fn not performance-relevant enough to migrate
- → Collaboration on automatic Cuda2OpenMP translation/compilation
- Prior art exists (eg. GPU Ocelot, ...) but is outdated
- Approach A: LLVM-IR (collab. w/ Ivan R. Ivanov @TokyoTech)
- Approach B: LLVM-MLIR (collab. w/ William S. Moses @MIT)

## **Option A: LLVM-IR (I. Ivanov @TokyoTech)**

- CUDA execution model:
  - Kernels exec. in blocks in grids
  - Threads in blocks run in parallel
  - No guarantee about the order of blocks or parallelism of blocks
- In theory easily mapped to
  6-way nested loop + OMP
  - GPU thread ?= CPU thread
    → oversubscribes CPU
    → bad cache access ☺
  - Parallelize over blocks?
    → GPU barriers... ☺

Jens Domke



entry

int i = threadIdx.x:

C[i] = A[i] + B[i];

Execution of a block on the GPU

apu thread idx 0 1 2 3 4 5 6 7

entry

barrier

barrier

exit

global void VecAdd(float\* A, float\* B, float\* C)



Kernel execution structure





Block n

...

## **Code transformation: continuation kernels**

- Generate continuation kernels (functions)
- Live variable analysis to find state (variables) necessary to preserve
- Done on LLVM IR level
- $f \xrightarrow[a^2]{a^2} f 0 \xrightarrow[a^2]{a^2} f 0 \xrightarrow[a^2]{a^2} cont: f1 \xrightarrow[a^2]{a^2} f 1 \xrightarrow[a^2]{a^2} cont: f2 \xrightarrow[a^2]{a^2} cont: f2$

Evaluation

- Rodinia (has OpenMP & CUDA implementations of the same problem
- Avg. *cpucuda* runtime 2.5x slower than native OpenMP (ranging from 20x speedup to 37x slowdown







## **Option B: LLVM-MLIR (W. Moses @MIT)**



• Polygeist compilation flow (github.com/wsmoses/Polygeist)



- Generic C/C++ frontend generates "standard" MLIR (multi-level interm. Rep.)
- Raising transformations for transforming "standard"→polyhedral MLIR (affine)
- Embedding of existing polyhedral tools (Pluto, CLooG) into MLIR
- Novel transformations and optimizations (statement splitting, reduction detection, etc) that rely on high-level compiler representation
- End-to-end evaluation of standard polyhedral benchmarks (Polybench)

## Use Polygeist to translate CUDA $\rightarrow$ OpenMP

Computer simulations create the future R-CCS

 $\rightarrow$ 

- Parallelize within blocks and first-class representation of parallelism
  - Maintain GPU parallelism in a form understandable to the compiler
  - Enables optimization between caller and kernel

\_\_\_\_\_device\_\_\_\_int sum(int\* in, int n); global void normalize(int \*out, int \*in, int n) { int tid = threadIdx.x: if (tid < n)out[tid] = in[tid] / sum(in, n); void launch(int\* out, int\* in, int n) {

normalize<<<nblocks, nthreads>>>(out, in, n);



%arg2: 132) { %c1 = arith.constant 1 : index %c0 = arith.constant 0 : index %sum = call @ Z3sumPii(%arg1, %arg2) scf.parallel (%arg3) = (%c0) to (%arg2) step (%c1) { %2 = memref.load %arg1[%arg3] %4 = arith.divsi %2, %sum : i32 memref.store %4, %arg0[%arg3]

func private @\_Z3sumPii(memref<?xi32>, i32) -> i32

- Remaining issue again: Synchronization Lowering
  - Efficiently lower a top-level sync by distributing the *parallel* for loop around the sync





- Store registers or **re-compute values** which are required in 2<sup>nd</sup> loop
- parallel for %i = 0 to N { parallel for %i = 0 to N { parallel for %i = 0 to N { codeA(%i); codeA(%i): codeA(%i); Sync. within control flow parallel for %i = 0 to N { codeA(%i); parallel\_for %i = 0 to N { for %j = ... { parallel for %i = 0 to N { parallel for %i = 0 to N { (for, if, while, etc.) can be codeB1(%i, %j); codeB1(%i, %j); codeB1(%i, %j); codeB1(%i, %j); sync threads; sync threads; sync\_threads; lowered by splitting and codeB2(%i, %j); codeB2(%i, %j); parallel for %i = 0 to N { codeB2(%i, %j); codeB2(%i, %j); codeC(%i); and interchanging loops parallel for %i = 0 to N { parallel for %i = 0 to N { codeC(%i); codeC(%i); parallel for %i = 0 to N { codeC(%i); Jens Domke

# Summary and Job/Collaboration Opportunities

- Advantages of MocCUDA & LLVM-IR:
  - Full control over SW stack; tune as we like (algos/code) w/o Intel
  - CUDA impl. (torch/etc.) implicitly supports async dispatch -> no Amdahl's law issues
  - Implicit support for other DL frameworks (incl. those without oneDNN support)
  - Easily integrate diff. precisions / kernel fusion (analyze GCD queues) / SVE / etc.
  - Usage potential far beyond just DL framework → backporting GPU codes to HBM-based x86/Arm CPUs

- Collaborations and Job opportunities:
  - Our research teams and open positions: <u>https://www.riken.jp/en/research/labs/r-ccs/</u> and <u>https://bit.ly/3faax8v</u>
- Internship/fellowship (Bachelor→PhD):
  - www.riken.jp/en/careers/programs/index.html
  - www.r-ccs.riken.jp/en/about/careers/internship/
- Supercomputer Fugaku:
  - Apply for node-hours: <u>www.r-ccs.riken.jp/en/fugaku/user-guide/</u>
  - Interactive, virtual tour: <u>www.r-ccs.riken.jp/en/fugaku/3d-models/</u> and <u>www.youtube.com/watch?v=f3cx4PGDGmg</u>

#### < jens.domke@riken.jp >