GELU Kernel Benchmarks: GELU (erf/tanh) on RISC-V RVV

Comparison of libm scalar vs. SLEEF vector GELU implementations (erf exact vs tanh approximation). Results captured on 17 Oct 2025. Implementation file: gelu_kernel.c (https://github.com/docularxu/sleef/blob/working.sleef.bench/gelu/gelu_kernel.c).

GUODONG XU
Director of China Operations, RISCstar Solutions. Copyright 2025.

Highest Speedup (Double Precision)

1.93x (erf, RVVM2)

Highest Speedup (Single Precision)

2.66x (erff, RVVM2)

Benchmark Environment

Software Platform

Fedora 42 Remix

Released by the Fedora-V Force team.

Download link: images.fedoravforce.org

Hardware Platform

K1 SoC (RISC-V)

Manufactured by SpacemiT.

CPU: 8 cores, model Spacemit® X60.

MMU Mode: sv39.

GELU Kernel using SLEEF

This implementation provides high-performance vectorized GELU (Gaussian Error Linear Unit) activation functions using SLEEF's math library across multiple SIMD architectures.

What is GELU?

GELU is an activation function commonly used in neural networks, especially in Transformer models like BERT and GPT. It has two formulations:

  1. Exact version: GELU(x) = 0.5 * x * (1 + erf(x/√2))
  2. Tanh approximation: GELU(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x3)))
GELU function illustration

Benchmark Harness

GELU kernel benchmarks were built and run from the gelu/ directory.

Source code: GitHub repository git@github.com:docularxu/sleef.git (fork/clone of shibatch/sleef), branch working.sleef.bench. Browse: docularxu/sleef@working.sleef.bench

cd gelu
./build_riscv.sh

Benchmark testing:

$ ./gelu_rvvm1 11008 100000; ./gelu_rvvm2 11008 100000;

Why size n=11008?

In LLM inference, GELU is applied over the FFN intermediate size (expanded hidden width per token after the first MLP projection), not small toy sizes. Using n=11008 mimics LLaMA 7B's FFN width, making the microbenchmark more representative.

Results Table

Mode GELU Variant Precision libm Time (ns) Vector Time (ns) libm Throughput (GB/s) Vector Throughput (GB/s) Speedup
RVVM1erfFP64117.49283.3870.1360.1921.41x
RVVM1tanhFP64155.47997.5030.1030.1641.59x
RVVM1erffFP3264.32132.1410.1240.2492.00x
RVVM1tanhfFP3272.39639.7660.1110.2011.82x
RVVM2erfFP64119.10361.9690.1340.2581.92x
RVVM2tanhFP64156.49698.1390.1020.1631.59x
RVVM2erffFP3264.82524.3260.1230.3292.66x
RVVM2tanhfFP3272.98739.6880.1100.2021.84x

Note: 'erf' indicates the exact GELU formulation using the error function; 'tanh' indicates the tanh approximation. The 'f' suffix denotes single precision (FP32).

Charts: Time, Throughput, and Speedup

Double Precision: Time (ns/element) by GELU Variant

Lower is better. GELU Variants: erf (exact), tanh (approx).

Single Precision: Time (ns/element) by GELU Variant

Lower is better. GELU Variants: erff (exact FP32), tanhf (approx FP32).

Double Precision: Throughput (GB/s) by GELU Variant

Higher is better.

Single Precision: Throughput (GB/s) by GELU Variant

Higher is better.

Speedup (Vector vs. libm Scalar)

Speedup = libm time / SLEEF Vector time.

Observations

What the benchmark shows

Key takeaways

Raw Benchmark Outputs (Two Configurations)

RVVM1 (LMUL=1)

$ ./gelu_rvvm1 11008 100000; ./gelu_rvvm2 11008 100000;
========================================
GELU Kernel using SLEEF (RISC-V Vector (LMUL=1))
========================================

Double precision correctness test:
  GELU exact max error:         2.557e-16
  GELU tanh approx max error:   2.479e-16
  Max difference (exact vs tanh): 4.732e-04

Single precision correctness test:
  GELU exact max error:         1.923e-07
  GELU tanh approx max error:   2.594e-07
  Max difference (exact vs tanh): 4.733e-04

Benchmark (n=11008, iterations=100000):
========================================

Double Precision (FP64):
Implementation                    Time (ns)   Throughput    Speedup
------------------------------ ------------ ------------ ----------
Scalar libm (erf)                   117.492     0.136 GB/s      1.00x
SLEEF Vector (erf)                   83.387     0.192 GB/s      1.41x

Scalar libm (tanh)                  155.479     0.103 GB/s      1.00x
SLEEF Vector (tanh)                  97.503     0.164 GB/s      1.59x

========================================

Single Precision (FP32):
Implementation                    Time (ns)   Throughput    Speedup
------------------------------ ------------ ------------ ----------
Scalar libm (erff)                   64.321     0.124 GB/s      1.00x
SLEEF Vector (erff)                  32.141     0.249 GB/s      2.00x

Scalar libm (tanhf)                  72.396     0.111 GB/s      1.00x
SLEEF Vector (tanhf)                 39.766     0.201 GB/s      1.82x

========================================

RVVM2 (LMUL=2)

$ ./gelu_rvvm2 11008 100000
========================================
GELU Kernel using SLEEF (RISC-V Vector (LMUL=2))
========================================

Double precision correctness test:
  GELU exact max error:         2.557e-16
  GELU tanh approx max error:   2.479e-16
  Max difference (exact vs tanh): 4.732e-04

Single precision correctness test:
  GELU exact max error:         1.923e-07
  GELU tanh approx max error:   2.594e-07
  Max difference (exact vs tanh): 4.733e-04

Benchmark (n=11008, iterations=100000):
========================================

Double Precision (FP64):
Implementation                    Time (ns)   Throughput    Speedup
------------------------------ ------------ ------------ ----------
Scalar libm (erf)                   119.103     0.134 GB/s      1.00x
SLEEF Vector (erf)                   61.969     0.258 GB/s      1.92x

Scalar libm (tanh)                  156.496     0.102 GB/s      1.00x
SLEEF Vector (tanh)                  98.139     0.163 GB/s      1.59x

========================================

Single Precision (FP32):
Implementation                    Time (ns)   Throughput    Speedup
------------------------------ ------------ ------------ ----------
Scalar libm (erff)                   64.825     0.123 GB/s      1.00x
SLEEF Vector (erff)                  24.326     0.329 GB/s      2.66x

Scalar libm (tanhf)                  72.987     0.110 GB/s      1.00x
SLEEF Vector (tanhf)                 39.688     0.202 GB/s      1.84x

========================================