Comparison of libm scalar vs. SLEEF vector GELU implementations (erf exact vs tanh approximation). Results captured on 17 Oct 2025. Implementation file: gelu_kernel.c (https://github.com/docularxu/sleef/blob/working.sleef.bench/gelu/gelu_kernel.c).
1.93x (erf, RVVM2)
2.66x (erff, RVVM2)
Released by the Fedora-V Force team.
Download link: images.fedoravforce.org
Manufactured by SpacemiT.
CPU: 8 cores, model Spacemit® X60.
MMU Mode: sv39.
This implementation provides high-performance vectorized GELU (Gaussian Error Linear Unit) activation functions using SLEEF's math library across multiple SIMD architectures.
GELU is an activation function commonly used in neural networks, especially in Transformer models like BERT and GPT. It has two formulations:
GELU(x) = 0.5 * x * (1 + erf(x/√2))GELU(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x3)))
GELU kernel benchmarks were built and run from the gelu/ directory.
Source code: GitHub repository git@github.com:docularxu/sleef.git (fork/clone of
shibatch/sleef),
branch working.sleef.bench.
Browse: docularxu/sleef@working.sleef.bench
cd gelu ./build_riscv.sh
Benchmark testing:
$ ./gelu_rvvm1 11008 100000; ./gelu_rvvm2 11008 100000;
In LLM inference, GELU is applied over the FFN intermediate size (expanded hidden width per token after the first MLP projection), not small toy sizes. Using n=11008 mimics LLaMA 7B's FFN width, making the microbenchmark more representative.
| Mode | GELU Variant | Precision | libm Time (ns) | Vector Time (ns) | libm Throughput (GB/s) | Vector Throughput (GB/s) | Speedup |
|---|---|---|---|---|---|---|---|
| RVVM1 | erf | FP64 | 117.492 | 83.387 | 0.136 | 0.192 | 1.41x |
| RVVM1 | tanh | FP64 | 155.479 | 97.503 | 0.103 | 0.164 | 1.59x |
| RVVM1 | erff | FP32 | 64.321 | 32.141 | 0.124 | 0.249 | 2.00x |
| RVVM1 | tanhf | FP32 | 72.396 | 39.766 | 0.111 | 0.201 | 1.82x |
| RVVM2 | erf | FP64 | 119.103 | 61.969 | 0.134 | 0.258 | 1.92x |
| RVVM2 | tanh | FP64 | 156.496 | 98.139 | 0.102 | 0.163 | 1.59x |
| RVVM2 | erff | FP32 | 64.825 | 24.326 | 0.123 | 0.329 | 2.66x |
| RVVM2 | tanhf | FP32 | 72.987 | 39.688 | 0.110 | 0.202 | 1.84x |
Note: 'erf' indicates the exact GELU formulation using the error function; 'tanh' indicates the tanh approximation. The 'f' suffix denotes single precision (FP32).
Lower is better. GELU Variants: erf (exact), tanh (approx).
Lower is better. GELU Variants: erff (exact FP32), tanhf (approx FP32).
Higher is better.
Higher is better.
Speedup = libm time / SLEEF Vector time.
erf = exact GELU; tanh = tanh-approx GELU. Suffix f denotes FP32.$ ./gelu_rvvm1 11008 100000; ./gelu_rvvm2 11008 100000; ======================================== GELU Kernel using SLEEF (RISC-V Vector (LMUL=1)) ======================================== Double precision correctness test: GELU exact max error: 2.557e-16 GELU tanh approx max error: 2.479e-16 Max difference (exact vs tanh): 4.732e-04 Single precision correctness test: GELU exact max error: 1.923e-07 GELU tanh approx max error: 2.594e-07 Max difference (exact vs tanh): 4.733e-04 Benchmark (n=11008, iterations=100000): ======================================== Double Precision (FP64): Implementation Time (ns) Throughput Speedup ------------------------------ ------------ ------------ ---------- Scalar libm (erf) 117.492 0.136 GB/s 1.00x SLEEF Vector (erf) 83.387 0.192 GB/s 1.41x Scalar libm (tanh) 155.479 0.103 GB/s 1.00x SLEEF Vector (tanh) 97.503 0.164 GB/s 1.59x ======================================== Single Precision (FP32): Implementation Time (ns) Throughput Speedup ------------------------------ ------------ ------------ ---------- Scalar libm (erff) 64.321 0.124 GB/s 1.00x SLEEF Vector (erff) 32.141 0.249 GB/s 2.00x Scalar libm (tanhf) 72.396 0.111 GB/s 1.00x SLEEF Vector (tanhf) 39.766 0.201 GB/s 1.82x ========================================
$ ./gelu_rvvm2 11008 100000 ======================================== GELU Kernel using SLEEF (RISC-V Vector (LMUL=2)) ======================================== Double precision correctness test: GELU exact max error: 2.557e-16 GELU tanh approx max error: 2.479e-16 Max difference (exact vs tanh): 4.732e-04 Single precision correctness test: GELU exact max error: 1.923e-07 GELU tanh approx max error: 2.594e-07 Max difference (exact vs tanh): 4.733e-04 Benchmark (n=11008, iterations=100000): ======================================== Double Precision (FP64): Implementation Time (ns) Throughput Speedup ------------------------------ ------------ ------------ ---------- Scalar libm (erf) 119.103 0.134 GB/s 1.00x SLEEF Vector (erf) 61.969 0.258 GB/s 1.92x Scalar libm (tanh) 156.496 0.102 GB/s 1.00x SLEEF Vector (tanh) 98.139 0.163 GB/s 1.59x ======================================== Single Precision (FP32): Implementation Time (ns) Throughput Speedup ------------------------------ ------------ ------------ ---------- Scalar libm (erff) 64.825 0.123 GB/s 1.00x SLEEF Vector (erff) 24.326 0.329 GB/s 2.66x Scalar libm (tanhf) 72.987 0.110 GB/s 1.00x SLEEF Vector (tanhf) 39.688 0.202 GB/s 1.84x ========================================