RISC-V GEMM Performance Summary

Comparison of DGEMM (Double) and SGEMM (Single) across Vector Targets (Matrix Sizes 256-4096)

Copyright 2025

HAIBIN LIU Southeast University

OpenBLAS Version Under Test

Benchmarks were captured from OpenBLAS 0.3.30 (official release), which was released on June 19, 2025.

How to Run the Benchmark

Detailed guidelines for running these benchmarks and collecting performance data are described in the benchmark.md documentation.

We invite you to read the benchmark guide to learn how the data presented in this report were collected, including build configurations, runtime parameters, and measurement methodologies.

Benchmark Environment

Software Platform

Fedora 42 Remix

Released by the Fedora-V Force team.

Download link: images.fedoravforce.org

Hardware Platform

K1 SoC (RISC-V)

Manufactured by SpacemiT.

CPU: 8 cores, model Spacemit® X60.

ISA Profile:
rv64imafdcv_zicbom_zicboz_zicntr_zicond_zicsr_zifencei_zihintpause_zihpm
zfh_zfhmin_zca_zcd_zba_zbb_zbc_zbs_zkt_zve32f_zve32x
zve64d_zve64f_zve64x_zvfh_zvfhmin_zvkt_sscofpmf_sstc
svinval_svnapot_svpbmt

MMU Mode: sv39.

Theoretical Performance Analysis

K1 SoC Vector Register Architecture (256-bit VLEN)

Single Precision (32-bit): 8 elements per vector register

32b
32b
32b
32b
32b
32b
32b
32b
256 bits total → 8x theoretical speedup

Double Precision (64-bit): 4 elements per vector register

64 bits
64 bits
64 bits
64 bits
256 bits total → 4x theoretical speedup

Note: K1 SoC (Spacemit® X60) implements the RISC-V Vector Extension with VLEN=256 bits, supporting both single and double precision floating-point operations through the RVV instruction set.

Key Findings

1. Practical vs. Theoretical: Performance Gap, Current RVV implementations achieve only 39-48% of theoretical peak

  • SGEMM: 3.1x actual speedup vs 8x theoretical = 38.9% efficiency (35,989 vs 11,546 MFLOPS on ZVL256B)
  • DGEMM: 1.9x actual speedup vs 4x theoretical = 47.5% efficiency (15,857 vs 8,175 MFLOPS on ZVL256B)
  • DGEMM better utilizes available vector parallelism
  • Implication: Limitation could be in memory bandwidth, register pressure, and kernel design. SGEMM's 8-wide operations likely face more severe register pressure and memory bandwidth bottlenecks on the K1 platform

2. Vector Length (VLEN) Gain: Longer is better, But, doubling Shows Diminishing Returns

  • ZVL256B vs ZVL128B gains fall short of 2x
  • SGEMM: ZVL256B achieves only 1.22x over ZVL128B (should be ~2x if perfectly scaling)
  • DGEMM: ZVL256B achieves 1.37x over ZVL128B (also below ideal 2x)
  • Key insight: Memory subsystem bandwidth, not compute capacity, is the primary bottleneck for GEMM on K1 — wider vectors can't be fed fast enough

3. For computing intensive kernels, i.e. GEMM, LMUL provides minimal benefit and sometimes hurts performance

  • Only benefit: DGEMM on ZVL128B sees modest 8% improvement with LMUL=4, compensating for narrow 128-bit vectors
  • Matrix-size dependency: LMUL=2 shows 11% gain for SGEMM ZVL128B at 256×256 (22,805 vs 20,482 MFLOPS) but 3% degradation at 4096×4096 (28,696 vs 29,451 MFLOPS)
  • Consistent degradation: Higher LMUL shows 2-3% performance loss at large matrix sizes (4096×4096) in 3 out of 4 configurations (DGEMM ZVL256B, both SGEMM cases)
  • Root cause: Increasing LMUL multiplies architectural register count but creates register pressure, limiting the compiler's ability to keep critical values in registers and forcing more memory traffic
💡

Critical Recommendation: Matrix-Size Adaptive Tuning

For production deployments on K1, implement matrix-size adaptive LMUL selection:

  • Small matrices (<1024): Use LMUL=2
  • Large matrices (≥1024): Use LMUL=1
  • Exception - DGEMM ZVL128B: Always use LMUL=4 (8% gain)

Impact: This strategy maximizes performance across all matrix sizes by balancing register pressure and computational throughput.

LMUL Analysis Summary:

At 4096×4096 (large matrices):

  • DGEMM ZVL128B: LMUL=4 (11,552 MFLOPS) vs LMUL=1 (10,686 MFLOPS) = 1.08x gain
  • DGEMM ZVL256B: LMUL=2 (15,388 MFLOPS) vs LMUL=1 (15,857 MFLOPS) = 0.97x (3% slower!)
  • SGEMM ZVL128B: LMUL=2 (28,696 MFLOPS) vs LMUL=1 (29,451 MFLOPS) = 0.97x (3% slower)
  • SGEMM ZVL256B: LMUL=2 (35,813 MFLOPS) vs LMUL=1 (35,989 MFLOPS) = 0.995x (0.5% slower)

At 256×256 (small matrices):

  • DGEMM ZVL256B: LMUL=2 (12,243 MFLOPS) vs LMUL=1 (12,536 MFLOPS) = 0.98x (close)
  • SGEMM ZVL128B: LMUL=2 (22,805 MFLOPS) vs LMUL=1 (20,482 MFLOPS) = 1.11x gain! 📈

Loading and analyzing data...

Raw Performance (MFLOPS vs. Matrix Size)

DGEMM (Double Precision) - MFLOPS

SGEMM (Single Precision) - MFLOPS

Speedup Analysis (vs. GENERIC Time)

DGEMM (Double Precision) - Speedup (GENERIC = 1.0x)

SGEMM (Single Precision) - Speedup (GENERIC = 1.0x)

Raw Data (Matrix Size 256 - 4096)

Precision Target LMUL Matrix Size Flops (MFLOPS) Time (s)

📚 RISC-V Vector Concepts Reference

This section provides background on RISC-V Vector architecture concepts referenced in the analysis above.

Register Grouping (LMUL) & Vector Length Control

1. Register Grouping (LMUL - Length Multiplier)

LMUL allows grouping multiple physical vector registers to process more data per instruction.

LMUL=1 (No grouping): 1 register

v0 (256 bits)

256 bits of data per operation

LMUL=2 (2 registers grouped): 2× data per instruction

v0 (256 bits)
v1 (256 bits)

512 bits of data per operation (2× wider)

LMUL=4 (4 registers grouped): 4× data per instruction

v0 (256 bits)
v1 (256 bits)
v2 (256 bits)
v3 (256 bits)

1024 bits of data per operation (4× wider)

Trade-off: Higher LMUL processes more data per instruction but reduces available register count. On a 32-register system: LMUL=1 gives 32 vector groups, LMUL=2 gives 16 groups, LMUL=4 gives 8 groups.

2. Vector Length Control (VLEN - Implementation-defined)

RISC-V Vector allows implementations to choose their vector register width (VLEN). Common values:

ZVL128B: VLEN = 128 bits

128b

ZVL256B: VLEN = 256 bits (K1 SoC)

256b

ZVL512B: VLEN = 512 bits

512b

Portability: RISC-V Vector code is VLEN-agnostic. The same binary runs on different VLEN implementations, automatically utilizing the available vector width through runtime detection.

💡 Understanding Application Vector Length (AVL)

RISC-V vectors are VLEN-agnostic through the vsetvl instruction, which takes an Application Vector Length (AVL) — the number of elements the software requests to process. The hardware returns the actual vector length (vl) based on this request and its capabilities.

How vsetvl Works:

Software Request
AVL = 8
elements needed
vsetvl Calculation
VLMAX = (LMUL×VLEN)/SEW
vl = min(AVL, VLMAX)
Hardware Returns
vl
actual length
✓ Vector operations process exactly vl elements
✓ Same code adapts to different VLEN implementations

Concrete Example: SGEMM 8×8 Kernel

Kernel code: vsetvl_e32m2(8) ← Requests 8 elements
• SEW = 32 bits (element width)
• LMUL = 2 (register grouping)
• AVL = 8 (requested by algorithm)
Hardware VLMAX vl returned
VLEN=128 (ZVL128B) 2 × 128 / 32 = 8 min(8, 8) = 8
VLEN=256 (K1) 2 × 256 / 32 = 16 min(8, 16) = 8
VLEN=512 2 × 512 / 32 = 32 min(8, 32) = 8

Visual: Register Element Layout (e32m2 - 2 registers grouped)

VLEN=128: VLMAX=8, vl=8
Register 0 (4 element capacity):
0
1
2
3
Register 1 (4 element capacity):
4
5
6
7
Total: 8 elements (256 bits: 2 registers × 128 bits)
VLEN=256 (K1): VLMAX=16, vl=8 (constrained by AVL)
Register 0 (8 element capacity):
0
1
2
3
4
5
6
7
Register 1 (8 element capacity):
·
·
·
·
·
·
·
·
8 elements processed (vl) | 8 available but not used
Total: 512 bits (2 registers × 256 bits)
VLEN=512: VLMAX=32, vl=8 (constrained by AVL)
Register 0 (16 element capacity):
0
1
2
3
4
5
6
7
·
·
·
·
·
·
·
·
Register 1 (16 element capacity):
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
Total: 1024 bits (2 registers × 512 bits)
Legend: Colored boxes = elements processed (vl) | Light gray = available capacity not used | LMUL=2 means 2 physical registers are grouped together

Key Point: The same kernel binary runs on all hardware, but the actual vector length (vl) is determined by both the software's request (AVL) and the hardware's capability (VLMAX). The hardware can only provide what the software asks for, even if it's capable of more.