Comparison of DGEMM (Double) and SGEMM (Single) across Vector Targets (Matrix Sizes 256-4096)
Copyright 2025
Benchmarks were captured from OpenBLAS 0.3.30 (official release), which was released on June 19, 2025.
Detailed guidelines for running these benchmarks and collecting performance data are described in the benchmark.md documentation.
We invite you to read the benchmark guide to learn how the data presented in this report were collected, including build configurations, runtime parameters, and measurement methodologies.
Released by the Fedora-V Force team.
Download link: images.fedoravforce.org
Manufactured by SpacemiT.
CPU: 8 cores, model Spacemit® X60.
ISA Profile:
rv64imafdcv_zicbom_zicboz_zicntr_zicond_zicsr_zifencei_zihintpause_zihpm
zfh_zfhmin_zca_zcd_zba_zbb_zbc_zbs_zkt_zve32f_zve32x
zve64d_zve64f_zve64x_zvfh_zvfhmin_zvkt_sscofpmf_sstc
svinval_svnapot_svpbmt
MMU Mode: sv39.
Single Precision (32-bit): 8 elements per vector register
Double Precision (64-bit): 4 elements per vector register
Note: K1 SoC (Spacemit® X60) implements the RISC-V Vector Extension with VLEN=256 bits, supporting both single and double precision floating-point operations through the RVV instruction set.
For production deployments on K1, implement matrix-size adaptive LMUL selection:
LMUL=2LMUL=1LMUL=4 (8% gain)Impact: This strategy maximizes performance across all matrix sizes by balancing register pressure and computational throughput.
At 4096×4096 (large matrices):
At 256×256 (small matrices):
Loading and analyzing data...
| Precision | Target | LMUL | Matrix Size | Flops (MFLOPS) | Time (s) |
|---|
This section provides background on RISC-V Vector architecture concepts referenced in the analysis above.
1. Register Grouping (LMUL - Length Multiplier)
LMUL allows grouping multiple physical vector registers to process more data per instruction.
LMUL=1 (No grouping): 1 register
256 bits of data per operation
LMUL=2 (2 registers grouped): 2× data per instruction
512 bits of data per operation (2× wider)
LMUL=4 (4 registers grouped): 4× data per instruction
1024 bits of data per operation (4× wider)
Trade-off: Higher LMUL processes more data per instruction but reduces available register count. On a 32-register system: LMUL=1 gives 32 vector groups, LMUL=2 gives 16 groups, LMUL=4 gives 8 groups.
2. Vector Length Control (VLEN - Implementation-defined)
RISC-V Vector allows implementations to choose their vector register width (VLEN). Common values:
ZVL128B: VLEN = 128 bits
ZVL256B: VLEN = 256 bits (K1 SoC)
ZVL512B: VLEN = 512 bits
Portability: RISC-V Vector code is VLEN-agnostic. The same binary runs on different VLEN implementations, automatically utilizing the available vector width through runtime detection.
💡 Understanding Application Vector Length (AVL)
RISC-V vectors are VLEN-agnostic through the vsetvl instruction, which takes an Application Vector Length (AVL) — the number of elements the software requests to process. The hardware returns the actual vector length (vl) based on this request and its capabilities.
How vsetvl Works:
Concrete Example: SGEMM 8×8 Kernel
vsetvl_e32m2(8) ← Requests 8 elements| Hardware | VLMAX | vl returned |
|---|---|---|
| VLEN=128 (ZVL128B) | 2 × 128 / 32 = 8 | min(8, 8) = 8 |
| VLEN=256 (K1) | 2 × 256 / 32 = 16 | min(8, 16) = 8 |
| VLEN=512 | 2 × 512 / 32 = 32 | min(8, 32) = 8 |
Visual: Register Element Layout (e32m2 - 2 registers grouped)
Key Point: The same kernel binary runs on all hardware, but the actual vector length (vl) is determined by both the software's request (AVL) and the hardware's capability (VLMAX). The hardware can only provide what the software asks for, even if it's capable of more.