RISC-V GEMV Performance Summary

Comparison of DGEMV (Double) and SGEMV (Single) across Vector Targets (Vector Sizes 256-4096)

Copyright 2025

HAIBIN LIU Southeast University

OpenBLAS Version Under Test

Benchmarks were captured from OpenBLAS 0.3.30 (official release), which was released on June 19, 2025.

How to Run the Benchmark

Detailed guidelines for running these benchmarks and collecting performance data are described in the benchmark.md documentation.

We invite you to read the benchmark guide to learn how the data presented in this report were collected, including build configurations, runtime parameters, and measurement methodologies.

Benchmark Environment

Software Platform

Fedora 42 Remix

Released by the Fedora-V Force team.

Download link: images.fedoravforce.org

Hardware Platform

K1 SoC (RISC-V)

Manufactured by SpacemiT.

CPU: 8 cores, model Spacemit® X60.

ISA Profile:
rv64imafdcv_zicbom_zicboz_zicntr_zicond_zicsr_zifencei_zihintpause_zihpm
zfh_zfhmin_zca_zcd_zba_zbb_zbc_zbs_zkt_zve32f_zve32x
zve64d_zve64f_zve64x_zvfh_zvfhmin_zvkt_sscofpmf_sstc
svinval_svnapot_svpbmt

MMU Mode: sv39.

Performance Analysis Summary

About GEMV (General Matrix-Vector Multiply)

GEMV performs matrix-vector multiplication: y = α·A·x + β·y, where A is a matrix and x, y are vectors. Unlike GEMM (matrix-matrix), GEMV is memory-bandwidth bound rather than compute-bound, making it more challenging to optimize. Each element of A is typically accessed only once, limiting opportunities for data reuse in cache.

Key Findings

1. GEMV Shows Limited and Inconsistent Vector Benefits

Unlike GEMM, GEMV optimization shows highly variable results across different vector sizes and precisions:

  • DGEMV: Only vector size 1024 shows modest gains (~18% over GENERIC). Most other sizes show degraded performance or minimal improvement
  • SGEMV: Shows strong positive results for sizes 256-2048 (1.4x to 5.3x over GENERIC), but experiences significant performance degradation at size 4096 (~0.37x of GENERIC performance)
  • Root cause: GEMV is memory-bandwidth bound with minimal data reuse. Vector optimizations help with smaller working sets that fit in cache, but large vectors (4096) likely exceed cache capacity and suffer from memory bottlenecks

2. SGEMV: Size-Dependent Performance Profile

  • Small to medium vectors (256-2048): Excellent vectorization gains
    • Size 256: Up to 5.3x speedup (ZVL256B LMUL=2: 23.2 vs 4.4 MFLOPS)
    • Size 512-2048: Consistent 1.4x to 3.3x gains
  • Large vectors (4096): Performance collapse
    • RVV kernels: 9.5-12.0 MFLOPS vs GENERIC: 26.4 MFLOPS
    • Indicates cache thrashing or memory bandwidth saturation

3. DGEMV: Minimal Vector Benefit

  • Best case: Size 1024 shows ~18% improvement (6.2 vs 5.3 MFLOPS)
  • Most sizes: Performance degradation or no benefit
  • Double-precision memory bandwidth requirements overwhelm vectorization advantages
  • Observation: DGEMV vectorization on current RVV implementations provides limited practical benefit

💡 GEMV Optimization Recommendation:

  • SGEMV: Use RVV kernels for vectors ≤ 2048, fall back to GENERIC for size 4096+
  • DGEMV: RVV optimization shows marginal benefit; consider GENERIC for most workloads
  • LMUL impact: Minimal differentiation between LMUL settings in GEMV, unlike GEMM where it's significant

Loading and analyzing data...

Raw Performance (MFLOPS vs. Vector Size)

DGEMV (Double Precision) - MFLOPS

SGEMV (Single Precision) - MFLOPS

Speedup Analysis (vs. GENERIC Time)

DGEMV (Double Precision) - Speedup (GENERIC = 1.0x)

SGEMV (Single Precision) - Speedup (GENERIC = 1.0x)

Raw Data (Vector Size 256 - 4096)

Precision Target LMUL Vector Size Flops (MFLOPS) Time (s)