Compares four implementations of vector matrix multiplication (y := Ax + y) on Perlmutter CPU nodes across problem sizes from N=1024 to N=16384. Compiler vectorization flags alone (AVX2) deliver a 4–5x speedup over basic serial code. OpenMP parallelism incurs overhead at small problem sizes but outperforms CBLAS at large sizes, with 4 threads consistently offering the best balance of parallelism and thread-management cost.
1 min read
Comparison of Serial and OpenMP Parallel Performance in Vector Matrix Multiplication