Uses the LIKWID hardware performance monitoring tool to look beneath runtime numbers and compare L2/L3 cache accesses and retired instruction counts across basic OpenMP, blocked-and-copy (BMMCO), and CBLAS DGEMM implementations. The headline finding: at N=2048, the naive implementation incurs nearly 600x the L3 cache accesses of CBLAS, directly explaining its poor performance. Larger block sizes (b=16) dramatically reduce cache pressure compared to smaller ones (b=4), and this is clearly reflected in both speedup curves and hardware metrics.
1 min read
Comparing LIKWID Hardware Metrics to Performance across DGEMM Implementations