CPU-based Matrix Multiplication

A highly optimized C++ implementation for matrix multiplication: C(1, 6000) = A(1, 4096) @ B_T(4096, 6000) with A in FP32 and B in FP4 format.

Key Optimizations

Multi-threading: OpenMP parallelization across output elements
SIMD Vectorization: AVX2 with FMA instructions for accelerated computation
Optimized FP4 Conversion: SIMD-accelerated batch dequantization with lookup tables
Memory Layout: Column-major packing to eliminate cache misses
Prefetching: Memory prefetching to hide latency

Tested on AMD EPYC 7763 (2 cores, 4 threads) with 16GB RAM

The optimized implementation maintains high numerical accuracy compared to the reference naive implementation:

# Compile the project
make
# Run benchmarks
make run
# Clean build files
make clean