Which one is faster?
The number of floating point operations is the same in both cases!
The answer is not straightforward: it depends on the computer's architecture.
On my laptop (Intel(R) Core(TM) Ultra 7 155H CPU @ 4.80GHz), multiply_with_unrolling is approximately 3-4 times faster than multiply with size = 1e6! (see examples/optimization/loop_unrolling.cpp).
Why? The Streaming SIMD Extensions (SSE2) instruction set of the CPU allows for parallelization at the microcode level. It's a super-scalar architecture with multiple instruction pipelines to execute several instructions concurrently during a clock cycle. The unrolled code better exploits this capability.
Note: Modern compilers with -O3 -march=native may automatically perform similar optimizations (sometimes, giving it a hand is beneficial). Counting operations doesn't necessarily reflect performance.