Just consider the generated code for the two foo variants:
https://godbolt.org/z/P7TMe4vax
The code using double interleaves multiplications for the first and second summand to achieve more ILP. The code using std::simd doesn't do this.
IMHO the reason for this is the usage of intrinsics in the std::simd implementation, which prevents such optimizations. This can have quite an impact on the acceleration you get from vectorization.
I have no good solution for this issue yet, I just want to raise the awareness here.
Maybe some annotations for intrinsics must be introduced, which can tell the compiler, that the annotated intrinsic is allowed to be optimized.
Just consider the generated code for the two
foovariants:https://godbolt.org/z/P7TMe4vax
The code using
doubleinterleaves multiplications for the first and second summand to achieve more ILP. The code usingstd::simddoesn't do this.IMHO the reason for this is the usage of intrinsics in the
std::simdimplementation, which prevents such optimizations. This can have quite an impact on the acceleration you get from vectorization.I have no good solution for this issue yet, I just want to raise the awareness here.
Maybe some annotations for intrinsics must be introduced, which can tell the compiler, that the annotated intrinsic is allowed to be optimized.