Discussion about this post

User's avatar
Martins's avatar

You should change _mm_store_ps to unaligned store, just how you use unaligned load for these thing. Unless your structure has attribute/pragma for 16-byte alignment.

For non-SSE4 code, you can do conditional move cheaper with _mm_andn_ps - that does mask negation for you. Then you need only and+andn+or - just 3 operations. Alternatively do a ^ ((a ^ b) & mask - also only 3 operations, no bitwise negation needed.

But in case you're using SSE4 code unconditionally (because blendps) then you can use _mm_test_all_zeroes instead of _mm_movemask_ps + compare in IsZero

Also I strongly suggest to never use Ofast optimization level. It breaks floating point operations, because it enables ffast-math

Expand full comment
Luke Schoen's avatar

Amazing article thanks for sharing!

I'd also note that in recent months LLMs have gone from 'might be able to do some SIMD' to 'can optimize almost anything into excellent SIMD'

I got chatGPT to rewrite by bitsplit function using AVX2 and it went from > 9 seconds to < 1 second 🤯

Expand full comment
2 more comments...

No posts