SIMD: A Practical Guide

Jun 15, 2024

A Guide for Writing High-Performance Vectorized Code

4 Comments

Jun 28, 2024

You should change _mm_store_ps to unaligned store, just how you use unaligned load for these thing. Unless your structure has attribute/pragma for 16-byte alignment.

For non-SSE4 code, you can do conditional move cheaper with _mm_andn_ps - that does mask negation for you. Then you need only and+andn+or - just 3 operations. Alternatively do a ^ ((a ^ b) & mask - also only 3 operations, no bitwise negation needed.

But in case you're using SSE4 code unconditionally (because blendps) then you can use _mm_test_all_zeroes instead of _mm_movemask_ps + compare in IsZero

Also I strongly suggest to never use Ofast optimization level. It breaks floating point operations, because it enables ffast-math

Expand full comment

Luke Schoen

Feb 19

Amazing article thanks for sharing!

I'd also note that in recent months LLMs have gone from 'might be able to do some SIMD' to 'can optimize almost anything into excellent SIMD'

I got chatGPT to rewrite by bitsplit function using AVX2 and it went from > 9 seconds to < 1 second 🤯

Expand full comment

Reply (1)

Caden Parker

Feb 19Edited

I've found that ChatGPT can typically solve small, well-defined programming problems very well, even with SIMD code. However, if you ask it do anything more complicated, it always ends up with some very pernicious, hard-to-spot bug.

I think it could be helpful for giving you ideas for how to tackle a problem, but I still fundamentally think that a language model isn't the right tool for writing code that works correctly, since from what I've seen, it can only mimic what looks correct without really understanding what's happening.

That being said, compilers are also getting really good at vectorization. In the ray tracer example of this article, I imagine that if you didn't use intrinsics but kept the SOA data layouts, it would probably do most of the SIMD optimization for you. The one exception I can think of is the HorizontalMinimumIndex function; I had to manually optimize that one because clang wasn't giving me the results I wanted

Expand full comment

Reply (1)

Luke Schoen

Feb 19

Awesome response ❤️ !

Yeah agreed 👍💯

One trick I learned recently which gets me over the looks right but has a bug hump is this:

Ask chatgpt to combine the simple (non simd) version of the function with the optimised fast (but broken) simd version, tell it to run both parts and compare all intermediate results, if there is a different print out a full log of the current values.

You just loop the outputs into gpt till it works then finally ask it to pull out just the SIMD version by itself.

Definitely agree about data layout etc we aren't out of the job just yet 😉

Thanks again dude awesome write up (just found your blog and can't wait to deep dive thru the rest)

Nice one dude 😎👉

Expand full comment