NEON actually supports proper 32-bit int multiplies - vmulq_s32; so does AVX2 - _mm256_mullo_epi32. NEON does also have dynamic shifts via vshlq_s32 & co (that might look like a left-shift instr, but it also does right-shifts with a negative shift amount).
Ah yes, you are correct. I will fix that. I still think SSE2 is worth targeting as a lowest common denominator between all SIMD platforms.
The broader idea of mixing results between lanes to increase the unpredictability of the output would definitely still apply, but perhaps there's a better way to do that for newer platforms.
There's a small mistake in 'Bit-Hacking the IEEE Float Format' step 4:
"Bitcast the resulting bits to a float and subtract 1.0 to shift the number down into the range: [1.0, 2.0)." - range should be "[0.0, 1.0)", the subtraction of 1.0 shifts _from_ [1.0,2.0) _to_ [0.0,1.0).
In my local testing, the first two algorithms I showed have a period around 2^31:
LCG_XS has a period of 2840256475
LCG_XS_24 has a period of 2613094886
These algorithms trade a small amount of global period for better short-term randomness and ease of use. This is fine in most GPU contexts where each thread typically needs only a handful of floating-point numbers — and realistically only uses 2^24 outputs. The period will vary slightly if you use different multipliers/increments (which is often done on a per-thread basis on a GPU)
LCG_XS_DUAL has a full 2^32 period per SIMD lane, since each lane is just a hashed LCG running independently.
In case anyone wants to very for themselves, I made a simple test here:
Have you looked at the paper "Hash Functions for GPU Rendering"? Seems to explore a similar space and also creates some new PCG variants.
https://jcgt.org/published/0009/03/02/
NEON actually supports proper 32-bit int multiplies - vmulq_s32; so does AVX2 - _mm256_mullo_epi32. NEON does also have dynamic shifts via vshlq_s32 & co (that might look like a left-shift instr, but it also does right-shifts with a negative shift amount).
Ah yes, you are correct. I will fix that. I still think SSE2 is worth targeting as a lowest common denominator between all SIMD platforms.
The broader idea of mixing results between lanes to increase the unpredictability of the output would definitely still apply, but perhaps there's a better way to do that for newer platforms.
There's a small mistake in 'Bit-Hacking the IEEE Float Format' step 4:
"Bitcast the resulting bits to a float and subtract 1.0 to shift the number down into the range: [1.0, 2.0)." - range should be "[0.0, 1.0)", the subtraction of 1.0 shifts _from_ [1.0,2.0) _to_ [0.0,1.0).
Thanks for pointing that out! Fixed :)
What about period of lcg-cs?
In my local testing, the first two algorithms I showed have a period around 2^31:
LCG_XS has a period of 2840256475
LCG_XS_24 has a period of 2613094886
These algorithms trade a small amount of global period for better short-term randomness and ease of use. This is fine in most GPU contexts where each thread typically needs only a handful of floating-point numbers — and realistically only uses 2^24 outputs. The period will vary slightly if you use different multipliers/increments (which is often done on a per-thread basis on a GPU)
LCG_XS_DUAL has a full 2^32 period per SIMD lane, since each lane is just a hashed LCG running independently.
In case anyone wants to very for themselves, I made a simple test here:
https://gist.github.com/Ne0nWinds/bc0e02d63fde075cca95db347e09024c