Building a Fast, SIMD/GPU-friendly Random…

Apr 15

Rethinking Random Number Generation for SIMD, GPUs, and Floats

7 Comments

Have you looked at the paper "Hash Functions for GPU Rendering"? Seems to explore a similar space and also creates some new PCG variants.

https://jcgt.org/published/0009/03/02/

Expand full comment

dzaima

Apr 22Edited

NEON actually supports proper 32-bit int multiplies - vmulq_s32; so does AVX2 - _mm256_mullo_epi32. NEON does also have dynamic shifts via vshlq_s32 & co (that might look like a left-shift instr, but it also does right-shifts with a negative shift amount).

Expand full comment

Reply (1)

Caden Parker

Apr 22

Ah yes, you are correct. I will fix that. I still think SSE2 is worth targeting as a lowest common denominator between all SIMD platforms.

The broader idea of mixing results between lanes to increase the unpredictability of the output would definitely still apply, but perhaps there's a better way to do that for newer platforms.

Expand full comment

remageFrs

Apr 22

There's a small mistake in 'Bit-Hacking the IEEE Float Format' step 4:

"Bitcast the resulting bits to a float and subtract 1.0 to shift the number down into the range: [1.0, 2.0)." - range should be "[0.0, 1.0)", the subtraction of 1.0 shifts _from_ [1.0,2.0) _to_ [0.0,1.0).

Expand full comment

Reply (1)

Caden Parker

Apr 22

Thanks for pointing that out! Fixed :)

Expand full comment

Yura

Apr 17

What about period of lcg-cs?

Expand full comment

Reply (1)

Caden Parker

Apr 18Edited

In my local testing, the first two algorithms I showed have a period around 2^31:

LCG_XS has a period of 2840256475

LCG_XS_24 has a period of 2613094886

These algorithms trade a small amount of global period for better short-term randomness and ease of use. This is fine in most GPU contexts where each thread typically needs only a handful of floating-point numbers — and realistically only uses 2^24 outputs. The period will vary slightly if you use different multipliers/increments (which is often done on a per-thread basis on a GPU)

LCG_XS_DUAL has a full 2^32 period per SIMD lane, since each lane is just a hashed LCG running independently.

In case anyone wants to very for themselves, I made a simple test here:

https://gist.github.com/Ne0nWinds/bc0e02d63fde075cca95db347e09024c

Expand full comment

Vectorized

Building a Fast, SIMD/GPU-friendly Random…