Building a Fast, SIMD/GPU-friendly Random Number Generator For Fun And Profit

Rethinking Random Number Generation for SIMD, GPUs, and Floats

Apr 15, 2025

When writing shaders, SIMD code, or GPU kernels, one typically doesn’t need a cryptographically secure random number generator — something fast and statistically decent is often good enough.

In these contexts, an RNG algorithm should:

Be fast — should only require a handful of instructions to execute
Be minimal — easy to reason about and modify
Have decent statistical properties — no obvious repeating patterns
Well-suited for floating-point numbers — no wasted mantissa bits or rounding issues
Be compatible with SIMD and GPU programming

This article introduces an algorithm that I’ll call LCG-XS that checks all those boxes. The core idea is simple: take a linear congruential generator (LCG) and apply an xorshift-style hash to its output. The result is an algorithm that:

Performs well in statistical tests like PractRand
Is compatible with SSE2, ARM NEON, WebAssembly, etc.
Has enough parameters for making the RNG outputs completely decorrelated across SIMD/GPU lanes
Is simple enough to write from scratch without having to look anything up

This article will walk through each one of those points step by step, starting with a deceptively simple and surprisingly effective building block: the linear congruential generator.

The LCG Algorithm

Sometimes simple is better

The idea behind linear congruential generators is to start from an initial random seed, multiply it by a constant, and then apply a modulo operation. The final value becomes the seed for the next iteration — and also serves as the generated random number.

In practice, a literal modulo operation isn’t required since hardware word sizes already cause an implicit modulo. For example, a 32-bit multiplication that overflows will silently discard any bits beyond 2³², effectively performing a modulo by 2³².

#define LCG_MULTIPLIER 2654435761
u32 LCG(u32 *Seed) {
  u32 Result = *Seed * LCG_MULTIPLIER;
  *Seed = Result;
  return Result;
}

As far as pseudo-random number generators go, this is about as simple and fast as they get. Surprisingly, if the correct multiplier constant is chosen, the most significant bits are fairly unpredictable and of good statistical quality. The lower bits, however, follow very obvious repeating cycles.1

The effects of integer multiplication on the higher-order bits can be conceptualized intuitively. If a number is multiplied by a power-of-two, it’s the equivalent of shifting that number’s bits to the left. If a number is multiplied by a non-power-of-two, the bits still have a tendency to move to the left, but they introduce irregular, nonlinear changes across the higher-order bits.

Seed (932):          0000 0011 1010 0100
Seed × 2:            0000 0111 0100 1000 (Seed << 1)
Seed × 5:            0001 0010 0011 0100

The lower bits follow predictable patterns, regardless of the multiplier used. If both the seed and the multiplier are odd, the result is always odd. If either one is even, the result is always even. In all cases, the least significant bit will not change, no matter how many iterations of the LCG are performed.

The lower bits can be improved by doing a simple addition with another constant. Performing this addition also makes the choice of the initial seed slightly less important — an initial seed of zero, for example, is handled automatically by the addition, whereas an initial seed of zero with a multiplication-only LCG would cause it to only ever output zeroes.

#define LCG_MULTIPLIER 2654435761u
#define LCG_INCREMENT 2891336453u

u32 LCG(u32 *Seed) {
  u32 Result = *Seed * LCG_MULTIPLIER + LCG_INCREMENT;
  *Seed = Result;
  return Result;
}

LCGs have another interesting property: given the right constants, they’ll produce every 32-bit number exactly once before repeating2 — which is to say that it’s less like pure mathematical randomness and more like a deterministic shuffle of every possible 32-bit value, with no repeats or omissions until the full 2³² cycle completes.

The uniformity in the results of an LCG is often desirable in many circumstances. A pure mathematically random function can produce the same number 10 times in a row, whereas this is impossible for a well-designed LCG. While an LCG is not random in the mathematical sense, it may be perceived as more random than a true mathematically random function. In addition, the LCG algorithm is so simple and transparent that it can be very easily modified for a particular application’s needs.

In order for an LCG to cycle through the entire 32-bit range of integers without repeats, the following conditions must be met:

The multiplier must be an odd number
- The 1-bit must be set in LCG_MULTIPLIER
The increment must be one more than a multiple of 4
- (LCG_INCREMENT % 4) == 1

While these conditions don’t guarantee that the LCG will be particularly good, it does avoid some of the more pathological cases of LCGs. An LCG with poorly chosen constants might cycle through only a handful of numbers before repeating.

Generating Floating-Point Numbers

Using LCG as a base

Random number generation algorithms typically produce sequences of 32 or 64-bit integers, but they don’t have anything to do with integers per se; their true purpose is to maximize the unpredictability — or entropy — of every bit they output. The challenge when generating floating-point numbers is figuring out how to map these bits into a floating-point range without introducing precision issues.

Floating-point numbers are typically generated within the range: [0.0, 1.0). When converting the 32-bit output of a random number generator to a floating-point number within this range, two things have to be kept in mind:

The range between 0.0 and 1.0 in floating-point is a non-linear space, which makes it slightly more difficult to uniformly distribute floating-point numbers across that range.
Floating-point numbers only have 23 bits of precision in the mantissa, so the bits generated from a 32-bit RNG have to be selected in a way that doesn’t reduce the quality of the random number distribution.

This article will explain two methods for taking the output of a 32-bit RNG and using it to produce floating-point numbers.

Division By A Power-Of-Two

The most intuitive way to convert the integer output of an LCG to a floating-point number is to simply divide it by the total range of possible integer outputs. 32-bit integers have a range of 2³² — since it’s not possible to represent 2³² with a 32-bit IEEE float, some of the bits have to be thrown away.

The largest integer power-of-two that can be accurately represented by a floating-point number is 2²⁴ — which is equal to 16,777,216. If only 24 of the 32 bits in the RNG output are used, the resulting integer will be somewhere in the range [0, 2²⁴). To map it onto the range [0.0, 1.0), the resulting integer can be divided by 2²⁴.

The question becomes: which 24 of the 32 bits should be used to generate floating-point numbers? As discussed earlier, the higher bits in the result of an LCG generally exhibit better statistical properties than the lower ones. Consequently, in a 32-bit LCG, the upper 24 bits are going to be better than the lower 24 bits — therefore, to best utilize the output of a 32-bit LCG, we can shift the 32-bit output down by 8, effectively ignoring the bottom 8 bits.

f32 RandomFloat(u32 *Seed) {
  *Seed = *Seed * 747796405u + 2891336453u;

  const f32 MaxInt = 16777216.0;
  return (f32)(*Seed >> 8) / MaxInt;
}

This method perfectly maps the output of the LCG to the floating-point range between 0.0 and 1.0. Because a power-of-two is used as the denominator in the division3, there aren’t any floating-point precision issues to worry about, which means the numbers are mapped in a perfectly uniform fashion.

The problem with this method, however, is that it requires a conversion between integers and floating-point numbers, something that certain platforms (like SSE2) can’t always do efficiently.

Bit-Hacking the IEEE Float Format

Another interesting way to generate a floating-point number is to directly construct it by manipulating the mantissa bits.

The process for doing so looks like this:

Run the standard LCG algorithm.
*Seed = *Seed * 747796405u + 2891336453u;
Shift the 32-bit output down by 9, creating a new 23-bit number that will become the new mantissa.
u32 Mantissa = *Seed >> 9;
Set the exponent bits to 127 — this makes it so that any combination of bits in the mantissa represent a number within the range: [1.0, 2.0)4. Then, perform a bitwise OR operation to merge the exponent and randomized mantissa.
u32 Bits = FloatBitsToUInt(1.0f) | Mantissa;
Bitcast the resulting bits to a float and subtract 1.0 to shift the number down into the range: [0.0, 1.0).
f32 Result = UintBitsToFloat(Bits) - 1.0;

Because all the floats below 1.0 have more precision than all the floats ≥1.0, the final subtraction doesn’t have precision issues due to rounding. The distribution of random numbers is perfectly uniform across the [0.0, 1.0) range.

As an example, suppose the LCG output is this 32-bit number:

LCG output: 1101000000000000000000000000101

After shifting down by 9, the lower bits are removed:

Mantissa: 1101000000000000000000

This value is then directly put into the mantissa:

| Sign (1 bit) | Exponent (8 bits) | Mantissa (23 bits)        |
|      0       |     01111111      | 1101000000000000000000    |

This floating-point number is equal to 1.8125. From here, the number is subtracted by 1.0 to give the final result of 0.8125.

u32 FloatBitsToUint(f32 Value) {
  u32 Result = 0;
  memcpy(&Result, &Value, 4);
  return Result;
}

f32 UintBitsToFloat(u32 Value) {
  f32 Result = 0;
  memcpy(&Result, &Value, 4);
  return Result;
}

f32 RandomFloat(u32 *Seed) {
  *Seed = *Seed * 747796405u + 2891336453u;
  u32 Mantissa = *Seed >> 9;
  u32 Bits = FloatBitsToUint(1.0f) | Mantissa;
  f32 Result = UintBitsToFloat(Bits) - 1.0;
  return Result;
}

The benefits of the mantissa approach are two-fold:

It will likely be slightly faster than the division technique5
All the operations are compatible with lower-end SIMD platforms like SSE2

The mantissa trick does sacrifice one bit of precision for the sake of the other benefits — it will generate a number in the form of x / 2²³, whereas the multiplication technique will generate a number in the form of x / 2²⁴, meaning the division technique will more effectively utilize the output of the RNG.

Improving The Lower Bits

Fixing the flaws of LCG

As mentioned before, the lower-order bits tend to have poor statistical qualities. The PCG set of algorithms — a refinement of the LCG approach — fixes this by applying a carefully designed hash function to the LCG output, improving the bit diffusion and overall statistical quality.6

u32 PCGHash(u32 State) {
    u32 Word = ((State >> ((State >> 28u) + 4u)) ^ State) * 277803737u;
    return (Word >> 22u) ^ Word;
}

u32 PCG(u32 *Seed) {
    *Seed = *Seed * 747796405u + 2891336453u;
    return PCGHash(*Seed);
}

This is an exceptionally well-designed algorithm. The LCG ensures that every number in the entire 32-bit range is hit, while the hashing step makes every output almost completely uncorrelated from the last. However, for the specific needs of SIMD and GPU programming, it introduces a few practical limitations:

It uses a bitshift by a non-constant, which some SIMD platforms — like SSE2 and WebAssembly — don’t support.
The core LCG algorithm is still fundamentally linear, which makes the choice of initial seed particularly important for SIMD lanes or GPU threads.
SIMD lanes and GPU threads typically have access to values in other lanes/threads. These can be used as part of the hashing step to make them even less predictable.
A lot of applications only need random floating-point numbers. This algorithm can be optimized for that specific use case.
The hashing step, while effective, can be harder to reason about and modify.

As part of the hashing step, the PCG algorithm shown above also uses an Xorshift-style hash. Xorshift is a family of random number generators that are both very fast and SIMD-compatible.7

u32 Xorshift32(u32 *Seed) {
    u32 N = *Seed;
    N ^= N << 13;
    N ^= N >> 17;
    N ^= N << 5;
    *Seed = N;
    return N;
}

Combining two random number generation schemes can alleviate some of the problems with either algorithm. XOR-based algorithms are very good at making each bit within the output fairly unpredictable, but they have problems when the seed is near zero (or has long streams of similar bits) and have a noticeable blocky output if they’re used for white noise. LCGs, on the other hand, output numbers in a very cyclic pattern8, but since every iteration of the algorithm does an addition step, they don’t suffer when their initial input isn’t perfectly chosen.

It’s generally preferable to use the higher bits of an LCG output since they’re less obviously cyclic than the lower bits; therefore, performing some kind of hashing step targeted toward the lower bits will improve the overall quality of the output. This is the main idea of LCG-XS.

// Linear Congruential Generator - Xorshifted
u32 LCG_XS(u32 *Seed) {
  u32 Result = *Seed * 747796405u + 2891336453u;
  Result ^= (Result >> 22);
  *Seed = Result;
  return Result;
}

This technique consistently outperforms both raw LCGs and Xorshift in testing suites like PractRand. PCG still remains the best-performing RNG across all tests, but this algorithm makes it much easier to decorrelate bitstreams with similar starting seeds (a GPU thread index, for example). This method can be combined with PCG to get even better cross-thread results.

The xorshift step scrambles the bottom bits and makes the output non-linear, avoiding the standard pitfalls of the LCG algorithm. Unlike pure Xorshift, however, it doesn’t fall apart when the initial seed isn’t chosen properly — making it well-suited for GPU workloads where the seeds are often chosen randomly per-thread.

Making A Float-Optimized Variant

An RNG with 32-bit state and 24-bit output

When generating random floating-point numbers, an RNG algorithm only has to generate 23 or 24 bits, depending on the method used to convert it to a floating-point number. In the context of an LCG, this means that the lower bits can act as a sort of hidden state that isn’t used in the output.

This is a fairly common technique in practice. Having more available bits in the seed for the algorithm, even if some of those bits are hidden, increases the overall unpredictability of the output. An LCG that outputs 32-bit numbers might store a 64-bit seed value. Since the higher bits are less predictable than the lower bits, the lower bits are simply ignored when returning the result of this function.

u32 LCG(u64 *Seed) {
  u64 Result = *Seed * 6364136223846793005ull + 1442695040888963407ull;
  *Seed = Result;
  return Result >> 32;
}

This simple change drastically improves the quality of the algorithm’s outputs. 64-bit multiplication, however, is not widely available on GPU hardware — and even when it is, it’s often emulated using multiple instructions. Since most GPU applications only need random floating-point numbers — which, at most, only require 24 bits of precision — some of the lower bits in the 32-bit seed can act as a hidden state. Instead of having an underlying 64-bit state that has a 32-bit output, the algorithm keeps its existing 32-bit state but only outputs 24-bit numbers.

u32 LCG_XS_24(u32 *Seed) {
  u32 Result = *Seed * 747796405u + 2891336453u;
  u32 HashedResult = Result ^ (Result >> 14);
  *Seed = HashedResult;
  return HashedResult >> 8;
}

As opposed to the 32-bit version, the hashing doesn’t have to be as cautious. The xorshift — in addition to slightly scattering the lower bits of the direct result of the algorithm — also modifies the 8 bits of hidden state that influence the next iteration. Unlike Xorshift32, however, the output isn’t solely modified by the XOR hash, which would result in that characteristically blocky output of Xorshift32.

Altogether, in a GLSL shader, it would look like this:

uint lcg_xs_24(inout uint state) {
  uint result = state * 747796405u + 2891336453u;
  uint hashed_result = result ^ (result >> 14);
  state = hashed_result;
  return hashed_result >> 8;
}

float random(inout uint state) {
  uint result = lcg_xs_24(state);
  const float inv_max_int = 1.0 / 16777216.0;
  return float(result) * inv_max_int;
}

This algorithm can be visualized with white noise. Here is an example that uses the pixel index as a seed:

const uint idx = gl_GlobalInvocationID.x;
uint state = idx;
vec4 output_color = vec4(vec3(random(state)), 1.0);

This resulting image is less than impressive.

32x32 white noise pattern generated by one iteration of LCG-XS. The image contains very obvious patterns due to the poorly chosen seed.

With only one iteration, the output is basically indistinguishable from that of a typical LCG. With two iterations, however, things start to get more interesting.

const uint idx = gl_GlobalInvocationID.x;
uint state = idx;
lcg_xs_24(state); // add one more iteration before computing the final result
vec4 output_color = vec4(vec3(random(state)), 1.0);

1024x1024 white noise pattern generated with two iterations of LCG-XS.

At two iterations with an initial bad seed, all threads have completely decorrelated values. This fixes one of the main problems of other RNG algorithms in practice: if the initial seed isn’t chosen properly, values across threads will look very similar. This algorithm can actually generate its own initial seed, making it very easy to work with when doing GPU programming. It’s able to recover if the initial seed isn’t chosen properly.

What I typically do is generate the initial seed using a different set of constants (multiplier, increment, shift amount) than what is used in the random function.9

uint gen_initial_seed() {
  const uint idx = gl_GlobalInvocationID.x;
  uint seed = idx;

  // several iterations can be run to further decorrelate threads
  for (uint i = 0; i < 3; i += 1) {
    seed = seed * 2654435761u + 1692572869u;
    seed = seed ^ (seed >> 18);
  }

  return seed;
}

uint lcg_xs_24(inout uint state) {
  uint result = state * 747796405u + 2891336453u;
  uint hashed_result = result ^ (result >> 14);
  state = hashed_result;
  return hashed_result >> 8;
}

This method also pairs quite nicely with PCG-based algorithms. This breaks some of PCG’s statistical guarantees in favor of short-term, per-thread divergence in the numbers that are generated.

uint pcg_hash(uint state) {
    uint word = ((state >> ((state >> 28u) + 4u)) ^ state) * 277803737u;
    return (word >> 22u) ^ word;
}

uint lcg_xs_24(inout uint state) {
  uint result = state * 747796405u + 2891336453u;
  uint xorhashed = result ^ (result >> 22);
  state = xorhashed;
  return pcg_hash(xorhashed) >> 8;
}

The nice part about this algorithm is its simplicity. Every part — the multiplier, increment, and shift amount — can be modified on a per-thread basis. It's simple enough to memorize and write from scratch — while still being mathematically sound for most GPU workloads.

Making a SIMD-optimized variant

A PCG-derivative optimized for SIMD

The algorithm discussed so far is already portable to most SIMD platforms. However, SIMD integer multiplication, at least on SSE2, works differently than one might expect. For example, SSE2 floating-point multiplication (_mm_mul_ps) computes four parallel results, one per lane. In contrast, SSE2 integer multiplication (_mm_mul_epu32), only performs two 32×32-bit multiplications, producing two 64-bit results. This section of the article is dedicated to creating an algorithm specifically for SIMD with statistical properties similar to that of PCG.

Since SIMD instructions perform multiple multiplications and additions in parallel, each lane can have a unique starting seed, multiplier, and increment constant. This effectively alternates between two different LCG streams.

typedef struct {
  u32 Seed1;
  u32 Seed2;
} random_state;

void LCG_XS_DUAL(random_state *State, u32 *Out) {
  u32 Result = State->Seed1 * 747796405u + 2891336453u;
  u32 Result2 = State->Seed2 * 2654435761u + 1692572869u;

  State->Seed1 = (u32)Result;
  State->Seed2 = (u32)Result2;

  Out[0] = Result;
  Out[1] = Result2;
}

There are two interesting properties of the SIMD version of this algorithm:

It’s very easy to access the results in other SIMD lanes. The results from one lane can be hashed with the results of another
The full 64-bit result of a 32×32-bit multiplication is saved to the register, meaning the top bits of the product can be used as an extra source of entropy

After playing around with PractRand, I ended up with an algorithm that’s able to pass PractRand with a full 4GB of data — meaning that for the entire 2³² range of outputs, the PractRand test suite wasn’t able to find any statistical anomalies.

void LCG_XS_DUAL(random_state *State, u32 *Out) {
  u64 Result1 = State->Seed1 * 747796405ull + 2891336453ull;
  u64 Result2 = State->Seed2 * 2654435761ull + 1692572869ull;

  State->Seed1 = (u32)Result;
  State->Seed2 = (u32)Result2;

  Out[0] = (Result1 >> 32) ^ (Result2 >> 9);
  Out[1] = (Result2 >> 32) ^ (Result1 >> 9);
}

It works by running two LCGs in parallel. Instead of directly outputting the LCG results, a hash is composed of numbers from both LCGs. It’s a PCG-like algorithm in the sense that it uses an untouched LCG (or two in this case) as the base algorithm, but it hashes the output on each iteration to make it non-linear.

In terms of complexity, the algorithm is as simple as it gets, but it yields very high quality random numbers. In addition, it’s fairly portable across SIMD platforms.

// Same function using SSE2 intrinsics

void LCG_XS_DUAL_SSE2(random_state *State, u32 *Out) {
  __m128i Multiplier = _mm_set_epi32(0u, 2654435761u, 0, 747796405u);
  __m128i Increment = _mm_set_epi32(0u, 1692572869u, 0, 2891336453u);

  __m128i Seed = _mm_set_epi32(0u, State->Seed2, 0u, State->Seed);
  Seed = _mm_mul_epu32(Seed, Multiplier);
  Seed = _mm_add_epi64(Seed, Increment);

  State->Seed = _mm_extract_epi32(Seed, 0);
  State->Seed2 = _mm_extract_epi32(Seed, 2);

  __m128i HighBits = _mm_srli_epi64(Seed, 32);
  __m128i Swapped = _mm_shuffle_epi32(Seed, _MM_SHUFFLE(1, 0, 3, 2));
  __m128i XOR_Hashed = _mm_xor_si128(HighBits, _mm_srli_epi64(Swapped, 9));

  Out[0] = _mm_extract_epi32(XOR_Hashed, 0);
  Out[1] = _mm_extract_epi32(XOR_Hashed, 2);
}

Rather than simply making a SIMD version of an existing algorithm, this approach is designed around the characteristics of SIMD itself — using multiple multipliers, parallel LCGs, and XOR hashing to maximize entropy with minimal state. The result is a compact, fast, and surprisingly high-quality random number generator.

Conclusion

This article was an explanation of my foray into random number generation. I wanted an algorithm that balanced simplicity, speed, statistical quality, and ease of use for SIMD/GPU programming.

For GPU programming scenarios, particularly shader programming, the float-optimized LCG-XS-24 variant is plenty fast and sufficiently random. It’s easy to remember and is able to recover from poorly chosen seeds.

On CPU SIMD platforms, the dual-state variant very effectively utilizes what’s unique to SIMD to make an algorithm that’s both incredibly simple and statistically sound.

If nothing else, I hope this article showed that with a bit of understanding and a few simple building blocks, it’s possible to craft a practical RNG tailored to the constraints of a given platform.

Thanks for reading!

“LCGs with power-of-two moduli have a major drawback: The (r + 1)th most significant bit has period length at most 2−r times that of the most significant bit. The low order bits thus have rather short period lengths, and for this reason, many authors recommend avoiding these generators for simulation. However, if e [that is, if the modulus is a power of two; say m = 2ᵉ for some integer e] is very large and only the most significant bits are used (e.g., if e = 128 and the 53 most significant bits of each xn are used to construct a floating-point number between 0 and 1), this drawback becomes much less important” — Tables of Linear Congruential Generators, Pierre L’Ecuyer (1999)

The conditions required for a Linear Congruential Generator (LCG) to achieve a full period are formally described by the Hull–Dobell theorem (Hull & Dobell, 1962).

This division could be converted into a multiplication by the reciprocal of MaxInt. The compiler is likely to do this automatically with optimizations turned on.

This can be visualized here: https://float.exposed/0x3f800000

The mantissa version takes 14 cycles on Skylake, while the division technique takes 18 cycles.
Division technique:
- Godbolt link
- uops.info Code Analyzer
Mantissa trick:
- Godbolt link
- uops.info Code Analyzer

Melissa O’Neill, PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation (2014)

George Marsaglia, Xorshift RNGs, Journal of Statistical Software, Vol. 8, Issue 14 (2003)

This issue was famously pointed out by George Marsaglia in 1968. He showed that LCGs don’t spread their outputs evenly in high dimensions — instead, they fall onto a bunch of parallel hyperplanes.
George Marsaglia, Random Numbers Fall Mainly in the Planes (1968)

Unlike traditional LCGs, careful choice of multiplier/increment constants isn't as critical due to the hashing of the lower bits

dzaima

Apr 22Edited

NEON actually supports proper 32-bit int multiplies - vmulq_s32; so does AVX2 - _mm256_mullo_epi32. NEON does also have dynamic shifts via vshlq_s32 & co (that might look like a left-shift instr, but it also does right-shifts with a negative shift amount).

Expand full comment

1 reply by Caden Parker

remageFrs

Apr 22

There's a small mistake in 'Bit-Hacking the IEEE Float Format' step 4:

"Bitcast the resulting bits to a float and subtract 1.0 to shift the number down into the range: [1.0, 2.0)." - range should be "[0.0, 1.0)", the subtraction of 1.0 shifts _from_ [1.0,2.0) _to_ [0.0,1.0).

4 more comments...

Vectorized

Discussion about this post