In your code,
you have:
mt[i] = mt[i + _MT64_MM] ^ (x >> 1) ^ ((x & 1ULL) ? _MT64_MATRIX_A : 0);
This emits a branch instruction due to the ? operator, which breaks up
the pipelining of instructions on most modern processors. Instead, I
tried this:
mt[i] = mt[i + _MT64_MM] ^ (x >> 1) ^ ((x & 1ULL) * _MT64_MATRIX_A);
Since you're ANDing with 1, the result will be 1 if the rightmost bit is
set and 0 otherwise. Multiplication then accomplishes the same thing as
the conditional, but without a branch instruction. On most modern CPUs,
multiplication is a single clock cycle instruction.
On a 32-bit Pentium machine, this sped things up by a full 30%! On an
AMD64 machine compiling into 64-bit code, this sped things up by about 5%.
-Adam Ierymenko