In your code,
you have:

mt[i] = mt[i + _MT64_MM] ^ (x >> 1) ^ ((x & 1ULL) ? _MT64_MATRIX_A : 0);

This emits a branch instruction due to the ? operator, which breaks up
the pipelining of instructions on most modern processors.  Instead, I
tried this:

mt[i] = mt[i + _MT64_MM] ^ (x >> 1) ^ ((x & 1ULL) * _MT64_MATRIX_A);

Since you're ANDing with 1, the result will be 1 if the rightmost bit is
set and 0 otherwise.  Multiplication then accomplishes the same thing as
the conditional, but without a branch instruction.  On most modern CPUs,
multiplication is a single clock cycle instruction.

On a 32-bit Pentium machine, this sped things up by a full 30%!  On an
AMD64 machine compiling into 64-bit code, this sped things up by about 5%.

-Adam Ierymenko