A straight-forward implementation of mt19937ar

In the standard code mt19937ar.c, genrand_int32 generates 624 words of random integers at once for every 624th call. (Here 624 is the size of the state array). I believed that this is faster than a straight-forward implementation (i.e. generate one word for every call) in most platforms.

However, Eric Landry informed me that this is not always the case, he showed that a straight-forward implementation is faster than the standard one by 9%, in API CS20D dual Alpha 833MHz running NetBSD 1.6.1 with gcc 2.95.3 with optimization level O2. This implementation has fewer memory accesses by avoiding unnecessary reload.

Here is Eric's code named mt19937ar-nrl.c (nrl for "no reload"). Several optimizations such as a clever technique "^(-(*p1 & 1) & MATRIX_A);" are adopted. The code is slower than the standard code by 8%, in my experiments using cygwin + gcc.