Random123-1.09: Random123: a Library of Counter-Based Random Number Generators

The Random123 library is a collection of counter-based random number generators (CBRNGs) for CPUs (C and C++) and GPUs (CUDA and OpenCL), as described in Parallel Random Numbers: As Easy as 1, 2, 3, Salmon, Moraes, Dror & Shaw, SC11, Seattle, Washington, USA, 2011, ACM . They are intended for use in statistical applications and Monte Carlo simulation and have passed all of the rigorous SmallCrush, Crush and BigCrush tests in the extensive TestU01 suite of statistical tests for random number generators. They are not suitable for use in cryptography or security even though they are constructed using principles drawn from cryptography.

CBRNGs are as fast as, or faster than conventional RNGs, much easier to parallelize, use minimal memory/cache resources, and require very little code. On modern architectures, the Random123 CBRNGs require a few cycles per byte of random data returned and return random data in convenient sizes (arrays of two or four elements, each element is an unsigned integer of 32 or 64 bits. The range of random numbers is the full representable range of the 32 or 64 bit unsigned integer) The <Random123/u01.h> header contains utility functions to convert 32- and 64-bit unsigned integers to open or closed ranges of single or double precision floating point numbers.

The Random123 library was written by John Salmon and Mark Moraes. It is available from http://deshawresearch.com/resources_random123.html. Please see the license for terms and conditions. Please send feedback, including bug reports, suggestions, patches, etc. to random123@deshawresearch.com.

Overview

Unlike conventional RNGs, counter-based RNGs are stateless functions (or function classes i.e. functors) whose arguments are a counter, and a key and returns a result of the same type as the counter.

result = CBRNGname(counter, key)

The returned result is a deterministic function of the key and counter, i.e. a unique (counter, key) tuple will always produce the same result. The result is highly sensitive to small changes in the inputs, so that the sequence of values produced by simply incrementing the counter (or key) is effectively indistinguishable from a sequence of samples of a uniformly distributed random variable.

For all the CBRNGs in the Random123 library, the result and counter are the same type, specifically an array of N words, where words have a width of W bits, encapsulated in r123arrayNxW structs, or equivalently, for C++, in the ArrayNxW typedefs in the r123 namespace. Keys are usually also arrayMxW types, but sometimes M is a different size than the counter N (e.g. Philox keys are half the number of elements as the counter, Threefry and ARS are the same number, AES uses an opaque key type rather than an array) The N random numbers returned in result.v[] are unsigned integers of width W (32 or 64), and the range of the random numbers is the full range of the unsigned integer of that width (i.e. 0 to 2^W-1)

In C++, all public names (classes, structs, typedefs, etc) are in the r123 namespace. In C, the public names (functions, enums, structs, typedefs) begin either with r123 or with one of the RNG family names, e.g., threefry, philox, ars, aesni. The RNG functions themselves have names like philox4x32. C++ class names are capitalized, e.g., Threefry4x32.

The different families of Random123 generators

Several families of CBRNGs are available in this version of the library:

Threefry is a non-cryptographic adaptation of the Threefish block cipher from the Skein Hash Function. See r123::Threefry2x32, r123::Threefry4x32, r123::Threefry2x64, r123::Threefry4x64.
Philox uses a Feistel network and integer multiplication. See r123::Philox2x32, r123::Philox4x32, r123::Philox2x64, r123::Philox4x64. The Nx64 forms are only available on hardware that supports 64-bit multiplication producing a 128-bit result.
AESNI uses the Advanced Encryption Standard (AES) New Instruction, available on certain modern x86 processors (some models of Intel Westmere and Sandy Bridge, and AMD Interlagos, as of 2011). AESNI CBRNGs can operate on four 32bit words (internally converting them to the 128bit SSE type needed by the AES-NI instructions, or on a single m128i "word", which holds the SSE type. See r123::AESNI4x32, r123::AESNI1xm128i.
ARS (Advanced Randomization System) is a non-cryptographic simplification of AESNI. See r123::ARS4x32_R, r123::ARS1xm128i_R.

Installation and Testing

The Random123 library is implemented entirely in header files. Thus, there is nothing to compile before using it and nothing to link after you have #included it in your source files. Simply direct your C or C++ compiler to find the header files in the include/ directory that was unpacked from the distribution tar file and use the Random123 header files, types and functions in your application.

In addition to the include/ files which implement the library the distribution also contains an examples/ directory. Users are STRONGLY ADVISED to compile and run the tests in examples/ before using Random123 in an application (see examples/README). Do not use the library if any tests fail. (It is not a failure for a test to report that it cannot run because of missing hardware capabilities like 64bit multiply, SSE, AES-NI or compiler capabilities)

Usage

C++ API

A typical C++ use case might look like:

#include <Random123/philox.h>

typedef r123::Philox4x32 RNG;
RNG rng;
RNG::ctr_type c={{}};
RNG::ukey_type uk={{}};
uk[0] = ???; // some user_supplied_seed
RNG::key_type k=uk;

for(...){
   c[0] = ???; // some loop-dependent application variable 
   c[1] = ???; // another loop-dependent application variable 
   RNG::ctr_type r = rng(c, k);
   // use the random values in r for some operation related to
   // this iteration on objectid
}

On each iteration,r contains an array of 4 32-bit random values that will not be repeated by any other call to rng as long as c and k are not reused.

In the example above, we use the r123::Philox4x32, but any of the other CBRNGs would serve equally well. Also note that for most CBRNGs, the ukey_type and the key_type are identical; the code could just as well ignore the ukey_type and directly construct the key_type. However, for the AESNI CBRNGs, the key_type is opaque, and must be constructed from a ukey_type, as shown.

The C API

In C, the example above could be written as:

#include <Random123/philox.h>

philox4x32_ctr_t c={{}};
philox4x32_ukey_t uk={{}};

uk.v[0] = user_supplied_seed;
philox4x32_key_t k = philox4x32keyinit(uk);

for(...){
    c.v[0] = ???; /* some loop-dependent application variable */
    c.v[1] = ???; /* another loop-dependent application variable */
    philox4x32_ctr_t r = philox4x32(c, k);
}

In C, access to the contents of the counter and key is through the fixed-size array member v.

The CUDA platform

All relevant functions in the C and C++ APIs for Random123 are declared as CUDA device functions if they are included in a CUDA kernel source file and compiled with a CUDA compiler (nvcc). They can be used exactly as described/documented for regular C or C++ programs. Note that CUDA device functions and host functions share the same namespace, so it is not currently possible to use Random123 functions in both the host portion and the device portion of the same .cu source file. To work around this, you must compile Random123-using host code in a separate .c source file from your .cu device-resident code. The Nx32 forms are faster than the Nx64 variants on current (2011) 32-bit GPU architectures.

It has been reported that Random123 uses 16 bytes of static memory per thread. This is undesirable and not intentional, but we do not have a workaround other than to suggest adjusting memory allocation accordingly.

The pi_cuda.cu and pi_cudapp.cu examples illustrate the use of CUDA.

The OpenCL platform

The functions in the Random123 C API can all be used in OpenCL kernels, just as in regular C functions. As with CUDA, the Nx32 forms are faster than the Nx64 variants on current (2011) 32-bit GPU architectures.

The pi_opencl.c and pi_opencl_kernel.ocl examples illustrate the use of OpenCL.

C++0X <random> interface

In addition to the stateless ("pure/functional") C++ API above, the Random123 package includes two C++ classes that leverage the C++0X <random> API.

r123::MicroURNG provides an adapter class that provides a more conventional interface compatible with the C++0X URNG (uniform random number generator) API; the MicroURNG adapter can be used with C++0x random number distributions and is fast/lightweight enough that a new MicroURNG can be instantiated with a unique key,counter tuple and used for each call to a distribution, there is little or no overhead to creating billions of unique MicroURNGs. This adapter retains one of the key advantages of CBRNGs -- complete application control over the RNG state.
r123::Engine provides the C++0x Random Engine API. This can also be used with any of the C++0X random distributions, but sacrifices the application control over RNG state that is a defining characteristic of CBRNGs.

The GNU Scientific Library (GSL) interface

In addition to the stateless ("pure/functional") C API above, the Random123 package includes two C adapter interfaces to the GNU Scientific Library (GSL).

The GSL_MICRORNG macro allows the application to define a GSL random number generator. It can be used with GSL random distributions but still provides the application with complete control over the RNG state (it is analogous to the MicroURNG class, in that it uses shorter periods, and is intended to be instantiated in large numbers for a few calls to the random distribution).
The GSL_CBRNG macro allows the application to create a GSL RNG with a completely conventional interface, sacrificing application control over the internal RNG state.

Generating uniformly distributed and Gaussian distributed floats and doubles

The Random123 library provides generators for uniformly distributed random integers. Often, applications want random real values or samples from other distributions. The general problem of generating samples from arbitrary distributions is beyond the scope of the Random123 library. One can, of course, use GSL or MicroURNG and the distributions in the C++11 <random> library, but a few simple cases are common enough that all that extra machinery seems like overkill. We have included code in the examples/ directory which developers may find useful.

examples/uniform.hpp - C++ functions that convert random integers to random, uniformly distributed floating point values.
examples/u01fixedpt.h - C functions that convert random integers to random, uniformly distributed, equi-spaced, i.e., fixed point, values.
examples/ua.hpp - C++11 functions that convert r123arrays of uniformly distributed integers into std::arrays of uniformly distributed floating point types. The return type is std::array because it is far easier, with template logic, to return a std::array of the correct size than an r123array of the correct size.
examples/boxmuller.hpp - C++ functions that take two uniformly distributed integers (32 or 64 bit) and return a pair of Gaussian distributed floats or doubles.

The Box-Muller method of generating Gaussian random variables is particularly well suited to Random123 because it deterministically consumes exactly two uniform randoms to generate exactly two gaussian randoms. It uses math library functions: sincos, log and sqrt which may be slow on some platforms, but which are surprisingly fast on others. Notably, on GPUs, the lack of branching in the Box-Muller method and hardware support for math functions overcomes the transcendental function overhead, making it the fastest generator of Gaussians that we are aware of.

Tests and Benchmarks

The examples/ directory, contains tests, examples and benchmarks.

Unit tests for individual components and "known-answer-tests", which should be run to ensure that these RNGs build correctly on desired platforms. These help to provide assurance that the code is being compiled correctly.
Complete, short programs estimate pi by counting the number of random points that fall inside a circle inscribed in a square, demonstrating the C, C++, AES, GSL, OpenCL, CUDA and C++0x APIs.
Header files, including uniform.hpp, ufixed01.h, ua.hpp, and boxmuller.hpp containing code that users may find useful but that are outside the scope of the Random123 library itself.
Some highly abstracted timing harnesses are provided which measure performance of a variety of generators in different programming environments.

Portability

Although we have done our best to make Random123 portable and standards conforming, it is an unfortunate fact that there is no portable code. There is only code that has been ported. We have tested the Random123 library with the following infrastructure

Linux, gcc (multiple versions from 3.4.3 through 5.2), on x86_64.
Linux, clang-2.9, 3.0, 3.1, 3.3 and 3.6 on x86_64.
Linux, clang-3.0 and 3.1 with lib++ (2012-04-19 svn checkout) on x86_64.
Linux, open64-4.2.4 on x86_64.
Linux, Intel icc and icpc 12.0.2 on x86_64.
Linux, OpenCL (NVIDIA SDK 4.0.17) on GTX480, M2090, GTX580 and GTX680 GPUs.
Linux, OpenCL (AMD APP SDK 2.4 or 2.5), on x86_64 CPUs and Radeon HD6970 GPUs.
Linux, OpenCL (Intel OpenCL 1.5), on x86_64 CPUs.
Linux, NVIDIA CUDA 4.1.15, 4.2.6, 5.5.22 and 7.5.1. (NOTE: We recommend against the use of CUDA before 4.1)
Linux, gcc-4.1.2 and 4.4.1 on x86.
Solaris, both gcc-3.4.3 and Sun C/C++ 5.8, on x86_64.
FreeBSD 8.2, gcc-4.2.1, on x86_64.
MacOS X 5.8, gcc-4.0.1, on x86.
MacOS X 5.8, llvm-2.9.1 on x86 (problems with catching C++ exceptions).
Windows 7, Microsoft Visual Studio, version 10.0, Microsoft C/C++ compiler 16.00.

Others have reported success on

MacOS, OpenCL on x86_64 CPUs
Linux, gcc-4.7.2 on Powerpc64 (BlueGene/Q)
Linux, Portland Group Compiler on Powerpc64 (BlueGene/Q)
Linux, IBM xlc on Powerpc64 (BlueGene/Q)

Warnings

With some compilation options, the CUDA nvcc compiler warns about unreachable code in array.h. The compiler doesn't recognize that the code that is unreachable for some values of some macro parameters, is actually reachable for other values of the parameters. It is possible to disable that particular warning for a specific compilation unit by adding -Wcudafe --diag_suppress=111 to the compilation command line.

Contributors

We welcome feedback to random123@deshawresearch.com about ports to other environments.

We are grateful for contributions from the following users:

Geoffrey Irving and Gabriel Rockefeller - BlueGene/Q and powerpc ports
Yan Zhou - MacOS and clang ports
David Lawrie - allowing 64-bit philox to compile for both host and device with CUDA
Bogdan Opanchuk - pointing out the inconsistent rotation constants in the implementation of threefry2xW in version 1.07 and earlier.