RC4 Distinguishability

RC4 is a stream cipher designed by Ron Rivest (the R in RSA). Compared to a block cipher like AES, it's really simple. All it does is maintain a permutation of all 256 possible bytes that is updated for each output byte. RC4 used to be popular, but is no longer considered secure for reasons you will soon see. First I will go on detour to explain what PRG security even means.

class RC4 {
	uint8_t state[256];
	int i;
	int j;
public:
	RC4(const uint8_t key[], size_t len) {
		for(i = 0; i < 256; ++i) {
			state[i] = i;
		}

		j = 0;
		for(i = 0; i < 256; ++i) {
			j = (j + state[i] + key[i % len]) % 256;
			std::swap(state[i], state[j]);
		}
		i = j = 0;
	}
	uint8_t next() {
		i = (i + 1) % 256;
		j = (j + state[i]) % 256;
		std::swap(state[i], state[j]);
		return state[(state[i] + state[j]) % 256];
	}
};
Figure 1. RC4 in its entirety.

In order for a pseudo random generator to be considered secure, its output must be indistinguishable from a "real" random process such as flipping a fair coin. Because PRGs have finite state, some states must eventually repeat (pigeonhole principle). Once the PRG has been in the same state twice, all future output will follow a period pattern because PRGs are deterministic. For this reason, no PRG can even be completely indistinguishable from "real" randomness.

Luckily, "eventually" is a really long time for any halfway decent random number generator. For example: the Mersenne Twister has a period of 219937 - 1[1]. Even 264 output bytes is overkill for a typical application such as encrypting a TLS connection (streaming a movie from Netflix requires ~1GB of bandwidth).

PRG output cannot be and does not need to be absolutely indistinguishable from "true" randomness. What we need is indistinguishablilty from the perspective of any adversary who sees "plausible" amounts of output and has "plausible" computing power. Sadly, RC4 does not fit the bill. The reason is the first bytes of RC4 output are biased (ie: the probability distribution of possible values is no uniform). Specifically, the second byte of output is 0 with probability 1/128 instead of 1/256[2].

int main() {
	uint8_t key[6];
	int fd = open("/dev/urandom", O_RDONLY);
	int counts[256];
	memset(counts, 0, sizeof counts);

	for(int i = 0; i < 10000; ++i) {
		//generate a new key
		if(read(fd, key, sizeof key) != sizeof key) exit(1);

		//seed prg
		RC4 prg(key, sizeof key);
		prg.next();

		//update statistics for second output byte
		++counts[prg.next()];
	}
	//second output byte is biased towards 0
	for(int i = 0; i < 256; ++i) {
		printf("%f\n", counts[i]/10000.0);
	}
	return 0;
}
Figure 2. A simple simulation of a few thousand runs of RC4 is enough to demonstrate the output bias

The simulation runs RC4 with different keys and tallies up the frequency of various values of the second output byte. Here is the probability distribution derived from the simulation output:

Probabilitiy distribution for second output byte. Notice 0 is an outlier.
Figure 3. An empirically determined estimate of the distribution for the second output byte. Notice that the probability of 0 is about twice that of any other value.

It turns out more of the early output bytes are biased. One remediation is to discard the first 256 output bytes. The best remediation is to use something else (Salsa or ChaCha anyone?). The example of RC4 shows that running even simple statistical tests to evaluate PRG output is a good idea. This is what Ron Rivest did when designing the newish stream cipher Spritz[3]. Also check out Dieharder with is a series of tests for random number generators.

References