Flash said:
Since the random distribution was not specified you don't need to do two
passes. This might mean there is no possibility of returning lines
towards the end of the file, or if the file is only two lines a high
probability of returning the last line, but it is still random.
Pick a probability for returning the current line, and keep going until
you either hit EOF or you at random decide to return the line.
Read Richard Heathfield's initial post on this thread
again; you'll find that his algorithm (1) requires no advance
knowledge of the line count, (2) makes just one pass over the
input, and (3) has equal probability of selecting any particular
line. It will, however, make one complete pass over all the
input even if it eventually selects the very first line read;
there is no "shortcut."
Point (3) may be a little difficult to grasp at first, so
let me offer a couple simple examples and then a formal proof.
What is the probability that the very last line is chosen? If
there are N lines altogether, the last line is chosen with
probability 1/N, as desired. How about the first line? It goes
into the "saved" buffer as soon as it's read (with probability
1/1), and then it survives there with probability
(1-1/2)*(1-1/3)*(1-1/4)*...*(1-1/N)
= (1/2)*(2/3)*(3/4)*...*((N-1)/N)
= 1/N
Now for the formal proof, by induction. If the file has
only one line, that line is selected with probability 1/1.
That's the "ground" step. Now suppose that the algorithm
works for N-line files, that is, that it chooses one of the
N lines with probability 1/N; we wish to show that it works
for files of N+1 lines. Imagine the situation just before
the final line is read: the first N lines have been processed
and one of them resides in the "saved" buffer. The final step
chooses the last line with probability 1/(N+1) -- which is as
it should be -- or chooses the already-saved line with probability
N/(N+1). Since the first N lines were processed by the N-line
algorithm (known to be correct), any particular one of those first
N lines is "saved" with probability 1/N, hence the probability that
that line is chosen from among N+1 is (1/N)*(N/(N+1)) = 1/(N+1).
The algorithm makes an unbiased "choice" from a 1-line file,
and if it works for N lines it also works for N+1. Therefore,
it works for all positive N. (To the limits of the random number
generator's precision, of course -- but that's an altogether
deeper topic.)