File handling question

Anunay · May 17, 2006

Hi all,

Suppose a text file is given and we have to write a program that will
return a random line from the file. This can be easily done. But what
if the text file is too big and can't fit into the main memory
completely? In this case, how will we modify our code?

Also, if we are given a stream, instead of a file, then what changes
are required?

Thanks,
Anunay

Richard Heathfield · May 17, 2006

Anunay said:

Hi all,

Suppose a text file is given and we have to write a program that will
return a random line from the file. This can be easily done. But what
if the text file is too big and can't fit into the main memory
completely? In this case, how will we modify our code?

Not at all, if you write it properly first time. You just need storage for
two lines: the currently "saved" line, and the most recently read line.
When you read line N into memory, copy it into the "saved" buffer with
probability 1/N. At the end of this process, the "saved" buffer will
contain a randomly selected line.

Also, if we are given a stream, instead of a file, then what changes
are required?

None at all, if the way you wrote the first program was to take the input
from stdin if no command line argument was forthcoming.

Friedrich Dominicus · May 17, 2006

Anunay said:
Hi all,

Suppose a text file is given and we have to write a program that will
return a random line from the file. This can be easily done. But what
if the text file is too big and can't fit into the main memory
completely?

What it the problem? You read the file line by line till you are at
your random number.

In this case, how will we modify our code?

I do not have to change a line of code.

Also, if we are given a stream, instead of a file, then what changes
are required?

What do you mean with this?

Friedrich

Ico · May 17, 2006

Friedrich Dominicus said:
What it the problem? You read the file line by line till you are at
your random number.

Requiring 2 passes of file reading: one to count the number of lines,
one to choos a random line. This can be used when reading a file
(although not optimal, ofcourse), but is not possible with streams.

What do you mean with this?

I assume the OP means reading from standard input, instead of a file.

Robert Latest · May 17, 2006

On Wed, 17 May 2006 07:07:01 +0000,
in Msg. <[email protected]>

[...]

homework assignment DONE.

robert

Anunay · May 17, 2006

Richard said:
Not at all, if you write it properly first time. You just need storage for
two lines: the currently "saved" line, and the most recently read line.
When you read line N into memory, copy it into the "saved" buffer with
probability 1/N. At the end of this process, the "saved" buffer will
contain a randomly selected line.

Hi Richard,

Can you please elaborate a little on this? I am confused what will
happen at the fopen() function call. As the file is too big to be fully
fit into main memory, what will fopen() return?
If fopen() returns SUCCESS, then where is that portion of file gets
stored which could not get loaded into main memory?

Thanks.

Richard Heathfield · May 17, 2006

Robert Latest said:

On Wed, 17 May 2006 07:07:01 +0000,
in Msg. <[email protected]>

[...]

homework assignment DONE.

No, I just did the thinking. He still gets to do the coding. If he can't,
and it looks very much as if this is the case, then presumably he still
won't get the marks.

Richard Heathfield · May 17, 2006

Anunay said:

Hi Richard,

Can you please elaborate a little on this?

I am very reluctant to do that.

I am confused what will happen at the fopen() function call.

Either the file will open, or fopen will return NULL.

As the file is too big to be fully fit into main memory, what
will fopen() return?

You seem to be confused about the purpose of fopen. It does not read a file
into main memory.

If fopen() returns SUCCESS, then where is that portion of file gets
stored which could not get loaded into main memory?

fopen() never returns SUCCESS. It returns either NULL or a pointer to an
in-memory semi-opaque data structure describing an open file.

CBFalconer · May 17, 2006

Richard said:
Anunay said:

Not at all, if you write it properly first time. You just need
storage for two lines: the currently "saved" line, and the most
recently read line. When you read line N into memory, copy it
into the "saved" buffer with probability 1/N. At the end of this
process, the "saved" buffer will contain a randomly selected line.

None at all, if the way you wrote the first program was to take the input
from stdin if no command line argument was forthcoming.

Using Richards clever algorithm and my ggets routine you can
probably reduce it to a few lines and a suitable function to decide
the "probability 1/N" result.

char *saved;
char *buffer;
size_t N, savedlinenum;

N = 0; saved = NULL; savedlinenum = 0;
while (0 == ggets(&buffer)) {
N++;
if (probfunction(N)) {
free(saved); saved = buffer;
savedlinenum = N;
}
else free(buffer);
}
if (saved) puts(saved);

For probfunction read the Cfaq. You can find ggets at:

<http://cbfalconer.home.att.net/download/ggets.zip>
--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>

Anunay · May 17, 2006

Richard said:
You seem to be confused about the purpose of fopen. It does not read a file
into main memory.

Okay, that was a bad question. I meant to ask that where will the
remaining portion of file be stored as it could not get loaded in one
go?

Richard Heathfield · May 17, 2006

Anunay said:

Okay, that was a bad question. I meant to ask that where will the
remaining portion of file be stored as it could not get loaded in one
go?

On the disk, where it is already. That's what disks are for.

You just load one line of the file at a time. Having read it, you either
keep it or discard it, as I explained earlier, and then you read the next
one. And so on.

CBFalconer · May 17, 2006

Richard said:
Anunay said:

On the disk, where it is already. That's what disks are for.

You just load one line of the file at a time. Having read it,
you either keep it or discard it, as I explained earlier, and
then you read the next one. And so on.

I think it is time to refer Anunay to K&R II and to his elementary
computers textbook.

Analog for Anunay: When you open a book, do you immediately
transfer its content to your brain? Or do you start at page 1 and
read a sentence? Later do you read another sentence?

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>

Flash Gordon · May 17, 2006

Ico said:
Requiring 2 passes of file reading: one to count the number of lines,
one to choos a random line. This can be used when reading a file
(although not optimal, ofcourse), but is not possible with streams.

Since the random distribution was not specified you don't need to do two
passes. This might mean there is no possibility of returning lines
towards the end of the file, or if the file is only two lines a high
probability of returning the last line, but it is still random.

I assume the OP means reading from standard input, instead of a file.

Pick a probability for returning the current line, and keep going until
you either hit EOF or you at random decide to return the line.
--
Flash Gordon, living in interesting times.
Web site - http://home.flash-gordon.me.uk/
comp.lang.c posting guidelines and intro:
http://clc-wiki.net/wiki/Intro_to_clc

Inviato da X-Privat.Org - Registrazione gratuita http://www.x-privat.org/join.php

Eric Sosman · May 17, 2006

Flash said:
Since the random distribution was not specified you don't need to do two
passes. This might mean there is no possibility of returning lines
towards the end of the file, or if the file is only two lines a high
probability of returning the last line, but it is still random.

Pick a probability for returning the current line, and keep going until
you either hit EOF or you at random decide to return the line.

Read Richard Heathfield's initial post on this thread
again; you'll find that his algorithm (1) requires no advance
knowledge of the line count, (2) makes just one pass over the
input, and (3) has equal probability of selecting any particular
line. It will, however, make one complete pass over all the
input even if it eventually selects the very first line read;
there is no "shortcut."

Point (3) may be a little difficult to grasp at first, so
let me offer a couple simple examples and then a formal proof.
What is the probability that the very last line is chosen? If
there are N lines altogether, the last line is chosen with
probability 1/N, as desired. How about the first line? It goes
into the "saved" buffer as soon as it's read (with probability
1/1), and then it survives there with probability

(1-1/2)*(1-1/3)*(1-1/4)*...*(1-1/N)
= (1/2)*(2/3)*(3/4)*...*((N-1)/N)
= 1/N

Now for the formal proof, by induction. If the file has
only one line, that line is selected with probability 1/1.
That's the "ground" step. Now suppose that the algorithm
works for N-line files, that is, that it chooses one of the
N lines with probability 1/N; we wish to show that it works
for files of N+1 lines. Imagine the situation just before
the final line is read: the first N lines have been processed
and one of them resides in the "saved" buffer. The final step
chooses the last line with probability 1/(N+1) -- which is as
it should be -- or chooses the already-saved line with probability
N/(N+1). Since the first N lines were processed by the N-line
algorithm (known to be correct), any particular one of those first
N lines is "saved" with probability 1/N, hence the probability that
that line is chosen from among N+1 is (1/N)*(N/(N+1)) = 1/(N+1).

The algorithm makes an unbiased "choice" from a 1-line file,
and if it works for N lines it also works for N+1. Therefore,
it works for all positive N. (To the limits of the random number
generator's precision, of course -- but that's an altogether
deeper topic.)

Jordan Abel · May 17, 2006

Hi all,

Suppose a text file is given and we have to write a program that will
return a random line from the file. This can be easily done. But what
if the text file is too big and can't fit into the main memory
completely? In this case, how will we modify our code?

Also, if we are given a stream, instead of a file, then what changes
are required?

Thanks,
Anunay

http://minnie.tuhs.org/UnixTree/V7/usr/src/games/fortune.c.html

Note that, of course, '(1+(double)RAND_MAX)' needs to be used instead of
'32768.' in that code for modern systems, and it could stand to be
otherwise updated. But the idea is there.

Bill Pursell · May 17, 2006

Anunay said:
Okay, that was a bad question. I meant to ask that where will the
remaining portion of file be stored as it could not get loaded in one
go?

As Richard pointed out, you are confused about the purpose of fopen().
The problem you are describing might be exemplified in the
following poorly designed program:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>

const char *path = "/tmp/some_file";

int
main(void)
{
FILE *ifp;
struct stat stat_buf;
unsigned char *data_buf;

/* fopen() doesn't care how big the file is. */
if ( (ifp = fopen(path, "r")) == NULL)
perror(path), exit(EXIT_FAILURE);

/* stat() will tell you how big the file is. */
if (stat(path, &stat_buf) == -1)
perror(NULL), exit(EXIT_FAILURE);

/*
* If you want to read the file all at once,
* you'll need to allocate space. This will
* fail if the file is too big.
*/
if ( (data_buf = malloc(stat_buf.st_size)) == NULL)
fprintf(stderr, "No memory\n"), exit(EXIT_FAILURE);

/* Process the file here. (look up fread()) */

return EXIT_SUCCESS;
}

The reason this is poorly designed is precisely because
of the problem you are trying to describe. It is fairly stupid
to read the entire file in at once. Instead, you should only
read in chunks and process them as you go.

Robert Latest · May 17, 2006

On Wed, 17 May 2006 07:08:21 -0400,

in Msg. said:
Analog for Anunay: When you open a book

He doesn't. Or at least not C books.

robert

Joe Smith · May 18, 2006

"Eric Sosman">
[snipped Flash Gordon for thematic considerations]
[Heathfield:]
Not at all, if you write it properly first time. You just need storage for
two lines: the currently "saved" line, and the most recently read line.
When you read line N into memory, copy it into the "saved" buffer with
probability 1/N. At the end of this process, the "saved" buffer will
contain a randomly selected line.

Read Richard Heathfield's initial post on this thread
again; you'll find that his algorithm (1) requires no advance
knowledge of the line count, (2) makes just one pass over the
input, and (3) has equal probability of selecting any particular
line. It will, however, make one complete pass over all the
input even if it eventually selects the very first line read;
there is no "shortcut."

Point (3) may be a little difficult to grasp at first, so
let me offer a couple simple examples and then a formal proof.
What is the probability that the very last line is chosen? If
there are N lines altogether, the last line is chosen with
probability 1/N, as desired. How about the first line? It goes
into the "saved" buffer as soon as it's read (with probability
1/1), and then it survives there with probability

(1-1/2)*(1-1/3)*(1-1/4)*...*(1-1/N)
= (1/2)*(2/3)*(3/4)*...*((N-1)/N)
= 1/N

Now for the formal proof, by induction. If the file has
only one line, that line is selected with probability 1/1.
That's the "ground" step. Now suppose that the algorithm
works for N-line files, that is, that it chooses one of the
N lines with probability 1/N; we wish to show that it works
for files of N+1 lines. Imagine the situation just before
the final line is read: the first N lines have been processed
and one of them resides in the "saved" buffer. The final step
chooses the last line with probability 1/(N+1) -- which is as
it should be -- or chooses the already-saved line with probability
N/(N+1). Since the first N lines were processed by the N-line
algorithm (known to be correct), any particular one of those first
N lines is "saved" with probability 1/N, hence the probability that
that line is chosen from among N+1 is (1/N)*(N/(N+1)) = 1/(N+1).

The algorithm makes an unbiased "choice" from a 1-line file,
and if it works for N lines it also works for N+1. Therefore,
it works for all positive N. (To the limits of the random number
generator's precision, of course -- but that's an altogether
deeper topic.)

This proof is at the point where the other people in the room either shake
or nod their head and comment. First, I find it notable that there exist
proofs in theoretical comp sci and that this continues to be fertile ground
for unfolding scientific truth.

I find no error in the induction. One of the reasons that I find no error
could be that I've all but forgotten conditional probability. If that is
sound, the telescoping looks correct. I've taken the liberty of sending
this to a professor with whom I'm acquainted who authored a book in
statistical paradoxes to ask for comment. There will always be a
counterexample somewhere; the question is whether it's relevant and how far
afield. If relevant, one would hand Mr. Sosman the chalk back to elaborate.

Mr. Heathfield's algorithm would make the act of observation. Whenever that
happens, the Law of Unintended Consequences gets a phone call. joe

Richard Heathfield · May 18, 2006

Joe Smith said:

Mr. Heathfield's algorithm [...]

Um, I think it's in Knuth somewhere (although I had a look just now, and
couldn't find it in TAOCP2.)

Joe Smith · May 18, 2006

Richard Heathfield said:
Joe Smith said:

Mr. Heathfield's algorithm [...]

Click to expand...

Um, I think it's in Knuth somewhere (although I had a look just now, and
couldn't find it in TAOCP2.)

I wish I had that handy, as I recall that he offers someting like $2.56 for
each erratum. I believe that relatively close to the beginning, Knuth
claims that Leonardo de Pisa gives the following numbers: 0, 1, 1, 2, 3, 5,
8, 13 and so on. In point of fact, Leonardo in _Libber Abbaci_ gives 0, 1,
2, 3, 5...., omitting one of the one's. I wouldn't be surprised if
Professor Knuth already heard of this, but if he hasn't and I read the
passage correctly, then .... joe

How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023
Error Handling	27	May 28, 2010
Noob File IO question	8	Apr 4, 2007
Errors When Pulling Information from CSV File to Python	0	Dec 10, 2020
Exception Handling	33	Mar 11, 2012
How to sort a CSV file with merge sort JAVA	7	May 6, 2021
Question about my projects	3	Jul 23, 2021
Handling different implementations of the same algorithm	1	Aug 10, 2011

File handling question

Anunay

Richard Heathfield

Friedrich Dominicus

Ico

Robert Latest

Anunay

Richard Heathfield

Richard Heathfield

CBFalconer

Anunay

Richard Heathfield

CBFalconer

Flash Gordon

Eric Sosman

Jordan Abel

Bill Pursell

Robert Latest

Joe Smith

Richard Heathfield

Joe Smith

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads