File handling question

Discussion in 'C Programming' started by Anunay, May 17, 2006.

  1. Anunay

    Anunay Guest

    Hi all,

    Suppose a text file is given and we have to write a program that will
    return a random line from the file. This can be easily done. But what
    if the text file is too big and can't fit into the main memory
    completely? In this case, how will we modify our code?

    Also, if we are given a stream, instead of a file, then what changes
    are required?

    Thanks,
    Anunay
    Anunay, May 17, 2006
    #1
    1. Advertising

  2. Anunay said:

    > Hi all,
    >
    > Suppose a text file is given and we have to write a program that will
    > return a random line from the file. This can be easily done. But what
    > if the text file is too big and can't fit into the main memory
    > completely? In this case, how will we modify our code?


    Not at all, if you write it properly first time. You just need storage for
    two lines: the currently "saved" line, and the most recently read line.
    When you read line N into memory, copy it into the "saved" buffer with
    probability 1/N. At the end of this process, the "saved" buffer will
    contain a randomly selected line.

    > Also, if we are given a stream, instead of a file, then what changes
    > are required?


    None at all, if the way you wrote the first program was to take the input
    from stdin if no command line argument was forthcoming.

    --
    Richard Heathfield
    "Usenet is a strange place" - dmr 29/7/1999
    http://www.cpax.org.uk
    email: rjh at above domain (but drop the www, obviously)
    Richard Heathfield, May 17, 2006
    #2
    1. Advertising

  3. "Anunay" <> writes:

    > Hi all,
    >
    > Suppose a text file is given and we have to write a program that will
    > return a random line from the file. This can be easily done. But what
    > if the text file is too big and can't fit into the main memory
    > completely?

    What it the problem? You read the file line by line till you are at
    your random number.

    >In this case, how will we modify our code?

    I do not have to change a line of code.
    >
    > Also, if we are given a stream, instead of a file, then what changes
    > are required?

    What do you mean with this?

    Friedrich


    --
    Please remove just-for-news- to reply via e-mail.
    Friedrich Dominicus, May 17, 2006
    #3
  4. Anunay

    Ico Guest

    Friedrich Dominicus <> wrote:
    > "Anunay" <> writes:
    >
    >> Hi all,
    >>
    >> Suppose a text file is given and we have to write a program that will
    >> return a random line from the file. This can be easily done. But what
    >> if the text file is too big and can't fit into the main memory
    >> completely?

    > What it the problem? You read the file line by line till you are at
    > your random number.


    Requiring 2 passes of file reading: one to count the number of lines,
    one to choos a random line. This can be used when reading a file
    (although not optimal, ofcourse), but is not possible with streams.

    >> Also, if we are given a stream, instead of a file, then what changes
    >> are required?


    > What do you mean with this?


    I assume the OP means reading from standard input, instead of a file.

    --
    :wq
    ^X^Cy^K^X^C^C^C^C
    Ico, May 17, 2006
    #4
  5. On Wed, 17 May 2006 07:07:01 +0000,
    Richard Heathfield <> wrote
    in Msg. <>

    [...]

    homework assignment DONE.

    robert
    Robert Latest, May 17, 2006
    #5
  6. Anunay

    Anunay Guest

    Richard Heathfield wrote:
    > Not at all, if you write it properly first time. You just need storage for
    > two lines: the currently "saved" line, and the most recently read line.
    > When you read line N into memory, copy it into the "saved" buffer with
    > probability 1/N. At the end of this process, the "saved" buffer will
    > contain a randomly selected line.


    Hi Richard,

    Can you please elaborate a little on this? I am confused what will
    happen at the fopen() function call. As the file is too big to be fully
    fit into main memory, what will fopen() return?
    If fopen() returns SUCCESS, then where is that portion of file gets
    stored which could not get loaded into main memory?

    Thanks.
    Anunay, May 17, 2006
    #6
  7. Robert Latest said:

    > On Wed, 17 May 2006 07:07:01 +0000,
    > Richard Heathfield <> wrote
    > in Msg. <>
    >
    > [...]
    >
    > homework assignment DONE.


    No, I just did the thinking. He still gets to do the coding. If he can't,
    and it looks very much as if this is the case, then presumably he still
    won't get the marks.

    --
    Richard Heathfield
    "Usenet is a strange place" - dmr 29/7/1999
    http://www.cpax.org.uk
    email: rjh at above domain (but drop the www, obviously)
    Richard Heathfield, May 17, 2006
    #7
  8. Anunay said:

    >
    > Richard Heathfield wrote:
    >> Not at all, if you write it properly first time. You just need storage
    >> for two lines: the currently "saved" line, and the most recently read
    >> line. When you read line N into memory, copy it into the "saved" buffer
    >> with probability 1/N. At the end of this process, the "saved" buffer will
    >> contain a randomly selected line.

    >
    > Hi Richard,
    >
    > Can you please elaborate a little on this?


    I am very reluctant to do that.

    > I am confused what will happen at the fopen() function call.


    Either the file will open, or fopen will return NULL.

    > As the file is too big to be fully fit into main memory, what
    > will fopen() return?


    You seem to be confused about the purpose of fopen. It does not read a file
    into main memory.

    > If fopen() returns SUCCESS, then where is that portion of file gets
    > stored which could not get loaded into main memory?


    fopen() never returns SUCCESS. It returns either NULL or a pointer to an
    in-memory semi-opaque data structure describing an open file.

    --
    Richard Heathfield
    "Usenet is a strange place" - dmr 29/7/1999
    http://www.cpax.org.uk
    email: rjh at above domain (but drop the www, obviously)
    Richard Heathfield, May 17, 2006
    #8
  9. Anunay

    CBFalconer Guest

    Richard Heathfield wrote:
    > Anunay said:
    >>
    >> Suppose a text file is given and we have to write a program that
    >> will return a random line from the file. This can be easily done.
    >> But what if the text file is too big and can't fit into the main
    >> memory completely? In this case, how will we modify our code?

    >
    > Not at all, if you write it properly first time. You just need
    > storage for two lines: the currently "saved" line, and the most
    > recently read line. When you read line N into memory, copy it
    > into the "saved" buffer with probability 1/N. At the end of this
    > process, the "saved" buffer will contain a randomly selected line.
    >
    >> Also, if we are given a stream, instead of a file, then what
    >> changes are required?

    >
    > None at all, if the way you wrote the first program was to take the input
    > from stdin if no command line argument was forthcoming.


    Using Richards clever algorithm and my ggets routine you can
    probably reduce it to a few lines and a suitable function to decide
    the "probability 1/N" result.

    char *saved;
    char *buffer;
    size_t N, savedlinenum;

    N = 0; saved = NULL; savedlinenum = 0;
    while (0 == ggets(&buffer)) {
    N++;
    if (probfunction(N)) {
    free(saved); saved = buffer;
    savedlinenum = N;
    }
    else free(buffer);
    }
    if (saved) puts(saved);

    For probfunction read the Cfaq. You can find ggets at:

    <http://cbfalconer.home.att.net/download/ggets.zip>
    --
    "If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers." - Keith Thompson
    More details at: <http://cfaj.freeshell.org/google/>
    Also see <http://www.safalra.com/special/googlegroupsreply/>
    CBFalconer, May 17, 2006
    #9
  10. Anunay

    Anunay Guest

    Richard Heathfield wrote:
    > > As the file is too big to be fully fit into main memory, what
    > > will fopen() return?

    >
    > You seem to be confused about the purpose of fopen. It does not read a file
    > into main memory.


    Okay, that was a bad question. I meant to ask that where will the
    remaining portion of file be stored as it could not get loaded in one
    go?
    Anunay, May 17, 2006
    #10
  11. Anunay said:

    >
    > Richard Heathfield wrote:
    >> > As the file is too big to be fully fit into main memory, what
    >> > will fopen() return?

    >>
    >> You seem to be confused about the purpose of fopen. It does not read a
    >> file into main memory.

    >
    > Okay, that was a bad question. I meant to ask that where will the
    > remaining portion of file be stored as it could not get loaded in one
    > go?


    On the disk, where it is already. That's what disks are for.

    You just load one line of the file at a time. Having read it, you either
    keep it or discard it, as I explained earlier, and then you read the next
    one. And so on.

    --
    Richard Heathfield
    "Usenet is a strange place" - dmr 29/7/1999
    http://www.cpax.org.uk
    email: rjh at above domain (but drop the www, obviously)
    Richard Heathfield, May 17, 2006
    #11
  12. Anunay

    CBFalconer Guest

    Richard Heathfield wrote:
    > Anunay said:
    >> Richard Heathfield wrote:
    >>
    >>>> As the file is too big to be fully fit into main memory, what
    >>>> will fopen() return?
    >>>
    >>> You seem to be confused about the purpose of fopen. It does not
    >>> read a file into main memory.

    >>
    >> Okay, that was a bad question. I meant to ask that where will
    >> the remaining portion of file be stored as it could not get
    >> loaded in one go?

    >
    > On the disk, where it is already. That's what disks are for.
    >
    > You just load one line of the file at a time. Having read it,
    > you either keep it or discard it, as I explained earlier, and
    > then you read the next one. And so on.


    I think it is time to refer Anunay to K&R II and to his elementary
    computers textbook.

    Analog for Anunay: When you open a book, do you immediately
    transfer its content to your brain? Or do you start at page 1 and
    read a sentence? Later do you read another sentence?

    --
    "If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers." - Keith Thompson
    More details at: <http://cfaj.freeshell.org/google/>
    Also see <http://www.safalra.com/special/googlegroupsreply/>
    CBFalconer, May 17, 2006
    #12
  13. Anunay

    Flash Gordon Guest

    Ico wrote:
    > Friedrich Dominicus <> wrote:
    >> "Anunay" <> writes:
    >>
    >>> Hi all,
    >>>
    >>> Suppose a text file is given and we have to write a program that will
    >>> return a random line from the file. This can be easily done. But what
    >>> if the text file is too big and can't fit into the main memory
    >>> completely?

    >> What it the problem? You read the file line by line till you are at
    >> your random number.

    >
    > Requiring 2 passes of file reading: one to count the number of lines,
    > one to choos a random line. This can be used when reading a file
    > (although not optimal, ofcourse), but is not possible with streams.


    Since the random distribution was not specified you don't need to do two
    passes. This might mean there is no possibility of returning lines
    towards the end of the file, or if the file is only two lines a high
    probability of returning the last line, but it is still random.

    >>> Also, if we are given a stream, instead of a file, then what changes
    >>> are required?

    >
    >> What do you mean with this?

    >
    > I assume the OP means reading from standard input, instead of a file.


    Pick a probability for returning the current line, and keep going until
    you either hit EOF or you at random decide to return the line.
    --
    Flash Gordon, living in interesting times.
    Web site - http://home.flash-gordon.me.uk/
    comp.lang.c posting guidelines and intro:
    http://clc-wiki.net/wiki/Intro_to_clc

    Inviato da X-Privat.Org - Registrazione gratuita http://www.x-privat.org/join.php
    Flash Gordon, May 17, 2006
    #13
  14. Anunay

    Eric Sosman Guest

    Flash Gordon wrote:

    > Ico wrote:
    >
    >> Friedrich Dominicus <> wrote:
    >>
    >>> "Anunay" <> writes:
    >>>
    >>>> Hi all,
    >>>>
    >>>> Suppose a text file is given and we have to write a program that will
    >>>> return a random line from the file. This can be easily done. But what
    >>>> if the text file is too big and can't fit into the main memory
    >>>> completely?
    >>>
    >>> What it the problem? You read the file line by line till you are at
    >>> your random number.

    >>
    >>
    >> Requiring 2 passes of file reading: one to count the number of lines,
    >> one to choos a random line. This can be used when reading a file
    >> (although not optimal, ofcourse), but is not possible with streams.

    >
    >
    > Since the random distribution was not specified you don't need to do two
    > passes. This might mean there is no possibility of returning lines
    > towards the end of the file, or if the file is only two lines a high
    > probability of returning the last line, but it is still random.
    >
    >>>> Also, if we are given a stream, instead of a file, then what changes
    >>>> are required?

    >>
    >>
    >>> What do you mean with this?

    >>
    >>
    >> I assume the OP means reading from standard input, instead of a file.

    >
    >
    > Pick a probability for returning the current line, and keep going until
    > you either hit EOF or you at random decide to return the line.


    Read Richard Heathfield's initial post on this thread
    again; you'll find that his algorithm (1) requires no advance
    knowledge of the line count, (2) makes just one pass over the
    input, and (3) has equal probability of selecting any particular
    line. It will, however, make one complete pass over all the
    input even if it eventually selects the very first line read;
    there is no "shortcut."

    Point (3) may be a little difficult to grasp at first, so
    let me offer a couple simple examples and then a formal proof.
    What is the probability that the very last line is chosen? If
    there are N lines altogether, the last line is chosen with
    probability 1/N, as desired. How about the first line? It goes
    into the "saved" buffer as soon as it's read (with probability
    1/1), and then it survives there with probability

    (1-1/2)*(1-1/3)*(1-1/4)*...*(1-1/N)
    = (1/2)*(2/3)*(3/4)*...*((N-1)/N)
    = 1/N

    Now for the formal proof, by induction. If the file has
    only one line, that line is selected with probability 1/1.
    That's the "ground" step. Now suppose that the algorithm
    works for N-line files, that is, that it chooses one of the
    N lines with probability 1/N; we wish to show that it works
    for files of N+1 lines. Imagine the situation just before
    the final line is read: the first N lines have been processed
    and one of them resides in the "saved" buffer. The final step
    chooses the last line with probability 1/(N+1) -- which is as
    it should be -- or chooses the already-saved line with probability
    N/(N+1). Since the first N lines were processed by the N-line
    algorithm (known to be correct), any particular one of those first
    N lines is "saved" with probability 1/N, hence the probability that
    that line is chosen from among N+1 is (1/N)*(N/(N+1)) = 1/(N+1).

    The algorithm makes an unbiased "choice" from a 1-line file,
    and if it works for N lines it also works for N+1. Therefore,
    it works for all positive N. (To the limits of the random number
    generator's precision, of course -- but that's an altogether
    deeper topic.)

    --
    Eric Sosman
    lid
    Eric Sosman, May 17, 2006
    #14
  15. Anunay

    Jordan Abel Guest

    On 2006-05-17, Anunay <> wrote:
    > Hi all,
    >
    > Suppose a text file is given and we have to write a program that will
    > return a random line from the file. This can be easily done. But what
    > if the text file is too big and can't fit into the main memory
    > completely? In this case, how will we modify our code?
    >
    > Also, if we are given a stream, instead of a file, then what changes
    > are required?
    >
    > Thanks,
    > Anunay
    >


    http://minnie.tuhs.org/UnixTree/V7/usr/src/games/fortune.c.html

    Note that, of course, '(1+(double)RAND_MAX)' needs to be used instead of
    '32768.' in that code for modern systems, and it could stand to be
    otherwise updated. But the idea is there.
    Jordan Abel, May 17, 2006
    #15
  16. Anunay

    Bill Pursell Guest

    Anunay wrote:
    > Richard Heathfield wrote:
    > > > As the file is too big to be fully fit into main memory, what
    > > > will fopen() return?

    > >
    > > You seem to be confused about the purpose of fopen. It does not read a file
    > > into main memory.

    >
    > Okay, that was a bad question. I meant to ask that where will the
    > remaining portion of file be stored as it could not get loaded in one
    > go?


    As Richard pointed out, you are confused about the purpose of fopen().
    The problem you are describing might be exemplified in the
    following poorly designed program:

    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    #include <sys/types.h>
    #include <sys/stat.h>

    const char *path = "/tmp/some_file";

    int
    main(void)
    {
    FILE *ifp;
    struct stat stat_buf;
    unsigned char *data_buf;

    /* fopen() doesn't care how big the file is. */
    if ( (ifp = fopen(path, "r")) == NULL)
    perror(path), exit(EXIT_FAILURE);

    /* stat() will tell you how big the file is. */
    if (stat(path, &stat_buf) == -1)
    perror(NULL), exit(EXIT_FAILURE);

    /*
    * If you want to read the file all at once,
    * you'll need to allocate space. This will
    * fail if the file is too big.
    */
    if ( (data_buf = malloc(stat_buf.st_size)) == NULL)
    fprintf(stderr, "No memory\n"), exit(EXIT_FAILURE);

    /* Process the file here. (look up fread()) */

    return EXIT_SUCCESS;
    }


    The reason this is poorly designed is precisely because
    of the problem you are trying to describe. It is fairly stupid
    to read the entire file in at once. Instead, you should only
    read in chunks and process them as you go.
    Bill Pursell, May 17, 2006
    #16
  17. On Wed, 17 May 2006 07:08:21 -0400,
    CBFalconer <> wrote
    in Msg. <>

    > Analog for Anunay: When you open a book


    He doesn't. Or at least not C books.

    robert
    Robert Latest, May 17, 2006
    #17
  18. Anunay

    Joe Smith Guest

    "Eric Sosman">
    >>>> "Anunay" <> writes:


    >>>>> Suppose a text file is given and we have to write a program that will
    >>>>> return a random line from the file. This can be easily done. But what
    >>>>> if the text file is too big and can't fit into the main memory
    >>>>> completely?


    [snipped Flash Gordon for thematic considerations]
    [Heathfield:]
    Not at all, if you write it properly first time. You just need storage for
    two lines: the currently "saved" line, and the most recently read line.
    When you read line N into memory, copy it into the "saved" buffer with
    probability 1/N. At the end of this process, the "saved" buffer will
    contain a randomly selected line.

    > Read Richard Heathfield's initial post on this thread
    > again; you'll find that his algorithm (1) requires no advance
    > knowledge of the line count, (2) makes just one pass over the
    > input, and (3) has equal probability of selecting any particular
    > line. It will, however, make one complete pass over all the
    > input even if it eventually selects the very first line read;
    > there is no "shortcut."
    >
    > Point (3) may be a little difficult to grasp at first, so
    > let me offer a couple simple examples and then a formal proof.
    > What is the probability that the very last line is chosen? If
    > there are N lines altogether, the last line is chosen with
    > probability 1/N, as desired. How about the first line? It goes
    > into the "saved" buffer as soon as it's read (with probability
    > 1/1), and then it survives there with probability
    >
    > (1-1/2)*(1-1/3)*(1-1/4)*...*(1-1/N)
    > = (1/2)*(2/3)*(3/4)*...*((N-1)/N)
    > = 1/N
    >
    > Now for the formal proof, by induction. If the file has
    > only one line, that line is selected with probability 1/1.
    > That's the "ground" step. Now suppose that the algorithm
    > works for N-line files, that is, that it chooses one of the
    > N lines with probability 1/N; we wish to show that it works
    > for files of N+1 lines. Imagine the situation just before
    > the final line is read: the first N lines have been processed
    > and one of them resides in the "saved" buffer. The final step
    > chooses the last line with probability 1/(N+1) -- which is as
    > it should be -- or chooses the already-saved line with probability
    > N/(N+1). Since the first N lines were processed by the N-line
    > algorithm (known to be correct), any particular one of those first
    > N lines is "saved" with probability 1/N, hence the probability that
    > that line is chosen from among N+1 is (1/N)*(N/(N+1)) = 1/(N+1).
    >
    > The algorithm makes an unbiased "choice" from a 1-line file,
    > and if it works for N lines it also works for N+1. Therefore,
    > it works for all positive N. (To the limits of the random number
    > generator's precision, of course -- but that's an altogether
    > deeper topic.)


    This proof is at the point where the other people in the room either shake
    or nod their head and comment. First, I find it notable that there exist
    proofs in theoretical comp sci and that this continues to be fertile ground
    for unfolding scientific truth.

    I find no error in the induction. One of the reasons that I find no error
    could be that I've all but forgotten conditional probability. If that is
    sound, the telescoping looks correct. I've taken the liberty of sending
    this to a professor with whom I'm acquainted who authored a book in
    statistical paradoxes to ask for comment. There will always be a
    counterexample somewhere; the question is whether it's relevant and how far
    afield. If relevant, one would hand Mr. Sosman the chalk back to elaborate.

    Mr. Heathfield's algorithm would make the act of observation. Whenever that
    happens, the Law of Unintended Consequences gets a phone call. joe
    Joe Smith, May 18, 2006
    #18
  19. Joe Smith said:

    <snip>

    > Mr. Heathfield's algorithm [...]


    Um, I think it's in Knuth somewhere (although I had a look just now, and
    couldn't find it in TAOCP2.)

    --
    Richard Heathfield
    "Usenet is a strange place" - dmr 29/7/1999
    http://www.cpax.org.uk
    email: rjh at above domain (but drop the www, obviously)
    Richard Heathfield, May 18, 2006
    #19
  20. Anunay

    Joe Smith Guest

    "Richard Heathfield" <> wrote in message
    news:p...
    > Joe Smith said:
    >
    > <snip>
    >
    >> Mr. Heathfield's algorithm [...]

    >
    > Um, I think it's in Knuth somewhere (although I had a look just now, and
    > couldn't find it in TAOCP2.)


    I wish I had that handy, as I recall that he offers someting like $2.56 for
    each erratum. I believe that relatively close to the beginning, Knuth
    claims that Leonardo de Pisa gives the following numbers: 0, 1, 1, 2, 3, 5,
    8, 13 and so on. In point of fact, Leonardo in _Libber Abbaci_ gives 0, 1,
    2, 3, 5...., omitting one of the one's. I wouldn't be surprised if
    Professor Knuth already heard of this, but if he hasn't and I read the
    passage correctly, then .... joe
    Joe Smith, May 18, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Sean W. Quinn
    Replies:
    1
    Views:
    359
    Gianni Mariani
    Dec 1, 2003
  2. uwb
    Replies:
    4
    Views:
    360
  3. Mark Tarver
    Replies:
    22
    Views:
    1,299
    J Kenneth King
    Apr 26, 2009
  4. Peter
    Replies:
    34
    Views:
    1,934
    James Kanze
    Oct 17, 2009
  5. Iñaki Baz Castillo
    Replies:
    1
    Views:
    182
    Iñaki Baz Castillo
    Apr 15, 2008
Loading...

Share This Page