Reading long lines from a file

Discussion in 'C Programming' started by Vlad Dogaru, Aug 14, 2007.

  1. Vlad Dogaru

    Vlad Dogaru Guest

    Hello,

    I suspect this comes up quite often, but I haven't found an exact
    solution in the FAQ. I have to read and parse a file with arbitrarily
    long lines and have come up with the following plan:

    1. start with a statically allocated buffer and a pointer of equal size
    2. read into the buffer using fgets and append to the pointer
    3. if buffer does not contain '\n', reallocate buffer and jump to 2
    4. return the pointer

    Do you see anything wrong with this? If so, how can I improve it?

    Thanks in advance,
    Vlad Dogaru

    --
    Number one reason to date an engineer:
    The world does revolve around us; we pick the coordinate system.
     
    Vlad Dogaru, Aug 14, 2007
    #1
    1. Advertising

  2. Vlad Dogaru said:

    > Hello,
    >
    > I suspect this comes up quite often, but I haven't found an exact
    > solution in the FAQ. I have to read and parse a file with arbitrarily
    > long lines and have come up with the following plan:
    >
    > 1. start with a statically allocated buffer and a pointer of equal
    > size 2. read into the buffer using fgets and append to the pointer
    > 3. if buffer does not contain '\n', reallocate buffer and jump to 2
    > 4. return the pointer
    >
    > Do you see anything wrong with this? If so, how can I improve it?


    To start with, you can't reallocate a statically allocated buffer! Nor
    can you have a pointer of equal size to a buffer except by sizing the
    buffer to be the same size as a pointer. Nor can you append to a
    pointer.

    Once we get those impossibilities out of the way, we can dispense with
    the unnecessary fgets call - your input is already buffered, so why
    buffer it again through fgets?

    Here's the plan:

    Allocate C (greater than 1) bytes of storage space DYNAMICALLY - point
    at this allocation with P. Set U to 0. Have a temporary pointer T
    kicking about the place.

    While you can read a character successfully that isn't a newline:
    If U == C - 1
    You're about to run out of space, so get some more
    T = realloc(P, C * 2)
    If that didn't work, you might want to try lower multipliers
    (1.5, 1.25 maybe) or even use add instead of multiply - and
    warn the caller that you're running low on RAM.
    Eventually, either you give up (in which case tell the user
    you failed), or you succeed, in which case set P = T
    Increase C to describe the new allocation amount accurately
    Endif

    If all is well
    P[U++] = the character you read
    Endif
    Endwhile
    If all is well
    P = '\0'
    End if
    P now contains the line.

    For a discussion of long-line issues, an implementation of a full line
    capture function, and links to other such implementations, see
    http://www.cpax.org.uk/prg/writings/fgetdata.php

    --
    Richard Heathfield <http://www.cpax.org.uk>
    Email: -www. +rjh@
    Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
    "Usenet is a strange place" - dmr 29 July 1999
     
    Richard Heathfield, Aug 14, 2007
    #2
    1. Advertising

  3. Vlad Dogaru

    pete Guest

    Vlad Dogaru wrote:
    >
    > Hello,
    >
    > I suspect this comes up quite often, but I haven't found an exact
    > solution in the FAQ. I have to read and parse a file with arbitrarily
    > long lines and have come up with the following plan:
    >
    > 1. start with a statically allocated buffer and a pointer of equal size
    > 2. read into the buffer using fgets and append to the pointer
    > 3. if buffer does not contain '\n', reallocate buffer and jump to 2
    > 4. return the pointer
    >
    > Do you see anything wrong with this?


    Possibly with the phrase "statically allocated".
    There's three kinds of duration:
    1 automatic
    2 static
    3 allocated

    Only allocated memory can be reallocated.

    > If so, how can I improve it?


    A few of the regulars here
    have written their own getline functions:
    http://www.cpax.org.uk/prg/writings/fgetdata.php#related

    --
    pete
     
    pete, Aug 14, 2007
    #3
  4. Vlad Dogaru

    Vlad Dogaru Guest

    Richard Heathfield wrote:
    > Vlad Dogaru said:
    >
    >> Hello,
    >>
    >> I suspect this comes up quite often, but I haven't found an exact
    >> solution in the FAQ. I have to read and parse a file with arbitrarily
    >> long lines and have come up with the following plan:
    >>
    >> 1. start with a statically allocated buffer and a pointer of equal
    >> size 2. read into the buffer using fgets and append to the pointer
    >> 3. if buffer does not contain '\n', reallocate buffer and jump to 2
    >> 4. return the pointer
    >>
    >> Do you see anything wrong with this? If so, how can I improve it?

    >
    > To start with, you can't reallocate a statically allocated buffer! Nor
    > can you have a pointer of equal size to a buffer except by sizing the
    > buffer to be the same size as a pointer. Nor can you append to a
    > pointer.
    >
    > Once we get those impossibilities out of the way, we can dispense with
    > the unnecessary fgets call - your input is already buffered, so why
    > buffer it again through fgets?



    If anything, my lack of English skills has contributed to the
    misunderstanding. I was talking about:
    char b[100], *p;
    Reading into b with fgets, then reallocating p as necessary to do a
    strcat(p, b).

    But your solution is much more elegant and now I see why fgets is
    unnecessary.

    >
    > Here's the plan:
    >
    > Allocate C (greater than 1) bytes of storage space DYNAMICALLY - point
    > at this allocation with P. Set U to 0. Have a temporary pointer T
    > kicking about the place.
    >
    > While you can read a character successfully that isn't a newline:
    > If U == C - 1
    > You're about to run out of space, so get some more
    > T = realloc(P, C * 2)
    > If that didn't work, you might want to try lower multipliers
    > (1.5, 1.25 maybe) or even use add instead of multiply - and
    > warn the caller that you're running low on RAM.
    > Eventually, either you give up (in which case tell the user
    > you failed), or you succeed, in which case set P = T
    > Increase C to describe the new allocation amount accurately
    > Endif
    >
    > If all is well
    > P[U++] = the character you read
    > Endif
    > Endwhile
    > If all is well
    > P = '\0'
    > End if
    > P now contains the line.
    >
    > For a discussion of long-line issues, an implementation of a full line
    > capture function, and links to other such implementations, see
    > http://www.cpax.org.uk/prg/writings/fgetdata.php


    Thank you for the clarification and the link. I will look into it and I
    am confident that I can write a similar function.

    Vlad
    --
    Number one reason to date an engineer:
    The world does revolve around us; we pick the coordinate system.
     
    Vlad Dogaru, Aug 14, 2007
    #4
  5. Vlad Dogaru

    David Mathog Guest

    Vlad Dogaru wrote:
    > Hello,
    >
    > I suspect this comes up quite often, but I haven't found an exact
    > solution in the FAQ. I have to read and parse a file with arbitrarily
    > long lines and have come up with the following plan:
    >
    > 1. start with a statically allocated buffer and a pointer of equal size
    > 2. read into the buffer using fgets and append to the pointer
    > 3. if buffer does not contain '\n', reallocate buffer and jump to 2
    > 4. return the pointer
    >
    > Do you see anything wrong with this? If so, how can I improve it?


    This may not apply to your particular case, but in some instances I have
    encountered with "arbitrarily long lines" one can just read a character
    at a time, examine it, perform some action, and then continue. This
    removes the need for a huge buffer, which in the worst case, might not
    even fit into the computer's memory. Obviously this won't work if any
    modification to the front of the line depends on a value near the end of
    the line.

    If you do go with the expanding buffer method be sure you that you do
    NOT use strcat() to append each new chunk of text. Doing so will result
    in each such addition scanning from the front of the buffer for the
    terminal '\0' in the string. I've seen this bug many, many times.
    It can cause a huge performance hit. Instead, keep track of the
    length of the string in the buffer and just copy the new string directly
    to the appropriate position, then adjust the length variable, and repeat.

    Regards,

    David Mathog
     
    David Mathog, Aug 14, 2007
    #5
  6. Vlad Dogaru

    Flash Gordon Guest

    Vlad Dogaru wrote, On 14/08/07 11:46:
    > Richard Heathfield wrote:


    <snip>

    >> To start with, you can't reallocate a statically allocated buffer! Nor
    >> can you have a pointer of equal size to a buffer except by sizing the
    >> buffer to be the same size as a pointer. Nor can you append to a pointer.
    >>
    >> Once we get those impossibilities out of the way, we can dispense with
    >> the unnecessary fgets call - your input is already buffered, so why
    >> buffer it again through fgets?

    >
    > If anything, my lack of English skills has contributed to the
    > misunderstanding. I was talking about:
    > char b[100], *p;
    > Reading into b with fgets, then reallocating p as necessary to do a
    > strcat(p, b).


    Since we do not know what p points to we cannot say whether you are
    allowed to realloc what it points to or not. You can only pass pointers
    returned by malloc or realloc to realloc.

    Also be ware of denial-of-service attacks where a user deliberately
    creates a file with a line 5GB long.

    <snip>
    --
    Flash Gordon
     
    Flash Gordon, Aug 14, 2007
    #6
  7. On 2007-08-14 17:43, Flash Gordon <> wrote:
    > Vlad Dogaru wrote, On 14/08/07 11:46:
    >> Richard Heathfield wrote:
    >>> To start with, you can't reallocate a statically allocated buffer! Nor
    >>> can you have a pointer of equal size to a buffer except by sizing the
    >>> buffer to be the same size as a pointer. Nor can you append to a pointer.

    [...]
    >> If anything, my lack of English skills has contributed to the
    >> misunderstanding. I was talking about:
    >> char b[100], *p;
    >> Reading into b with fgets, then reallocating p as necessary to do a
    >> strcat(p, b).

    >
    > Since we do not know what p points to we cannot say whether you are
    > allowed to realloc what it points to or not.


    We cannot *know*, but I think it is reasonable to assume from the
    description to assume that he uses malloc to get the initial value for
    p. You don't always have to assume the stupidest possible version if
    something isn't specified exactly ;-).

    > Also be ware of denial-of-service attacks where a user deliberately
    > creates a file with a line 5GB long.


    ACK. But that's probably not something which should be hard-coded into
    the application. After all, the program might run on a machine with 64
    GB RAM where 5 GB of memory usage is quite acceptable. You could use a
    configurable limit or rely on OS features to limit memory consumption
    (e.g. ulimit on unixoid systems).

    hp

    --
    _ | Peter J. Holzer | I know I'd be respectful of a pirate
    |_|_) | Sysadmin WSR | with an emu on his shoulder.
    | | | |
    __/ | http://www.hjp.at/ | -- Sam in "Freefall"
     
    Peter J. Holzer, Aug 20, 2007
    #7
  8. On Aug 20, 1:57 pm, "Peter J. Holzer" <> wrote:
    > On 2007-08-14 17:43, Flash Gordon <> wrote:
    >
    > > Vlad Dogaru wrote, On 14/08/07 11:46:
    > >> Richard Heathfield wrote:
    > >>> To start with, you can't reallocate a statically allocated buffer! Nor
    > >>> can you have a pointer of equal size to a buffer except by sizing the
    > >>> buffer to be the same size as a pointer. Nor can you append to a pointer.

    > [...]
    > >> If anything, my lack of English skills has contributed to the
    > >> misunderstanding. I was talking about:
    > >> char b[100], *p;
    > >> Reading into b with fgets, then reallocating p as necessary to do a
    > >> strcat(p, b).

    >
    > > Since we do not know what p points to we cannot say whether you are
    > > allowed to realloc what it points to or not.

    >
    > We cannot *know*, but I think it is reasonable to assume from the
    > description to assume that he uses malloc to get the initial value for
    > p. You don't always have to assume the stupidest possible version if
    > something isn't specified exactly ;-).


    Reading Flash Gordon's post I don't see him assuming anything.
    He was simply aiming to cover all possibilities and I'm all for
    that ; we do aim to be accurate around here.
     
    Spiros Bousbouras, Aug 20, 2007
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. George Marsaglia

    Assigning unsigned long to unsigned long long

    George Marsaglia, Jul 8, 2003, in forum: C Programming
    Replies:
    1
    Views:
    683
    Eric Sosman
    Jul 8, 2003
  2. Daniel Rudy

    unsigned long long int to long double

    Daniel Rudy, Sep 19, 2005, in forum: C Programming
    Replies:
    5
    Views:
    1,195
    Peter Shaggy Haywood
    Sep 20, 2005
  3. Mathieu Dutour

    long long and long

    Mathieu Dutour, Jul 17, 2007, in forum: C Programming
    Replies:
    4
    Views:
    479
    santosh
    Jul 24, 2007
  4. Bart C

    Use of Long and Long Long

    Bart C, Jan 9, 2008, in forum: C Programming
    Replies:
    27
    Views:
    806
    Peter Nilsson
    Jan 15, 2008
  5. McGregor
    Replies:
    2
    Views:
    1,657
    Tom Anderson
    Jan 29, 2009
Loading...

Share This Page