Re: sscanf() safety

Discussion in 'C Programming' started by Jens Thoms Toerring, Dec 23, 2010.

  1. Cross <> wrote:

    > I am working on an rtf renderer and parser. My code is hosted at
    > http://code.google.com/p/ertf . I tried kcachegrind on my binaries and found
    > that getc() is taking a lot of time. Obviously, character read from files is
    > slow. So, I decided to read the whole file into memory as a char buffer.
    > Please feel free to comment and suggest on the following code. Now, I want
    > to scan the char buffer using sscanf().


    I guess you mean fscanf() here since you seem to want to read
    from a file and not from a string - which is what sscanf() is
    for,

    > However, I remember once I heard in
    > a chat room that sscanf() has buffer overflow vulnerabilities.


    Only when used in an unsafe fashion, that is reaing with e.g.
    an unadorned '%s' since you can't forsee the number if chars
    that can be read in and thus you can't supply a large enough
    buffer. But when you specify the maximum number of chars that
    fscanf() is allowed to read it's perfectlly safe. So for example

    sscanf( fp, "%4096[^\x01-\0xFF]", buf );

    will read up to but never more than 4096 chars from the file
    and put them into 'buf' (and it will put a trailing '\0' at
    the end). It also will stop at a '\0' in the file, but then
    this is a binary file and in that case fscanf() isn't suit-
    able anyway.

    But if you want to read in the whole file anyway and you don't
    want to do any conversion of the data then fread() is probably
    the better choice anyway.

    > I would like
    > pointers on this and would like to know how I can use sscanf() safely.


    > if(!fp){
    > fprintf(stderr, "File pointer uninitialized.\n");
    > goto close_strbuf;
    > }


    > while((count = fread(cp, 1, 4096, fp))){
    > if(feof(fp))break;


    Checking for feof() is rather useless - if you hit the end of the
    file and there are no chars left to be read the return value of
    fread() will be 0. That's a good time to stop;-)

    > strbuf_append(buf, cp);


    No idea what this function is supposed to do (it's not a standard
    C function and thus shouldn't have a name starting with 'str'),
    but I am rather sure that this use is unsafe - what you got from
    fread() has rather likely no '\0' at the end, so the function
    has no chance to determine where to stop when copying from 'cp'
    to 'buf', you would also have to pass 'count' to it...

    > }
    > cp[count] = '\0';
    > strbuf_append(buf, cp);
    > fclose(fp);


    The function (or, the snippet from some function) doesn't make
    too much sense to me the way its written. How large is 'buf'?
    Do you want to read never more than 4096 chars from a file and
    thus 'buf' has (at least) 4097 elements? In that case why use
    the while() loop? Just

    count = fread( buf, 1, 4096, fp );
    buf[ count ] = '\0';

    would do fine.

    Otherwise where do you check that you can still append to 'buf'?
    That's assuming that you don't know the length of the file in
    advance - but if you did I would expect that you would try to
    read in everything at once instead of in chunks of 4096 bytes.

    So, please give a bit more of information about what you're
    trying to do, otherwise it's impossible to tell what you
    may need.
    Regards, Jens
    --
    \ Jens Thoms Toerring ___
    \__________________________ http://toerring.de
    Jens Thoms Toerring, Dec 23, 2010
    #1
    1. Advertising

  2. Cross <> wrote:
    > On 12/24/2010 02:04 AM, Jens Thoms Toerring wrote:
    > > Checking for feof() is rather useless - if you hit the end of the
    > > file and there are no chars left to be read the return value of
    > > fread() will be 0. That's a good time to stop;-)


    > Without feof(), there was some garbage read each time. I don't know why but
    > with feof() things appear fine.


    You shouldn't program by trial and error but try understand what's
    going wrong;-) Perhaps using feof() just hides the error but does
    not really fix the real problem. Given the scarcity of information
    you're giving that could very well be the case.

    > > The function (or, the snippet from some function) doesn't make
    > > too much sense to me the way its written. How large is 'buf'?

    > I have defined buf to be of size 4096 bytes.
    > > Do you want to read never more than 4096 chars from a file and
    > > thus 'buf' has (at least) 4097 elements? In that case why use
    > > the while() loop? Just
    > >
    > > count = fread( buf, 1, 4096, fp );
    > > buf[ count ] = '\0';
    > >
    > > would do fine.


    > The files sizes used in testing range from several kilobytes to several
    > megabytes.


    That may be the case but you simply don't show enough of your
    code to make sense, and this reply doesn't help either, sorry.
    So it's impossible to come up with sensible advice. Please show
    the whole function (plus functions used from within it like
    strbuf_append()) - then things might become clearer. At the
    moment it's impossible to even say what you're intentions for
    that function are, so it's like giving the car mechanic a photo
    of your car and asking him why it won't start;-)

    At least if files can be much longer than 4096 bytes then all
    tests of how much still fits into 'buf' are missing (does that
    happen in strbuf_append()?) as well as reallocation of 'buf' etc.
    And that's were buffer overflows you seem to be concerned about
    typically happen (or are avoided).

    Regards, Jens
    --
    \ Jens Thoms Toerring ___
    \__________________________ http://toerring.de
    Jens Thoms Toerring, Dec 23, 2010
    #2
    1. Advertising

  3. Cross <> wrote:
    > On 12/24/2010 03:12 AM, Jens Thoms Toerring wrote:
    > > At least if files can be much longer than 4096 bytes then all
    > > tests of how much still fits into 'buf' are missing (does that
    > > happen in strbuf_append()?) as well as reallocation of 'buf' etc.
    > > And that's were buffer overflows you seem to be concerned about
    > > typically happen (or are avoided).
    > >

    > I am using eina string buffer utility from EFL. http://enlightenment.org


    That's nice for you. But there was no call to any of them in
    the code you posted. And using third-party functions doesn't
    make your code inherently safer, they still need to be used
    correctly (assuming they're bug-free). But if you're not pre-
    pared to tell what you're doing (except some more or less
    meaningless snippets) you will need a clairvoyant for help,
    my crystal ball doesn't work well enough at the moment. (I
    even went as far as checking the web page you gave for your
    program but all you get there under "Downloads" is "This pro-
    ject currently has no downloads".)

    So
    <volume setting="max">
    POST THE REAL, COMPLETE CODE YOU'RE USING
    </volume>

    and don't make assumptions about where the problem is - if you
    knew you wouldn't have to ask. Best write a short, compilable
    program that exhibits the problems you're facing, leaving out
    everthing else, and post that.

    Regards, Jens
    --
    \ Jens Thoms Toerring ___
    \__________________________ http://toerring.de
    Jens Thoms Toerring, Dec 23, 2010
    #3
  4. Cross <> wrote:
    > On 12/24/2010 04:22 AM, Jens Thoms Toerring wrote:
    > > So
    > > <volume setting="max">
    > > POST THE REAL, COMPLETE CODE YOU'RE USING
    > > </volume>
    > >
    > > and don't make assumptions about where the problem is - if you
    > > knew you wouldn't have to ask. Best write a short, compilable
    > > program that exhibits the problems you're facing, leaving out
    > > everthing else, and post that.


    > http://docs.enlightenment.org/auto/eina/group__Eina__String__Buffer__Group.html


    > The above link is the documentation and the source code is available at
    > http://trac.enlightenment.org/e/browser/trunk/eina/src/lib


    Thanks, I am able to use Google. But in the code you posted
    here no 'eina_' function was ever used. So why should I care
    about it?

    > I also posted the site for my full code http://code.google.com/p/ertf Look in
    > src/lib/ertf_input.c and src/lib/ertf_rtf_to_markup.c


    As I already told you, if you go to "Downloads" all you get is
    "This project currently has no downloads.". So the only way to
    get the source you have to do a svn checkout. Perhaps you should
    consider that not everyone has svn and is prepared to install it.
    But if you do it anyway one finds that the code snippet you pos-
    ted here isn't to be found in that source tree at all - e.g. the
    fread() function isn't found when you grep over the whole tree.
    So it got nothing to do with the code we're supposed to discuss
    here and doesn't give a clue what you're after. All one finds in
    one of the files you mentioned is a single function with more
    than 400 lines, with a single switch case making up more than
    350 of that. Good people have been shot for less. And in the
    other one one finds e.g.

    int
    ertf_tag_get(FILE *fp, char *s)
    {
    int c;
    fscanf(fp, "%[^ 0123456789;\\{}\n\r]", s);
    if ((c = fgetc(fp)) != ' ')
    ungetc(c, fp);
    CHECK_EOF(fp, "EOF encountered while obtaining control tag.\n", return 1);
    return 0;
    }

    This function can't be used safely at all. Within the function
    you have no idea how much room is left in 's', so it's impos-
    sible to use fscanf() without a potential buffer overrun. You
    also don't check if fscanf() was successful - but the calling
    functions seem, as far as I can see, to blindly assume that it
    did succeed. And the check for EOF after the fgetc() is useless
    - if you hit EOF you already tried to stuff it back into the
    file, which of course is not what you want since EOF is not a
    char and what gets "pushed back" is whatever EOF actually is on
    the system, converted to an unsigned char...

    --
    \ Jens Thoms Toerring ___
    \__________________________ http://toerring.de
    Jens Thoms Toerring, Dec 24, 2010
    #4
  5. Cross <> wrote:
    > Actually, I tried kcachegrind on my code and found that as the file size
    > increases 50 - 80% of the time is spent in i/o (on fgetc() specifically).
    > So, I came up with the idea of reading the whole file into a char buffer
    > and scanning it.


    Something like

    Eina_Strbuf *buf = eina_strbuf_new( );
    char cp[ 4097 ];
    size_t count;

    while ( ( count = fread( cp, 1, 4096, fp ) ) != 0 )
    {
    cp[ count ] = '\0';
    eina_strbuf_append( buf, cp );
    }

    should do the job (assuming I got the part about the eina string
    buffer right and it works as I assume). Of course, you should
    check the return value of eina_strbuf_append() - it may run out
    of memory. And there's no need to check for EOF at all - the
    loop will be left once the end of the file has been reached.

    Also, you seem to like feof() a lot, but as far as my experience
    goes it's hardly needed at all. All functions for reading from a
    file return some kind if indication if reading succeeded or not
    and if you use that calling feof() in most cases is only needed
    if you have to distinguish between the EOF condition from a read
    failure. I just grepped through a project of mine with about 250k
    LoC and dozends of files being read in and parsed and found not a
    single call of feof() (all there is are a few calls of ferror()
    which come from automatically generated code).

    > > All one finds in
    > > one of the files you mentioned is a single function with more
    > > than 400 lines, with a single switch case making up more than
    > > 350 of that. Good people have been shot for less.

    > Honestly, I would like to know a better way of doing it.


    Split it up into functions that do simple tasks and call them
    instead of putting all the functionality into a single mono-
    litic function. Also consider if its possible to use a "dis-
    patch table", e.g. if you have code like

    if ( ! strncmp( buf, "plain", 5 )
    {
    /* do something with plain text */
    }
    else if ( ! strncmp( buf, "par", 3 ) )
    {
    /* do something with a paragraph */
    }
    ....
    else
    /* do something if all else failed */

    you could instead define a structure like

    struct {
    const char * keyword;
    int ( *handler )( const char * );
    } dispatch_table[ ] = { { "plain", handle_plain_text },
    { "f", handle_font },
    ... };

    and use that it as in

    for ( int i = 0; i < sizeof dispatch_table / sizeof *dispatch_table, i++ )
    if ( ! strncmp( buf, dispatch_table[ i ].keyword,
    strlen( dispatch_table[ i ].keyword ) )
    {
    result = dispatch_table[ i ].handler( buf );
    break;
    }

    if ( i == sizeof dispatch_table / sizeof *dispatch_table )
    result = handle_unkown_keyword( buf );

    That could get rid of the endless if/else if/else blocks. Of
    course you then need a single function for each of the diff-
    erent keywords (don't be afraid of short functions, the com-
    piler will rather likely inline them), but you should have
    something like that anyway to make the code modular and give
    the reader at least a fighting chance of understanding what's
    going on. With the above definition of the structure they would
    have to have a signature of

    int handle_plain_text( const char * buf );

    and you must adapt the member for the function pointer when
    you use a different signature.

    When you use such functions you will rather likely have to
    assemble the information about the current state of the par-
    ser (all those 'boldset', 'italicset' variables etc.) into
    a structure that you then pass to the functions (probably as
    a pointer to the structure).

    Of course, instead of testing for the length of the dispatch
    table you could have a sentinel element at the end (where
    e.g. the keyword member is a NULL pointer) and bail out from
    the loop when this is found - the trick with

    sizeof dispatch_table / sizeof *dispatch_table

    only works when the array of structures is defined in the
    same function where it's used (or as a global array).

    > > And in the
    > > other one one finds e.g.
    > >
    > > int
    > > ertf_tag_get(FILE *fp, char *s)
    > > {
    > > int c;
    > > fscanf(fp, "%[^ 0123456789;\\{}\n\r]", s);
    > > if ((c = fgetc(fp)) != ' ')
    > > ungetc(c, fp);
    > > CHECK_EOF(fp, "EOF encountered while obtaining control tag.\n", return 1);
    > > return 0;
    > > }
    > >
    > > This function can't be used safely at all. Within the function
    > > you have no idea how much room is left in 's', so it's impos-
    > > sible to use fscanf() without a potential buffer overrun. You
    > > also don't check if fscanf() was successful - but the calling
    > > functions seem, as far as I can see, to blindly assume that it
    > > did succeed. And the check for EOF after the fgetc() is useless
    > > - if you hit EOF you already tried to stuff it back into the
    > > file, which of course is not what you want since EOF is not a
    > > char and what gets "pushed back" is whatever EOF actually is on
    > > the system, converted to an unsigned char...
    > >


    > Considering this, s is declared of size 1000 in ertf_font.c. However, now I
    > think I should reflect this in fscanf() as something similar to "%1000s".


    Yes. Of course, that assumes that you know for sure that 's' has
    1000 elements, which can't be checked from within the function.
    And if you do something like that make it "%999s" - you need to
    count in the trailing '\0' that is always appended, so there's
    only room for 999 characters.

    Another thing: what you're doing here seems to be writing a
    parser. They are typically ugly to write and if the "syntax"
    of the input isn't too convoluted it can be worth the time
    to check out tools that automatically generate the "ugly"
    part of the code for you. One example would be yacc/flex.
    Admittedly, it takes some time to learn how to use them, but
    then parsing files becomes a lot easier and much more fun
    since you don't have to care anymore about all those pesky
    little details;-)
    Regards, Jens
    --
    \ Jens Thoms Toerring ___
    \__________________________ http://toerring.de
    Jens Thoms Toerring, Jan 9, 2011
    #5
  6. Cross <> wrote:
    > >> Considering this, s is declared of size 1000 in ertf_font.c. However, now I
    > >> think I should reflect this in fscanf() as something similar to "%1000s".

    > >
    > > Yes. Of course, that assumes that you know for sure that 's' has
    > > 1000 elements, which can't be checked from within the function.
    > > And if you do something like that make it "%999s" - you need to
    > > count in the trailing '\0' that is always appended, so there's
    > > only room for 999 characters.

    >
    > Well "%999s" shall be able to get strings whose maximum length is 999,
    > isn't it?


    The "%999s" bit tells *scanf() to read in up to 999 chars -
    the necessary trailing '\0' (that makes it a string) always
    will be added automatically. So if you have a char array
    with 1000 elements (and thus a sizeof 1000) the maximum
    number of chars you can allow *scanf() to read in without
    risking a buffer overrun is 999.

    Regards, Jens
    --
    \ Jens Thoms Toerring ___
    \__________________________ http://toerring.de
    Jens Thoms Toerring, Jan 10, 2011
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Anonymous
    Replies:
    0
    Views:
    468
    Anonymous
    Oct 30, 2003
  2. pramod
    Replies:
    22
    Views:
    1,801
    Lew Pitcher
    Jan 6, 2004
  3. Mike

    sscanf and c++!

    Mike, Jun 8, 2004, in forum: C++
    Replies:
    3
    Views:
    3,541
  4. Nobody

    Re: sscanf() safety

    Nobody, Dec 23, 2010, in forum: C Programming
    Replies:
    0
    Views:
    545
    Nobody
    Dec 23, 2010
  5. Barry Schwarz

    Re: sscanf() safety

    Barry Schwarz, Dec 24, 2010, in forum: C Programming
    Replies:
    1
    Views:
    383
    Barry Schwarz
    Dec 24, 2010
Loading...

Share This Page