Hows my code

Discussion in 'C Programming' started by saraca means ashoka tree, Dec 21, 2004.

  1. The following code is the heart of a program that I wrote to extract
    html tags from a webpage. How efficient is my code ?. Is there still
    possible way to optimize the code. Am I using everything as per the
    text book. I am just apprehensive whether this may break or may cause a
    memmory leak. Any chance for it.

    #define TOKN_SIZE 256

    void tagfinder() {

    char ch, *tokens;
    int i, j ,len;
    i=j=0;
    //scan buffer holds the webpage as a string.
    len = strlen(scan_buffer);

    while(i < len) {
    ch = scan_buffer[i++];
    if(ch == '<') {
    tokens = malloc(TOKN_SIZE*sizeof(char));
    j=0;
    while(ch != '>') {
    ch = scan_buffer[i++];
    if(j >= TOKN_SIZE)
    tokens = realloc(tokens, (j+TOKN_SIZE) * sizeof(char));
    if(ch != '>') {
    tokens[j++] = ch;
    tokens[j] = '\0';
    }

    }// end of while(ch != '>')
    printf("%s\n",tokens);
    free(tokens);
    }//end of if(ch == '<')
    }//end of while(len > 0)

    }
     
    saraca means ashoka tree, Dec 21, 2004
    #1
    1. Advertising

  2. saraca means ashoka tree

    Eric Sosman Guest

    saraca means ashoka tree wrote:
    > The following code is the heart of a program that I wrote to extract
    > html tags from a webpage. How efficient is my code ?. Is there still
    > possible way to optimize the code. Am I using everything as per the
    > text book. I am just apprehensive whether this may break or may cause a
    > memmory leak. Any chance for it.
    > [code snipped; see up-thread]


    Before worrying about efficiency, worry about correctness.
    You use malloc() and realloc() without checking for failure,
    the way you use realloc() will cause a memory leak if realloc()
    ever fails, the insertion of '\0' can run off the end of your
    allocated region, an empty tag "<>" will leave you with a
    non-string lacking the terminal '\0', and a '<' without a
    matching '>' will send your code completely off the rails.

    Once you've fixed these five bugs (and any others I didn't
    happen to spot in your badly-indented code), you can start
    measuring the performance of your program to see whether any
    efficiency improvements are needed. Keep in mind that if it
    takes you one hour to improve the speed by one millisecond,
    you must run the program 3.6 million times just to break even.

    If efficiency improvements are needed (as they well may be;
    your code as it stands is far from tight), here are four
    suggestions. Note that the C language itself has no notion of
    "efficiency," so the actual effect of these suggestions will
    vary from platform to platform. As a practical matter, all
    four are likely to improve matters, but this is not guaranteed.
    Again, you must measure.

    Suggestion #1: Learn how to use the strchr() function,
    because it can probably locate the '<' and '>' characters
    faster than you can. Don't reinvent the wheel.

    Suggestion #2: If all you need to do is print out the
    substrings between '<' and '>', print them directly from the
    source buffer and get rid of the malloc() and realloc() calls.
    Learn how to use the "%.*s" format specification, or learn how
    to use fwrite().

    Suggestion #3: If your real program needs to store the
    substrings somewhere instead of just printing them out, don't
    allocate memory until you've located the closing '>' and know
    how much space you'll need. This avoids wasting memory when
    you get a short substring, and avoids the overhead of realloc()
    when you get a long one.

    Suggestion #4: Learn how to use the memcpy() function,
    because it can probably copy characters from the big string
    to your destination area faster than you can. (It will
    almost certainly do better than your current practice of
    storing most destination positions twice!) Don't reinvent
    the wheel.

    --
     
    Eric Sosman, Dec 21, 2004
    #2
    1. Advertising

  3. On Tue, 21 Dec 2004, saraca means ashoka tree wrote:
    >
    > The following code is the heart of a program that I wrote to extract
    > html tags from a webpage. How efficient is my code ?. Is there still
    > possible way to optimize the code[?]


    Of course.

    > Am I using everything as per the text book[?] I am just apprehensive
    > whether this may break or may cause a memory leak. Any chance for it[?]


    Re-post your code with some indentation, and maybe someone will take
    the trouble to look at it. Right now, it's completely unreadable.
    (If your problem is that you're trying to post hard tabs to Usenet...
    don't! I recommend
    http://www.contrib.andrew.cmu.edu/~ajo/free-software/detab.c
    , for obvious reasons. ;) Run 'detab -R -4 myprogram.c' and re-post.)

    Make sure the text you're posting actually compiles, by the way.
    What you posted in this message doesn't compile in either C90 or C99,
    the languages discussed in this newsgroup. I strongly recommend you
    don't use '//'-style comments in C code you intend to show anyone;
    they tend to do Bad Things like

    // this is a long comment that overflows the line and turns into a
    syntax error

    and every so often (though less and less frequently, thankfully) we see

    file://this comment was mangled by a Windows-based news client

    -Arthur
     
    Arthur J. O'Dwyer, Dec 21, 2004
    #3
  4. saraca means ashoka tree

    Old Wolf Guest

    saraca means ashoka tree wrote:
    > The following code is the heart of a program that I wrote to extract
    > html tags from a webpage. How efficient is my code ?.


    Better would be to not use malloc at all, and just remember
    a pointer to the token start, and its length (or perhaps,
    the offset into scan_buffer of the start and end).

    The only reason to malloc would be if you needed to
    destroy scan_buffer but keep the tokens, or if you
    needed to pass the token to a function that cannot
    handle a length-counted string (printf is not one of
    those functions).

    >
    > #define TOKN_SIZE 256
    >
    > void tagfinder() {
    >
    > char ch, *tokens;
    > int i, j ,len;
    > i=j=0;
    > //scan buffer holds the webpage as a string.
    > len = strlen(scan_buffer);


    You forgot to include stdlib.h and string.h

    > while(i < len) {
    > ch = scan_buffer[i++];
    > if(ch == '<') {
    > tokens = malloc(TOKN_SIZE*sizeof(char));


    sizeof(char) is always 1.
    You need to check malloc's return value. It will return NULL
    if you have run out of memory.

    > j=0;
    > while(ch != '>') {
    > ch = scan_buffer[i++];
    > if(j >= TOKN_SIZE)
    > tokens = realloc(tokens, (j+TOKN_SIZE) * sizeof(char));


    You need to check realloc's return value.
    Also, if you run out of memory then realloc will return NULL
    and leak the original buffer. So to avoid leaks in the case
    of a memory shortage you need to so something like:
    temp = realloc(.......);
    if (!temp) { free(tokens); exit(EXIT_FAILURE); }
    tokens = temp;

    This is also bad design because you realloc one char
    at a time. So if your token was 1024 in length then
    you will end up doing 1024 - TOKN_SIZE allocations.
    You could at least increase the allocation size by
    TOKN_SIZE each time.

    Even better would be to count the length of the token
    before you do any allocations at all. Then you only
    need to allocate once (a memory allocation is orders
    of magnitude slower than scanning a token twice).

    > if(ch != '>') {
    > tokens[j++] = ch;
    > tokens[j] = '\0';
    > }


    You overflowed 'tokens'. For example, if j == TOKN_SIZE-1
    then tokens[j++]=ch sets the last character to ch,
    and then tokens[j]=0 writes past the end of the buffer.

    This is inefficient anyway because you write a \0 every
    time. You should only write the \0 once, after the
    token is finished.

    >
    > }// end of while(ch != '>')
    > printf("%s\n",tokens);
    > free(tokens);
    > }//end of if(ch == '<')
    > }//end of while(len > 0)
    >
    > }
     
    Old Wolf, Dec 21, 2004
    #4
  5. I am not certain whether the program only exists to extract and store
    tags, or if this is just a single piece of a web-browser or such.
    If it is part of a larger application, I would recommend not to store
    the strings. If you set up constants:

    #define NO_TAG -1
    #define TAG_HTML 0
    #define TAG_HEAD 1
    #define TAG_BODY 2
    #define TAG_P 3

    or such, then you can store your information in a tree structure that
    you can enumerate and switch each tag:

    int currentTag;
    while ((currentTag = getNext(tagTree)) != NO_TAG) {
    /* tagTree might be some sort of tree structure that retains the
    hierarchy of the tags */

    switch (currentTag) {
    case TAG_HTML:
    ....code...
    break;
    case TAG_HEAD:
    ....code...

    etc. ...
    }
    }

    which will be far easier to deal with than strings and faster when it
    comes to actually doing something based on what kind of tag it is.
    Otherwise you will need to be performing string compares everywhere
    through your code.

    -Chris Williams
     
    Chris Williams, Dec 22, 2004
    #5
  6. saraca means ashoka tree

    Jason Taylor Guest

    Try using Crystal REVS for C (www.sgvsarc.com) to take care of
    indentation and formatting. Its flowcharts will help you review your
    code.
     
    Jason Taylor, Dec 24, 2004
    #6
  7. "Jason Taylor" <> writes:
    > Try using Crystal REVS for C (www.sgvsarc.com) to take care of
    > indentation and formatting. Its flowcharts will help you review your
    > code.


    If you don't mind paying $399 for a single license (slightly
    discounted for multiple licenses) -- and it seems to be for Windows
    only.

    Jason, you wouldn't happen to work for the company that sells this
    thing, would you?

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
    We must do something. This is something. Therefore, we must do this.
     
    Keith Thompson, Dec 24, 2004
    #7
  8. saraca means ashoka tree

    Jason Taylor Guest

    Keith, Why make a personal remark?

    I just happen to be a very satisfied user of Crystal REVS for C. I
    think it's a great product!

    Allow me to ask a few simple questions:

    Have you tried using it? You can download a free eval copy.

    Its flowcharts and automatic formatting have saved me a lot of time.

    Would you rather do all the low-level editing manually? Would you
    rather have code that is hard to read?

    Would you rather spend more time to understand a function instead of
    using flowcharts?

    How much do you value your time? How does $399 compare?

    Did you take a look at Crystal FLOW for C? It costs $219 - single
    quantity.

    A Windows only product seems fine to me. There must be at least a few
    hundred thousand individuals who design with C/C++ primarily in
    Windows.
     
    Jason Taylor, Dec 24, 2004
    #8
  9. "Jason Taylor" <> writes:
    > Keith, Why make a personal remark?


    Because you came into this newsgroup (where you haven't posted
    previously as far as I can tell) and posted what appear to be
    advertisements for a commercial product. I've seen a lot of
    advertisements posing as testimonials from satisfied customers.

    > I just happen to be a very satisfied user of Crystal REVS for C. I
    > think it's a great product!


    Ok, I'll take your word for it. Even so, the topicality of your posts
    is questionable, but I'm not going to make a big deal out of it.

    > Allow me to ask a few simple questions:
    >
    > Have you tried using it? You can download a free eval copy.


    No. Since I don't develop C software under Windows, I don't have much
    use for it.

    > Its flowcharts and automatic formatting have saved me a lot of time.


    I don't like flowcharts, but de gustibus et cetera.

    > Would you rather do all the low-level editing manually? Would you
    > rather have code that is hard to read?


    Yes and no, respectively.

    [...]

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
    We must do something. This is something. Therefore, we must do this.
     
    Keith Thompson, Dec 24, 2004
    #9
  10. On 24 Dec 2004 12:01:01 -0800, in comp.lang.c , "Jason Taylor"
    <> wrote:

    >A Windows only product seems fine to me.


    But offtopic here - this is a platfom neutral group.

    >There must be at least a few
    >hundred thousand individuals who design with C/C++ primarily in
    >Windows.


    So what? Theres a billion speak various dialects of indian, but thats not
    topical here either.


    --
    Mark McIntyre
    CLC FAQ <http://www.eskimo.com/~scs/C-faq/top.html>
    CLC readme: <http://www.ungerhu.com/jxh/clc.welcome.txt>
     
    Mark McIntyre, Dec 24, 2004
    #10
  11. saraca means ashoka tree wrote:
    > The following code is the heart of a program that I wrote to extract
    > html tags from a webpage. How efficient is my code ?. Is there still
    > possible way to optimize the code.
    > ...
    > len = strlen(scan_buffer);
    >
    > while(i < len) {
    > ch = scan_buffer[i++];


    while (ch = scan_buffer[i++])
    {
     
    Dietmar Schindler, Dec 28, 2004
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ron
    Replies:
    1
    Views:
    2,708
    Showjumper
    Jun 24, 2003
  2. Ian
    Replies:
    0
    Views:
    1,391
  3. =?Utf-8?B?Q2FybG8gTWFyY2hlc29uaQ==?=

    Fire Code behind code AND Javascript code associated to a Button Click Event

    =?Utf-8?B?Q2FybG8gTWFyY2hlc29uaQ==?=, Feb 10, 2004, in forum: ASP .Net
    Replies:
    4
    Views:
    21,247
    =?Utf-8?B?Q2FybG8gTWFyY2hlc29uaQ==?=
    Feb 11, 2004
  4. keithb
    Replies:
    1
    Views:
    920
    Bruce Barker
    Mar 29, 2006
  5. johannes falcone
    Replies:
    2
    Views:
    77
    Kevin Walzer
    Feb 1, 2014
Loading...

Share This Page