Parsing?

Discussion in 'C Programming' started by lisp9000@gmail.com, Sep 13, 2007.

  1. Guest

    Hi,

    I am writing a log parser (beginner in C) and have some questions.

    There are 2 types of log files which are very similar:

    Type 1:
    117: SYSTEM->P0 Welcome to the server
    444: Z1->P0 Greetings
    812: SYSTEM->EVERYONE "Chumly" (P0) was kill #5 for "Dragon
    Master" (Z1)
    954: P0->TEAMORANGE Help me!

    Type 2:
    Welcome aboard Chumly! 00:03:40
    00:03:41: Qualax-5->TEAMORANGE Qualax-5 destroyed by "Dragon
    Master" (Z1)
    Blaster missed!!! 00:03:53
    00:04:06: P0->TEAMPURPLE Help Needed at Zorcon-8

    So in Type 1 there is always an integar indicating relative time
    prefixing every line and in Type 2 there is always a 24-hour style
    timestamp but sometimes it is prefixed and other times suffixed.

    I thought of using strtok() but that doesn't handling quoting so if I
    encounter a message in " " it won't be able to handle it.

    Does anyone have any idea on the best way to tokenize this? My goal is
    to extract only certain types of messages such as the ones between
    players (eg. Z1->P0) and to the team message board (eg P0->TEAMPURPLE)
    and put these into HTML files in time increasing order.

    I thought of reading in one character of the log file at a time but
    then I will need lots of branch logic (if( char = 'T') && (char-next
    == 'E') etc..) and that could get quite messy and confusing. I also
    thought of storing each token as a struct field but I haven't much
    experience with structs. I was also wondering about using fixed arrays
    of chars vs arrays of char pointers. Any ideas and especially code
    snippets would be appreciated to help me get started.

    Lisp 9000
     
    , Sep 13, 2007
    #1
    1. Advertising

  2. On 13 Sep, 23:55, "" <> wrote:

    > I am writing a log parser (beginner in C) and have some questions.
    >
    > There are 2 types of log files which are very similar:
    >
    > Type 1:
    > 117: SYSTEM->P0 Welcome to the server
    > 444: Z1->P0 Greetings
    > 812: SYSTEM->EVERYONE "Chumly" (P0) was kill #5 for "Dragon
    > Master" (Z1)
    > 954: P0->TEAMORANGE Help me!
    >
    > Type 2:
    > Welcome aboard Chumly! 00:03:40
    > 00:03:41: Qualax-5->TEAMORANGE Qualax-5 destroyed by "Dragon
    > Master" (Z1)
    > Blaster missed!!! 00:03:53
    > 00:04:06: P0->TEAMPURPLE Help Needed at Zorcon-8
    >
    > So in Type 1 there is always an integar indicating relative time
    > prefixing every line and in Type 2 there is always a 24-hour style
    > timestamp but sometimes it is prefixed and other times suffixed.
    >
    > I thought of using strtok() but that doesn't handling quoting so if I
    > encounter a message in " " it won't be able to handle it.
    >
    > Does anyone have any idea on the best way to tokenize this? My goal is
    > to extract only certain types of messages such as the ones between
    > players (eg. Z1->P0) and to the team message board (eg P0->TEAMPURPLE)
    > and put these into HTML files in time increasing order.
    >
    > I thought of reading in one character of the log file at a time but
    > then I will need lots of branch logic (if( char = 'T') && (char-next
    > == 'E') etc..) and that could get quite messy and confusing. I also
    > thought of storing each token as a struct field but I haven't much
    > experience with structs. I was also wondering about using fixed arrays
    > of chars vs arrays of char pointers. Any ideas and especially code
    > snippets would be appreciated to help me get started.


    probably a bit off-topic to comp.lang.c so I've added
    comp.programming.

    Try googling "Recursive descent parser"


    --
    Nick Keighley

    Unpredictability may be exciting, but I don't believe it constitutes
    good programming practice.
    Richard Heathfield
     
    Nick Keighley, Sep 14, 2007
    #2
    1. Advertising

  3. Guest

    On Sep 14, 5:41 am, Nick Keighley <>
    wrote:
    > On 13 Sep, 23:55, "" <> wrote:
    >
    >
    >
    > > I am writing a log parser (beginner in C) and have some questions.

    >
    > > There are 2 types of log files which are very similar:

    >
    > > Type 1:
    > > 117: SYSTEM->P0 Welcome to the server
    > > 444: Z1->P0 Greetings
    > > 812: SYSTEM->EVERYONE "Chumly" (P0) was kill #5 for "Dragon
    > > Master" (Z1)
    > > 954: P0->TEAMORANGE Help me!

    >
    > > Type 2:
    > > Welcome aboard Chumly! 00:03:40
    > > 00:03:41: Qualax-5->TEAMORANGE Qualax-5 destroyed by "Dragon
    > > Master" (Z1)
    > > Blaster missed!!! 00:03:53
    > > 00:04:06: P0->TEAMPURPLE Help Needed at Zorcon-8

    >
    > > So in Type 1 there is always an integar indicating relative time
    > > prefixing every line and in Type 2 there is always a 24-hour style
    > > timestamp but sometimes it is prefixed and other times suffixed.

    >
    > > I thought of using strtok() but that doesn't handling quoting so if I
    > > encounter a message in " " it won't be able to handle it.

    >
    > > Does anyone have any idea on the best way to tokenize this? My goal is
    > > to extract only certain types of messages such as the ones between
    > > players (eg. Z1->P0) and to the team message board (eg P0->TEAMPURPLE)
    > > and put these into HTML files in time increasing order.

    >
    > > I thought of reading in one character of the log file at a time but
    > > then I will need lots of branch logic (if( char = 'T') && (char-next
    > > == 'E') etc..) and that could get quite messy and confusing. I also
    > > thought of storing each token as a struct field but I haven't much
    > > experience with structs. I was also wondering about using fixed arrays
    > > of chars vs arrays of char pointers. Any ideas and especially code
    > > snippets would be appreciated to help me get started.

    >
    > probably a bit off-topic to comp.lang.c so I've added
    > comp.programming.
    >
    > Try googling "Recursive descent parser"


    Interesting. Most of what I found was rather abstracted/formalistic
    and this method seems to be used in compiler construction. Can you
    show an example of this expressed in C code on one of the lines in my
    sample log please?

    Lisp 9000
     
    , Sep 14, 2007
    #3
  4. Mark Bluemel Guest

    wrote:
    > Hi,
    >
    > I am writing a log parser (beginner in C) and have some questions.
    >
    > There are 2 types of log files which are very similar:
    >
    > Type 1:
    > 117: SYSTEM->P0 Welcome to the server
    > 444: Z1->P0 Greetings
    > 812: SYSTEM->EVERYONE "Chumly" (P0) was kill #5 for "Dragon
    > Master" (Z1)
    > 954: P0->TEAMORANGE Help me!
    >
    > Type 2:
    > Welcome aboard Chumly! 00:03:40
    > 00:03:41: Qualax-5->TEAMORANGE Qualax-5 destroyed by "Dragon
    > Master" (Z1)
    > Blaster missed!!! 00:03:53
    > 00:04:06: P0->TEAMPURPLE Help Needed at Zorcon-8
    >
    > So in Type 1 there is always an integar indicating relative time
    > prefixing every line and in Type 2 there is always a 24-hour style
    > timestamp but sometimes it is prefixed and other times suffixed.
    >
    > I thought of using strtok() but that doesn't handling quoting so if I
    > encounter a message in " " it won't be able to handle it.
    >
    > Does anyone have any idea on the best way to tokenize this?


    What do you count as "tokens"? Until you define your requirement more
    clearly, it will be hard to meet.

    > My goal is
    > to extract only certain types of messages such as the ones between
    > players (eg. Z1->P0) and to the team message board (eg P0->TEAMPURPLE)
    > and put these into HTML files in time increasing order.


    OK. This is a little clearer. So on that basis, you are interested in
    all lines from type 1 log files, but only those lines in type 2 which
    start with with a timestamp? So on type 2 files, You could simply check
    the first 9 characters for matching the timestamp pattern.

    Having selected your lines you can start after the timestamp or sequence
    number and look for the first non-space, being the start of the "real"
    data. strtok(), with varying delimiter specifications, could then be
    used to break that down into sender (delimited by '>', then remove the
    last character, perhaps), receiver (delimited by space) and text
    (delimited by '\0')...

    Would that work?
     
    Mark Bluemel, Sep 14, 2007
    #4
  5. Guest

    On Sep 14, 6:28 am, Mark Bluemel <> wrote:
    >
    > What do you count as "tokens"? Until you define your requirement more
    > clearly, it will be hard to meet.


    Hi Mark,

    The message chunks I am interested in extracting and putting into HTML
    files.

    > OK. This is a little clearer. So on that basis, you are interested in
    > all lines from type 1 log files, but only those lines in type 2 which
    > start with with a timestamp? So on type 2 files,


    Yes that's correct.

    > You could simply check
    > the first 9 characters for matching the timestamp pattern.


    Could you show me some code that would do that? I know how to read in
    a whole line but not sure how to individually check each character.

    > Having selected your lines you can start after the timestamp or sequence
    > number and look for the first non-space, being the start of the "real"
    > data. strtok(), with varying delimiter specifications, could then be
    > used to break that down into sender (delimited by '>', then remove the
    > last character, perhaps), receiver (delimited by space) and text
    > (delimited by '\0')...
    >
    > Would that work?


    That sounds good, but what about the message tokens that have quotes
    (" ") in them? This will cause strtok to not give me the desired
    results. For example a line such as:

    Z1->P0 Hello, what do you think of "The Leaves of Grass"?

    I also want to be sure the messages that are extracted and put into
    the HTML files will be in the same time increasing order so one can
    read the messages naturally as they occurred. I have plans to
    eventually add searching based on a player id and some other cool
    things but right now I would be happy to get a basic version working
    and build onto that such as using structs and dynamic memory
    allocation once I learn those concepts. Some programmers told me it's
    best to learn by doing rather than just sitting down and reading an
    entire textbook before starting to write C code. So that's what I'm
    trying to do. I appreciate all the help.

    Lisp 9000
     
    , Sep 15, 2007
    #5
  6. <> wrote in message
    news:...
    > On Sep 14, 5:41 am, Nick Keighley <>
    > wrote:
    > Interesting. Most of what I found was rather abstracted/formalistic
    > and this method seems to be used in compiler construction. Can you
    > show an example of this expressed in C code on one of the lines in my
    > sample log please?
    >

    There's Basic interpreter on my website. It is also available as a book for
    a nominal price, a bit more if you want if professionally printed.

    Basically you want to define a "token level". For your file a token will
    probably be either a word or a special symbol like ->, or a timestamp.

    So the heart is a gettoken() and match() system. gettoken() returns the
    current token(), match() dispenses with it, and error-checks if it wasn't
    legal.

    The you define the higher level constructs. For instance a speaker looks
    like

    identifier -> identifier..

    So you say

    speaker()
    {
    identifier();
    match("->");
    identifier();
    }

    (This is pseudocode, you have to examine the idenifiers to construct a
    speaker object).

    It is recursive because an idenifier might be
    a word
    an idenifier, a join symbol (eg colon), another identifier

    char *identifier()
    {
    answer = gettoken();
    match(answer);
    if(gettoken() == ":")
    match(":");
    answer = strcat(answer, identifier());
    }

    (Pseudo code again, obviously you need a string handling system in place)

    --
    Free games and programming goodies.
    http://www.personal.leeds.ac.uk/~bgy1mm
     
    Malcolm McLean, Sep 15, 2007
    #6
  7. wrote:

    > On Sep 14, 6:28 am, Mark Bluemel <> wrote:
    >
    >> OK. This is a little clearer. So on that basis, you are interested in
    >> all lines from type 1 log files, but only those lines in type 2 which
    >> start with with a timestamp? So on type 2 files,

    >
    > Yes that's correct.
    >
    >> You could simply check
    >> the first 9 characters for matching the timestamp pattern.

    >
    > Could you show me some code that would do that? I know how to read in
    > a whole line but not sure how to individually check each character.


    You should read the recent thread titled "Duration Conversion" for some
    examples how you can parse and validate a date string.
    One example that can readily be applied to your situation goes like
    this:

    ret = fscanf(logfile, "%2d:%2d:%2d:", hr, min, sec);
    if (ret == -1)
    {
    /* End of File reached */
    }
    else if (ret != 3)
    {
    /* Not a timestamp at the start of the line. Discard it */
    fscanf(logfile, "%*[^\n]%*c");
    }
    else
    {
    /* We have an interesting log entry. Try to parse the rest. */
    }

    >
    >> Having selected your lines you can start after the timestamp or
    >> sequence number and look for the first non-space, being the start of
    >> the "real" data. strtok(), with varying delimiter specifications,
    >> could then be used to break that down into sender (delimited by '>',
    >> then remove the last character, perhaps), receiver (delimited by
    >> space) and text (delimited by '\0')...
    >>
    >> Would that work?

    >
    > That sounds good, but what about the message tokens that have quotes
    > (" ") in them? This will cause strtok to not give me the desired
    > results.


    The quotes could only cause a problem if they are used in the sender or
    receiver part to escape a delimiter character.
    With the example lines you have given so far, the described parsing
    method with strtok does not have any trouble with the quotes at all.

    If log line like this are possible
    18:42:55 "SEN>DER"->"My Receiver" Some test, with " character
    then you have to account for the use of quotes.
    This can be simply done with a check if the first character of the
    sender or receiver is a quote character. If so, search for the ending
    quote, otherwise proceed as normal.

    > For example a line such as:
    >
    > Z1->P0 Hello, what do you think of "The Leaves of Grass"?
    >
    > I also want to be sure the messages that are extracted and put into
    > the HTML files will be in the same time increasing order so one can
    > read the messages naturally as they occurred.


    Before writing the lines, you could just sort them on timestamp.
    When presenting lines from two log files in different formats, I would
    transform all timestamps to a common format, to make it easier to read.

    >
    > Lisp 9000


    Bart v Ingen Schenau
    --
    a.c.l.l.c-c++ FAQ: http://www.comeaucomputing.com/learn/faq
    c.l.c FAQ: http://www.eskimo.com/~scs/C-faq/top.html
    c.l.c++ FAQ: http://www.parashift.com/c -faq-lite/
     
    Bart van Ingen Schenau, Sep 15, 2007
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. GIMME
    Replies:
    2
    Views:
    877
    GIMME
    Feb 11, 2004
  2. Naren
    Replies:
    0
    Views:
    586
    Naren
    May 11, 2004
  3. Christopher Diggins
    Replies:
    0
    Views:
    613
    Christopher Diggins
    Jul 9, 2007
  4. Christopher Diggins
    Replies:
    0
    Views:
    442
    Christopher Diggins
    Jul 9, 2007
  5. John Levine
    Replies:
    0
    Views:
    738
    John Levine
    Feb 2, 2012
Loading...

Share This Page