Parsing?

lisp9000 · Sep 13, 2007

Hi,

I am writing a log parser (beginner in C) and have some questions.

There are 2 types of log files which are very similar:

Type 1:
117: SYSTEM->P0 Welcome to the server
444: Z1->P0 Greetings
812: SYSTEM->EVERYONE "Chumly" (P0) was kill #5 for "Dragon
Master" (Z1)
954: P0->TEAMORANGE Help me!

Type 2:
Welcome aboard Chumly! 00:03:40
00:03:41: Qualax-5->TEAMORANGE Qualax-5 destroyed by "Dragon
Master" (Z1)
Blaster missed!!! 00:03:53
00:04:06: P0->TEAMPURPLE Help Needed at Zorcon-8

So in Type 1 there is always an integar indicating relative time
prefixing every line and in Type 2 there is always a 24-hour style
timestamp but sometimes it is prefixed and other times suffixed.

I thought of using strtok() but that doesn't handling quoting so if I
encounter a message in " " it won't be able to handle it.

Does anyone have any idea on the best way to tokenize this? My goal is
to extract only certain types of messages such as the ones between
players (eg. Z1->P0) and to the team message board (eg P0->TEAMPURPLE)
and put these into HTML files in time increasing order.

I thought of reading in one character of the log file at a time but
then I will need lots of branch logic (if( char = 'T') && (char-next
== 'E') etc..) and that could get quite messy and confusing. I also
thought of storing each token as a struct field but I haven't much
experience with structs. I was also wondering about using fixed arrays
of chars vs arrays of char pointers. Any ideas and especially code
snippets would be appreciated to help me get started.

Lisp 9000

Nick Keighley · Sep 14, 2007

I am writing a log parser (beginner in C) and have some questions.

There are 2 types of log files which are very similar:

Type 1:
117: SYSTEM->P0 Welcome to the server
444: Z1->P0 Greetings
812: SYSTEM->EVERYONE "Chumly" (P0) was kill #5 for "Dragon
Master" (Z1)
954: P0->TEAMORANGE Help me!

Type 2:
Welcome aboard Chumly! 00:03:40
00:03:41: Qualax-5->TEAMORANGE Qualax-5 destroyed by "Dragon
Master" (Z1)
Blaster missed!!! 00:03:53
00:04:06: P0->TEAMPURPLE Help Needed at Zorcon-8

So in Type 1 there is always an integar indicating relative time
prefixing every line and in Type 2 there is always a 24-hour style
timestamp but sometimes it is prefixed and other times suffixed.

I thought of using strtok() but that doesn't handling quoting so if I
encounter a message in " " it won't be able to handle it.

Does anyone have any idea on the best way to tokenize this? My goal is
to extract only certain types of messages such as the ones between
players (eg. Z1->P0) and to the team message board (eg P0->TEAMPURPLE)
and put these into HTML files in time increasing order.

I thought of reading in one character of the log file at a time but
then I will need lots of branch logic (if( char = 'T') && (char-next
== 'E') etc..) and that could get quite messy and confusing. I also
thought of storing each token as a struct field but I haven't much
experience with structs. I was also wondering about using fixed arrays
of chars vs arrays of char pointers. Any ideas and especially code
snippets would be appreciated to help me get started.

probably a bit off-topic to comp.lang.c so I've added
comp.programming.

Try googling "Recursive descent parser"

--
Nick Keighley

Unpredictability may be exciting, but I don't believe it constitutes
good programming practice.
Richard Heathfield

lisp9000 · Sep 14, 2007

probably a bit off-topic to comp.lang.c so I've added
comp.programming.

Try googling "Recursive descent parser"

Interesting. Most of what I found was rather abstracted/formalistic
and this method seems to be used in compiler construction. Can you
show an example of this expressed in C code on one of the lines in my
sample log please?

Lisp 9000

Mark Bluemel · Sep 14, 2007

Hi,

I am writing a log parser (beginner in C) and have some questions.

There are 2 types of log files which are very similar:

Type 1:
117: SYSTEM->P0 Welcome to the server
444: Z1->P0 Greetings
812: SYSTEM->EVERYONE "Chumly" (P0) was kill #5 for "Dragon
Master" (Z1)
954: P0->TEAMORANGE Help me!

Type 2:
Welcome aboard Chumly! 00:03:40
00:03:41: Qualax-5->TEAMORANGE Qualax-5 destroyed by "Dragon
Master" (Z1)
Blaster missed!!! 00:03:53
00:04:06: P0->TEAMPURPLE Help Needed at Zorcon-8

So in Type 1 there is always an integar indicating relative time
prefixing every line and in Type 2 there is always a 24-hour style
timestamp but sometimes it is prefixed and other times suffixed.

I thought of using strtok() but that doesn't handling quoting so if I
encounter a message in " " it won't be able to handle it.

Does anyone have any idea on the best way to tokenize this?

What do you count as "tokens"? Until you define your requirement more
clearly, it will be hard to meet.

My goal is
to extract only certain types of messages such as the ones between
players (eg. Z1->P0) and to the team message board (eg P0->TEAMPURPLE)
and put these into HTML files in time increasing order.

OK. This is a little clearer. So on that basis, you are interested in
all lines from type 1 log files, but only those lines in type 2 which
start with with a timestamp? So on type 2 files, You could simply check
the first 9 characters for matching the timestamp pattern.

Having selected your lines you can start after the timestamp or sequence
number and look for the first non-space, being the start of the "real"
data. strtok(), with varying delimiter specifications, could then be
used to break that down into sender (delimited by '>', then remove the
last character, perhaps), receiver (delimited by space) and text
(delimited by '\0')...

Would that work?

lisp9000 · Sep 15, 2007

What do you count as "tokens"? Until you define your requirement more
clearly, it will be hard to meet.

Hi Mark,

The message chunks I am interested in extracting and putting into HTML
files.

OK. This is a little clearer. So on that basis, you are interested in
all lines from type 1 log files, but only those lines in type 2 which
start with with a timestamp? So on type 2 files,

Yes that's correct.

You could simply check
the first 9 characters for matching the timestamp pattern.

Could you show me some code that would do that? I know how to read in
a whole line but not sure how to individually check each character.

Having selected your lines you can start after the timestamp or sequence
number and look for the first non-space, being the start of the "real"
data. strtok(), with varying delimiter specifications, could then be
used to break that down into sender (delimited by '>', then remove the
last character, perhaps), receiver (delimited by space) and text
(delimited by '\0')...

Would that work?

That sounds good, but what about the message tokens that have quotes
(" ") in them? This will cause strtok to not give me the desired
results. For example a line such as:

Z1->P0 Hello, what do you think of "The Leaves of Grass"?

I also want to be sure the messages that are extracted and put into
the HTML files will be in the same time increasing order so one can
read the messages naturally as they occurred. I have plans to
eventually add searching based on a player id and some other cool
things but right now I would be happy to get a basic version working
and build onto that such as using structs and dynamic memory
allocation once I learn those concepts. Some programmers told me it's
best to learn by doing rather than just sitting down and reading an
entire textbook before starting to write C code. So that's what I'm
trying to do. I appreciate all the help.

Lisp 9000

Malcolm McLean · Sep 15, 2007

On Sep 14, 5:41 am, Nick Keighley <[email protected]>
wrote:
Interesting. Most of what I found was rather abstracted/formalistic
and this method seems to be used in compiler construction. Can you
show an example of this expressed in C code on one of the lines in my
sample log please?

There's Basic interpreter on my website. It is also available as a book for
a nominal price, a bit more if you want if professionally printed.

Basically you want to define a "token level". For your file a token will
probably be either a word or a special symbol like ->, or a timestamp.

So the heart is a gettoken() and match() system. gettoken() returns the
current token(), match() dispenses with it, and error-checks if it wasn't
legal.

The you define the higher level constructs. For instance a speaker looks
like

identifier -> identifier..

So you say

speaker()
{
identifier();
match("->");
identifier();
}

(This is pseudocode, you have to examine the idenifiers to construct a
speaker object).

It is recursive because an idenifier might be
a word
an idenifier, a join symbol (eg colon), another identifier

char *identifier()
{
answer = gettoken();
match(answer);
if(gettoken() == ":")
match(":");
answer = strcat(answer, identifier());
}

(Pseudo code again, obviously you need a string handling system in place)

Bart van Ingen Schenau · Sep 15, 2007

Yes that's correct.

Could you show me some code that would do that? I know how to read in
a whole line but not sure how to individually check each character.

You should read the recent thread titled "Duration Conversion" for some
examples how you can parse and validate a date string.
One example that can readily be applied to your situation goes like
this:

ret = fscanf(logfile, "%2d:%2d:%2d:", hr, min, sec);
if (ret == -1)
{
/* End of File reached */
}
else if (ret != 3)
{
/* Not a timestamp at the start of the line. Discard it */
fscanf(logfile, "%*[^\n]%*c");
}
else
{
/* We have an interesting log entry. Try to parse the rest. */
}

That sounds good, but what about the message tokens that have quotes
(" ") in them? This will cause strtok to not give me the desired
results.

The quotes could only cause a problem if they are used in the sender or
receiver part to escape a delimiter character.
With the example lines you have given so far, the described parsing
method with strtok does not have any trouble with the quotes at all.

If log line like this are possible
18:42:55 "SEN>DER"->"My Receiver" Some test, with " character
then you have to account for the use of quotes.
This can be simply done with a check if the first character of the
sender or receiver is a quote character. If so, search for the ending
quote, otherwise proceed as normal.

For example a line such as:

Z1->P0 Hello, what do you think of "The Leaves of Grass"?

I also want to be sure the messages that are extracted and put into
the HTML files will be in the same time increasing order so one can
read the messages naturally as they occurred.

Before writing the lines, you could just sort them on timestamp.
When presenting lines from two log files in different formats, I would
transform all timestamps to a common format, to make it easier to read.

Lisp 9000

Bart v Ingen Schenau

Problem with codewars.	5	Dec 4, 2023
Python : parsing the command line options using optparse	0	Feb 25, 2014
I need help fixing my website	2	Oct 15, 2023
VHDL CODE FOR CONTROLLER WHEN PLANE IS DESIGNED USING THREE POINTS	0	Nov 8, 2012
Help with my responsive home page	2	Dec 14, 2022
initializing data at compile time	10	Dec 18, 2004
Unexpected Error with .NET Framework 2.0 runtime	0	Feb 21, 2007
Parsing files	2	Oct 6, 2006

Parsing?

lisp9000

Nick Keighley

lisp9000

Mark Bluemel

lisp9000

Malcolm McLean

Bart van Ingen Schenau

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads