Parsing Log records with regular expressions

K

Kris K.

I have a log file which is text based which has records in two formats
of the following form
`
A|B|C|D\n
A|B|C|D|E\n
\n
Exception\n
\n
\tstack trace line1\n
\tstack trace line2\n
\tstack trace line3\n
\n
A|B|C|D\n`

The first form (A|B|C|D) has statically defined columns delimited by a
pipe symbol. The second form has the last character "E" which implies an
exception record. If it is an exception record the information about the
exception follows. The exception information starts with a line
"Exception", followed by another newline and stacktrace on multiple
lines. Each stacktrace element starts with a tab.

I am parsing this file with ruby. Currently I am reading line by line
and building the log records. This is working fine.

I am wondering if I could rely on regular expressions to do it instead
of reading line by line - I could read a chunk of the file and apply two
regular expressions to see if there is a match and if I find the match
process the record and move to the next record. If there is no match,
then I combine multiple chunks until I find a match. Is this approach a
valid
consideration? Is this doable with Ruby? If there are any open source
projects, that do something like this, can someone point me to it? Also
any thoughts which one is more efficient and why? Appreciate any
feedback.
 
R

Robert Klemme

I have a log file which is text based which has records in two formats
of the following form
`
A|B|C|D\n
A|B|C|D|E\n
\n
Exception\n
\n
\tstack trace line1\n
\tstack trace line2\n
\tstack trace line3\n
\n
A|B|C|D\n`

The first form (A|B|C|D) has statically defined columns delimited by a
pipe symbol. The second form has the last character "E" which implies an
exception record. If it is an exception record the information about the
exception follows. The exception information starts with a line
"Exception", followed by another newline and stacktrace on multiple
lines. Each stacktrace element starts with a tab.

I am parsing this file with ruby. Currently I am reading line by line
and building the log records. This is working fine.

I am wondering if I could rely on regular expressions to do it instead
of reading line by line - I could read a chunk of the file and apply two
regular expressions to see if there is a match and if I find the match
process the record and move to the next record. If there is no match,
then I combine multiple chunks until I find a match. Is this approach a
valid consideration?

Question is: why do you want to do that? Line based parsing is simple
and has the advantage that you always get a complete record. Note
also that underneath Ruby uses buffered reading - just in case you
wonder about IO efficiency.
Is this doable with Ruby?

Yes, certainly.
If there are any open source
projects, that do something like this, can someone point me to it? Also
any thoughts which one is more efficient and why? Appreciate any
feedback.

My implementation of this would use a single regular expression with
an optional part for the "|E". That way you need to match only once
and you can immediately distinguish record types.

# untested
Record = Struct.new :a, :b, :c, :d, :e

last = nil
ex = false

def parse
ARGF.each do |line|
if %r{^([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)(\|E)?} =~ line
ex = $5
r = Record.new $1, $2, $3, $4
r.e = "" if ex

yield last if last

last = r
elsif ex
last.e << line
else
warn "Dunno what to do with line %{line.inspect}"
end
end

yield last if last
end

parse do |rec|
p rec
end

Cheers

robert
 
K

Kris K.

Thanks for the prompt response. Apprecite your taking the time to
respond with sample code. I have just started on this as a pet project
to learn Ruby. The task is to build a log analysis web application. The
log file is not a standard one - in the sense that it is dynamically
constructed where some columns are optional, but all of them are
seperated by '|' character. Initially I am starting with reading a
static file but at some point my plan is to use SSH to read the live
file contents and provide realtime inforation. So I was considering what
other alternatives might work well in the realtime scenario as well.

Robert Klemme wrote in post #979584:
My implementation of this would use a single regular expression with
an optional part for the "|E". That way you need to match only once
and you can immediately distinguish record types.

# untested
Record = Struct.new :a, :b, :c, :d, :e

last = nil
ex = false

def parse
ARGF.each do |line|
if %r{^([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)(\|E)?} =~ line
ex = $5
r = Record.new $1, $2, $3, $4
r.e = "" if ex

yield last if last

last = r
elsif ex
last.e << line
else
warn "Dunno what to do with line %{line.inspect}"
end
end

yield last if last
end

parse do |rec|
p rec
end

Cheers

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top