Searching for a very fast string parser

M

|MKSM|

Hello,

I want to parse a log file containing several line in the same format.
My log files are about 50mb each (350k lines) so i need something
quite fast. The current (and fastest) solution i came up with is using
StringScanner.

I save what i get into variables and then pass them all into a Struct
i created. Each new struct is then passed into an Array that holds all
structs.


Here's my test code:

require 'strscan'

a =3D "1140908573.050732 rule 19/0(match): pass unkn(255) on sis1:
80.202.226.15.50000 > 192.168.0.6.52525: UDP, length 64"

s =3D StringScanner.new(a)
time =3D s.scan(/\d+\.\d+/)
s.pos +=3D 23
rule_no =3D s.scan(/\d+/)
s.skip(/[\d\D]*?\s/)
stat =3D s.scan(/\w+/)
s.skip(/.*on\s/)
interface =3D s.scan(/\w+\:/)
s.skip(/\D+?\s/)
out_ip =3D s.scan(/(\d+\.){3}\d{0,3}/)
s.pos +=3D 1
out_port =3D s.scan(/\d+/)
s.skip(/\D+/)
in_ip =3D s.scan(/(\d+\.){3}\d{0,3}/)
s.pos +=3D 1
in_port =3D s.scan(/\d+/)
s.pos +=3D 2
proto =3D s.scan(/\w+/)
proto
s.pos +=3D 1

Running that on a 10k times loop it takes about 0.6 seconds to
complete. Is there a better/faster way on doing it?

Regards,

Ricardo.
 
A

ara.t.howard

Hello,

I want to parse a log file containing several line in the same format.
My log files are about 50mb each (350k lines) so i need something
quite fast. The current (and fastest) solution i came up with is using
StringScanner.

I save what i get into variables and then pass them all into a Struct
i created. Each new struct is then passed into an Array that holds all
structs.


Here's my test code:

require 'strscan'

a = "1140908573.050732 rule 19/0(match): pass unkn(255) on sis1:
80.202.226.15.50000 > 192.168.0.6.52525: UDP, length 64"

s = StringScanner.new(a)
time = s.scan(/\d+\.\d+/)
s.pos += 23
rule_no = s.scan(/\d+/)
s.skip(/[\d\D]*?\s/)
stat = s.scan(/\w+/)
s.skip(/.*on\s/)
interface = s.scan(/\w+\:/)
s.skip(/\D+?\s/)
out_ip = s.scan(/(\d+\.){3}\d{0,3}/)
s.pos += 1
out_port = s.scan(/\d+/)
s.skip(/\D+/)
in_ip = s.scan(/(\d+\.){3}\d{0,3}/)
s.pos += 1
in_port = s.scan(/\d+/)
s.pos += 2
proto = s.scan(/\w+/)
proto
s.pos += 1

Running that on a 10k times loop it takes about 0.6 seconds to
complete. Is there a better/faster way on doing it?

Regards,

Ricardo.

can you put a demo log file on the web somewhere?

-a
 
M

|MKSM|

Hello,

I want to parse a log file containing several line in the same format.
My log files are about 50mb each (350k lines) so i need something
quite fast. The current (and fastest) solution i came up with is using
StringScanner.

I save what i get into variables and then pass them all into a Struct
i created. Each new struct is then passed into an Array that holds all
structs.


Here's my test code:

require 'strscan'

a =3D "1140908573.050732 rule 19/0(match): pass unkn(255) on sis1:
80.202.226.15.50000 > 192.168.0.6.52525: UDP, length 64"

s =3D StringScanner.new(a)
time =3D s.scan(/\d+\.\d+/)
s.pos +=3D 23
rule_no =3D s.scan(/\d+/)
s.skip(/[\d\D]*?\s/)
stat =3D s.scan(/\w+/)
s.skip(/.*on\s/)
interface =3D s.scan(/\w+\:/)
s.skip(/\D+?\s/)
out_ip =3D s.scan(/(\d+\.){3}\d{0,3}/)
s.pos +=3D 1
out_port =3D s.scan(/\d+/)
s.skip(/\D+/)
in_ip =3D s.scan(/(\d+\.){3}\d{0,3}/)
s.pos +=3D 1
in_port =3D s.scan(/\d+/)
s.pos +=3D 2
proto =3D s.scan(/\w+/)
proto
s.pos +=3D 1

Running that on a 10k times loop it takes about 0.6 seconds to
complete. Is there a better/faster way on doing it?

Regards,

Ricardo.

can you put a demo log file on the web somewhere?

-a
I'm sorry, the log file i have comes from a live firewall. I'd rather
not release it.

The log is only consisted by several line such as the one i used in the cod=
e.

Regards,

Ricardo
 
R

Robert Klemme

Caleb said:
OK, so first off, your sample implementation seemed to have several
bugs in it. After fixing those, I thought you might be able to save
some time by glomming all the regexp's together, obviating the need
for StringScanner altogether. However, that doesn't seem to have
actually made any difference...

I don't buy this. A single plain RX is usually faster than a more complex
solution. Even on a machine with constant high load (I had no different
available at the moment) I get a significant difference (north of 6%):
15:22:14 [source]: /c/temp/ruby/logscan.rb
Rehearsal ------------------------------------------------
strscan 5.969000 0.000000 5.969000 ( 6.095000)
rx 5.828000 0.000000 5.828000 ( 5.951000)
rx with conv 5.860000 0.000000 5.860000 ( 5.922000)
-------------------------------------- total: 17.657000sec

user system total real
strscan 5.953000 0.000000 5.953000 ( 6.043000)
rx 5.547000 0.000000 5.547000 ( 5.747000)
rx with conv 5.765000 0.000000 5.765000 ( 5.924000)

(script attached)
if anything it seems to have been a
little slower. I don't know why. And the great big long Regexp is
considerably harder to read.

Using %r{} and /x makes a great deal in readability (see script).

Kind regards

robert
 
R

Robert Klemme

Robert said:
Caleb said:
OK, so first off, your sample implementation seemed to have several
bugs in it. After fixing those, I thought you might be able to save
some time by glomming all the regexp's together, obviating the need
for StringScanner altogether. However, that doesn't seem to have
actually made any difference...

I don't buy this. A single plain RX is usually faster than a more
complex solution. Even on a machine with constant high load (I had
no different available at the moment) I get a significant difference
(north of 6%):
15:22:14 [source]: /c/temp/ruby/logscan.rb
Rehearsal ------------------------------------------------
strscan 5.969000 0.000000 5.969000 ( 6.095000)
rx 5.828000 0.000000 5.828000 ( 5.951000)
rx with conv 5.860000 0.000000 5.860000 ( 5.922000)
-------------------------------------- total: 17.657000sec

user system total real
strscan 5.953000 0.000000 5.953000 ( 6.043000)
rx 5.547000 0.000000 5.547000 ( 5.747000)
rx with conv 5.765000 0.000000 5.765000 ( 5.924000)

(script attached)
if anything it seems to have been a
little slower. I don't know why. And the great big long Regexp is
considerably harder to read.

Using %r{} and /x makes a great deal in readability (see script).

Kind regards

robert

I redid the test on an idle Linux machine with Ruby 1.8.1 and the
StringScanner is actually faster:

[root@fox tmp]# ./logscan.rb
Rehearsal ------------------------------------------------
strscan 2.990000 0.000000 2.990000 ( 2.991096)
rx 4.870000 0.000000 4.870000 ( 4.868536)
rx with 4.280000 0.010000 4.290000 ( 4.284334)
rx with conv 5.240000 0.000000 5.240000 ( 5.459702)
-------------------------------------- total: 17.390000sec

user system total real
strscan 3.000000 0.000000 3.000000 ( 2.999783)
rx 4.870000 0.000000 4.870000 ( 4.899242)
rx with 4.300000 0.010000 4.310000 ( 4.869835)
rx with conv 5.240000 0.000000 5.240000 ( 5.442722)

Apparently I have to correct myself...

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top