Searching for a very fast string parser

Discussion in 'Ruby' started by |MKSM|, Mar 8, 2006.

  1. |MKSM|

    |MKSM| Guest

    Hello,

    I want to parse a log file containing several line in the same format.
    My log files are about 50mb each (350k lines) so i need something
    quite fast. The current (and fastest) solution i came up with is using
    StringScanner.

    I save what i get into variables and then pass them all into a Struct
    i created. Each new struct is then passed into an Array that holds all
    structs.


    Here's my test code:

    require 'strscan'

    a =3D "1140908573.050732 rule 19/0(match): pass unkn(255) on sis1:
    80.202.226.15.50000 > 192.168.0.6.52525: UDP, length 64"

    s =3D StringScanner.new(a)
    time =3D s.scan(/\d+\.\d+/)
    s.pos +=3D 23
    rule_no =3D s.scan(/\d+/)
    s.skip(/[\d\D]*?\s/)
    stat =3D s.scan(/\w+/)
    s.skip(/.*on\s/)
    interface =3D s.scan(/\w+\:/)
    s.skip(/\D+?\s/)
    out_ip =3D s.scan(/(\d+\.){3}\d{0,3}/)
    s.pos +=3D 1
    out_port =3D s.scan(/\d+/)
    s.skip(/\D+/)
    in_ip =3D s.scan(/(\d+\.){3}\d{0,3}/)
    s.pos +=3D 1
    in_port =3D s.scan(/\d+/)
    s.pos +=3D 2
    proto =3D s.scan(/\w+/)
    proto
    s.pos +=3D 1

    Running that on a 10k times loop it takes about 0.6 seconds to
    complete. Is there a better/faster way on doing it?

    Regards,

    Ricardo.
     
    |MKSM|, Mar 8, 2006
    #1
    1. Advertising

  2. |MKSM|

    Guest

    On Thu, 9 Mar 2006, |MKSM| wrote:

    > Hello,
    >
    > I want to parse a log file containing several line in the same format.
    > My log files are about 50mb each (350k lines) so i need something
    > quite fast. The current (and fastest) solution i came up with is using
    > StringScanner.
    >
    > I save what i get into variables and then pass them all into a Struct
    > i created. Each new struct is then passed into an Array that holds all
    > structs.
    >
    >
    > Here's my test code:
    >
    > require 'strscan'
    >
    > a = "1140908573.050732 rule 19/0(match): pass unkn(255) on sis1:
    > 80.202.226.15.50000 > 192.168.0.6.52525: UDP, length 64"
    >
    > s = StringScanner.new(a)
    > time = s.scan(/\d+\.\d+/)
    > s.pos += 23
    > rule_no = s.scan(/\d+/)
    > s.skip(/[\d\D]*?\s/)
    > stat = s.scan(/\w+/)
    > s.skip(/.*on\s/)
    > interface = s.scan(/\w+\:/)
    > s.skip(/\D+?\s/)
    > out_ip = s.scan(/(\d+\.){3}\d{0,3}/)
    > s.pos += 1
    > out_port = s.scan(/\d+/)
    > s.skip(/\D+/)
    > in_ip = s.scan(/(\d+\.){3}\d{0,3}/)
    > s.pos += 1
    > in_port = s.scan(/\d+/)
    > s.pos += 2
    > proto = s.scan(/\w+/)
    > proto
    > s.pos += 1
    >
    > Running that on a 10k times loop it takes about 0.6 seconds to
    > complete. Is there a better/faster way on doing it?
    >
    > Regards,
    >
    > Ricardo.


    can you put a demo log file on the web somewhere?

    -a

    --
    knowledge is important, but the much more important is the use toward which it
    is put. this depends on the heart and mine the one who uses it.
    - h.h. the 14th dali lama
     
    , Mar 8, 2006
    #2
    1. Advertising

  3. |MKSM|

    |MKSM| Guest

    On 3/8/06, <> wrote:
    > On Thu, 9 Mar 2006, |MKSM| wrote:
    >
    > > Hello,
    > >
    > > I want to parse a log file containing several line in the same format.
    > > My log files are about 50mb each (350k lines) so i need something
    > > quite fast. The current (and fastest) solution i came up with is using
    > > StringScanner.
    > >
    > > I save what i get into variables and then pass them all into a Struct
    > > i created. Each new struct is then passed into an Array that holds all
    > > structs.
    > >
    > >
    > > Here's my test code:
    > >
    > > require 'strscan'
    > >
    > > a =3D "1140908573.050732 rule 19/0(match): pass unkn(255) on sis1:
    > > 80.202.226.15.50000 > 192.168.0.6.52525: UDP, length 64"
    > >
    > > s =3D StringScanner.new(a)
    > > time =3D s.scan(/\d+\.\d+/)
    > > s.pos +=3D 23
    > > rule_no =3D s.scan(/\d+/)
    > > s.skip(/[\d\D]*?\s/)
    > > stat =3D s.scan(/\w+/)
    > > s.skip(/.*on\s/)
    > > interface =3D s.scan(/\w+\:/)
    > > s.skip(/\D+?\s/)
    > > out_ip =3D s.scan(/(\d+\.){3}\d{0,3}/)
    > > s.pos +=3D 1
    > > out_port =3D s.scan(/\d+/)
    > > s.skip(/\D+/)
    > > in_ip =3D s.scan(/(\d+\.){3}\d{0,3}/)
    > > s.pos +=3D 1
    > > in_port =3D s.scan(/\d+/)
    > > s.pos +=3D 2
    > > proto =3D s.scan(/\w+/)
    > > proto
    > > s.pos +=3D 1
    > >
    > > Running that on a 10k times loop it takes about 0.6 seconds to
    > > complete. Is there a better/faster way on doing it?
    > >
    > > Regards,
    > >
    > > Ricardo.

    >
    > can you put a demo log file on the web somewhere?
    >
    > -a
    >
    > --
    > knowledge is important, but the much more important is the use toward whi=

    ch it
    > is put. this depends on the heart and mine the one who uses it.
    > - h.h. the 14th dali lama
    >
    >

    I'm sorry, the log file i have comes from a live firewall. I'd rather
    not release it.

    The log is only consisted by several line such as the one i used in the cod=
    e.

    Regards,

    Ricardo
     
    |MKSM|, Mar 8, 2006
    #3
  4. On Mar 8, 2006, at 12:09 PM, |MKSM| wrote:

    > I'm sorry, the log file i have comes from a live firewall. I'd rather
    > not release it.


    Would randomizing the data render it safe?

    >> "ABC 123".gsub(/[a-zA-Z0-9]/i) { |chr| ("0".."9").include?(chr) ?

    rand(10) : ("A".."Z").to_a[rand(26)] }
    => "HNQ 265"

    James Edward Gray II
     
    James Edward Gray II, Mar 8, 2006
    #4
  5. Caleb Clausen wrote:
    > OK, so first off, your sample implementation seemed to have several
    > bugs in it. After fixing those, I thought you might be able to save
    > some time by glomming all the regexp's together, obviating the need
    > for StringScanner altogether. However, that doesn't seem to have
    > actually made any difference...


    I don't buy this. A single plain RX is usually faster than a more complex
    solution. Even on a machine with constant high load (I had no different
    available at the moment) I get a significant difference (north of 6%):

    >> 15:22:14 [source]: /c/temp/ruby/logscan.rb

    Rehearsal ------------------------------------------------
    strscan 5.969000 0.000000 5.969000 ( 6.095000)
    rx 5.828000 0.000000 5.828000 ( 5.951000)
    rx with conv 5.860000 0.000000 5.860000 ( 5.922000)
    -------------------------------------- total: 17.657000sec

    user system total real
    strscan 5.953000 0.000000 5.953000 ( 6.043000)
    rx 5.547000 0.000000 5.547000 ( 5.747000)
    rx with conv 5.765000 0.000000 5.765000 ( 5.924000)

    (script attached)

    > if anything it seems to have been a
    > little slower. I don't know why. And the great big long Regexp is
    > considerably harder to read.


    Using %r{} and /x makes a great deal in readability (see script).

    Kind regards

    robert
     
    Robert Klemme, Mar 9, 2006
    #5
  6. Robert Klemme wrote:
    > Caleb Clausen wrote:
    >> OK, so first off, your sample implementation seemed to have several
    >> bugs in it. After fixing those, I thought you might be able to save
    >> some time by glomming all the regexp's together, obviating the need
    >> for StringScanner altogether. However, that doesn't seem to have
    >> actually made any difference...

    >
    > I don't buy this. A single plain RX is usually faster than a more
    > complex solution. Even on a machine with constant high load (I had
    > no different available at the moment) I get a significant difference
    > (north of 6%):
    >
    >>> 15:22:14 [source]: /c/temp/ruby/logscan.rb

    > Rehearsal ------------------------------------------------
    > strscan 5.969000 0.000000 5.969000 ( 6.095000)
    > rx 5.828000 0.000000 5.828000 ( 5.951000)
    > rx with conv 5.860000 0.000000 5.860000 ( 5.922000)
    > -------------------------------------- total: 17.657000sec
    >
    > user system total real
    > strscan 5.953000 0.000000 5.953000 ( 6.043000)
    > rx 5.547000 0.000000 5.547000 ( 5.747000)
    > rx with conv 5.765000 0.000000 5.765000 ( 5.924000)
    >
    > (script attached)
    >
    >> if anything it seems to have been a
    >> little slower. I don't know why. And the great big long Regexp is
    >> considerably harder to read.

    >
    > Using %r{} and /x makes a great deal in readability (see script).
    >
    > Kind regards
    >
    > robert


    I redid the test on an idle Linux machine with Ruby 1.8.1 and the
    StringScanner is actually faster:

    [root@fox tmp]# ./logscan.rb
    Rehearsal ------------------------------------------------
    strscan 2.990000 0.000000 2.990000 ( 2.991096)
    rx 4.870000 0.000000 4.870000 ( 4.868536)
    rx with 4.280000 0.010000 4.290000 ( 4.284334)
    rx with conv 5.240000 0.000000 5.240000 ( 5.459702)
    -------------------------------------- total: 17.390000sec

    user system total real
    strscan 3.000000 0.000000 3.000000 ( 2.999783)
    rx 4.870000 0.000000 4.870000 ( 4.899242)
    rx with 4.300000 0.010000 4.310000 ( 4.869835)
    rx with conv 5.240000 0.000000 5.240000 ( 5.442722)

    Apparently I have to correct myself...

    robert
     
    Robert Klemme, Mar 9, 2006
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Raymond Arthur St. Marie II of III

    very Very VERY dumb Question About The new Set( ) 's

    Raymond Arthur St. Marie II of III, Jul 23, 2003, in forum: Python
    Replies:
    4
    Views:
    499
    Raymond Hettinger
    Jul 27, 2003
  2. shanx__=|;-

    very very very long integer

    shanx__=|;-, Oct 16, 2004, in forum: C Programming
    Replies:
    19
    Views:
    1,679
    Merrill & Michele
    Oct 19, 2004
  3. Abhishek Jha

    very very very long integer

    Abhishek Jha, Oct 16, 2004, in forum: C Programming
    Replies:
    4
    Views:
    446
    jacob navia
    Oct 17, 2004
  4. Peter

    Very very very basic question

    Peter, Feb 8, 2005, in forum: C Programming
    Replies:
    14
    Views:
    526
    Dave Thompson
    Feb 14, 2005
  5. olivier.melcher

    Help running a very very very simple code

    olivier.melcher, May 12, 2008, in forum: Java
    Replies:
    8
    Views:
    2,329
Loading...

Share This Page