Optimization anyone

Discussion in 'Ruby' started by Horacio Sanson, Nov 22, 2005.

  1. I have this little script that takes a list of keyword sets, each set has only
    two keywords and for each one of them the script creates a regular expression
    like this:

    Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}")

    then I match it to a string that contains a long text fetched from a web page.

    a more complete pseudo-code

    #########################################
    long_text = get_web_page(url)

    keyword_hash = load_keyword_array_from_database

    keyword_hash.each_pair { |id, value|

    key1 = value[0]
    key2 = value[1]

    r = Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}")
    return id if long_text =~ r
    }

    return -1
    ###########################################


    Now this code works perfect, the problem is that the keyword_hash has more
    than 300 elements and running this code can take between 50 to 120 seconds.
    Since I am processing more than 1000 pages with this code it takes forever.


    I solved this problem by replacing the regular expression match to

    r1 = Regexp.new("#{key1}\.*#{key2}")
    r2 = Regexp.new("#{key2}\.*#{key1}")

    return id if long_text =~ r1 or long_text =~ r2


    I simply put the or statement outside the regular expresion and the speedup
    was from 50~120sec to 0.40 secs per page.


    using the Benchmark class and running some test I got

    normal: 0 0
    27.688000 0.015000 27.703000 ( 27.765000 )
    fast:
    0.469000 0.000000 0.484000 (0.954000)


    the speed difference is totally diferent.

    Is this expected when using regular expressions??


    regards,
    Horacio
    Horacio Sanson, Nov 22, 2005
    #1
    1. Advertising

  2. Horacio Sanson

    ts Guest

    >>>>> "H" == Horacio Sanson <> writes:

    H> Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}")

    vs

    H> r1 = Regexp.new("#{key1}\.*#{key2}")
    H> r2 = Regexp.new("#{key2}\.*#{key1}")

    H> Is this expected when using regular expressions??

    yes, ruby has some optimizations. For example with the regexp /abc.*def/

    svg% ruby -rjj -e '/abc.*def/.dump'
    Regexp /abc.*def/
    0 exactn "abc" (3)
    1 anychar_repeat
    2 exactn "def" (3)
    3 end
    must : abc
    optimize : exactn
    svg%

    It call the regexp engine (which is slow) only when it has found the
    substring "abc" in the string

    Now if you use /abc.*def|def.*abc/ you break this optimization


    svg% ruby -rjj -e '/abc.*def|def.*abc/.dump'
    Regexp /abc.*def|def.*abc/
    0 on_failure_jump ==> 5
    1 exactn "abc" (3)
    2 anychar_repeat
    3 exactn "def" (3)
    4 jump ==> 8
    5 exactn "def" (3)
    6 anychar_repeat
    7 exactn "abc" (3)
    8 end
    svg%


    it must call the stupid :)-)) regexp engine for each line


    Guy Decoux
    ts, Nov 22, 2005
    #2
    1. Advertising

  3. Horacio Sanson wrote:
    > I have this little script that takes a list of keyword sets, each set
    > has only two keywords and for each one of them the script creates a
    > regular expression like this:
    >
    > Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}")
    >
    > then I match it to a string that contains a long text fetched from a
    > web page.
    >
    > a more complete pseudo-code
    >
    > #########################################
    > long_text = get_web_page(url)
    >
    > keyword_hash = load_keyword_array_from_database
    >
    > keyword_hash.each_pair { |id, value|
    >
    > key1 = value[0]
    > key2 = value[1]
    >
    > r = Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}")
    > return id if long_text =~ r
    > }
    >
    > return -1
    > ###########################################
    >
    >
    > Now this code works perfect, the problem is that the keyword_hash has
    > more than 300 elements and running this code can take between 50 to
    > 120 seconds. Since I am processing more than 1000 pages with this
    > code it takes forever.
    >
    >
    > I solved this problem by replacing the regular expression match to
    >
    > r1 = Regexp.new("#{key1}\.*#{key2}")
    > r2 = Regexp.new("#{key2}\.*#{key1}")
    >
    > return id if long_text =~ r1 or long_text =~ r2
    >
    >
    > I simply put the or statement outside the regular expresion and the
    > speedup was from 50~120sec to 0.40 secs per page.
    >
    >
    > using the Benchmark class and running some test I got
    >
    > normal: 0 0
    > 27.688000 0.015000 27.703000 ( 27.765000
    > )
    > fast:
    > 0.469000 0.000000 0.484000 (0.954000)
    >
    >
    > the speed difference is totally diferent.
    >
    > Is this expected when using regular expressions??


    On obvious optimization is to create all regexps during
    load_keyword_array_from_database() and not during iteration of the hash.
    That way you just have to do it once and can reuse those regexps with
    multiple pages you check.

    Another possible optimization is to take your approach of splitting the
    regexps a bit further and create two regexps - one for each keyword - and
    return the id if both match. This works only correctly if (i) keywords
    don't overlap or (ii) you can use \b to ensure matching on word
    boundaries.

    Kind regards

    robert
    Robert Klemme, Nov 22, 2005
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kim Petersen

    Help and optimization hints, anyone?

    Kim Petersen, Jan 23, 2004, in forum: Python
    Replies:
    4
    Views:
    368
    Kim Petersen
    Jan 23, 2004
  2. dp_pearce
    Replies:
    2
    Views:
    309
    Terry Reedy
    Jul 9, 2008
  3. dp_pearce
    Replies:
    5
    Views:
    272
    dp_pearce
    Jul 15, 2008
  4. Ravikiran

    Zero Optimization and Sign Optimization???

    Ravikiran, Nov 17, 2008, in forum: C Programming
    Replies:
    22
    Views:
    862
    Thad Smith
    Nov 24, 2008
  5. neoswf
    Replies:
    4
    Views:
    276
Loading...

Share This Page