Using Classifier::LSI

Discussion in 'Ruby' started by chrisjroos@gmail.com, Feb 8, 2006.

  1. Guest

    Hi,

    I've just spent some time looking at the ruby classifier[1] library.

    I need to index 3000 one line product names.

    The default behaviour is to index items as they are inserted (at least
    it would certainly appear that's what's happening based on simple
    benchmarking). This causes the time to index new items to rise
    exponentially. After some experimentation, I found that turning off
    the automatic indexing and then manually building once all items were
    added was far quicker (17 secs to add 100 items with auto indexing, 0.6
    secs to add 100 items with manual index building).

    I also did a quick test to confirm that the output was the same in both
    cases.

    I'm really just wondering why the default behaviour seems to be the
    most inefficient and whether I'm missing anything obvious using the
    manual build method?

    Any help / pointers would be much appreciated.

    Chris

    [1] http://classifier.rufy.com
    , Feb 8, 2006
    #1
    1. Advertising

  2. Hey Chris,

    On 2/8/06, <> wrote:

    > The default behaviour is to index items as they are inserted (at least
    > it would certainly appear that's what's happening based on simple
    > benchmarking). This causes the time to index new items to rise
    > exponentially.


    right. the index building takes time and there is a lot of double work goi=
    ng
    on.

    > After some experimentation, I found that turning off
    > the automatic indexing and then manually building once all items were
    > added was far quicker (17 secs to add 100 items with auto indexing, 0.6
    > secs to add 100 items with manual index building).


    Great. This is exactly why there is an :auto_rebuild option, so you can ma=
    ke
    this more effecient (like you discovered).

    > I'm really just wondering why the default behaviour seems to be the
    > most inefficient and whether I'm missing anything obvious using the
    > manual build method?


    I believe the default behavior is directed toward simple usage to help give
    people a feel for what the library does. Given the mathematics underneath =
    the
    index building, when you have 3000 items the index building is slower.
    (I believe the docs say around 500 is rule of thumb for speedy
    return).

    I don't think you're missing anything, here. Just use
    :auto_rebuild=3D>false and build it manually.

    I was actually advocating a lazy evaluation as the default (so the
    index wouldn't be built until you needed it, which would work rather
    effeciently for both simple usage and your case). The argument
    against this was to prevent users from thinking the searching was
    slow. Once the index is built, everything else flies. LSI just
    really is math intensive, which is why there is a C-based backend for
    it. Just make sure you're using something other than the pure-ruby
    version if speed becomes an issue.

    Let us know if anything else is unclear.
    Cameron McBride, Feb 8, 2006
    #2
    1. Advertising

  3. Guest

    Ok, so it took just over 7 hours to build the index of 3000 items.

    Since then it has been running for a further 12 hours trying to use
    that index to obtain likely matches for the same 3000 items; i.e. for
    each of the 3000 items I am trying to get the best matches from the
    index (using find related).

    Should I even bother waiting for it to finish or should I be
    investigating something else to achieve similar results?

    BTW. I am using the c extension.

    Thanks in advance,

    Chris
    , Feb 9, 2006
    #3
  4. I'm doing a project at the moment cataloguing publications where I've
    been using Ferret (a ruby port of lucene). I have to say, this is a
    brilliant library. It has really good search capablilities, division of
    'documents' into various fields that can be individually searched, and
    it's fast. Here are some basic benchmarks using the most simple entry
    mode (both stores and tokenises all input) using ruby 1.8.4
    (2005-10-29) [i686-linux] on a 1.4GHz Celeron M processor, 768MB
    memory:

    tim@ghostknife:~/programming/ruby> cat bm_ferret.rb
    require 'benchmark'
    require 'rubygems'
    require 'ferret'

    $index = Ferret::Index::Index.new :path => './test_index', :create =>
    true
    $words = []

    File.open('/usr/share/dict/words') do |dict|
    dict.each_line do |line|
    $words << line.chomp
    end
    end
    def get_words
    ret = []
    10.times do
    ret << $words[(rand * $words.length).to_i]
    end
    ret
    end
    Benchmark.bmbm(7) do |x|
    x.report('Filing: ') do
    1000.times {$index << {:key => (rand * 1000000).to_i,
    :data => get_words.join(' ')}}
    end
    end

    tim@ghostknife:~/programming/ruby> ruby bm_ferret.rb
    Rehearsal --------------------------------------------
    Filing: 7.920000 0.630000 8.550000 ( 8.551508)
    ----------------------------------- total: 8.550000sec

    user system total real
    Filing: 7.680000 0.700000 8.380000 ( 8.637888)
    Timothy Goddard, Feb 9, 2006
    #4
  5. Chris,

    On 2/9/06, <> wrote:
    > Ok, so it took just over 7 hours to build the index of 3000 items.


    something sounds drastically wrong.

    > Since then it has been running for a further 12 hours trying to use
    > that index to obtain likely matches for the same 3000 items; i.e. for
    > each of the 3000 items I am trying to get the best matches from the
    > index (using find related).


    again, something seems funny here. Just performing a benchmark on the
    dominant calculation for index building using for somewhere around 3000
    documents with 50 unique keywords. This took on the order of 4 minutes
    on my 1.3GHz Athlon. The pure-ruby version would take exponentially longer=
    Cameron McBride, Feb 9, 2006
    #5
  6. On Feb 9, 2006, at 7:48, wrote:
    > Since then it has been running for a further 12 hours trying to use
    > that index to obtain likely matches for the same 3000 items; i.e. for
    > each of the 3000 items I am trying to get the best matches from the
    > index (using find related).
    >
    > Should I even bother waiting for it to finish or should I be
    > investigating something else to achieve similar results?


    Can't comment on the time it takes, but the data you're using doesn't
    seem particularly suited to LSI, in my opinion (and this sort of
    thing is my occupation these days). LSI's not magic - what it's
    doing is taking advantage of the statistical properties of language.
    So it needs two things to work well: a relatively large set of words
    compared to the number of items, and the items should be (more or
    less) standard language.

    Obviously I don't know exactly what the product names are, but as a
    class, product names don't strike me as fitting those constraints
    very well. Firstly because I expect them to be fairly short (5-6
    words, tops?), and secondly because they lack a lot of the syntax and
    semantic relations that you'd find in a sentence (nominals don't have
    very much internal structure, in general).

    Other approaches that might be promising might be standard word/
    document search (like ferret, already mentioned), or a language model
    approach, which works using relative frequencies of words. In the
    power tool domain, for instance, "grit" might correlate highly with
    "sander", and so you could say that anything with "grit" in it is
    related to sanding.

    That said, I'm not aware of any Ruby libraries which implement this
    sort of thing, so if you wanted to stick with Ruby, you'd be doing it
    yourself (it's not a particularly sophisticated approach, though, so
    it likely wouldn't be that hard).

    matthew smillie.



    ----
    Matthew Smillie <>
    Institute for Communicating and Collaborative Systems
    University of Edinburgh
    Matthew Smillie, Feb 9, 2006
    #6
  7. Guest

    Ok, so I'm thinking (taking into account what Matthew Smillie has said)
    that it's due to the number of 'unique' words found in my documents -
    4343 unique words found in 3320 product names.

    I have actually spent the morning looking at ferret and some other text
    searching libraries and at the moment I think ferret will do what I
    require.

    Thanks for the help folks.

    Chris
    , Feb 9, 2006
    #7
  8. On 2/9/06, <> wrote:
    > Ok, so I'm thinking (taking into account what Matthew Smillie has said)
    > that it's due to the number of 'unique' words found in my documents -
    > 4343 unique words found in 3320 product names.
    >
    > I have actually spent the morning looking at ferret and some other text
    > searching libraries and at the moment I think ferret will do what I
    > require.


    sounds like the right approach, glad you found something that worked.

    Cameron
    Cameron McBride, Feb 9, 2006
    #8
  9. Dave Fayram Guest

    >Ok, so it took just over 7 hours to build the index of 3000 items.

    Depending on the size of your inputs, it can take longer.
    Classifier::LSI has a very specific sweet spot of data volume in its
    current incarnation. It's not currently suited for very large loads. I
    am currently working on a version that will work with the MPI so that
    clusters can attack the problem. LSI just isn't useful for large
    datasets without _very_ hefty hardware.

    But if your matrix is only 3000x4000-ish, it shouldn't take that long.
    Make sure that your classifier is binding to gsl, and if you need to,
    email me and we can pursue it further. I only experience a breakdown in
    speed when my unique-words list is in the 15k+ range, with more than
    1500 items.

    Also, Remember that LSI wants to work with the *semantics*. A list of
    product titles isn't going to bring very much semantic data into the
    system. The way I originally envisioned using LSI was to create
    summaries of documents, and working on sets of document abstracts to
    rapidly provide relevant ACM papers. :)
    Dave Fayram, Feb 9, 2006
    #9
  10. On 2/9/06, <> wrote:
    > I have actually spent the morning looking at ferret and some other text
    > searching libraries and at the moment I think ferret will do what I
    > require.


    There is also a Classifier::Bayes available that might be more in sync
    with what you're trying to do.

    Cameron
    Cameron McBride, Feb 9, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. penny336
    Replies:
    1
    Views:
    665
    tom_usenet
    Nov 7, 2003
  2. Robin Becker

    pop3 email header classifier?

    Robin Becker, Sep 19, 2003, in forum: Python
    Replies:
    6
    Views:
    418
    David Mertz
    Sep 23, 2003
  3. Lucas Carlson
    Replies:
    4
    Views:
    147
    Dave Fayram
    Apr 25, 2005
  4. Tom Reilly

    classifier lsi and ruby gsl

    Tom Reilly, May 4, 2005, in forum: Ruby
    Replies:
    2
    Views:
    118
    Dave Fayram
    May 5, 2005
  5. Chris Kottom

    Doing LSI at scale in Ruby

    Chris Kottom, May 26, 2011, in forum: Ruby
    Replies:
    10
    Views:
    374
    Ryan Davis
    May 29, 2011
Loading...

Share This Page