Using Classifier::LSI

C

chrisjroos

Hi,

I've just spent some time looking at the ruby classifier[1] library.

I need to index 3000 one line product names.

The default behaviour is to index items as they are inserted (at least
it would certainly appear that's what's happening based on simple
benchmarking). This causes the time to index new items to rise
exponentially. After some experimentation, I found that turning off
the automatic indexing and then manually building once all items were
added was far quicker (17 secs to add 100 items with auto indexing, 0.6
secs to add 100 items with manual index building).

I also did a quick test to confirm that the output was the same in both
cases.

I'm really just wondering why the default behaviour seems to be the
most inefficient and whether I'm missing anything obvious using the
manual build method?

Any help / pointers would be much appreciated.

Chris

[1] http://classifier.rufy.com
 
C

Cameron McBride

Hey Chris,

The default behaviour is to index items as they are inserted (at least
it would certainly appear that's what's happening based on simple
benchmarking). This causes the time to index new items to rise
exponentially.

right. the index building takes time and there is a lot of double work goi=
ng
on.
After some experimentation, I found that turning off
the automatic indexing and then manually building once all items were
added was far quicker (17 secs to add 100 items with auto indexing, 0.6
secs to add 100 items with manual index building).

Great. This is exactly why there is an :auto_rebuild option, so you can ma=
ke
this more effecient (like you discovered).
I'm really just wondering why the default behaviour seems to be the
most inefficient and whether I'm missing anything obvious using the
manual build method?

I believe the default behavior is directed toward simple usage to help give
people a feel for what the library does. Given the mathematics underneath =
the
index building, when you have 3000 items the index building is slower.
(I believe the docs say around 500 is rule of thumb for speedy
return).

I don't think you're missing anything, here. Just use
:auto_rebuild=3D>false and build it manually.

I was actually advocating a lazy evaluation as the default (so the
index wouldn't be built until you needed it, which would work rather
effeciently for both simple usage and your case). The argument
against this was to prevent users from thinking the searching was
slow. Once the index is built, everything else flies. LSI just
really is math intensive, which is why there is a C-based backend for
it. Just make sure you're using something other than the pure-ruby
version if speed becomes an issue.

Let us know if anything else is unclear.
 
C

chrisjroos

Ok, so it took just over 7 hours to build the index of 3000 items.

Since then it has been running for a further 12 hours trying to use
that index to obtain likely matches for the same 3000 items; i.e. for
each of the 3000 items I am trying to get the best matches from the
index (using find related).

Should I even bother waiting for it to finish or should I be
investigating something else to achieve similar results?

BTW. I am using the c extension.

Thanks in advance,

Chris
 
T

Timothy Goddard

I'm doing a project at the moment cataloguing publications where I've
been using Ferret (a ruby port of lucene). I have to say, this is a
brilliant library. It has really good search capablilities, division of
'documents' into various fields that can be individually searched, and
it's fast. Here are some basic benchmarks using the most simple entry
mode (both stores and tokenises all input) using ruby 1.8.4
(2005-10-29) [i686-linux] on a 1.4GHz Celeron M processor, 768MB
memory:

tim@ghostknife:~/programming/ruby> cat bm_ferret.rb
require 'benchmark'
require 'rubygems'
require 'ferret'

$index = Ferret::Index::Index.new :path => './test_index', :create =>
true
$words = []

File.open('/usr/share/dict/words') do |dict|
dict.each_line do |line|
$words << line.chomp
end
end
def get_words
ret = []
10.times do
ret << $words[(rand * $words.length).to_i]
end
ret
end
Benchmark.bmbm(7) do |x|
x.report('Filing: ') do
1000.times {$index << {:key => (rand * 1000000).to_i,
:data => get_words.join(' ')}}
end
end

tim@ghostknife:~/programming/ruby> ruby bm_ferret.rb
Rehearsal --------------------------------------------
Filing: 7.920000 0.630000 8.550000 ( 8.551508)
----------------------------------- total: 8.550000sec

user system total real
Filing: 7.680000 0.700000 8.380000 ( 8.637888)
 
C

Cameron McBride

Chris,

Ok, so it took just over 7 hours to build the index of 3000 items.

something sounds drastically wrong.
Since then it has been running for a further 12 hours trying to use
that index to obtain likely matches for the same 3000 items; i.e. for
each of the 3000 items I am trying to get the best matches from the
index (using find related).

again, something seems funny here. Just performing a benchmark on the
dominant calculation for index building using for somewhere around 3000
documents with 50 unique keywords. This took on the order of 4 minutes
on my 1.3GHz Athlon. The pure-ruby version would take exponentially longer=
 
M

Matthew Smillie

Since then it has been running for a further 12 hours trying to use
that index to obtain likely matches for the same 3000 items; i.e. for
each of the 3000 items I am trying to get the best matches from the
index (using find related).

Should I even bother waiting for it to finish or should I be
investigating something else to achieve similar results?

Can't comment on the time it takes, but the data you're using doesn't
seem particularly suited to LSI, in my opinion (and this sort of
thing is my occupation these days). LSI's not magic - what it's
doing is taking advantage of the statistical properties of language.
So it needs two things to work well: a relatively large set of words
compared to the number of items, and the items should be (more or
less) standard language.

Obviously I don't know exactly what the product names are, but as a
class, product names don't strike me as fitting those constraints
very well. Firstly because I expect them to be fairly short (5-6
words, tops?), and secondly because they lack a lot of the syntax and
semantic relations that you'd find in a sentence (nominals don't have
very much internal structure, in general).

Other approaches that might be promising might be standard word/
document search (like ferret, already mentioned), or a language model
approach, which works using relative frequencies of words. In the
power tool domain, for instance, "grit" might correlate highly with
"sander", and so you could say that anything with "grit" in it is
related to sanding.

That said, I'm not aware of any Ruby libraries which implement this
sort of thing, so if you wanted to stick with Ruby, you'd be doing it
yourself (it's not a particularly sophisticated approach, though, so
it likely wouldn't be that hard).

matthew smillie.
 
C

chrisjroos

Ok, so I'm thinking (taking into account what Matthew Smillie has said)
that it's due to the number of 'unique' words found in my documents -
4343 unique words found in 3320 product names.

I have actually spent the morning looking at ferret and some other text
searching libraries and at the moment I think ferret will do what I
require.

Thanks for the help folks.

Chris
 
C

Cameron McBride

Ok, so I'm thinking (taking into account what Matthew Smillie has said)
that it's due to the number of 'unique' words found in my documents -
4343 unique words found in 3320 product names.

I have actually spent the morning looking at ferret and some other text
searching libraries and at the moment I think ferret will do what I
require.

sounds like the right approach, glad you found something that worked.

Cameron
 
D

Dave Fayram

Ok, so it took just over 7 hours to build the index of 3000 items.

Depending on the size of your inputs, it can take longer.
Classifier::LSI has a very specific sweet spot of data volume in its
current incarnation. It's not currently suited for very large loads. I
am currently working on a version that will work with the MPI so that
clusters can attack the problem. LSI just isn't useful for large
datasets without _very_ hefty hardware.

But if your matrix is only 3000x4000-ish, it shouldn't take that long.
Make sure that your classifier is binding to gsl, and if you need to,
email me and we can pursue it further. I only experience a breakdown in
speed when my unique-words list is in the 15k+ range, with more than
1500 items.

Also, Remember that LSI wants to work with the *semantics*. A list of
product titles isn't going to bring very much semantic data into the
system. The way I originally envisioned using LSI was to create
summaries of documents, and working on sets of document abstracts to
rapidly provide relevant ACM papers. :)
 
C

Cameron McBride

I have actually spent the morning looking at ferret and some other text
searching libraries and at the moment I think ferret will do what I
require.

There is also a Classifier::Bayes available that might be more in sync
with what you're trying to do.

Cameron
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,563
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top