[ANN] Ferret 0.1.0 (Port of Java Lucene) released

D

David Balmain

------=_Part_8617_31987395.1129862171462
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Hi Folks,

I know there have been at least a few people looking for something like thi=
s
on the mailing list, so please check it out. It's a port of a Java project
so I'd particularly like to hear how I can make it more Ruby like. Enjoy!

Dave Balmain

=3D=3D Description

Ferret is a full port of the Java Lucene searching and indexing library.
It's available as a gem so try it out! To get started quickly read the quic=
k
start at the project homepage;

http://ferret.davebalmain.com/trac/

=3D=3D Quick (Very Simple) Example

require 'ferret'

include Ferret

docs =3D [
{ :title =3D> "The Pragmatic Programmer",
:author =3D> "Dave Thomas, Andy Hunt",
:tags =3D> "Programming, Broken Windows, Boiled Frogs",
:published =3D> "1999-10-13",
:content =3D> "Yada yada yada ..."
},
{ :title =3D> "Programming Ruby",
:author =3D> "Dave Thomas, Chad Fowler, Andy Hunt",
:tags =3D> "Ruby",
:published =3D> "2004-10-06",
:content =3D> "Yada yada yada ..."
},
{ :title =3D> "Agile Web Development with Rails",
:author =3D> "Dave Thomas, David Heinemeier Hansson, Leon Breedt, Mike Clar=
k,
Thomas Fuchs, Andreas Schwarz",
:tags =3D> "Ruby, Rails, Web Development",
:published =3D> "2005-07-13",
:content =3D> "Yada yada yada ..."
},
{ :title =3D> "Ruby, Developer's Guide",
:author =3D> "Robert Feldt, Lyle Johnson, Michael Neumann",
:tags =3D> "Ruby, Racc, GUI, FOX",
:published =3D> "2002-10-06",
:content =3D> "Yada yada yada ..."
},
{ :title =3D> "Lucene In Action",
:author =3D> "Otis Gospodnetic, Erik Hatcher",
:tags =3D> "Lucene, Java, Search, Indexing",
:published =3D> "2004-12-01",
:content =3D> "Yada yada yada ..."
}
]

index =3D Index::Index.new()

docs.each {|doc| index << doc }

puts index.size

puts "\nFind all documents on ruby:-"
index.search_each("tags:Ruby") do |doc, score|
puts "Document <#{index[doc]["title"]}> found with a score of %0.2f" % scor=
e
end

puts "\nFind all documents on ruby published this year:-"
index.search_each("tags:ruby AND published: >=3D 2005") do |doc, score|
puts "Document <#{index[doc]["title"]}> found with a score of %0.2f" % scor=
e
end

puts "\nFind all documents by the Pragmatic Programmers:-"
index.search_each('author:("dave Thomas" AND "Andy hunt")') do |doc, score|
puts "Document <#{index[doc]["title"]}> found with a score of %0.2f" % scor=
e
end

------=_Part_8617_31987395.1129862171462--
 
T

Tobias Luetke

Amazing,

This is a tremendous gift to the ruby on rails crowd. Lucene was
probably *the* library which was most missed in the ruby world.

Cheers for such an amazing port.
 
D

David Balmain

------=_Part_9729_21835177.1129872065959
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline



Thanks for pointing that out. I missed it. I do think it's a great idea for
a quiz though. It will be interesting to see what people come up with in sa=
y
50 lines as opposed to 10,000. And I'll be able to see who to hit up for
some help. ;-)

Dave

------=_Part_9729_21835177.1129872065959--
 
B

Bob Hutchison

This is great news! I'm installing it now, and will have a go at it
whenever the gem shows up :)

The link to the tutorial embedded on your intro page is incorrect
(the one in the TOC at the top right works though)

Cheers,
Bob

Hi Folks,

I know there have been at least a few people looking for something
like this
on the mailing list, so please check it out. It's a port of a Java
project
so I'd particularly like to hear how I can make it more Ruby like.
Enjoy!

Dave Balmain

== Description

Ferret is a full port of the Java Lucene searching and indexing
library.
It's available as a gem so try it out! To get started quickly read
the quick
start at the project homepage;

http://ferret.davebalmain.com/trac/

== Quick (Very Simple) Example

require 'ferret'

include Ferret

docs = [
{ :title => "The Pragmatic Programmer",
:author => "Dave Thomas, Andy Hunt",
:tags => "Programming, Broken Windows, Boiled Frogs",
:published => "1999-10-13",
:content => "Yada yada yada ..."
},
{ :title => "Programming Ruby",
:author => "Dave Thomas, Chad Fowler, Andy Hunt",
:tags => "Ruby",
:published => "2004-10-06",
:content => "Yada yada yada ..."
},
{ :title => "Agile Web Development with Rails",
:author => "Dave Thomas, David Heinemeier Hansson, Leon Breedt,
Mike Clark,
Thomas Fuchs, Andreas Schwarz",
:tags => "Ruby, Rails, Web Development",
:published => "2005-07-13",
:content => "Yada yada yada ..."
},
{ :title => "Ruby, Developer's Guide",
:author => "Robert Feldt, Lyle Johnson, Michael Neumann",
:tags => "Ruby, Racc, GUI, FOX",
:published => "2002-10-06",
:content => "Yada yada yada ..."
},
{ :title => "Lucene In Action",
:author => "Otis Gospodnetic, Erik Hatcher",
:tags => "Lucene, Java, Search, Indexing",
:published => "2004-12-01",
:content => "Yada yada yada ..."
}
]

index = Index::Index.new()

docs.each {|doc| index << doc }

puts index.size

puts "\nFind all documents on ruby:-"
index.search_each("tags:Ruby") do |doc, score|
puts "Document <#{index[doc]["title"]}> found with a score of %
0.2f" % score
end

puts "\nFind all documents on ruby published this year:-"
index.search_each("tags:ruby AND published: >= 2005") do |doc, score|
puts "Document <#{index[doc]["title"]}> found with a score of %
0.2f" % score
end

puts "\nFind all documents by the Pragmatic Programmers:-"
index.search_each('author:("dave Thomas" AND "Andy hunt")') do |
doc, score|
puts "Document <#{index[doc]["title"]}> found with a score of %
0.2f" % score
end
 
G

George Moschovitis

Can't wait to try this!

thanks,
George.

Hi Folks,

I know there have been at least a few people looking for something like t= his
on the mailing list, so please check it out. It's a port of a Java projec= t
so I'd particularly like to hear how I can make it more Ruby like. Enjoy!

Dave Balmain

=3D=3D Description

Ferret is a full port of the Java Lucene searching and indexing library.
It's available as a gem so try it out! To get started quickly read the qu= ick
start at the project homepage;

http://ferret.davebalmain.com/trac/

=3D=3D Quick (Very Simple) Example

require 'ferret'

include Ferret

docs =3D [
{ :title =3D> "The Pragmatic Programmer",
:author =3D> "Dave Thomas, Andy Hunt",
:tags =3D> "Programming, Broken Windows, Boiled Frogs",
:published =3D> "1999-10-13",
:content =3D> "Yada yada yada ..."
},
{ :title =3D> "Programming Ruby",
:author =3D> "Dave Thomas, Chad Fowler, Andy Hunt",
:tags =3D> "Ruby",
:published =3D> "2004-10-06",
:content =3D> "Yada yada yada ..."
},
{ :title =3D> "Agile Web Development with Rails",
:author =3D> "Dave Thomas, David Heinemeier Hansson, Leon Breedt, Mike Cl= ark,
Thomas Fuchs, Andreas Schwarz",
:tags =3D> "Ruby, Rails, Web Development",
:published =3D> "2005-07-13",
:content =3D> "Yada yada yada ..."
},
{ :title =3D> "Ruby, Developer's Guide",
:author =3D> "Robert Feldt, Lyle Johnson, Michael Neumann",
:tags =3D> "Ruby, Racc, GUI, FOX",
:published =3D> "2002-10-06",
:content =3D> "Yada yada yada ..."
},
{ :title =3D> "Lucene In Action",
:author =3D> "Otis Gospodnetic, Erik Hatcher",
:tags =3D> "Lucene, Java, Search, Indexing",
:published =3D> "2004-12-01",
:content =3D> "Yada yada yada ..."
}
]

index =3D Index::Index.new()

docs.each {|doc| index << doc }

puts index.size

puts "\nFind all documents on ruby:-"
index.search_each("tags:Ruby") do |doc, score|
puts "Document <#{index[doc]["title"]}> found with a score of %0.2f" % sc= ore
end

puts "\nFind all documents on ruby published this year:-"
index.search_each("tags:ruby AND published: >=3D 2005") do |doc, score|
puts "Document <#{index[doc]["title"]}> found with a score of %0.2f" % sc= ore
end

puts "\nFind all documents by the Pragmatic Programmers:-"
index.search_each('author:("dave Thomas" AND "Andy hunt")') do |doc, scor= e|
puts "Document <#{index[doc]["title"]}> found with a score of %0.2f" % sc= ore
end
 
D

David Balmain

The link to the tutorial embedded on your intro page is incorrect
(the one in the TOC at the top right works though)

Bob, thanks for that, the link is fixed now. Unfortunately my
packaging system is a bit broken. I think a few files got left out so
please wait for version 0.1.1. It'll be ready in a couple of hours.

Regards,
Dave
 
D

Devin Mullins

Question for those (soon to be) in the know:

How does this compare to (Estraier/Hyper
Estraier/Ruby-Odeum/SimpleSearch/other 'IR' systems with Ruby
bindings?) on (ease of learning/ease of use/ease of
maintenance/speed/any other noteworthy attributes)? To put it simply,
which one should I choose?* :)

(Well, speed's pretty well covered on the home page, though I'm not sure
how much faster Hyper Estraier is than Lucene, and not sure how much
slower SimpleSearch is than Ferret. I only ask because it's the thing
people in the position of questioneer are /supposed/ to do.)

Free feel to answer whatever part of that you (want/know), or just tell
me to fork off... a thread.

*For those actually interested in answering that question, it'll be an
intranet app that won't likely get a major amount of hits, but will
likely have a major amount of data. Right now, I'm just looking to make
a rough prototype in a week, but wouldn't mind picking a contender, if
quickly-pickuppable.

(Devin/twifkak)
//
 
N

Norjee

Atm i'm using jruby to use lucene search, but this definitely sounds
great!!! Do you have any plans to include snowball stemmers? As my
documents are not english, i could use them ;)
 
N

Norjee

I now find ruby's stemmer4r, which wraps the snowball stemmers. I can't
wait to use this ;)
 
N

Norjee

Atm i'm using jruby to use lucene search, but this definitely sounds
great!!! Do you have any plans to include snowball stemmers? As my
documents are not english, i could use them ;)

I just recall ruby's stemmer4r, which wraps the snowball stemmers. I
can't
wait to use this ;)
 
D

David Balmain

------=_Part_14912_15105866.1129923047253
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Atm i'm using jruby to use lucene search, but this definitely sounds
great!!! Do you have any plans to include snowball stemmers? As my
documents are not english, i could use them ;)

Hi Norjee,
I've looked at the snowball parser and I don't think it would be too hard t=
o
do a pure ruby version of this if enough people are interested. But that is
pretty low on my to do list so I hope stemmer4r will do for now. I also hop=
e
that you won't be needing unicode support as that is one of the things that
is missing in Ferret. Speaking of which, anyone know of any good ruby
unicode tutorials?

Dave

------=_Part_14912_15105866.1129923047253--
 
D

David Balmain

------=_Part_15332_23146292.1129925131931
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Question for those (soon to be) in the know:

How does this compare to (Estraier/Hyper
Estraier/Ruby-Odeum/SimpleSearch/other 'IR' systems with Ruby
bindings?) on (ease of learning/ease of use/ease of
maintenance/speed/any other noteworthy attributes)? To put it simply,
which one should I choose?* :)


Hi Devin,
I'm afraid I've only briefly looked at those other IR systems but I'll try
and answer your question as best I can. I think Ferret is currently pretty
easy to learn and use through the Index interface as described in my
original post. I don't think ease of use should turn you off. Once I've don=
e
a bit more work on the documentation, I think it'll be a lot easier to find
your way around than some of the other ones. But it'll be significantly
slower than the C library backed search engines. I'm certainly not the type
of person to say speed isn't important, however, I think ferret should
easily handle the kind of website you are talking about.

Ferret should be a lot faster than SimpleSearch for large document sets.
Having said that, there is a ruby quiz coming up for which I intend to writ=
e
a quick and simple search engine that will easily outperform simple search
so if people are interested, I might make that a project too.

=3D=3D As for the others, the main advantages of Ferret are;

* a more powerful extendable query language. You can do boolean, phrase,
range, fuzzy (for misspellings etc), wildcard, sloppy phrase (out of order
phrases) and more. Check out the Query Parser in the API for more info on
the query language.
http://ferret.davebalmain.com/api/classes/Ferret/QueryParser.html

* a more powerful document structure. I could be wrong about this so someon=
e
please correct me if I am, but I think most of the other IR's just take a
string as a document. Ferrets documents can have multiple fields. Each fiel=
d
can have a different analyzer (parses field into tokens). You can store
binary fields like images or compress your data. In fact, you could do away
with a database altogether and just use Ferret. (You can also store term
vectors if you want to compare document similarities, but that's getting
pretty technical)

* Ferret is pure ruby (at least it can be if you don't install the C
extension) so it'll run anywhere Ruby does.

* If you are patient, Ferret will one day match or beat the speed of those
other search engines. Hopefully by Christmas but it all depends how much
help I can get between now and then.

=3D=3D And the main disadvantages;

* Ferret is still alpha and has not been put into production yet. Hopefully
that will change soon.

* Ferret is currently slower than the C backed IRs


Anyway, sorry for such a long email. It's really hard to describe all the
features available. In fact, there is a whole book on Lucene by Erik Hatche=
r
and Otis Gospodnetic which I highly recommend if you want to take full
advantage of all the features in Ferret. Most of the examples should
translate pretty easily into Ruby.

Please let me know if you have any more questions.
Regards,
Dave

------=_Part_15332_23146292.1129925131931--
 
D

David Balmain

------=_Part_15380_18774294.1129925358246
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

I'm probably wrong, but I thought we already had
a port of this? A thing called Rucene?


Yes and also one called rubylucene. Unfortunately Erik Hatcher never had th=
e
time to get those projects off the ground. Hopefully he'll have time to hel=
p
me out now that the port is finished though. ;)

------=_Part_15380_18774294.1129925358246--
 
D

David Balmain

------=_Part_17180_21767205.1129951663145
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Have you looked into what it would take to allow all text to be UTF-8?
Would it take a complete overhaul or a somewhat-minor tweak?
If a tweak, what would you charge to make that lovely update? :)


I'm looking in to it now. I'll send you the bill. ;-)

------=_Part_17180_21767205.1129951663145--
 
M

Miles Keaton

b25lIGNvcnJlY3Rpb246ClJlZ0V4cFRva2VuaXplciBzaG91bGQgYmUgUkVUb2tlbml6ZXIKKGF0
IGxlYXN0IGluIHRoZSB2ZXJzaW9uIHlvdSd2ZSByZWxlYXNlZCBwdWJsaWNseSkKCgoKPiByZXF1
aXJlICdydWJ5Z2VtcycKPiByZXF1aXJlICdmZXJyZXQnCj4gaW5jbHVkZSBGZXJyZXQKPgo+IGNs
YXNzIENoaW5lc2VBbmFseXplcgo+IGRlZiB0b2tlbl9zdHJlYW0oZmllbGQsIHN0cmluZykKPiB0
b2tlbml6ZXIgPSBBbmFseXNpczo6UmVnRXhwVG9rZW5pemVyLm5ldyhzdHJpbmcpCj4gY2xhc3Mg
PDx0b2tlbml6ZXIKPiBkZWYgdG9rZW5fcmUoKSAvLi8gZW5kCj4gZW5kCj4gcmV0dXJuIHRva2Vu
aXplcgo+IGVuZAo+IGVuZAo+Cj4gZG9jcyA9IFsitcC1wr2bIiwgIsvRy/fL+dPQzfjSsyIsICLL
0cv3y/nT0NbQzsTN+NKzIiwgIsvRy/e88szl1tDOxM340rMiXQo+IGluZGV4ID0gSW5kZXg6Oklu
ZGV4Lm5ldyg6YW5hbHl6ZXIgPT4gQ2hpbmVzZUFuYWx5emVyLm5ldykKPiBkb2NzLmVhY2ggeyB8
ZG9jfCBpbmRleCA8PCBkb2MgfQo+IHB1dHMgaW5kZXhbM11bIiJdCj4gdHEgPSBTZWFyY2g6OlRl
cm1RdWVyeS5uZXcoSW5kZXg6OlRlcm0ubmV3KCIiLCAizfgiKSkKPiBpbmRleC5zZWFyY2hfZWFj
aCh0cSkgZG8gfGRvYywgc2NvcmV8Cj4gcHV0cyAiRG9jdW1lbnQgI3tkb2N9IGZvdW5kIHdpdGgg
c2NvcmUgI3tzY29yZX0iCj4gZW5kCj4gaW5kZXguY2xvc2UK
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top