full-text indexing of pdf, rtf, txt, html

D

Dizzy Haze

I have a big pile of files on my local machine that are in a variety of
formats - txt, rtf, pdf, html, etc. What I'm looking for is a script
that will crawl through the files and perform simple full-text indexing
on them, and will allow for queries to be executed on the index.

Any recommendations?

thx.
 
L

Lars Kellogg-Stedman

Any recommendations?

http://swish-e.org/ is easy to implement, well documented, and has a
fairly active community around it. It works well on moderate-sized
document collections, but doesn't scale to "huge". No, I can't quantify
that. This is what I use to index my email.

ObPerl: Swish-E comes with Perl bindings.

I've heard good things about Lucene
(http://lucene.apache.org/java/docs/), but I haven't tried it myself.

Also:

http://freshmeat.net/search/?q=text+index

-- Lars
 
L

Lars Kellogg-Stedman

Does this program originate from West L.A. or San Francisco?

I'm sure I'm going to regret this, but what reference am I missing?

-- Lars
 
L

Lars Kellogg-Stedman

Does this program originate from West L.A. or San Francisco?
I'm sure I'm going to regret this, but what reference am I missing?

Oh, duh, never mind. Slow on the uptake.

-- Lars
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,049
Latest member
Allen00Reed

Latest Threads

Top