Client-side search engine capable of indexing .pdf files is needed.

G

gordom

Hi everyone.
I'm preparing a CD/DVD presentation containing html and pdf documents. I
would like to implement an offline search engine (the content of few
hundreds of pdf files should be indexed).
The script must be free of charge. I was googling for a while and found
only one truly free solution capable of indexing unlimited numbers of
..pdf files. It's JSSINDEX (http://jssindex.sourceforge.net/). Its
indexer runs in Linux environment. Do you know any other script that
could run under Windows while being developed? Thanks in advance,

gordom
 
S

SAM

Le 9/3/09 1:52 AM, Stefan Weiss a écrit :
I think it's amazing enough that you even found a tool like that; I
didn't know they existed, and I don't know of any alternatives. A few
years ago, I wrote a documentation system which included a JS search
index, but that wasn't anywhere near a full-text index, just class names
and method names. Anyway, Linux is free, and it only takes about 10
minutes to get it up and running on a virtual machine on Windows. If
JSSINDEX does what you need, I'd say use it.
Every other project with an offline search engine that I've seen or
worked on included an executable component which had to be installed
before use.

Can't the files previously indexed in a Json array ?
Maybe that can be done in PHP ?

What about xml/xsl ?
 
T

-TNO-

I'm preparing a CD/DVD presentation containing html and pdf documents. I
would like to implement an offline search engine (the content of few
hundreds of pdf files should be indexed).
The script must be free of charge.  I was googling for a while and found
only one truly free solution capable of indexing unlimited numbers of
.pdf files. It's JSSINDEX (http://jssindex.sourceforge.net/). Its
indexer runs in Linux environment. Do you know any other script that
could run under Windows while being developed?

Would it be practical to simply use JScript in the WSH to execute the
DOS FIND command with parameters behind the scenes? You may be able to
come up with some half decent implementation. If memory serves
correct, the general approach is something like this (untested):

var wShell = new ActiveXObject("WScript.Shell");
var oExec = wShell.Exec("FIND /N /I \"Array\" C:\\Documents\\")

while(oExec.Status == 0){
WScript.Sleep(100)
}

WScript.StdOut.Write(oExec.StdOut.ReadAll());
 
S

SAM

Le 9/3/09 5:14 AM, Stefan Weiss a écrit :
The index I created was similar to that in principle (but why use JSON
strings when you can create object literals).

Because Json was absolutely made to help in data manipulations.
(I believe, think, expect)

I certainly do not well understand what you mean by indexing files.
If it is only to report the list of the names of pdf files stored in a
folder (on the CD) the browser must be abble to display it
Then on this window there is certainly a search button, no?

On my Fx I can even sort the files by name, size, date.

However I understand that the index is made before to burn the CD.
So where is exactly the problem?
I think you can enter the "data" in an html table that, with a bit of
JS, can be sorted by columns.
A search on word(s) in this table (or the initial storing object) to
reveal files with this(these) term(s) must not be very difficult to do.
Of course it can be done, but I don't know of any package which would
index several hundred PDF files, create a usable JS index, and provide a
front end. Looks like JSSINDEX does that (except it uses Lush instead of
PHP).

As I havn't Windows nor Linux (and if I would, I don't install someting
just to see) I can't see the advantage of this tool.

The BdD (data base ?) may be create (the array of indexed files) in xml
The engine in xsl
Rest to create the JS to activate all that.
(I saw an appli like that, but maybe that will not work everywhere)
 
G

gordom

I doubt gordom was interested in a list of file names.

Exactly. I want to have the content of the pdf files to be indexed. I
would like to provide the user with the ability to search the content of
the .pdf files for any phrase he would like to.

gordom

P.S. Thanks for all your comments.
 
S

SAM

Le 9/3/09 4:58 PM, gordom a écrit :
Exactly. I want to have the content of the pdf files to be indexed. I
would like to provide the user with the ability to search the content of
the .pdf files for any phrase he would like to.

SpotLight ?
 
S

SAM

Le 9/3/09 3:43 PM, Stefan Weiss a écrit :
I doubt gordom was interested in a list of file names. Creating a full
text index is quite a bit more complex than simply listing directory
contents. <http://fr.wikipedia.org/wiki/Indexation>

Well, who will choice the terms to index ?
Who will built for each file its own array of terms ?
Who will built the links for each term (to the files and inside them)?

From the point where the data are complete and in an object (or a
simple array) I suppose that most of the job is made.

<http://cjoint.com/?jdvO4bUE6Q> 1500 items
(without index ... not in SANstore)
 
S

SAM

Le 9/4/09 1:11 AM, Stefan Weiss a écrit :
Le 9/3/09 3:43 PM, Stefan Weiss a écrit :
Well, who will choice the terms to index ?
Who will built for each file its own array of terms ?
Who will built the links for each term (to the files and inside them)?

The indexer will do all of that.
From the point where the data are complete and in an object (or a
simple array) I suppose that most of the job is made.

Not necessarily. You need both parts for an efficient search engine: the
index and the lookup algorithm. The index lookup needs to be fast, and
able to sort the results in a meaningful way.
<http://cjoint.com/?jdvO4bUE6Q> 1500 items
(without index ... not in SANstore)

| var liste = [
| '00.htm',
| '000.htm',
| '0000000000000001.txt',
| '001.htm',
| '12-1.gif',
| '20-100_100tre.htm',
| '20-100_100tre2.htm',

That's just a list of file names again, not a full text index. It has
only 1500 entries, which isn't even close to what we're dealing here.

It has 1500 entries, will the CD contain more than 1500 files ?
With these simple entries (they could have been lines of a cvs file,
each line been a card of the file with name, date, list of indexed
terms, short introduction ...)
I didn't understand the "not in SANstore" part - how is that relevant?

I havn't more complicated example in stock (in store ? in SAM's shop).
If you would have one I'll be glad to see it.

Searching one or more terms along this list is very fast because we have
only to keep each line containing one of the terms : a single loop on
the 1500 lines (or entries). The new list of files, expected relatively
short, can then be easily manipulated to show what wanted.

About indexation of a list of terms met in the files I suppose we can
have an array of them
terms = [
'add 12 125 956',
'addition 1 8 274 315 977 1235',
...
where the numbers are the indexes to find the correct files stored in
another array.
This method would have to be faster.
Maybe it takes more room in memory ? Not sure.
Regarding your other post: Spotlight is only available on OSX, and
(AFAIK) doesn't have a JavaScript front-end. It may be possible to burn
a its index to a CD, but without the Spotlight executable, that won't
help much.

At least that could be a solution for a specific environment ;-)
TNO's suggestion has a similar problem: it requires WSH to be installed
and accessible from an HTML page (unlikely). It will be afwully slow as
well, because each search will have to read the complete contents of the

I suppose that it would be better to have all the content written in memory.
CD. And then it probably won't find "à bientôt" because the source
encoding doesn't match the search encoding.

Once Reg Exp will plan that \w is no more only ASCII characters but
those of more complet charsets, perhaps will we can match more seriously
(or easily) !english words, even if search functions were made by an
illiterate guy from US.
JSSINDEX still looks like the way to go (didn't test it, though). BTW, I
just checked, Lush is available as Debian and Ubuntu packages. If there
aren't any other requirements, getting the indexer to work should be a
piece of cake.

Something in Ruby ?
<http://books.google.fr/books?id=OBh...esult&ct=result&resnum=8#v=onepage&q=&f=false>
 
D

Dr J R Stockton

In comp.lang.javascript message <[email protected]
r>, Fri, 4 Sep 2009 03:51:06, SAM <[email protected]
valid> posted:
Once Reg Exp will plan that \w is no more only ASCII characters but
those of more complet charsets, perhaps will we can match more
seriously (or easily) !english words, even if search functions were
made by an illiterate guy from US.

Too much would break if the character set of \w \W were changed.

The answer, clearly, is to use a non-English w W. But there is no need
to use a non-contiguous language - there is Cymraeg.

Unicode has \u0175 = ŵ, as in the word dw^r (which this newsreader
does not consider transmissible) and Ŵ as well. Those would do
nicely.

Does French Brythonic use it?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top