Client-side search engine capable of indexing .pdf files is needed.

Discussion in 'Javascript' started by gordom, Sep 3, 2009.

  1. gordom

    gordom Guest

    Hi everyone.
    I'm preparing a CD/DVD presentation containing html and pdf documents. I
    would like to implement an offline search engine (the content of few
    hundreds of pdf files should be indexed).
    The script must be free of charge. I was googling for a while and found
    only one truly free solution capable of indexing unlimited numbers of
    ..pdf files. It's JSSINDEX (http://jssindex.sourceforge.net/). Its
    indexer runs in Linux environment. Do you know any other script that
    could run under Windows while being developed? Thanks in advance,

    gordom
    gordom, Sep 3, 2009
    #1
    1. Advertising

  2. gordom

    SAM Guest

    Le 9/3/09 1:52 AM, Stefan Weiss a écrit :
    > On 03/09/09 01:33, gordom wrote:
    >> I'm preparing a CD/DVD presentation containing html and pdf documents. I
    >> would like to implement an offline search engine (the content of few
    >> hundreds of pdf files should be indexed).
    >> The script must be free of charge. I was googling for a while and found
    >> only one truly free solution capable of indexing unlimited numbers of
    >> .pdf files. It's JSSINDEX (http://jssindex.sourceforge.net/). Its
    >> indexer runs in Linux environment. Do you know any other script that
    >> could run under Windows while being developed?

    >
    > I think it's amazing enough that you even found a tool like that; I
    > didn't know they existed, and I don't know of any alternatives. A few
    > years ago, I wrote a documentation system which included a JS search
    > index, but that wasn't anywhere near a full-text index, just class names
    > and method names. Anyway, Linux is free, and it only takes about 10
    > minutes to get it up and running on a virtual machine on Windows. If
    > JSSINDEX does what you need, I'd say use it.
    > Every other project with an offline search engine that I've seen or
    > worked on included an executable component which had to be installed
    > before use.


    Can't the files previously indexed in a Json array ?
    Maybe that can be done in PHP ?

    What about xml/xsl ?
    --
    sm
    SAM, Sep 3, 2009
    #2
    1. Advertising

  3. gordom

    -TNO- Guest

    Re: Client-side search engine capable of indexing .pdf files isneeded.

    On 03/09/09 01:33, gordom wrote:
    > I'm preparing a CD/DVD presentation containing html and pdf documents. I
    > would like to implement an offline search engine (the content of few
    > hundreds of pdf files should be indexed).
    > The script must be free of charge.  I was googling for a while and found
    > only one truly free solution capable of indexing unlimited numbers of
    > .pdf files. It's JSSINDEX (http://jssindex.sourceforge.net/). Its
    > indexer runs in Linux environment. Do you know any other script that
    > could run under Windows while being developed?


    Would it be practical to simply use JScript in the WSH to execute the
    DOS FIND command with parameters behind the scenes? You may be able to
    come up with some half decent implementation. If memory serves
    correct, the general approach is something like this (untested):

    var wShell = new ActiveXObject("WScript.Shell");
    var oExec = wShell.Exec("FIND /N /I \"Array\" C:\\Documents\\")

    while(oExec.Status == 0){
    WScript.Sleep(100)
    }

    WScript.StdOut.Write(oExec.StdOut.ReadAll());
    -TNO-, Sep 3, 2009
    #3
  4. gordom

    SAM Guest

    Le 9/3/09 5:14 AM, Stefan Weiss a écrit :
    > On 03/09/09 02:42, SAM wrote:
    >> Le 9/3/09 1:52 AM, Stefan Weiss a écrit :
    >>>> I'm preparing a CD/DVD presentation containing html and pdf documents. I
    >>>> would like to implement an offline search engine (the content of few
    >>>> hundreds of pdf files should be indexed).
    >>>> The script must be free of charge. I was googling for a while and found
    >>>> only one truly free solution capable of indexing unlimited numbers of
    >>>> .pdf files. It's JSSINDEX (http://jssindex.sourceforge.net/). Its
    >>>> indexer runs in Linux environment. Do you know any other script that
    >>>> could run under Windows while being developed?

    >> Can't the files previously indexed in a Json array ?

    >
    > The index I created was similar to that in principle (but why use JSON
    > strings when you can create object literals).


    Because Json was absolutely made to help in data manipulations.
    (I believe, think, expect)

    I certainly do not well understand what you mean by indexing files.
    If it is only to report the list of the names of pdf files stored in a
    folder (on the CD) the browser must be abble to display it
    Then on this window there is certainly a search button, no?

    On my Fx I can even sort the files by name, size, date.

    However I understand that the index is made before to burn the CD.
    So where is exactly the problem?
    I think you can enter the "data" in an html table that, with a bit of
    JS, can be sorted by columns.
    A search on word(s) in this table (or the initial storing object) to
    reveal files with this(these) term(s) must not be very difficult to do.

    >> Maybe that can be done in PHP ?

    >
    > Of course it can be done, but I don't know of any package which would
    > index several hundred PDF files, create a usable JS index, and provide a
    > front end. Looks like JSSINDEX does that (except it uses Lush instead of
    > PHP).


    As I havn't Windows nor Linux (and if I would, I don't install someting
    just to see) I can't see the advantage of this tool.

    >> What about xml/xsl ?

    >
    > ?


    The BdD (data base ?) may be create (the array of indexed files) in xml
    The engine in xsl
    Rest to create the JS to activate all that.
    (I saw an appli like that, but maybe that will not work everywhere)

    --
    sm
    SAM, Sep 3, 2009
    #4
  5. gordom

    gordom Guest


    >
    >> I certainly do not well understand what you mean by indexing files.
    >> If it is only to report the list of the names of pdf files stored in a
    >> folder (on the CD) the browser must be abble to display it
    >> Then on this window there is certainly a search button, no?

    >
    > I doubt gordom was interested in a list of file names.


    Exactly. I want to have the content of the pdf files to be indexed. I
    would like to provide the user with the ability to search the content of
    the .pdf files for any phrase he would like to.

    gordom

    P.S. Thanks for all your comments.
    gordom, Sep 3, 2009
    #5
  6. gordom

    SAM Guest

    Le 9/3/09 4:58 PM, gordom a écrit :
    >
    >>
    >>> I certainly do not well understand what you mean by indexing files.
    >>> If it is only to report the list of the names of pdf files stored in
    >>> a folder (on the CD) the browser must be abble to display it
    >>> Then on this window there is certainly a search button, no?

    >>
    >> I doubt gordom was interested in a list of file names.

    >
    > Exactly. I want to have the content of the pdf files to be indexed. I
    > would like to provide the user with the ability to search the content of
    > the .pdf files for any phrase he would like to.


    SpotLight ?

    --
    sm
    SAM, Sep 3, 2009
    #6
  7. gordom

    SAM Guest

    Le 9/3/09 3:43 PM, Stefan Weiss a écrit :
    > On 03/09/09 14:56, SAM wrote:
    >> Le 9/3/09 5:14 AM, Stefan Weiss a écrit :
    >>> On 03/09/09 02:42, SAM wrote:
    >>>> Le 9/3/09 1:52 AM, Stefan Weiss a écrit :
    >>>>>> I'm preparing a CD/DVD presentation containing html and pdf documents. I
    >>>>>> would like to implement an offline search engine (the content of few
    >>>>>> hundreds of pdf files should be indexed).

    >
    >> I certainly do not well understand what you mean by indexing files.
    >> If it is only to report the list of the names of pdf files stored in a
    >> folder (on the CD) the browser must be abble to display it
    >> Then on this window there is certainly a search button, no?

    >
    > I doubt gordom was interested in a list of file names. Creating a full
    > text index is quite a bit more complex than simply listing directory
    > contents. <http://fr.wikipedia.org/wiki/Indexation>


    Well, who will choice the terms to index ?
    Who will built for each file its own array of terms ?
    Who will built the links for each term (to the files and inside them)?

    From the point where the data are complete and in an object (or a
    simple array) I suppose that most of the job is made.

    <http://cjoint.com/?jdvO4bUE6Q> 1500 items
    (without index ... not in SANstore)
    --
    sm
    SAM, Sep 3, 2009
    #7
  8. gordom

    SAM Guest

    Le 9/4/09 1:11 AM, Stefan Weiss a écrit :
    > On 03/09/09 22:15, SAM wrote:
    >> Le 9/3/09 3:43 PM, Stefan Weiss a écrit :
    >>> On 03/09/09 14:56, SAM wrote:
    >>>> I certainly do not well understand what you mean by indexing files.
    >>>> If it is only to report the list of the names of pdf files stored in a
    >>>> folder (on the CD) the browser must be abble to display it
    >>>> Then on this window there is certainly a search button, no?
    >>> I doubt gordom was interested in a list of file names. Creating a full
    >>> text index is quite a bit more complex than simply listing directory
    >>> contents. <http://fr.wikipedia.org/wiki/Indexation>

    >> Well, who will choice the terms to index ?
    >> Who will built for each file its own array of terms ?
    >> Who will built the links for each term (to the files and inside them)?

    >
    > The indexer will do all of that.
    >
    >> From the point where the data are complete and in an object (or a
    >> simple array) I suppose that most of the job is made.

    >
    > Not necessarily. You need both parts for an efficient search engine: the
    > index and the lookup algorithm. The index lookup needs to be fast, and
    > able to sort the results in a meaningful way.
    >
    >> <http://cjoint.com/?jdvO4bUE6Q> 1500 items
    >> (without index ... not in SANstore)

    >
    > | var liste = [
    > | '00.htm',
    > | '000.htm',
    > | '0000000000000001.txt',
    > | '001.htm',
    > | '12-1.gif',
    > | '20-100_100tre.htm',
    > | '20-100_100tre2.htm',
    >
    > That's just a list of file names again, not a full text index. It has
    > only 1500 entries, which isn't even close to what we're dealing here.


    It has 1500 entries, will the CD contain more than 1500 files ?
    With these simple entries (they could have been lines of a cvs file,
    each line been a card of the file with name, date, list of indexed
    terms, short introduction ...)

    > I didn't understand the "not in SANstore" part - how is that relevant?


    I havn't more complicated example in stock (in store ? in SAM's shop).
    If you would have one I'll be glad to see it.

    Searching one or more terms along this list is very fast because we have
    only to keep each line containing one of the terms : a single loop on
    the 1500 lines (or entries). The new list of files, expected relatively
    short, can then be easily manipulated to show what wanted.

    About indexation of a list of terms met in the files I suppose we can
    have an array of them
    terms = [
    'add 12 125 956',
    'addition 1 8 274 315 977 1235',
    ...
    where the numbers are the indexes to find the correct files stored in
    another array.
    This method would have to be faster.
    Maybe it takes more room in memory ? Not sure.

    > Regarding your other post: Spotlight is only available on OSX, and
    > (AFAIK) doesn't have a JavaScript front-end. It may be possible to burn
    > a its index to a CD, but without the Spotlight executable, that won't
    > help much.


    At least that could be a solution for a specific environment ;-)
    <http://www.apple.com/downloads/macosx/home_learning/deliciouslibrary.html>

    > TNO's suggestion has a similar problem: it requires WSH to be installed
    > and accessible from an HTML page (unlikely). It will be afwully slow as
    > well, because each search will have to read the complete contents of the


    I suppose that it would be better to have all the content written in memory.

    > CD. And then it probably won't find "à bientôt" because the source
    > encoding doesn't match the search encoding.


    Once Reg Exp will plan that \w is no more only ASCII characters but
    those of more complet charsets, perhaps will we can match more seriously
    (or easily) !english words, even if search functions were made by an
    illiterate guy from US.

    > JSSINDEX still looks like the way to go (didn't test it, though). BTW, I
    > just checked, Lush is available as Debian and Ubuntu packages. If there
    > aren't any other requirements, getting the indexer to work should be a
    > piece of cake.


    Something in Ruby ?
    <http://books.google.fr/books?id=OBhAuww-OokC&pg=PA137&lpg=PA137&dq=ruby+file+indexer&source=bl&ots=2yh2lSt1bK&sig=0vjYl4cMJ-3PxayHwg0YJOGYnbk&hl=fr&ei=t1ugSr24Ac74-QaGqsD0Dw&sa=X&oi=book_result&ct=result&resnum=8#v=onepage&q=&f=false>

    --
    sm
    SAM, Sep 4, 2009
    #8
  9. In comp.lang.javascript message <4aa0728a$0$23445$
    r>, Fri, 4 Sep 2009 03:51:06, SAM <
    valid> posted:

    >Once Reg Exp will plan that \w is no more only ASCII characters but
    >those of more complet charsets, perhaps will we can match more
    >seriously (or easily) !english words, even if search functions were
    >made by an illiterate guy from US.


    Too much would break if the character set of \w \W were changed.

    The answer, clearly, is to use a non-English w W. But there is no need
    to use a non-contiguous language - there is Cymraeg.

    Unicode has \u0175 = ŵ, as in the word dw^r (which this newsreader
    does not consider transmissible) and Ŵ as well. Those would do
    nicely.

    Does French Brythonic use it?

    --
    (c) John Stockton, Surrey, UK. ?@merlyn.demon.co.uk Turnpike v6.05 MIME.
    Web <URL:http://www.merlyn.demon.co.uk/> - FAQish topics, acronyms, & links.
    Proper <= 4-line sig. separator as above, a line exactly "-- " (SonOfRFC1036)
    Do not Mail News to me. Before a reply, quote with ">" or "> " (SonOfRFC1036)
    Dr J R Stockton, Sep 5, 2009
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. C
    Replies:
    0
    Views:
    493
  2. Fred Newton

    Search Engine Indexing of Dynamic Content

    Fred Newton, Aug 7, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    443
    Fred Newton
    Aug 9, 2004
  3. Brent
    Replies:
    2
    Views:
    1,763
    Brent
    Nov 16, 2004
  4. Kieran Seymour
    Replies:
    13
    Views:
    12,317
    Blinky the Shark
    Apr 21, 2005
  5. C
    Replies:
    3
    Views:
    216
    Manohar Kamath [MVP]
    Oct 17, 2003
Loading...

Share This Page