pdf index builder

Discussion in 'Java' started by Giovanni Azua, Dec 5, 2011.

  1. Hello!

    I have the strong need to do the following. Given a set of PDF files
    scattered across multiple directories, build a global index that includes
    for every index term the file names and corresponding pages where such
    index occurs. A really nice to have would be to "parse" formulas but I
    guess these are stored as images ...

    Before I go ahead and build a solution using Apache's PDFBox and/or iText
    can anyone advice if such solution exists? even if commercial? I googled
    for this already ...

    My use-case for this is a very critical open book exam but there are no
    books instead a bunch of dense PDF papers and lectures (a lot) if I get
    such index I might get an edge here :)

    TIA,
    Best regards,
    Giovanni

    -- Giovanni
     
    Giovanni Azua, Dec 5, 2011
    #1
    1. Advertising

  2. On 11-12-05 11:03 AM, Giovanni Azua wrote:
    > Hello!
    >
    > I have the strong need to do the following. Given a set of PDF files
    > scattered across multiple directories, build a global index that includes
    > for every index term the file names and corresponding pages where such
    > index occurs. A really nice to have would be to "parse" formulas but I
    > guess these are stored as images ...
    >
    > Before I go ahead and build a solution using Apache's PDFBox and/or iText
    > can anyone advice if such solution exists? even if commercial? I googled
    > for this already ...
    >
    > My use-case for this is a very critical open book exam but there are no
    > books instead a bunch of dense PDF papers and lectures (a lot) if I get
    > such index I might get an edge here :)
    >
    > TIA,
    > Best regards,
    > Giovanni
    >
    > -- Giovanni


    Presumably you don't want to get as high-powered (and costly and
    complicated) as something like CBR (content based retrieval) in IBM
    FileNet P8. :)

    AFAIK Alfresco uses PDFBox with Lucene for PDF text extraction and
    indexing. If you're in control of the entire Alfresco system you'd have
    access to the indexing data in its raw form. But I don't see the point,
    I'd myself simply run PDFBox and Lucene standalone, if all you want is a
    global index. Granted, Alfresco is not a complicated install.

    One note: PDFBox is noted by a number of commentators to be slow in the
    Alfresco environment. For all I know it's slow, period. You might want
    to consider pdftotext. There are some decent articles on using it vice
    PDFBox with Alfresco.

    AHS
     
    Arved Sandstrom, Dec 6, 2011
    #2
    1. Advertising

  3. Giovanni Azua

    Roedy Green Guest

    On 5 Dec 2011 15:03:46 GMT, Giovanni Azua <>
    wrote, quoted or indirectly quoted someone who said :

    >Before I go ahead and build a solution using Apache's PDFBox and/or iText
    >can anyone advice if such solution exists? even if commercial? I googled
    >for this already ...


    there is a ton of PDF utilities. Have a browse at
    http://mindprod.com/jgloss/pdf.html

    I would be quite surprised if what you want does not exist.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    For me, the appeal of computer programming is that
    even though I am quite a klutz,
    I can still produce something, in a sense
    perfect, because the computer gives me as many
    chances as I please to get it right.
     
    Roedy Green, Dec 9, 2011
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. karthikeyavenkat
    Replies:
    2
    Views:
    677
    Bryce
    Mar 17, 2005
  2. Phlip
    Replies:
    5
    Views:
    593
    Stefan Behnel
    Jan 13, 2010
  3. Ricardo Pog
    Replies:
    1
    Views:
    493
    Austin Ziegler
    Mar 26, 2008
  4. Sean Nakasone
    Replies:
    1
    Views:
    432
    Farrel Lifson
    Apr 14, 2008
  5. Tomasz Chmielewski

    sorting index-15, index-9, index-110 "the human way"?

    Tomasz Chmielewski, Mar 4, 2008, in forum: Perl Misc
    Replies:
    4
    Views:
    362
    Tomasz Chmielewski
    Mar 4, 2008
Loading...

Share This Page