Multiple PDF, PPT, DOC to html or text conversion

Discussion in 'ASP .Net' started by osiceanu, Feb 21, 2008.

  1. osiceanu

    osiceanu Guest

    Hello,

    I have a asp.net application storing pdf files and word documents into
    db. The problem appears when trying to show a preview of a document on
    the aspx page. That is converting the document to html or text. Is
    there a method for doing it? Keeping the images in the document or the
    format of the document is not necessary.
    If it is not possible, maybe an image preview of the document (i.e.
    the first page of it) is more suitable and easier.

    Thanks in advance!
    osiceanu, Feb 21, 2008
    #1
    1. Advertising

  2. osiceanu wrote:
    > Hello,
    >
    > I have a asp.net application storing pdf files and word documents into
    > db. The problem appears when trying to show a preview of a document on
    > the aspx page. That is converting the document to html or text. Is
    > there a method for doing it? Keeping the images in the document or the
    > format of the document is not necessary.
    > If it is not possible, maybe an image preview of the document (i.e.
    > the first page of it) is more suitable and easier.
    >
    > Thanks in advance!


    Perhaps not an "optimal" solution in terms of resource usage on the
    server, but could you use the Office 2007 COM objects for this?

    A PDF document you can easily embed into a page.

    A Word document you could, on the server, load into the Word
    application, save as a temporary pdf file, and then embed that into the
    page.

    If resource usage on the server will take a hit, you could tag new
    documents in the database "must be rendered to pdf", and then run a job
    at intervals that does the same, ie. loads up the word document into
    Word, save as pdf, and then uploads the pdf to the database as an
    alternate representation of the word document.

    You mention that you want to convert it to html or text. Is this a
    must-have criteria? Because if you need that you need to either have a
    server-component that can output html from pdf and word (Word 2007 can
    do this from the word file), or you need to do a similar interval-based
    rendering of the files to html.

    3rd party class libraries exists that does either, and while I don't
    know the current state of pdf libraries that would fit, I do know that
    the only way to support all the features of the word application is by
    using word itself.

    As for only showing the text, you can then probably use such 3rd party
    libraries, TX Text Control can be used to grab the text from a word
    file, and there are probably similar things for pdf, but do know that
    pdf is a format suitable for printing, I've seen badly formed pdf files
    that consists of words on a page, but the words are not actually put on
    a page on a per line per sentence basis, more like just thrown onto the
    page in the right spots, grabbing the text from such a document would
    most likely not look good.

    --
    Lasse Vågsæther Karlsen
    mailto:
    http://presentationmode.blogspot.com/
    PGP KeyID: 0xBCDEA2E3
    Lasse Vågsæther Karlsen, Feb 21, 2008
    #2
    1. Advertising

  3. osiceanu

    osiceanu Guest

    Thank you for your response!

    I was trying to do something like Google's "View as HTML" for
    documents like pdf, doc, ppt, xls. This component would be used also
    for searching the site and returning as results the documents
    containing the search text.
    Another approach would be to have the documents on the hard disk, and
    storing into the db only references to those documents. But for
    searching I also have to store the text from documents.
    osiceanu, Feb 21, 2008
    #3
  4. This solution is a very bad one for a server environment (automating
    word in a server environment). See the following thread as to why:

    http://groups.google.com/group/micr...04123c3e4db/8e7b1b19ebfa0e34#8e7b1b19ebfa0e34

    Additionally, this link is referenced in the thread as to why MS says it
    is a bad idea to Automate word in a server environment:

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;257757

    --
    - Nicholas Paldino [.NET/C# MVP]
    -


    "Lasse Vågsæther Karlsen" <> wrote in message
    news:...
    > osiceanu wrote:
    >> Hello,
    >>
    >> I have a asp.net application storing pdf files and word documents into
    >> db. The problem appears when trying to show a preview of a document on
    >> the aspx page. That is converting the document to html or text. Is
    >> there a method for doing it? Keeping the images in the document or the
    >> format of the document is not necessary.
    >> If it is not possible, maybe an image preview of the document (i.e.
    >> the first page of it) is more suitable and easier.
    >>
    >> Thanks in advance!

    >
    > Perhaps not an "optimal" solution in terms of resource usage on the
    > server, but could you use the Office 2007 COM objects for this?
    >
    > A PDF document you can easily embed into a page.
    >
    > A Word document you could, on the server, load into the Word application,
    > save as a temporary pdf file, and then embed that into the page.
    >
    > If resource usage on the server will take a hit, you could tag new
    > documents in the database "must be rendered to pdf", and then run a job at
    > intervals that does the same, ie. loads up the word document into Word,
    > save as pdf, and then uploads the pdf to the database as an alternate
    > representation of the word document.
    >
    > You mention that you want to convert it to html or text. Is this a
    > must-have criteria? Because if you need that you need to either have a
    > server-component that can output html from pdf and word (Word 2007 can do
    > this from the word file), or you need to do a similar interval-based
    > rendering of the files to html.
    >
    > 3rd party class libraries exists that does either, and while I don't know
    > the current state of pdf libraries that would fit, I do know that the only
    > way to support all the features of the word application is by using word
    > itself.
    >
    > As for only showing the text, you can then probably use such 3rd party
    > libraries, TX Text Control can be used to grab the text from a word file,
    > and there are probably similar things for pdf, but do know that pdf is a
    > format suitable for printing, I've seen badly formed pdf files that
    > consists of words on a page, but the words are not actually put on a page
    > on a per line per sentence basis, more like just thrown onto the page in
    > the right spots, grabbing the text from such a document would most likely
    > not look good.
    >
    > --
    > Lasse Vågsæther Karlsen
    > mailto:
    > http://presentationmode.blogspot.com/
    > PGP KeyID: 0xBCDEA2E3
    Nicholas Paldino [.NET/C# MVP], Feb 21, 2008
    #4
  5. "Lasse Vågsæther Karlsen" <> wrote in message
    news:...

    >> I have a asp.net application storing pdf files and word documents into
    >> db. The problem appears when trying to show a preview of a document on
    >> the aspx page. That is converting the document to html or text. Is
    >> there a method for doing it? Keeping the images in the document or the
    >> format of the document is not necessary.
    >> If it is not possible, maybe an image preview of the document (i.e.
    >> the first page of it) is more suitable and easier.

    >
    > Perhaps not an "optimal" solution in terms of resource usage on the
    > server, but could you use the Office 2007 COM objects for this?


    Under no circumstances should server-side Office automation be attempted:
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;q257757#kb2
    http://support.microsoft.com/default.aspx/kb/288367
    http://www.aspose.com/Products/Aspose.Words/Api/index.html?url=Why-not-Automation.html

    Use this instead:
    http://www.aspose.com/Products/Aspose.Words/


    --
    Mark Rae
    ASP.NET MVP
    http://www.markrae.net
    Mark Rae [MVP], Feb 21, 2008
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Gaurav
    Replies:
    2
    Views:
    378
  2. Matt
    Replies:
    3
    Views:
    484
    Tor Iver Wilhelmsen
    Sep 17, 2004
  3. msnews.microsoft.com
    Replies:
    0
    Views:
    329
    msnews.microsoft.com
    Aug 10, 2006
  4. Gaurav
    Replies:
    3
    Views:
    121
Loading...

Share This Page