Multiple PDF, PPT, DOC to html or text conversion

O

osiceanu

Hello,

I have a asp.net application storing pdf files and word documents into
db. The problem appears when trying to show a preview of a document on
the aspx page. That is converting the document to html or text. Is
there a method for doing it? Keeping the images in the document or the
format of the document is not necessary.
If it is not possible, maybe an image preview of the document (i.e.
the first page of it) is more suitable and easier.

Thanks in advance!
 
L

Lasse Vågsæther Karlsen

osiceanu said:
Hello,

I have a asp.net application storing pdf files and word documents into
db. The problem appears when trying to show a preview of a document on
the aspx page. That is converting the document to html or text. Is
there a method for doing it? Keeping the images in the document or the
format of the document is not necessary.
If it is not possible, maybe an image preview of the document (i.e.
the first page of it) is more suitable and easier.

Thanks in advance!

Perhaps not an "optimal" solution in terms of resource usage on the
server, but could you use the Office 2007 COM objects for this?

A PDF document you can easily embed into a page.

A Word document you could, on the server, load into the Word
application, save as a temporary pdf file, and then embed that into the
page.

If resource usage on the server will take a hit, you could tag new
documents in the database "must be rendered to pdf", and then run a job
at intervals that does the same, ie. loads up the word document into
Word, save as pdf, and then uploads the pdf to the database as an
alternate representation of the word document.

You mention that you want to convert it to html or text. Is this a
must-have criteria? Because if you need that you need to either have a
server-component that can output html from pdf and word (Word 2007 can
do this from the word file), or you need to do a similar interval-based
rendering of the files to html.

3rd party class libraries exists that does either, and while I don't
know the current state of pdf libraries that would fit, I do know that
the only way to support all the features of the word application is by
using word itself.

As for only showing the text, you can then probably use such 3rd party
libraries, TX Text Control can be used to grab the text from a word
file, and there are probably similar things for pdf, but do know that
pdf is a format suitable for printing, I've seen badly formed pdf files
that consists of words on a page, but the words are not actually put on
a page on a per line per sentence basis, more like just thrown onto the
page in the right spots, grabbing the text from such a document would
most likely not look good.
 
O

osiceanu

Thank you for your response!

I was trying to do something like Google's "View as HTML" for
documents like pdf, doc, ppt, xls. This component would be used also
for searching the site and returning as results the documents
containing the search text.
Another approach would be to have the documents on the hard disk, and
storing into the db only references to those documents. But for
searching I also have to store the text from documents.
 
M

Mark Rae [MVP]

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top