.doc to html and pdf conversion with python

  • Thread starter Alexander Klingenstein
  • Start date
A

Alexander Klingenstein

I need to take a bunch of .doc files (word 2000) which have a little text including some tables/layout and mostly pictures and comvert them to a pdf and extract the text and images separately too. If I have a pdf, I can do create the html with pdftohtml called from python with popen. However I need an automated way to converst the .doc to PDF first.

Is there a way to do what I want either with a python lib, 3rd party app, or maybe remote controlling Word (a la VBA) by "printing" to PDF with a distiller?
I already tried wvware from gwnuwin32, however it has problems with big image files embedded in .doc file(looks like a mmap error).

Alex

______________________________________________________________________
XXL-Speicher, PC-Virenschutz, Spartarife & mehr: Nur im WEB.DE Club!
Jetzt gratis testen! http://freemail.web.de/home/landingpad/?mc=021130
 
L

Luap777

Alexander said:
I need to take a bunch of .doc files (word 2000) which have a little text including some tables/layout and mostly pictures and comvert them to a pdf and extract the text and images > separately too. If I have a pdf, I can do create the html with pdftohtml called from python with > popen. However I need an automated way to converst the .doc to PDF first.

Is there some reason you really want to convert to PDF first? You can
get much better HTML right from the Word doc. You'll lose a lot of info
going from PDF to HTML.

Something like this can open doc in Word, save as HTML, then close doc.

import os, win32com.client

wdApp = win32com.client.Dispatch("Word.Application")
wdApp.Visible = 1

def SaveDocAsHTML(docPath, htmlPath):
doc = wdApp.Documents.Open(docPath)
# See
mk:mad:MSITStore:C:\Program%20Files\Microsoft%20Office\OFFICE11\1033\VBAWD10.CHM::/html/womthSaveAs1.htm
# in Word VBA help doc for more info.

# Saves all text and formatting with HTML tags so that the
resulting document can be viewed in a Web browser.
doc.SaveAs(htmlPath, win32com.client.constants.wdFormatHTML)
# Saves text with HTML tags with minimal cascading style sheet
formatting. The resulting document can be viewed in a Web browser.
#doc.SaveAs(htmlPath,
win32com.client.constants.wdFormatFilteredHTML)
doc.Close()

And if you aren't satisfied with the ugly HTML you're likely to get,
you can try running µTidylib (http://utidylib.berlios.de/) on the
output after this step also.

Thank you,
Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,480
Members
44,900
Latest member
Nell636132

Latest Threads

Top