How to convert MS Word document to text file?

G

Goh, Yong Kwang

Hi.

I'm trying to write a Perl program that can given a Microsoft Word
document file convert it into a text file, which can then be formatted
into a HTML file. So in essence it's like extracting the information
and repackaging them for the Web.

I don't use the "Save As Web Page" command from Word directly as I've
a whole bunch of Word documents to convert, doing one-by-one is slow
and laborious. And the formatting does not match the Web template I've
done.

So I'm thinking of using Win32::OLE package to create a Microsoft Word
Document OLE object and then calling its SaveAs method to export a MS
Word document to a text file.

However, to reduce the overhead of having to create a Microsoft Word
Document OLE object and then calling its SaveAs method to export a MS
Word document to a text file, I have been wondering if there is a Perl
module somewhere that does this conversion directly w/o having to call
up Microsoft Word and using OLE.

Thanks.

Regards,
Goh, Yong-Kwang
(e-mail address removed)
Singapore
 
J

John Bokma

Hi.

I'm trying to write a Perl program that can given a Microsoft Word
document file convert it into a text file, which can then be formatted
into a HTML file. So in essence it's like extracting the information
and repackaging them for the Web.

I don't use the "Save As Web Page" command from Word directly as I've
a whole bunch of Word documents to convert, doing one-by-one is slow
and laborious. And the formatting does not match the Web template I've
done.

So I'm thinking of using Win32::OLE package to create a Microsoft Word
Document OLE object and then calling its SaveAs method to export a MS
Word document to a text file.

However, to reduce the overhead of having to create a Microsoft Word
Document OLE object and then calling its SaveAs method to export a MS
Word document to a text file, I have been wondering if there is a Perl
module somewhere that does this conversion directly w/o having to call
up Microsoft Word and using OLE.

http://wvware.sourceforge.net/

http://www.google.com/search?q=MS+word+to+text+conversion
which also gave as result:

http://www.w3.org/Tools/Word_proc_filters.html
 
P

Petri

Goh said:
I don't use the "Save As Web Page" command from Word directly as
I've a whole bunch of Word documents to convert, doing one-by-one
is slow and laborious. And the formatting does not match the Web
template I've done.

Why do them one by one, when you can automate Word to do them all for you?
So I'm thinking of using Win32::OLE package to create a Microsoft
Word Document OLE object and then calling its SaveAs method to
export a MS Word document to a text file.

If you are going for Word automation, why not let Word save the documents
directly as HTML, and skip the whole text-to-HTML part?

Of course, you don't need Perl for this, but you can use it if you like.
All examples are in VBScript, though:
http://msdn.microsoft.com/office/un...rary/en-us/dnword2k/html/odc_expwordtoxml.asp


Petri
 
A

Alan J. Flavell

why not let Word save the documents directly as HTML,

off-topic for Perl, but I doubt that Word can distinguish between
real HTML, and a hole in the ground!

Any mention of using its facilities to extrude its quasi-HTML needs to
be coupled with some discussion of how to limit the resulting damage,
IMNSHO.

I've had better results by having Word save as RTF, and using
third-party tools which convert RTF to web formats, preferably with
some kind of customisation facilities to "tune" the result.


h.t.h (we now return you to your regular diet of Perls...)
 
P

Petri

off-topic for Perl, but I doubt that Word can distinguish
between real HTML, and a hole in the ground!

Well, you have to adjust your level of expectation accordingly, to the level of
quality supplied by the producer.
Or something like that. :)

I know of the (lack of) quality of Word's HTML-output all too well.
But the OP mentioned himself that he had contemplated using Word's
HTML-conversion, so I assumed he new what he was doing.
That is, if his expected userbase are all using MS IE, then any HTML produced by
MS Word will always work.


Petri
 
C

Chris

Hi.

I'm trying to write a Perl program that can given a Microsoft Word
document file convert it into a text file, which can then be formatted
into a HTML file. So in essence it's like extracting the information
and repackaging them for the Web.

I don't use the "Save As Web Page" command from Word directly as I've
a whole bunch of Word documents to convert, doing one-by-one is slow
and laborious. And the formatting does not match the Web template I've
done.

So I'm thinking of using Win32::OLE package to create a Microsoft Word
Document OLE object and then calling its SaveAs method to export a MS
Word document to a text file.

However, to reduce the overhead of having to create a Microsoft Word
Document OLE object and then calling its SaveAs method to export a MS
Word document to a text file, I have been wondering if there is a Perl
module somewhere that does this conversion directly w/o having to call
up Microsoft Word and using OLE.

I'm not convinced that using Perl in this particular case is really
necessary outside of the context of Win32::OLE, but if you are insistent
on using it apart from OLE then perhaps just passing the Word document
through 'antiword' and using the ASCII results from that to do your HTML
processing would work? You ask for a "module" and if you have access to
the Win32::OLE module, then I really now of no better module to
recommend for working with Office objects...?

-ceo
 
B

Ben Morrow

Quoth "Alan J. Flavell said:
off-topic for Perl, but I doubt that Word can distinguish between
real HTML, and a hole in the ground!

I'm no (HT|SG|X)ML expert, but AFAICS Word2k and later produce valid
XML-including-XHTML. This is a somewhat different beast from real HTML,
of course, but it is a perfectly good well-defined browser-supported
format, unlike the mess previous versions of Word used to produce.
I've had better results by having Word save as RTF, and using
third-party tools which convert RTF to web formats, preferably with
some kind of customisation facilities to "tune" the result.

If you're using Perl and want to manage the conversion yourself, I would
recommend using 'Save as Web Page' and then parsing the resulting XML.
It contains all the information in the original Word doc, and is an
easier format to deal with than RTF.

Ben
 
A

Alan J. Flavell

If you're using Perl and want to manage the conversion yourself, I would
recommend using 'Save as Web Page' and then parsing the resulting XML.
It contains all the information in the original Word doc,

That's part of the problem, since most of the minutiae of a typical
Word document (like fonts and other layout dimensions sized in pt
units, to name just one problem area) have no place in HTML that is
meant to be used on the WWW, while Word documents do not necessarily
contain any logical indications such as correspond to HTML's headings,
<abbr>, list-items, and so on, as opposed to visual effects such as
bigger fonts, italics, indents, bullets, and so on.

But I've overstayed my welcome on this un-Perl issue, so "see you on
comp.infosystems.www.*" if you really wanted to pursue this.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,262
Messages
2,571,050
Members
48,769
Latest member
Clifft

Latest Threads

Top