How to convert MS Word document to text file?

Discussion in 'Perl Misc' started by Goh, Yong Kwang, Jun 16, 2004.

  1. Hi.

    I'm trying to write a Perl program that can given a Microsoft Word
    document file convert it into a text file, which can then be formatted
    into a HTML file. So in essence it's like extracting the information
    and repackaging them for the Web.

    I don't use the "Save As Web Page" command from Word directly as I've
    a whole bunch of Word documents to convert, doing one-by-one is slow
    and laborious. And the formatting does not match the Web template I've
    done.

    So I'm thinking of using Win32::OLE package to create a Microsoft Word
    Document OLE object and then calling its SaveAs method to export a MS
    Word document to a text file.

    However, to reduce the overhead of having to create a Microsoft Word
    Document OLE object and then calling its SaveAs method to export a MS
    Word document to a text file, I have been wondering if there is a Perl
    module somewhere that does this conversion directly w/o having to call
    up Microsoft Word and using OLE.

    Thanks.

    Regards,
    Goh, Yong-Kwang

    Singapore
    Goh, Yong Kwang, Jun 16, 2004
    #1
    1. Advertising

  2. Goh, Yong Kwang

    John Bokma Guest

    Goh, Yong Kwang wrote:

    > Hi.
    >
    > I'm trying to write a Perl program that can given a Microsoft Word
    > document file convert it into a text file, which can then be formatted
    > into a HTML file. So in essence it's like extracting the information
    > and repackaging them for the Web.
    >
    > I don't use the "Save As Web Page" command from Word directly as I've
    > a whole bunch of Word documents to convert, doing one-by-one is slow
    > and laborious. And the formatting does not match the Web template I've
    > done.
    >
    > So I'm thinking of using Win32::OLE package to create a Microsoft Word
    > Document OLE object and then calling its SaveAs method to export a MS
    > Word document to a text file.
    >
    > However, to reduce the overhead of having to create a Microsoft Word
    > Document OLE object and then calling its SaveAs method to export a MS
    > Word document to a text file, I have been wondering if there is a Perl
    > module somewhere that does this conversion directly w/o having to call
    > up Microsoft Word and using OLE.


    http://wvware.sourceforge.net/

    http://www.google.com/search?q=MS word to text conversion
    which also gave as result:

    http://www.w3.org/Tools/Word_proc_filters.html

    --
    John MexIT: http://johnbokma.com/mexit/
    personal page: http://johnbokma.com/
    Experienced Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
    John Bokma, Jun 16, 2004
    #2
    1. Advertising

  3. Goh, Yong Kwang

    Petri Guest

    In article <>, Goh, Yong Kwang
    says...
    > I don't use the "Save As Web Page" command from Word directly as
    > I've a whole bunch of Word documents to convert, doing one-by-one
    > is slow and laborious. And the formatting does not match the Web
    > template I've done.


    Why do them one by one, when you can automate Word to do them all for you?

    > So I'm thinking of using Win32::OLE package to create a Microsoft
    > Word Document OLE object and then calling its SaveAs method to
    > export a MS Word document to a text file.


    If you are going for Word automation, why not let Word save the documents
    directly as HTML, and skip the whole text-to-HTML part?

    Of course, you don't need Perl for this, but you can use it if you like.
    All examples are in VBScript, though:
    http://msdn.microsoft.com/office/un...rary/en-us/dnword2k/html/odc_expwordtoxml.asp


    Petri
    Petri, Jun 16, 2004
    #3
  4. On Wed, 16 Jun 2004, Petri wrote:

    > why not let Word save the documents directly as HTML,


    off-topic for Perl, but I doubt that Word can distinguish between
    real HTML, and a hole in the ground!

    Any mention of using its facilities to extrude its quasi-HTML needs to
    be coupled with some discussion of how to limit the resulting damage,
    IMNSHO.

    I've had better results by having Word save as RTF, and using
    third-party tools which convert RTF to web formats, preferably with
    some kind of customisation facilities to "tune" the result.


    h.t.h (we now return you to your regular diet of Perls...)
    Alan J. Flavell, Jun 16, 2004
    #4
  5. Goh, Yong Kwang

    Petri Guest

    In article <>, Alan J.
    Flavell says...
    > On Wed, 16 Jun 2004, Petri wrote:


    >> why not let Word save the documents directly as HTML,


    > off-topic for Perl, but I doubt that Word can distinguish
    > between real HTML, and a hole in the ground!


    Well, you have to adjust your level of expectation accordingly, to the level of
    quality supplied by the producer.
    Or something like that. :)

    I know of the (lack of) quality of Word's HTML-output all too well.
    But the OP mentioned himself that he had contemplated using Word's
    HTML-conversion, so I assumed he new what he was doing.
    That is, if his expected userbase are all using MS IE, then any HTML produced by
    MS Word will always work.


    Petri
    Petri, Jun 16, 2004
    #5
  6. Goh, Yong Kwang

    Chris Guest

    Goh, Yong Kwang wrote:
    > Hi.
    >
    > I'm trying to write a Perl program that can given a Microsoft Word
    > document file convert it into a text file, which can then be formatted
    > into a HTML file. So in essence it's like extracting the information
    > and repackaging them for the Web.
    >
    > I don't use the "Save As Web Page" command from Word directly as I've
    > a whole bunch of Word documents to convert, doing one-by-one is slow
    > and laborious. And the formatting does not match the Web template I've
    > done.
    >
    > So I'm thinking of using Win32::OLE package to create a Microsoft Word
    > Document OLE object and then calling its SaveAs method to export a MS
    > Word document to a text file.
    >
    > However, to reduce the overhead of having to create a Microsoft Word
    > Document OLE object and then calling its SaveAs method to export a MS
    > Word document to a text file, I have been wondering if there is a Perl
    > module somewhere that does this conversion directly w/o having to call
    > up Microsoft Word and using OLE.
    >


    I'm not convinced that using Perl in this particular case is really
    necessary outside of the context of Win32::OLE, but if you are insistent
    on using it apart from OLE then perhaps just passing the Word document
    through 'antiword' and using the ASCII results from that to do your HTML
    processing would work? You ask for a "module" and if you have access to
    the Win32::OLE module, then I really now of no better module to
    recommend for working with Office objects...?

    -ceo
    Chris, Jun 16, 2004
    #6
  7. Goh, Yong Kwang

    Ben Morrow Guest

    Quoth "Alan J. Flavell" <>:
    > On Wed, 16 Jun 2004, Petri wrote:
    >
    > > why not let Word save the documents directly as HTML,

    >
    > off-topic for Perl, but I doubt that Word can distinguish between
    > real HTML, and a hole in the ground!


    I'm no (HT|SG|X)ML expert, but AFAICS Word2k and later produce valid
    XML-including-XHTML. This is a somewhat different beast from real HTML,
    of course, but it is a perfectly good well-defined browser-supported
    format, unlike the mess previous versions of Word used to produce.

    > I've had better results by having Word save as RTF, and using
    > third-party tools which convert RTF to web formats, preferably with
    > some kind of customisation facilities to "tune" the result.


    If you're using Perl and want to manage the conversion yourself, I would
    recommend using 'Save as Web Page' and then parsing the resulting XML.
    It contains all the information in the original Word doc, and is an
    easier format to deal with than RTF.

    Ben

    --
    If I were a butterfly I'd live for a day, / I would be free, just blowing away.
    This cruel country has driven me down / Teased me and lied, teased me and lied.
    I've only sad stories to tell to this town: / My dreams have withered and died.
    (Kate Rusby)
    Ben Morrow, Jun 16, 2004
    #7
  8. On Wed, 16 Jun 2004, Ben Morrow wrote:

    > If you're using Perl and want to manage the conversion yourself, I would
    > recommend using 'Save as Web Page' and then parsing the resulting XML.
    > It contains all the information in the original Word doc,


    That's part of the problem, since most of the minutiae of a typical
    Word document (like fonts and other layout dimensions sized in pt
    units, to name just one problem area) have no place in HTML that is
    meant to be used on the WWW, while Word documents do not necessarily
    contain any logical indications such as correspond to HTML's headings,
    <abbr>, list-items, and so on, as opposed to visual effects such as
    bigger fonts, italics, indents, bullets, and so on.

    But I've overstayed my welcome on this un-Perl issue, so "see you on
    comp.infosystems.www.*" if you really wanted to pursue this.
    Alan J. Flavell, Jun 16, 2004
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. srk
    Replies:
    0
    Views:
    639
  2. srk
    Replies:
    0
    Views:
    607
  3. prajwala
    Replies:
    2
    Views:
    1,267
    Paddy
    Sep 20, 2007
  4. Michael G. Schneider

    Modifying a Word document without using Word Automation

    Michael G. Schneider, Dec 15, 2003, in forum: ASP General
    Replies:
    5
    Views:
    282
    el.c. - myLittleTools.net
    Dec 16, 2003
  5. Guest
    Replies:
    4
    Views:
    295
    Guest
    May 12, 2006
Loading...

Share This Page