parsing MS word docs -- tutorial request

B

bp.tralfamadore

All,

I am trying to write a script that will parse and extract data from a
MS Word document. Can / would anyone refer me to a tutorial on how to
do that? (perhaps from tables). I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. Thanks for your attention and
patience.

::bp::
 
O

Okko Willeboordsed

Get a copy of; Python Programming on Win32, ISBN 1-56592-621-8
Use Google and VBA for help
 
R

Reedick, Andrew

-----Original Message-----
From: [email protected] [mailto:python-
[email protected]] On Behalf Of
(e-mail address removed)
Sent: Tuesday, October 28, 2008 10:26 AM
To: (e-mail address removed)
Subject: parsing MS word docs -- tutorial request

All,

I am trying to write a script that will parse and extract data from a
MS Word document. Can / would anyone refer me to a tutorial on how to
do that? (perhaps from tables). I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. Thanks for your attention and
patience.

::bp::


Word Object Model:
http://msdn.microsoft.com/en-us/library/bb244515.aspx

Google for sample code to get you started.
 
K

Kay Schluehr

All,

I am trying to write a script that will parse and extract data from a
MS Word document.  Can / would anyone refer me to a tutorial on how to
do that?  (perhaps from tables).  I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated.  Thanks for your attention and
patience.

::bp::

One can convert MS-Word documents into some class of XML documents
called MHTML. If I remember correctly those documents had an .mht
extension. The result is a huge amount of ( nevertheless structured )
markup gibberish together with text. If one spends time and attention
one can find pattern in the markup ( we have XML and it's human
readable ).

A few years ago I used this conversion to implement roughly following
thing algorithm:

1. I manually highlighted one or more sections in a Word doc using a
background colour marker.
2. I searched for the colour marked section and determined the
structure. The structure information was fed into a state machine.
3. With this state machine I searched for all sections that were
equally structured.
4. I applied a href link to the text that was surrounded by the
structure and removed the colour marker.
5. In another document I searched for the same text and set an anchor.

This way I could link two documents ( those were public specifications
being originally disconnected ).

Kay
 
T

Terry Reedy

Kay said:
One can convert MS-Word documents into some class of XML documents
called MHTML. If I remember correctly those documents had an .mht
extension. The result is a huge amount of ( nevertheless structured )
markup gibberish together with text. If one spends time and attention
one can find pattern in the markup ( we have XML and it's human
readable ).

A related solution is to use OpenOffice to convert to
OpenDocumentFormat, a zipped multiple XML format, and then use ODFPY to
parse the XML and access the contents as linked objects.
http://opendocumentfellowship.com/development/projects/odfpy
 
B

bp.tralfamadore

Thanks everyone -- very helpful!
I really appreciate your help -- that is what makes the world a
wonderful place.

peace.

::bp::
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top