Extracting Metadata from Microsoft Office documents

B

Ben Morrow

Quoth (e-mail address removed) (nodata):
How can I extract Metadata from a range of Office documents, on a
Linux box?

The easiest way, if you can manage it, is to save the docs as HTML with
a recent version on Office on a Windows box. New (since 2k-ish) versions
of Office actually produce XML, with pretty much everything in the
original file intact (and, of course, the file is about one tenth the
size...). You can then parse this with, say, XML::LibXML and get out the
data you need.

If you have access to a win32 box over the network it wouldn't be too
hard to write a perl script for the win32 box which would receive a
document, open it in Office using Win32::OLE, save it as HTML and send
it back.

If you don't, you're into parsing the binary file yourself; a quick look
at CPAN doesn't show up anything useful. You could try creating
documents with known metadata and grovelling around in the files with a
hex editor to see if you can reverse engineer the format sufficiently;
or you could try your luck with Abiword or OOffice to see if you can get
them to convert the files into something you can read.

Ben
 
B

Ben Morrow

Quoth Ben Morrow said:
Quoth (e-mail address removed) (nodata):

If you have access to a win32 box over the network it wouldn't be too
hard to write a perl script for the win32 box which would receive a
document, open it in Office using Win32::OLE, save it as HTML and send
it back.

I meant to add that 'over the network' can include VMware, if you are in a
position to afford it (and a windows license, and an office licence)...

Ben
 
N

nodata

The easiest way, if you can manage it, is to save the docs as HTML with
a recent version on Office on a Windows box. New (since 2k-ish) versions
of Office actually produce XML, with pretty much everything in the
original file intact (and, of course, the file is about one tenth the
size...). You can then parse this with, say, XML::LibXML and get out the
data you need.

Thanks.

I'll be using the metadata extraction to do smart indexing of
documents on an Apache server.
The users' store their documents in a folder, and the Apache server
provides a useful listing of what files are in which folder - the
metadata is key.

The problem with saving as XML is that we can't yet switch to XML as
the default file format, so putting a document on the Apache server
would mean first saving an XML version, then saving a normal version.
Not very efficient :/

On top of that, there are also a large number of legacy documents
which we need to keep in their current format because we haven't
tested how reliable the document conversion will be.

OpenOffice.org seems to have metadata storage it right. Maybe that'd
be a better direction to move in.

Wish me luck! :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top