Extracting Metadata from Microsoft Office documents

nodata · Jun 12, 2004

How can I extract Metadata from a range of Office documents, on a Linux box?

Ben Morrow · Jun 13, 2004

Quoth (e-mail address removed) (nodata):

How can I extract Metadata from a range of Office documents, on a
Linux box?

The easiest way, if you can manage it, is to save the docs as HTML with
a recent version on Office on a Windows box. New (since 2k-ish) versions
of Office actually produce XML, with pretty much everything in the
original file intact (and, of course, the file is about one tenth the
size...). You can then parse this with, say, XML::LibXML and get out the
data you need.

If you have access to a win32 box over the network it wouldn't be too
hard to write a perl script for the win32 box which would receive a
document, open it in Office using Win32::OLE, save it as HTML and send
it back.

If you don't, you're into parsing the binary file yourself; a quick look
at CPAN doesn't show up anything useful. You could try creating
documents with known metadata and grovelling around in the files with a
hex editor to see if you can reverse engineer the format sufficiently;
or you could try your luck with Abiword or OOffice to see if you can get
them to convert the files into something you can read.

Ben

Ben Morrow · Jun 13, 2004

Quoth Ben Morrow said:
Quoth (e-mail address removed) (nodata):

If you have access to a win32 box over the network it wouldn't be too
hard to write a perl script for the win32 box which would receive a
document, open it in Office using Win32::OLE, save it as HTML and send
it back.

I meant to add that 'over the network' can include VMware, if you are in a
position to afford it (and a windows license, and an office licence)...

Ben

nodata · Jun 13, 2004

The easiest way, if you can manage it, is to save the docs as HTML with

a recent version on Office on a Windows box. New (since 2k-ish) versions
of Office actually produce XML, with pretty much everything in the
original file intact (and, of course, the file is about one tenth the
size...). You can then parse this with, say, XML::LibXML and get out the
data you need.

Thanks.

I'll be using the metadata extraction to do smart indexing of
documents on an Apache server.
The users' store their documents in a folder, and the Apache server
provides a useful listing of what files are in which folder - the
metadata is key.

The problem with saving as XML is that we can't yet switch to XML as
the default file format, so putting a document on the Apache server
would mean first saving an XML version, then saving a normal version.
Not very efficient :/

On top of that, there are also a large number of legacy documents
which we need to keep in their current format because we haven't
tested how reliable the document conversion will be.

OpenOffice.org seems to have metadata storage it right. Maybe that'd
be a better direction to move in.

Wish me luck!

Scipy install Problems	1	Oct 17, 2023
Extracting value from nested JSON	2	Sep 10, 2023
Question about multiple metadata files to one file	0	Feb 14, 2022
When deployed to Heroku, python setup.py egg info did not run successfully.	1	Jul 4, 2022
How to change key name in json file with python	0	Oct 2, 2022
Classifying documents using latent dirichlet allocation	0	Jan 29, 2021
Extracting Metadata from Windows Media files	1	Mar 30, 2008
Create Office documents on server	2	Oct 29, 2008

Extracting Metadata from Microsoft Office documents

nodata

Ben Morrow

Ben Morrow

nodata

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads