Extracting Metadata from Microsoft Office documents

Discussion in 'Perl Misc' started by nodata, Jun 12, 2004.

  1. nodata

    nodata Guest

    How can I extract Metadata from a range of Office documents, on a Linux box?
     
    nodata, Jun 12, 2004
    #1
    1. Advertising

  2. nodata

    Ben Morrow Guest

    Quoth (nodata):
    > How can I extract Metadata from a range of Office documents, on a
    > Linux box?


    The easiest way, if you can manage it, is to save the docs as HTML with
    a recent version on Office on a Windows box. New (since 2k-ish) versions
    of Office actually produce XML, with pretty much everything in the
    original file intact (and, of course, the file is about one tenth the
    size...). You can then parse this with, say, XML::LibXML and get out the
    data you need.

    If you have access to a win32 box over the network it wouldn't be too
    hard to write a perl script for the win32 box which would receive a
    document, open it in Office using Win32::OLE, save it as HTML and send
    it back.

    If you don't, you're into parsing the binary file yourself; a quick look
    at CPAN doesn't show up anything useful. You could try creating
    documents with known metadata and grovelling around in the files with a
    hex editor to see if you can reverse engineer the format sufficiently;
    or you could try your luck with Abiword or OOffice to see if you can get
    them to convert the files into something you can read.

    Ben

    --
    I must not fear. Fear is the mind-killer. I will face my fear and
    I will let it pass through me. When the fear is gone there will be
    nothing. Only I will remain.
    Frank Herbert, 'Dune'
     
    Ben Morrow, Jun 13, 2004
    #2
    1. Advertising

  3. nodata

    Ben Morrow Guest

    Quoth Ben Morrow <>:
    >
    > Quoth (nodata):
    > > How can I extract Metadata from a range of Office documents, on a
    > > Linux box?

    >
    > If you have access to a win32 box over the network it wouldn't be too
    > hard to write a perl script for the win32 box which would receive a
    > document, open it in Office using Win32::OLE, save it as HTML and send
    > it back.


    I meant to add that 'over the network' can include VMware, if you are in a
    position to afford it (and a windows license, and an office licence)...

    Ben

    --
    Joy and Woe are woven fine,
    A Clothing for the Soul divine William Blake
    Under every grief and pine 'Auguries of Innocence'
    Runs a joy with silken twine.
     
    Ben Morrow, Jun 13, 2004
    #3
  4. nodata

    nodata Guest

    > The easiest way, if you can manage it, is to save the docs as HTML with
    > a recent version on Office on a Windows box. New (since 2k-ish) versions
    > of Office actually produce XML, with pretty much everything in the
    > original file intact (and, of course, the file is about one tenth the
    > size...). You can then parse this with, say, XML::LibXML and get out the
    > data you need.


    Thanks.

    I'll be using the metadata extraction to do smart indexing of
    documents on an Apache server.
    The users' store their documents in a folder, and the Apache server
    provides a useful listing of what files are in which folder - the
    metadata is key.

    The problem with saving as XML is that we can't yet switch to XML as
    the default file format, so putting a document on the Apache server
    would mean first saving an XML version, then saving a normal version.
    Not very efficient :/

    On top of that, there are also a large number of legacy documents
    which we need to keep in their current format because we haven't
    tested how reliable the document conversion will be.

    OpenOffice.org seems to have metadata storage it right. Maybe that'd
    be a better direction to move in.

    Wish me luck! :)
     
    nodata, Jun 13, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Lee Harris
    Replies:
    0
    Views:
    406
    Lee Harris
    Jul 8, 2003
  2. Stan Accrington
    Replies:
    1
    Views:
    991
    Michael Borgwardt
    May 13, 2004
  3. Brett Selleck

    Schema Metadata not a Metadata Schema

    Brett Selleck, Sep 4, 2003, in forum: XML
    Replies:
    1
    Views:
    439
    Andy Dingley
    Sep 4, 2003
  4. IchBin
    Replies:
    1
    Views:
    758
    IchBin
    Mar 30, 2008
  5. Aaron Bertrand - MVP

    Access Metadata from Office files

    Aaron Bertrand - MVP, Nov 4, 2003, in forum: ASP General
    Replies:
    5
    Views:
    141
Loading...

Share This Page