Extracting text data from MS Word document

Discussion in 'Java' started by Max, Sep 15, 2004.

  1. Max

    Max Guest

    Hello,
    I need to extract textual information (as ASCII chars' stream for
    instance, or a text file) from MS Word document using Java. As it is
    going to be non-MS environment, e.g. UNIX I cannot rely on
    Windows-specific APIs.

    Does anybody has such an experience?
    I'd be appreciated for any references/hints related to subject


    MSWord related part of Jakarta POI project isn't looking ready for use
    right now? Is it?

    Thank you,
    Max
     
    Max, Sep 15, 2004
    #1
    1. Advertising

  2. Max

    Ike Guest

    you may want to look at openoffice.org, I believe they are on sourveforge.
    They read/write MS word files in Java.

    Secondly, check out http://jakarta.apache.org/poi/

    Thirdly try http://www.wotsit.org/ and search on 'word' and you;ll find they
    have the docs on all ms word formats.

    -Ike

    "Max" <> wrote in message
    news:...
    > Hello,
    > I need to extract textual information (as ASCII chars' stream for
    > instance, or a text file) from MS Word document using Java. As it is
    > going to be non-MS environment, e.g. UNIX I cannot rely on
    > Windows-specific APIs.
    >
    > Does anybody has such an experience?
    > I'd be appreciated for any references/hints related to subject
    >
    >
    > MSWord related part of Jakarta POI project isn't looking ready for use
    > right now? Is it?
    >
    > Thank you,
    > Max
     
    Ike, Sep 16, 2004
    #2
    1. Advertising

  3. Max

    Ann Guest

    "Ike" <> wrote in message
    news:ac52d.2944$...
    > you may want to look at openoffice.org, I believe they are on sourveforge.
    > They read/write MS word files in Java.
    >
    > Secondly, check out http://jakarta.apache.org/poi/
    >
    > Thirdly try http://www.wotsit.org/ and search on 'word' and you;ll find

    they
    > have the docs on all ms word formats.
    >
    > -Ike
    >
    > "Max" <> wrote in message
    > news:...
    > > Hello,
    > > I need to extract textual information (as ASCII chars' stream for
    > > instance, or a text file) from MS Word document using Java. As it is
    > > going to be non-MS environment, e.g. UNIX I cannot rely on
    > > Windows-specific APIs.
    > >
    > > Does anybody has such an experience?
    > > I'd be appreciated for any references/hints related to subject
    > >
    > >
    > > MSWord related part of Jakarta POI project isn't looking ready for use
    > > right now? Is it?
    > >
    > > Thank you,
    > > Max

    >


    If you have control, save the word document in RTF format
    which is mostly text. Then just read it as any other text file.
     
    Ann, Sep 16, 2004
    #3
  4. Max

    Paul Lutus Guest

    Ann wrote:

    / ...

    > If you have control, save the word document in RTF format
    > which is mostly text.


    Not really. Look at one sometime in a plain-text editor.

    > Then just read it as any other text file.


    If the intent is to read it "as any other text file", why not save it as any
    othre text file? Word does that too. If instead it is saved as RTF, it
    should be read as RTF, which Java can do with some limited success.

    --
    Paul Lutus
    http://www.arachnoid.com
     
    Paul Lutus, Sep 16, 2004
    #4
  5. Max

    Max Guest

    unfortunately this approach doesn't fit ...

    I cannot make users to save documents in some specific formats ...
    The Big Idea is that
    1. user works with preferred document format (MS Word)
    2. sends it in a System
    3. System extracts text data from it and process it as necessary ...

    any other ideas?
    :)


    Paul Lutus <> wrote in message news:<>...
    > Ann wrote:
    >
    > / ...
    >
    > > If you have control, save the word document in RTF format
    > > which is mostly text.

    >
    > Not really. Look at one sometime in a plain-text editor.
    >
    > > Then just read it as any other text file.

    >
    > If the intent is to read it "as any other text file", why not save it as any
    > othre text file? Word does that too. If instead it is saved as RTF, it
    > should be read as RTF, which Java can do with some limited success.
     
    Max, Sep 16, 2004
    #5
  6. Max

    Paul Lutus Guest

    Max wrote:

    > unfortunately this approach doesn't fit ...
    >
    > I cannot make users to save documents in some specific formats ...


    Then you cannot get RTF either, Ann's suggestion. Too bad.

    I guess you will have to see what MS Word converters are available on the
    receiving end.

    > The Big Idea is that
    > 1. user works with preferred document format (MS Word)
    > 2. sends it in a System
    > 3. System extracts text data from it and process it as necessary ...


    Yes, and for that, you will need an MS Word converter. Since the target
    platform is described as "UNIX", I can't go farther without finding out
    which unix. If it were Linux, I would know exactly what to tell you (a
    Linux installation can, and usually does, host several MS Word converter
    methods).

    --
    Paul Lutus
    http://www.arachnoid.com
     
    Paul Lutus, Sep 16, 2004
    #6
  7. Max () wrote:
    : Hello,
    : I need to extract textual information (as ASCII chars' stream for
    : instance, or a text file) from MS Word document using Java.


    antiword is not written in java, but it does what you want. Run it with
    the java equivalent of the well known system command.

    I.e. in perl

    system("antiword ms-word-file.doc > temporary-file.txt");

    temporary-file.txt then contains what you want.
     
    Malcolm Dew-Jones, Sep 17, 2004
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?S2V2aW4gSw==?=
    Replies:
    2
    Views:
    2,965
    =?Utf-8?B?S2V2aW4gSw==?=
    Apr 6, 2006
  2. amit
    Replies:
    0
    Views:
    367
  3. srk
    Replies:
    0
    Views:
    695
  4. srk
    Replies:
    0
    Views:
    649
  5. Michael G. Schneider

    Modifying a Word document without using Word Automation

    Michael G. Schneider, Dec 15, 2003, in forum: ASP General
    Replies:
    5
    Views:
    326
    el.c. - myLittleTools.net
    Dec 16, 2003
Loading...

Share This Page