Extracting text data from MS Word document

M

Max

Hello,
I need to extract textual information (as ASCII chars' stream for
instance, or a text file) from MS Word document using Java. As it is
going to be non-MS environment, e.g. UNIX I cannot rely on
Windows-specific APIs.

Does anybody has such an experience?
I'd be appreciated for any references/hints related to subject


MSWord related part of Jakarta POI project isn't looking ready for use
right now? Is it?

Thank you,
Max
 
P

Paul Lutus

Ann wrote:

/ ...
If you have control, save the word document in RTF format
which is mostly text.

Not really. Look at one sometime in a plain-text editor.
Then just read it as any other text file.

If the intent is to read it "as any other text file", why not save it as any
othre text file? Word does that too. If instead it is saved as RTF, it
should be read as RTF, which Java can do with some limited success.
 
M

Max

unfortunately this approach doesn't fit ...

I cannot make users to save documents in some specific formats ...
The Big Idea is that
1. user works with preferred document format (MS Word)
2. sends it in a System
3. System extracts text data from it and process it as necessary ...

any other ideas?
:)
 
P

Paul Lutus

Max said:
unfortunately this approach doesn't fit ...

I cannot make users to save documents in some specific formats ...

Then you cannot get RTF either, Ann's suggestion. Too bad.

I guess you will have to see what MS Word converters are available on the
receiving end.
The Big Idea is that
1. user works with preferred document format (MS Word)
2. sends it in a System
3. System extracts text data from it and process it as necessary ...

Yes, and for that, you will need an MS Word converter. Since the target
platform is described as "UNIX", I can't go farther without finding out
which unix. If it were Linux, I would know exactly what to tell you (a
Linux installation can, and usually does, host several MS Word converter
methods).
 
M

Malcolm Dew-Jones

Max ([email protected]) wrote:
: Hello,
: I need to extract textual information (as ASCII chars' stream for
: instance, or a text file) from MS Word document using Java.


antiword is not written in java, but it does what you want. Run it with
the java equivalent of the well known system command.

I.e. in perl

system("antiword ms-word-file.doc > temporary-file.txt");

temporary-file.txt then contains what you want.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,527
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top