Word 2007 XML merge & PDF conversion on Unix

  • Thread starter Praveen Mohanan
  • Start date
P

Praveen Mohanan

Hi...All,

We have a requirement where we have to do the mail merge for Word
documents on Unix & then convert into PDF, all on unix/linux platform.

I can convert all the Word documents to Word 2007 xml & store on unix
platform.

The Q I have is If I use xml/xslt to merge the data with the Word XML
document & then store it back as an xml on unix ,how do I convert into PDF?

Are there tools available to do all these?

Regards,

P
 
J

[Jongware]

Praveen Mohanan said:
I can convert all the Word documents to Word 2007 xml & store on unix
platform.

The Q I have is If I use xml/xslt to merge the data with the Word XML
document & then store it back as an xml on unix ,how do I convert into PDF?

1. You don't "convert" something to PDF. Ever. Please repeat for yourself.
PDF is printer output, just as the paper from your printer. Have you ever
converted something to paper?
So, if PDF is printer output, your question becomes: "how do I print an xml on
unix to pdf?"

2. XML is an abstract data format. If you print XML, you'll get lots and lots of
<this>stuff</this>. "Hey, now I *know* you're wrong! My Word file can be
converted to XML!" Not so. Your Word file, saved as XML, is not different from
the Word .DOC file (well, it shouldn't be). It is saved in another output
format, yes, but you can't print your .DOC file to a printer either. (No you
can't. You need Word to read the byte codes and interpret them for you.)

3. The only way you can print your XSLT'ed file in the format you expect (a
nice-looking text document, not line after line of <..>'s) is if you ensured
your output XML format is still readable by Word. Then you can use Word to print
to PDF.

[Jw]
 
P

Peter Flynn

1. You don't "convert" something to PDF. Ever. Please repeat for yourself.
PDF is printer output, just as the paper from your printer. Have you ever
converted something to paper?
So, if PDF is printer output, your question becomes: "how do I print an xml on
unix to pdf?"

2. XML is an abstract data format. If you print XML, you'll get lots and lots of
<this>stuff</this>. "Hey, now I *know* you're wrong! My Word file can be
converted to XML!" Not so. Your Word file, saved as XML, is not different from
the Word .DOC file (well, it shouldn't be). It is saved in another output
format, yes, but you can't print your .DOC file to a printer either. (No you
can't. You need Word to read the byte codes and interpret them for you.)

3. The only way you can print your XSLT'ed file in the format you expect (a
nice-looking text document, not line after line of <..>'s) is if you ensured
your output XML format is still readable by Word. Then you can use Word to print
to PDF.

The first two are right on target, but the third not the only answer.
If the merged data+text are now in an XML format, you can use XSL[T]
transform to PDF by one of two methods:

XSL:FO --> FO --> PDF using any FO processor
XSLT --> LaTeX -- PDF using LaTeX

Both work fine: LaTeX has better typographics but you have to learn it
and it's not written in Java (some process pipelines demand end-to-end
Java). Using XSL:FO you have to reinvent the wheel every time, and the
only free processor (fop) is incomplete.

Either way it's going to be tedious because Word does not identify the
important parts of your document in a form that a computer can
recognise, only its appearance in a form that human eyes and brain can
understand, unless your authors have used specifically designed styles
in a template. If you allowed your authors to put anything anywhere, in
any format they wanted, you will now have to cope with the result, which
can be painful.

///Peter
 
J

Joe Kesselman

So, if PDF is printer output, your question becomes: "how do I print an xml on
unix to pdf?"

Actually, the term usually used is "render" rather than print.

The usual approach to getting from XML to PDF is to use XSLT stylesheets
(using a processor such as Apache Xalan) to style the XML into XSL-FO
markup, then run that through a Formatting Objects renderer (such as
Apache FOP) to get PDFs.
2. XML is an abstract data format. If you print XML, you'll get lots and lots of
<this>stuff</this>.

XML is raw markup. However, XML-based rendering languages can be as rich
as anything Word can do (or more so); XSL-FO is one example thereof.

Ideally, the right thing to do is to drop Word from your toolchain; it's
too much of a hassle to work with, and its markup is at the wrong level
for effective retargeting. (For this kind of task what you want is a
semantic markup system). If you can't, be prepared to wrestle with it.
 
J

[Jongware]

Joe Kesselman said:
Actually, the term usually used is "render" rather than print.

Hah -- I was pondering on the proper term. This sounds better than 'virtual
printing', or any such misnomers.

[etc]
XML is raw markup. However, XML-based rendering languages can be as rich
as anything Word can do (or more so); XSL-FO is one example thereof.

Peter Flynn also mentions this; I have my doubts if one could mould Word's XML
to fit to XSL-FO specs. I for one am not inclined to investigate that any
further. Anyway ... [continued below]
Ideally, the right thing to do is to drop Word from your toolchain; it's
too much of a hassle to work with, and its markup is at the wrong level
for effective retargeting. (For this kind of task what you want is a
semantic markup system). If you can't, be prepared to wrestle with it.

[ctd.] -- if the only reason to put the Word doc through an XSLT process is to
do some mail merging and then making PDFs of the result, it makes much more
sense to drop the XSLT requirement. Word is perfectly able to do mail merges.
The OP mentions he's working on a *nix version; in that case OpenOffice could be
used, which surely has similar functionality. I wonder if Praveen has valid
reasons to do this using XSLT, or if it's just because of Word being able to
output XML, which then gets (ab)used by way of cross-platform document.

[Jw]
 
A

Andy Dingley

1. You don't "convert" something to PDF. Ever. Please repeat for yourself.
PDF is printer output, just as the paper from your printer.

Bollocks.


To the OP, if you have a serious need to generate PDFs with serious
control of the exact output, then look at generating XSL:FO from youtr
XML with XSLT, then using Apache FOP to render the XSL:FO into PDFs.


Of course PDF is an output format, not just a printer format. By
definition: it _is_ a format, and is defined as one, independently of
any printers. (although this is an unhelpful distinction for using
it).

Now it does have an implicit canvas embedded inside it (i.e. the page
size), so it's not as "output independent" as XML or even Word can be.
That's "printer like" behaviour, but it still doesn;t make the stand-
alone definiton of its format vanish. It's also _independent_ of the
final choice of printer and the control language that printer uses.

It's a bit more complex than that of course. You frequently (and
should!) have PDFs that have only a loose dependency on one paper
size, so that they can be rendered to either A4 or US Letter paper
sizes, depending on the local standards of the end user.

You can also control (if you're careful enough) how a "document" is
"rendered" to the internals of a PDF document. Do this carefully and
you produce something that's device and scaling independent (probably
efficiently small too). Do it badly and you spew out a crude bitmap
that's prone to jaggies.

Most PDFs are generated by a "printer driver" that sends its results
to a file instead of a printer. Try Foxit's, if you don't want
Adobe's. This doesn't mean that there's no PDF format, or that you
can't convert content to that format.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top