Word 2007 XML merge & PDF conversion on Unix

Discussion in 'XML' started by Praveen Mohanan, Aug 4, 2007.

  1. Hi...All,

    We have a requirement where we have to do the mail merge for Word
    documents on Unix & then convert into PDF, all on unix/linux platform.

    I can convert all the Word documents to Word 2007 xml & store on unix
    platform.

    The Q I have is If I use xml/xslt to merge the data with the Word XML
    document & then store it back as an xml on unix ,how do I convert into PDF?

    Are there tools available to do all these?

    Regards,

    P
    Praveen Mohanan, Aug 4, 2007
    #1
    1. Advertising

  2. Praveen Mohanan

    [Jongware] Guest

    "Praveen Mohanan" <> wrote in message
    news:_R3ti.1360$...
    > I can convert all the Word documents to Word 2007 xml & store on unix
    > platform.
    >
    > The Q I have is If I use xml/xslt to merge the data with the Word XML
    > document & then store it back as an xml on unix ,how do I convert into PDF?


    1. You don't "convert" something to PDF. Ever. Please repeat for yourself.
    PDF is printer output, just as the paper from your printer. Have you ever
    converted something to paper?
    So, if PDF is printer output, your question becomes: "how do I print an xml on
    unix to pdf?"

    2. XML is an abstract data format. If you print XML, you'll get lots and lots of
    <this>stuff</this>. "Hey, now I *know* you're wrong! My Word file can be
    converted to XML!" Not so. Your Word file, saved as XML, is not different from
    the Word .DOC file (well, it shouldn't be). It is saved in another output
    format, yes, but you can't print your .DOC file to a printer either. (No you
    can't. You need Word to read the byte codes and interpret them for you.)

    3. The only way you can print your XSLT'ed file in the format you expect (a
    nice-looking text document, not line after line of <..>'s) is if you ensured
    your output XML format is still readable by Word. Then you can use Word to print
    to PDF.

    [Jw]
    [Jongware], Aug 4, 2007
    #2
    1. Advertising

  3. Praveen Mohanan

    Peter Flynn Guest

    [Jongware] wrote:
    > "Praveen Mohanan" <> wrote in message
    > news:_R3ti.1360$...
    >> I can convert all the Word documents to Word 2007 xml & store on unix
    >> platform.
    >>
    >> The Q I have is If I use xml/xslt to merge the data with the Word XML
    >> document & then store it back as an xml on unix ,how do I convert into PDF?

    >
    > 1. You don't "convert" something to PDF. Ever. Please repeat for yourself.
    > PDF is printer output, just as the paper from your printer. Have you ever
    > converted something to paper?
    > So, if PDF is printer output, your question becomes: "how do I print an xml on
    > unix to pdf?"
    >
    > 2. XML is an abstract data format. If you print XML, you'll get lots and lots of
    > <this>stuff</this>. "Hey, now I *know* you're wrong! My Word file can be
    > converted to XML!" Not so. Your Word file, saved as XML, is not different from
    > the Word .DOC file (well, it shouldn't be). It is saved in another output
    > format, yes, but you can't print your .DOC file to a printer either. (No you
    > can't. You need Word to read the byte codes and interpret them for you.)
    >
    > 3. The only way you can print your XSLT'ed file in the format you expect (a
    > nice-looking text document, not line after line of <..>'s) is if you ensured
    > your output XML format is still readable by Word. Then you can use Word to print
    > to PDF.


    The first two are right on target, but the third not the only answer.
    If the merged data+text are now in an XML format, you can use XSL[T]
    transform to PDF by one of two methods:

    XSL:FO --> FO --> PDF using any FO processor
    XSLT --> LaTeX -- PDF using LaTeX

    Both work fine: LaTeX has better typographics but you have to learn it
    and it's not written in Java (some process pipelines demand end-to-end
    Java). Using XSL:FO you have to reinvent the wheel every time, and the
    only free processor (fop) is incomplete.

    Either way it's going to be tedious because Word does not identify the
    important parts of your document in a form that a computer can
    recognise, only its appearance in a form that human eyes and brain can
    understand, unless your authors have used specifically designed styles
    in a template. If you allowed your authors to put anything anywhere, in
    any format they wanted, you will now have to cope with the result, which
    can be painful.

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
    Peter Flynn, Aug 4, 2007
    #3
  4. [Jongware] wrote:
    > So, if PDF is printer output, your question becomes: "how do I print an xml on
    > unix to pdf?"


    Actually, the term usually used is "render" rather than print.

    The usual approach to getting from XML to PDF is to use XSLT stylesheets
    (using a processor such as Apache Xalan) to style the XML into XSL-FO
    markup, then run that through a Formatting Objects renderer (such as
    Apache FOP) to get PDFs.

    > 2. XML is an abstract data format. If you print XML, you'll get lots and lots of
    > <this>stuff</this>.


    XML is raw markup. However, XML-based rendering languages can be as rich
    as anything Word can do (or more so); XSL-FO is one example thereof.

    Ideally, the right thing to do is to drop Word from your toolchain; it's
    too much of a hassle to work with, and its markup is at the wrong level
    for effective retargeting. (For this kind of task what you want is a
    semantic markup system). If you can't, be prepared to wrestle with it.

    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Aug 4, 2007
    #4
  5. Praveen Mohanan

    [Jongware] Guest

    "Joe Kesselman" <> wrote in message
    news:...
    > [Jongware] wrote:
    > > So, if PDF is printer output, your question becomes: "how do I print an xml

    on
    > > unix to pdf?"

    >
    > Actually, the term usually used is "render" rather than print.


    Hah -- I was pondering on the proper term. This sounds better than 'virtual
    printing', or any such misnomers.

    [etc]
    > XML is raw markup. However, XML-based rendering languages can be as rich
    > as anything Word can do (or more so); XSL-FO is one example thereof.


    Peter Flynn also mentions this; I have my doubts if one could mould Word's XML
    to fit to XSL-FO specs. I for one am not inclined to investigate that any
    further. Anyway ... [continued below]

    > Ideally, the right thing to do is to drop Word from your toolchain; it's
    > too much of a hassle to work with, and its markup is at the wrong level
    > for effective retargeting. (For this kind of task what you want is a
    > semantic markup system). If you can't, be prepared to wrestle with it.


    [ctd.] -- if the only reason to put the Word doc through an XSLT process is to
    do some mail merging and then making PDFs of the result, it makes much more
    sense to drop the XSLT requirement. Word is perfectly able to do mail merges.
    The OP mentions he's working on a *nix version; in that case OpenOffice could be
    used, which surely has similar functionality. I wonder if Praveen has valid
    reasons to do this using XSLT, or if it's just because of Word being able to
    output XML, which then gets (ab)used by way of cross-platform document.

    [Jw]
    [Jongware], Aug 4, 2007
    #5
  6. Praveen Mohanan

    Andy Dingley Guest

    On 4 Aug, 20:34, "[Jongware]" <> wrote:

    > 1. You don't "convert" something to PDF. Ever. Please repeat for yourself.
    > PDF is printer output, just as the paper from your printer.


    Bollocks.


    To the OP, if you have a serious need to generate PDFs with serious
    control of the exact output, then look at generating XSL:FO from youtr
    XML with XSLT, then using Apache FOP to render the XSL:FO into PDFs.


    Of course PDF is an output format, not just a printer format. By
    definition: it _is_ a format, and is defined as one, independently of
    any printers. (although this is an unhelpful distinction for using
    it).

    Now it does have an implicit canvas embedded inside it (i.e. the page
    size), so it's not as "output independent" as XML or even Word can be.
    That's "printer like" behaviour, but it still doesn;t make the stand-
    alone definiton of its format vanish. It's also _independent_ of the
    final choice of printer and the control language that printer uses.

    It's a bit more complex than that of course. You frequently (and
    should!) have PDFs that have only a loose dependency on one paper
    size, so that they can be rendered to either A4 or US Letter paper
    sizes, depending on the local standards of the end user.

    You can also control (if you're careful enough) how a "document" is
    "rendered" to the internals of a PDF document. Do this carefully and
    you produce something that's device and scaling independent (probably
    efficiently small too). Do it badly and you spew out a crude bitmap
    that's prone to jaggies.

    Most PDFs are generated by a "printer driver" that sends its results
    to a file instead of a printer. Try Foxit's, if you don't want
    Adobe's. This doesn't mean that there's no PDF format, or that you
    can't convert content to that format.
    Andy Dingley, Aug 6, 2007
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Sharon
    Replies:
    0
    Views:
    592
    Sharon
    Jul 27, 2005
  2. Philippe Geril
    Replies:
    0
    Views:
    339
    Philippe Geril
    Jun 15, 2007
  3. Praveen Mohanan
    Replies:
    1
    Views:
    370
  4. Scott Abel
    Replies:
    0
    Views:
    431
    Scott Abel
    Oct 17, 2007
Loading...

Share This Page