converting stuff to xml files?

Discussion in 'XML' started by yawnmoth, Dec 24, 2007.

  1. yawnmoth

    yawnmoth Guest

    XSL stylesheets can be used to convert an XML file into whatever
    binary format you want (DocBook, for example, does PDF's). My
    question is... what if you wanted to go in the other direction? (eg.
    convert a PDF to DocBook) Could you do that with existing XML
    utilities or would you have to write your own program to do that?
     
    yawnmoth, Dec 24, 2007
    #1
    1. Advertising

  2. yawnmoth

    Tony Lavinio Guest

    yawnmoth wrote:
    > XSL stylesheets can be used to convert an XML file into whatever
    > binary format you want (DocBook, for example, does PDF's). My
    > question is... what if you wanted to go in the other direction? (eg.
    > convert a PDF to DocBook) Could you do that with existing XML
    > utilities or would you have to write your own program to do that?


    In this case, XSL stylesheets actually turn XML to XSL-FO, which is a
    specific type of XML that post-processors then turn into PDF's. That
    is why you need Apache FOP or XEP or something else.

    XSL only does XML-to-XML, XML-to-text, and XML-to-HTML.

    You can go from some formats into XML, using XSLT 2.0's unparsed-text()
    function, which reads a URI-addressable resource into a string, but
    that's pretty much it.

    There are products which will convert from non-XML to XML; we sell some,
    other companies sell others, and some are open source.

    PDF to XML is hard, since the data isn't always rendered in the order
    in which it went in. Each piece of text is stored pretty much as x, y,
    value (simplifying a lot!), and the x's and y's aren't necessarily
    sorted. There is a java suite called PDFBox which is open source and
    has some useful tools for parsing PDF's; there was another company that
    specialized in it somewhere, but just Google for "pdf to xml" and see
    what you find.

    --
    Tony Lavinio <> DataDirect <> Stylus Studio XML <>
    XQuery, XSLT, XML Schema and EDI Toolset <> http://www.stylusstudio.com/
    <> There is no problem that brute force and ignorance cannot overcome <>
     
    Tony Lavinio, Dec 24, 2007
    #2
    1. Advertising

  3. "yawnmoth" <> wrote in message
    news:...
    > XSL stylesheets can be used to convert an XML file into whatever
    > binary format you want (DocBook, for example, does PDF's). My
    > question is... what if you wanted to go in the other direction? (eg.
    > convert a PDF to DocBook) Could you do that with existing XML
    > utilities or would you have to write your own program to do that?


    Using the unparsed-text() function one can read and process any text file
    with XSLT 2.0.

    See for example the JSON to XML convertor[1] (the FXSL function
    f:json-document), which uses the LR-Parsing Framework[2] of FXSL[3] (all
    completely written in pure XSLT 2.0).

    Given a LR(1) grammar of a language one can produce a language processor in
    pure XSLT 2.0 in a straightforward manner.

    Cheers,
    Dimitre Novatchev

    1.
    http://fxsl.cvs.sourceforge.net/fxsl/fxsl-xslt2/f/func-json-document.xsl?view=markup&sortby=date

    2.
    http://fxsl.cvs.sourceforge.net/fxsl/fxsl-xslt2/f/func-lrParse.xsl?view=markup&sortby=date

    3. http://fxsl.sf.net
     
    Dimitre Novatchev, Dec 25, 2007
    #3
  4. yawnmoth

    yawnmoth Guest

    On Dec 24, 1:01 pm, Tony Lavinio <> wrote:
    > yawnmothwrote:
    > > XSL stylesheets can be used to convert an XML file into whatever
    > > binary format you want (DocBook, for example, does PDF's). My
    > > question is... what if you wanted to go in the other direction? (eg.
    > > convert a PDF to DocBook) Could you do that with existing XML
    > > utilities or would you have to write your own program to do that?

    >
    > In this case, XSL stylesheets actually turn XML to XSL-FO, which is a
    > specific type of XML that post-processors then turn into PDF's. That
    > is why you need Apache FOP or XEP or something else.
    >
    > XSL only does XML-to-XML, XML-to-text, and XML-to-HTML.

    Binary files kinda are text files. Sure, the average text file might
    not contain null bytes, but who's to say one can't?
     
    yawnmoth, Dec 28, 2007
    #4
  5. yawnmoth <> wrote in news:4b03cb69-af1e-4fdf-9801-
    :

    > Sure, the average text file might
    > not contain null bytes, but who's to say one can't?


    Text is printable. How do you print a null byte?

    OTOH, text must be encoded. Certain encodings do, in fact, contain null
    bytes.
     
    Kenneth Porter, Dec 28, 2007
    #5
  6. yawnmoth <> wrote in news:2030ba4b-5dcb-4ec8-bda8-
    :

    > what if you wanted to go in the other direction? (eg.
    > convert a PDF to DocBook) Could you do that with existing XML
    > utilities or would you have to write your own program to do that?


    How do you get toothpaste back into a tube? How do you get milk back into a
    cow? Certain transformations are straightforward, while others may be
    impossible.

    To convert from PDF, you need to completely specify the transform rules.
    You need to have a good understanding of the PDF format, including all the
    odd cases that inevitably get used by some PDF-writing tool. (Check out the
    kinds of HTML that the many versions of Office generate to see how badly a
    tool can abuse a format.)
     
    Kenneth Porter, Dec 28, 2007
    #6
  7. yawnmoth

    Ken Starks Guest

    yawnmoth wrote:
    > XSL stylesheets can be used to convert an XML file into whatever
    > binary format you want (DocBook, for example, does PDF's). My
    > question is... what if you wanted to go in the other direction? (eg.
    > convert a PDF to DocBook) Could you do that with existing XML
    > utilities or would you have to write your own program to do that?



    Adobe have an experimental project, called `Mars' for what they call
    an 'xml-friendly' format for pdf.

    See more at:
    http://labs.adobe.com/technologies/mars/
     
    Ken Starks, Jan 1, 2008
    #7
  8. yawnmoth

    Andy Dingley Guest

    On 24 Dec 2007, 16:57, yawnmoth <> wrote:
    > XSL stylesheets can be used to convert an XML file into whatever
    > binary format you want (DocBook, for example, does PDF's). My
    > question is... what if you wanted to go in the other direction?


    2nd law of thermodynamics (wiki it) applies.
    You can always "lose" information, but you can't re-generate it.

    "Information" can mean content or structure equally well here. Turning
    text in different XML elements into plain rendered text (such as a
    bitmap or PDF) is "lossy", because you lose the knowledge of which
    class of element it came from.

    This applies to a "closed system", so sticking information or hints
    back onto it from outside counts as "cheating" :cool: It's also hard to
    do, very hard if you're talking about a production-grade bulk system.


    So, the upshot of all this is: keep your content in a semantically-
    rich, structure-preserving format for as long as possible. Transform
    it into "simple" presentation formats at the very last moment.
    Investigate ways to keep the semantics intact, even when published to
    these simple formats, e.g. HTML might lose the XMl element names in
    favour of making everthing a <div>, but you can still preserve that
    information by adding suitable class attributes.
     
    Andy Dingley, Jan 2, 2008
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. jkflens
    Replies:
    2
    Views:
    1,546
    jkflens
    May 30, 2006
  2. Tony Girgenti
    Replies:
    3
    Views:
    858
    Laurent Bugnion [MVP]
    Feb 18, 2007
  3. srinivasan srinivas
    Replies:
    5
    Views:
    358
  4. Chris Rebert

    Re: Converting .py files to batch files.

    Chris Rebert, Sep 15, 2008, in forum: Python
    Replies:
    0
    Views:
    391
    Chris Rebert
    Sep 15, 2008
  5. banker123

    Stuff a hash from two source files

    banker123, Nov 21, 2006, in forum: Perl Misc
    Replies:
    10
    Views:
    193
    Alan_C
    Nov 23, 2006
Loading...

Share This Page