Tidy transforms "&" in the source-xml into a "&"

Discussion in 'XML' started by Ragnar, Nov 4, 2006.

  1. Ragnar

    Ragnar Guest

    Hi,

    2 issues left with my tidy-work:

    1) Tidy transforms a "&" in the source-xml into a "&" in the tidied
    version. My XML-Importer cannot handle it
    2) in a long <title>-string a wrap is produced like:
    <title>my very long title blab la blab la
    Blabla bla </title>
    Importer also has got problems with it


    My tidy.bat
    tidy.exe --output-xhtml yes --show-body-only yes --new-blocklevel-tags
    component,bblocation,title2,short_intro,long_intro,date,reference,category,image_small,image_medium,image_large,body2,external_link_text1,external_link_url1
    --indent auto --write-back yes %1


    regards
    Ragnar
    Ragnar, Nov 4, 2006
    #1
    1. Advertising

  2. Ragnar wrote:
    > 1) Tidy transforms a "&amp;" in the source-xml into a "&" in the tidied
    > version.


    Hold it a moment -- if your source is XML, why are you going through Tidy?

    Having said that, this shouldn't happen in XHTML output mode. Contact
    Tidy's authors, and/or show us a failing example so we can crosscheck
    this and make sure


    > 2) in a long <title>-string a wrap is produced like:
    > <title>my very long title blab la blab la
    > Blabla bla </title>
    > Importer also has got problems with it


    Turn off auto-indent.

    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Nov 4, 2006
    #2
    1. Advertising

  3. Ragnar

    Timo Harmo Guest

    On Sat, 04 Nov 2006 10:17:58 -0500, Joe Kesselman
    <> wrote:

    >Hold it a moment -- if your source is XML, why are you going through Tidy?


    Is there a better way to check the well-formedness of a xml-file than
    tidy -xml ?
    -Timo
    Timo Harmo, Nov 4, 2006
    #3
  4. Ragnar

    Ragnar Guest

    Ragnar, Nov 4, 2006
    #4
  5. Timo Harmo wrote:
    > Is there a better way to check the well-formedness of a xml-file than
    > tidy -xml ?


    Tidy is not primarily an XML tool. It's a tool for repairing
    sloppily-written HTML and XHTML.

    To check well-formedness of XML, feed it to any proper XML parser. If
    the parser doesn't accept it, the XML is not well-formed.


    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Nov 4, 2006
    #5
  6. You never answered my question: If this is already XML, why are you
    putting it through Tidy in the first place?

    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Nov 4, 2006
    #6
  7. Ragnar wrote:
    > Here is my file: http://www.ticope.de/tmp/source.xml


    Not well formed, so it isn't XML, despite the file name. First obvious
    error is that someone failed to put quotes around the value of the lang
    attribute. I'd recommend you fix this where it originates, rather than
    trying to patch it later by running it through Tidy, especially since
    you say Tidy's doing things you don't expect.

    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Nov 4, 2006
    #7
  8. Tried running the most recent copy of Tidy against your input file,
    using your batchfile. It is *NOT* damaging the &. Either you're
    confusing yourself badly (for example, looking at the text in an XML
    tool, which of course will see &amp; as the & character since that's
    what &amp; represents), or you're running a damaged copy of Tidy and
    need to upgrade.

    I'll bet on the former.

    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Nov 4, 2006
    #8
  9. Oh, forgot to say: The only thing I did differently was that I named the
    input file test.html.

    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Nov 4, 2006
    #9
  10. I may also have accidentally dropped the "--write-back yes".

    Still, this does suggest that Tidy isn't your problem.

    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Nov 5, 2006
    #10
  11. Ragnar

    Ragnar Guest

    Joe Kesselman schrieb:

    > Tried running the most recent copy of Tidy against your input file,
    > using your batchfile. It is *NOT* damaging the &. Either you're
    > confusing yourself badly (for example, looking at the text in an XML
    > tool, which of course will see &amp; as the & character since that's
    > what &amp; represents), or you're running a damaged copy of Tidy and
    > need to upgrade.



    Hi Joe

    thank you so for your work and help

    Yes, you might be right. I was confused by the tool which has presented
    &amp; as &.
    So you say I dont have wellformed xml and therefore I cannot use tidy.
    The content was exported automatically from an older version of a CMS
    and the rich-text-fields were not XHTML-compliant. But you are right- I
    should focus more on exporting and trying to optimize the exporter
    instead of the importer. Maybe it is just enough to run tidy there or
    do a lot of string-manipulations (Replace) in the phase where the
    content is exported using SOAP.


    Ragnar
    Ragnar, Nov 5, 2006
    #11
  12. Ragnar wrote:
    > So you say I dont have wellformed xml and therefore I cannot use tidy.


    Tidy's job is to (take an informed guess at how to) fix ill-formed HTML,
    not ill-formed XML. And even there, it should be considered a stopgap,
    used only because so few people (or tools!) produce officially correct HTML.

    If you're working in XML, you should start by producing real XML. That
    really shouldn't be hard to do.


    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Nov 5, 2006
    #12
  13. Ragnar

    Andy Dingley Guest

    Joe Kesselman wrote:

    > To check well-formedness of XML, feed it to any proper XML parser. If
    > the parser doesn't accept it, the XML is not well-formed.


    What would you suggest if it _isn't_ well-formed XML? (dodgy use of
    HTML entities being an obvious "fixable" problem that springs to mind)

    It's not an uncommon problem to have to deal with cruddy XML like this.
    I'd be interested to hear what other peoples' favourite tools for
    helping with it are.
    Andy Dingley, Nov 8, 2006
    #13
  14. Andy Dingley wrote:
    > What would you suggest if it _isn't_ well-formed XML? (dodgy use of
    > HTML entities being an obvious "fixable" problem that springs to mind)


    There really is no good way to repair a damaged document without deep
    knowledge of exactly what the intended document structure was -- which
    is why Tidy is such a complicated application; it needs to understand
    HTML well enough to make intelligent guesses about what the author's
    intent was.

    The *best* you can hope to do is to sweep the problem under the carpet
    and guess right most of the time.

    So I would, very strongly, suggest fixing the problem at the source. If
    it isn't well-formed XML, fix the tool that generated it.

    --
    Joe Kesselman / Beware the fury of a patient man. -- John Dryden
    Joseph Kesselman, Nov 8, 2006
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Random

    xslt, multiple transforms

    Random, Feb 28, 2005, in forum: ASP .Net
    Replies:
    3
    Views:
    1,479
    Wilco Bauwer
    Feb 28, 2005
  2. Ian Stanley

    Help with transforms

    Ian Stanley, Jul 30, 2003, in forum: Java
    Replies:
    3
    Views:
    503
    Ian Stanley
    Jul 31, 2003
  3. Ravi
    Replies:
    4
    Views:
    375
    Dimitre Novatchev
    Nov 10, 2003
  4. knipknap
    Replies:
    0
    Views:
    1,246
    knipknap
    Jan 19, 2010
  5. Luklrc
    Replies:
    4
    Views:
    89
Loading...

Share This Page