Tidy; how to make it XML-conform? <BR> needs to be closed

Discussion in 'XML' started by Ragnar, Oct 23, 2006.

  1. Ragnar

    Ragnar Guest

    Hi

    I have one question regarding Tidy (http://tidy.sourceforge.net). My
    source XML-file has got a lot of unclosed <BR>-tags. Which command do I
    need (in my tidy config-file) to close it <BR/> and make valid XML out
    of it?


    regards
    Rag.
    Ragnar, Oct 23, 2006
    #1
    1. Advertising

  2. In article <>,
    Ragnar <> wrote:

    >I have one question regarding Tidy (http://tidy.sourceforge.net). My
    >source XML-file has got a lot of unclosed <BR>-tags. Which command do I
    >need (in my tidy config-file) to close it <BR/> and make valid XML out
    >of it?


    Use the -asxml or -asxhtml flag.

    -- Richard
    Richard Tobin, Oct 23, 2006
    #2
    1. Advertising

  3. * Ragnar wrote in comp.text.xml:
    >I have one question regarding Tidy (http://tidy.sourceforge.net). My
    >source XML-file has got a lot of unclosed <BR>-tags. Which command do I
    >need (in my tidy config-file) to close it <BR/> and make valid XML out
    >of it?


    HTML Tidy is not designed to clean up arbitrary XML documents, so if by
    "XML-file" you really mean some arbitrary XML document, then it might be
    difficult to address your problem. If you mean "HTML" or "XHTML" instead
    then use the output-* family of options, or the -asxml command line
    option and ensure that you have not set the input-xml flag.
    --
    Björn Höhrmann · mailto: · http://bjoern.hoehrmann.de
    Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
    68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
    Bjoern Hoehrmann, Oct 24, 2006
    #3
  4. Ragnar

    Ragnar Guest

    Thank your for your help. It is very important to get support because
    I have to finish it today

    my command line looks like: tidy -asxml -config config.txt old.xml

    I get the same error like without using "-asxml"

    Error: unexpected </reference> in <BR>

    That means it finds an unclosed <BR>-tag at node "reference".

    To get rid of it I could use "no-xml" as input-format but then tidy
    would transform my XML into a HTML-structure what is not wanted


    Ragnar
    Ragnar, Oct 24, 2006
    #4
  5. Ragnar

    Ragnar Guest

    Another question regarding Tidy:

    I want to use the COM-Wrapper of Tidy. Now I have found this example:
    I dont know why "Stat As Long" is used. I tried to work without "Stat"
    but I cannot call objTidyDoc.MethodName directly


    Dim objTidyDoc As TidyDocument
    Set objTidyDoc = New TidyDocument
    Stat = 0
    Stat = objTidyDoc.LoadConfig(strTidyConfig)
    Stat = objTidyDoc.ParseFile(strFilePath & strXmlFileName)
    Stat = objTidyDoc.CleanAndRepair()
    Stat = objTidyDoc.RunDiagnostics()
    Stat = objTidyDoc.SaveFile(strFilePath & strXmlFileName)
    Ragnar, Oct 24, 2006
    #5
  6. Ragnar

    Ragnar Guest

    Now I know how to use the COM-Wrapper but my main question is still
    open

    How can I transform this source-xml into valid xml without using the
    workaround of getting an HTML-output? I dont want to have the HTML-tags
    like <HEAD> and <BODY> around it

    http://www.ticope.de/tmp/source.xml/download

    help VERY appreciated, this task keeps me busy too long
    Rag.
    Ragnar, Oct 26, 2006
    #6
  7. If your input isn't HTML, Tidy may not be able to help you, and nothing
    else out there is likely to be able to read your mind and guess that you
    intended <BR> tags to autoterminate.

    Since you know that *was* your intent, how about just doing a text-level
    global replace of <BR> with <BR/>?
    Joseph Kesselman, Oct 26, 2006
    #7
  8. Ragnar

    Ragnar Guest

    Joseph Kesselman schrieb:
    > Since you know that *was* your intent, how about just doing a text-level
    > global replace of <BR> with <BR/>?


    Joseph,
    that is a very nice idea

    It could look like this (assuming <BR> appears in node "reference"):
    Set objDOMnode = objDom.selectSingleNode("//reference")
    If Not objDOMnode Is Nothing Then
    strReference = objDOMnode.Text
    End If
    strReference = Replace(strReference , "<BR>", "<BR/>", 1, -1,
    vbTextCompare)

    But I dont get a value in strReference which means that XML has to be
    valid before working with XMLDOM. Am I right? I checked it by closing
    <BR/> manually, then I get a value for strReference
    Ragnar, Oct 26, 2006
    #8
  9. Ragnar wrote:
    > But I dont get a value in strReference which means that XML has to be
    > valid before working with XMLDOM.


    XML has to be well-formed before using any XML tools. An unterminated
    element, such as your <BR>, is not well-formed XML. Fix it first.

    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Oct 27, 2006
    #9
  10. Ragnar

    Andy Dingley Guest

    Ragnar wrote:

    > How can I transform this source-xml into valid xml without using the
    > workaround of getting an HTML-output?


    Find some non-Tidy Tidy-like XML tool ? Maybe write one for your
    specific task?

    Tidy uses an approximation of an SGML parser and a tag-soup strainer to
    take "approximate HTML", turn it into the best-guess internal
    (DOM-like) model of the intended page, then serialise it accurately.
    This relies on three things that you don't have available:

    * SGML parsing (omitted tags can often be inferred cleanly)
    * A known HTML DTD
    * Fix-up code outside the SGML parser that has assumed HTML-soup
    behaviours coded explicitly into it.

    If your problem is "bad XML" that isn't even approximating HTML, then I
    sympathise, but Tidy has three of its hands tied.

    Why is your bad XML bad? What's the problem? Can you build some specifc
    tool that fixes some specific problem? Even if it has to work with
    simple text-file processing and can't support more than one encoding,
    it might be enough.

    I've done a lot of work with RSS which is only approximate XML at best
    and often significantly invalid. Typically it includes HTML entity
    references (eg &eacute; )that aren't part of XML. It's not too hard to
    scan the whole document with a crude entity reference expander that can
    map these (from a known list) onto the numeric form. I usually try to
    XML parse them, then if this fails I check for the presence of such
    entities, convert them and then attempt to re-parse.
    Andy Dingley, Oct 27, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Gaurav
    Replies:
    0
    Views:
    16,121
    Gaurav
    Nov 7, 2003
  2. andy
    Replies:
    0
    Views:
    4,112
  3. X l e c t r i c

    Can Division Width Conform To Content

    X l e c t r i c, Feb 16, 2006, in forum: HTML
    Replies:
    3
    Views:
    655
    kchayka
    Feb 16, 2006
  4. Michel T.
    Replies:
    14
    Views:
    890
    John Ersatznom
    Jan 18, 2007
  5. Christopher M.
    Replies:
    1
    Views:
    418
    Joseph Kesselman
    Jan 3, 2008
Loading...

Share This Page