Newbie question about how to solve the use escape characters

Discussion in 'XML' started by Mark Chao, Nov 15, 2005.

  1. Mark Chao

    Mark Chao Guest

    Hi, I am a newbie, I spend quite sometime searching on the web, but I
    didn't find anything. I hope this question is not too bad to ask here.

    I am trying to convert XML document into another form, such as this:

    <a>
    A
    <b>B</b>
    <c>C</c>
    </a>

    should be converted to this:

    a A
    a b B
    a c C

    I am using the Java's sax parser with my own extended DefaultHandler.
    Usually XML documents given to me will have the elements and child
    elements properly idented (as above). However this will cause problem,
    as the character() in the handler class will be called even between 2
    endElement() call, sometimes between 2 startElement() call.

    This will also cause problem as the "A" will be parsed to "\n\tA"
    because it is just parsed as it is. The obvious way to solve this
    problem is to just make my handler taking only XML files which have no
    "\n" nor "\t" escape characters. I can also manually take out any of
    these escape characters, but it will also accidentally remove any
    intended escape characters.

    Another way would be disallowing XML documents which have character
    data between 2 startElement or 2 endElement. ie only have character
    data between 1 startElement and 1 endElement. However this constraint
    is too heavy and not appropriate.

    This is just a semantic problem, but I just want to know if there are
    any other ways to tackle the problem.
    Mark Chao, Nov 15, 2005
    #1
    1. Advertising

  2. Mark Chao

    Peter Flynn Guest

    Mark Chao wrote:

    > Hi, I am a newbie, I spend quite sometime searching on the web, but I
    > didn't find anything. I hope this question is not too bad to ask here.
    >
    > I am trying to convert XML document into another form, such as this:
    >
    > <a>
    > A
    > <b>B</b>
    > <c>C</c>
    > </a>


    This should ring immediate warning bells. Mixed Content (interspersed
    text and markup) is normally the wrong model in data-oriented
    applications. A more useful form would be

    <a>
    <something>A</something>
    <b>B</b>
    <c>C</c>
    </a>

    After all, the "A" must have some function, so it should be identified.

    > should be converted to this:
    >
    > a A
    > a b B
    > a c C


    The following XSLT will do this.

    <?xml version="1.0" encoding="iso-8859-1"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="1.0">

    <xsl:eek:utput method="text"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="*">
    <xsl:for-each select="ancestor::*">
    <xsl:value-of select="name()"/>
    <xsl:text> </xsl:text>
    </xsl:for-each>
    <xsl:value-of select="name()"/>
    <xsl:apply-templates/>
    </xsl:template>

    <xsl:template match="text()">
    <xsl:text> </xsl:text>
    <xsl:value-of select="normalize-space(.)"/>
    <xsl:text>
    </xsl:text>
    </xsl:template>

    </xsl:stylesheet>

    > I am using the Java's sax parser with my own extended DefaultHandler.
    > Usually XML documents given to me will have the elements and child
    > elements properly idented (as above). However this will cause problem,
    > as the character() in the handler class will be called even between 2
    > endElement() call, sometimes between 2 startElement() call.


    That's why I suggest that this is a suboptimal format for the data.

    > This is just a semantic problem, but I just want to know if there are
    > any other ways to tackle the problem.


    Try XSLT.

    ///Peter
    Peter Flynn, Nov 16, 2005
    #2
    1. Advertising

  3. Mark Chao

    Guest

    Thanks a lot. I'll start learning XSLT as well.

    About what I have done, I used the decorator pattern and created a
    decorator wrapping around my base handler. This will buffer the text
    received in characters(), and send the complete text in one go. It will
    also take out the \n and \t from the beginning of the text and the end
    of the text.

    I found out later that there is a XMLFilterImpl. It is interesting that
    this class implements both the reader interface and all the handler
    interface, whereas my decorator only implements the ContentHandler.
    Just a personal opinion, I think my design can be a little be more
    efficient. For example:

    reader = XMLReaderFactory.createXMLReader();
    handler = new SimpleHandler(); // Extends DefaultHandler

    reader.setContentHandler(new BufferedHandler(handler));
    reader.setErrorHandler(handler);

    My design is easier to understand (implements only the handler part of
    the interface) and it can prevent passing the call unnecessarily. (if
    you are using XMLFilterImpl to create a filter for each of the
    ContentHandler and ErrorHandler, this will cause extra calls across
    layers.)

    Anyone think the same as me? :)
    , Nov 16, 2005
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Griff

    trying out escape characters

    Griff, Aug 3, 2004, in forum: Perl
    Replies:
    6
    Views:
    603
  2. Maziar Aflatoun

    Escape characters

    Maziar Aflatoun, Dec 5, 2003, in forum: ASP .Net
    Replies:
    3
    Views:
    554
    Jason S
    Dec 5, 2003
  3. Guadala Harry

    What Happens To Escape Characters?

    Guadala Harry, Aug 18, 2004, in forum: ASP .Net
    Replies:
    3
    Views:
    692
    Lau Lei Cheong
    Aug 19, 2004
  4. =?Utf-8?B?YmFzdWxhc3o=?=

    Are there escape characters for SQL?

    =?Utf-8?B?YmFzdWxhc3o=?=, Jul 7, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    10,942
    Patrice
    Jul 7, 2005
  5. slomo
    Replies:
    5
    Views:
    1,525
    Duncan Booth
    Dec 2, 2007
Loading...

Share This Page