Newbie question about how to solve the use escape characters

M

Mark Chao

Hi, I am a newbie, I spend quite sometime searching on the web, but I
didn't find anything. I hope this question is not too bad to ask here.

I am trying to convert XML document into another form, such as this:

<a>
A
<b>B</b>
<c>C</c>
</a>

should be converted to this:

a A
a b B
a c C

I am using the Java's sax parser with my own extended DefaultHandler.
Usually XML documents given to me will have the elements and child
elements properly idented (as above). However this will cause problem,
as the character() in the handler class will be called even between 2
endElement() call, sometimes between 2 startElement() call.

This will also cause problem as the "A" will be parsed to "\n\tA"
because it is just parsed as it is. The obvious way to solve this
problem is to just make my handler taking only XML files which have no
"\n" nor "\t" escape characters. I can also manually take out any of
these escape characters, but it will also accidentally remove any
intended escape characters.

Another way would be disallowing XML documents which have character
data between 2 startElement or 2 endElement. ie only have character
data between 1 startElement and 1 endElement. However this constraint
is too heavy and not appropriate.

This is just a semantic problem, but I just want to know if there are
any other ways to tackle the problem.
 
P

Peter Flynn

Mark said:
Hi, I am a newbie, I spend quite sometime searching on the web, but I
didn't find anything. I hope this question is not too bad to ask here.

I am trying to convert XML document into another form, such as this:

<a>
A
<b>B</b>
<c>C</c>
</a>

This should ring immediate warning bells. Mixed Content (interspersed
text and markup) is normally the wrong model in data-oriented
applications. A more useful form would be

<a>
<something>A</something>
<b>B</b>
<c>C</c>
</a>

After all, the "A" must have some function, so it should be identified.
should be converted to this:

a A
a b B
a c C

The following XSLT will do this.

<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:eek:utput method="text"/>
<xsl:strip-space elements="*"/>

<xsl:template match="*">
<xsl:for-each select="ancestor::*">
<xsl:value-of select="name()"/>
<xsl:text> </xsl:text>
</xsl:for-each>
<xsl:value-of select="name()"/>
<xsl:apply-templates/>
</xsl:template>

<xsl:template match="text()">
<xsl:text> </xsl:text>
<xsl:value-of select="normalize-space(.)"/>
<xsl:text>
</xsl:text>
</xsl:template>

I am using the Java's sax parser with my own extended DefaultHandler.
Usually XML documents given to me will have the elements and child
elements properly idented (as above). However this will cause problem,
as the character() in the handler class will be called even between 2
endElement() call, sometimes between 2 startElement() call.

That's why I suggest that this is a suboptimal format for the data.
This is just a semantic problem, but I just want to know if there are
any other ways to tackle the problem.

Try XSLT.

///Peter
 
M

mcha226

Thanks a lot. I'll start learning XSLT as well.

About what I have done, I used the decorator pattern and created a
decorator wrapping around my base handler. This will buffer the text
received in characters(), and send the complete text in one go. It will
also take out the \n and \t from the beginning of the text and the end
of the text.

I found out later that there is a XMLFilterImpl. It is interesting that
this class implements both the reader interface and all the handler
interface, whereas my decorator only implements the ContentHandler.
Just a personal opinion, I think my design can be a little be more
efficient. For example:

reader = XMLReaderFactory.createXMLReader();
handler = new SimpleHandler(); // Extends DefaultHandler

reader.setContentHandler(new BufferedHandler(handler));
reader.setErrorHandler(handler);

My design is easier to understand (implements only the handler part of
the interface) and it can prevent passing the call unnecessarily. (if
you are using XMLFilterImpl to create a filter for each of the
ContentHandler and ErrorHandler, this will cause extra calls across
layers.)

Anyone think the same as me? :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top