XPath querying text node *including* <br/>

S

Sven

Dear all,

I'm trying to extract data from HTML using XPath in Java.
Unfortunately the text contents of nodes may contain <br/> tags which
are not correctly interpreted, at least not for me ;)

A <p> node may contain this text:

<p>
Test1<br/>
Test2<br/>
Test3
</p>

Which is returned by the XPath query as "Test1Test2Test3" but I need
it as "Test1\nTest2\nTest3" or "Test1 Test2 Test3".

Here's example code (Java 6):

public class Example {
private static final String html = said:
Test2<br/>Test3</p></body></html>";

public static void main( String[] args ) throws Exception {
final XPathFactory xPathFactory = XPathFactory.newInstance();

XPath xPath = xPathFactory.newXPath();
String value = (String)xPath.evaluate(
"//p",
new InputSource( new StringReader( html ) ),
XPathConstants.STRING );

System.out.println( value );

xPath = xPathFactory.newXPath();
value = (String)xPath.evaluate(
"//p/text()",
new InputSource( new StringReader( html ) ),
XPathConstants.STRING );

System.out.println( value );

xPath = xPathFactory.newXPath();
value = (String)xPath.evaluate(
"//p/node()",
new InputSource( new StringReader( html ) ),
XPathConstants.STRING );

System.out.println( value );
}
}

This code returns:

Test1Test2Test3
Test1
Test1

Is there any way (XPath function etc) which will return the contents
as desired?

Thank you!
 
B

Bjoern Hoehrmann

* Sven wrote in comp.text.xml:
I'm trying to extract data from HTML using XPath in Java.
Unfortunately the text contents of nodes may contain <br/> tags which
are not correctly interpreted, at least not for me ;)

You have to convert them to line breaks yourself, using XPath 1.0 there
is no way to transform them to line breaks with a simple expression. It
would be easy to do with XSLT, otherwise you have to implement this in
code. If you don't have other child elements you could simply iterate
over the children of the element, append text to a buffer and if you
have a br element instead, append a line break to the buffer.
 
J

Joshua Cranmer

Sven said:
Dear all,

I'm trying to extract data from HTML using XPath in Java.
Unfortunately the text contents of nodes may contain <br/> tags which
are not correctly interpreted, at least not for me ;)

A <p> node may contain this text:

<p>
Test1<br/>
Test2<br/>
Test3
</p>

Which is returned by the XPath query as "Test1Test2Test3" but I need
it as "Test1\nTest2\nTest3" or "Test1 Test2 Test3".

Here's example code (Java 6):

public class Example {
private static final String html =
> "<html><body><p>Test1<br/> Test2<br/> Test3</p></body></html>";

}

This code returns:

Test1Test2Test3
Test1
Test1

Is there any way (XPath function etc) which will return the contents
as desired?

Thank you!

String sanitized = html.replaceAll("<br/>","\n");
and then replace you usages of `html' with those of `sanitized'.
 
P

Philippe Poulard

Joshua Cranmer a écrit :
String sanitized = html.replaceAll("<br/>","\n");
and then replace you usages of `html' with those of `sanitized'.

Hi,

This usually doesn't work for thousand different reasons, for examples :

<br></br>
<!-- this <br/> isn't a line break -->
<![CDATA[this <br/> isn't a line break]]>
<br><?todo : buy some <br/>ead?></br>

etc...

This is the main reason why we have to use parsers : this way, one can
process things for what they are rather than for what they look like.

With a SAX filter you can have a more verbose code, but correct :

public class LineBreakFilter extends XMLFilterImpl {
public void startElement(String uri, String localName, String
qName, Attributes atts) {
if ( "br".equals(localName) ) {
characters("\n".toCharArray(), 0, 1);
} else {
super.startElement(...);
}
}
public void endElement(String uri, String localName, String qName) {
if ( ! "br".equals(localName) ) {
super.endElement(...);
} // else do nothing
}
}

You just have to plug it to a SAX parser (beware to namespaces if you
have some).

--
Cordialement,

///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !
 
P

Philippe Poulard

Philippe Poulard a écrit :
public class LineBreakFilter extends XMLFilterImpl {
public void startElement(String uri, String localName, String qName,
Attributes atts) {
if ( "br".equals(localName) ) {
characters("\n".toCharArray(), 0, 1);
} else {
super.startElement(...);
}
}
public void endElement(String uri, String localName, String qName) {
if ( ! "br".equals(localName) ) {
super.endElement(...);
} // else do nothing
}
}

I forgot to add in the test: && uri == null (or && uri.length == 0, I
don't remember what the SAX parser is supposed to give)

--
Cordialement,

///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !
 
S

Sven

String sanitized = html.replaceAll("<br/>","\n");
and then replace you usages of `html' with those of `sanitized'.

Thanks for the hint! Although Philippe noted that this may not work in
all situations it's sufficient enough for me at the moment.

Now I have another problem with text nodes (damn text nodes *g*).
Still the same scenario where I try to extract data from XHTML pages,
let's assume we have nodes like this

<div>
Text1
<a href="http://...">Link</a>
Text2
</div>

Then \\div\text() will only return "Text1". It's basically the same
problem where text nodes are interrupted by child nodes. Any way with
pure XPath to fetch the whole text?

Thanks!
 
J

Joseph J. Kesselman

<div>
Text1
<a href="http://...">Link</a>
Text2
</div>

Then \\div\text() will only return "Text1".

//div/text() (note: FORWARD slashes in XPath!) will return two text
nodes. Whatever you are doing with the result of that path may be
operating on only the first node returned, but you didn't show us that
which makes it hard to advise you.

Alernatively, you could retrieve the text value of the <div> element --
but that would include Link as well, since it's defined as all contained
text.
 
M

Martin Honnen

Sven said:
<div>
Text1
<a href="http://...">Link</a>
Text2
</div>

Then \\div\text() will only return "Text1". It's basically the same
problem where text nodes are interrupted by child nodes. Any way with
pure XPath to fetch the whole text?

Well
string(/div)
will give you the text contained in that element which is
"
Text1
Link
Text2
"

And
/div/text()
as an XPath 1.0 expression selects two text nodes over which you can
iterate to extract
"
Text1
"
and
"
Text2
"
 
S

Sven

//div/text() (note: FORWARD slashes in XPath!) will return two text
nodes. Whatever you are doing with the result of that path may be
operating on only the first node returned, but you didn't show us that
which makes it hard to advise you.

Thanks, this was my bad! In Java I used XPath#evaluate( String
expression, InputSource source, QName returnType ) with returnType ==
XPathConstants.STRING which apparently doesn't concatenate multiple
results. I'm now using XPathConstants.NODESET and everything works as
expected. Great!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top