XPath querying text node *including* <br/>

Discussion in 'Java' started by Sven, Apr 27, 2008.

  1. Sven

    Sven Guest

    Dear all,

    I'm trying to extract data from HTML using XPath in Java.
    Unfortunately the text contents of nodes may contain <br/> tags which
    are not correctly interpreted, at least not for me ;)

    A <p> node may contain this text:

    <p>
    Test1<br/>
    Test2<br/>
    Test3
    </p>

    Which is returned by the XPath query as "Test1Test2Test3" but I need
    it as "Test1\nTest2\nTest3" or "Test1 Test2 Test3".

    Here's example code (Java 6):

    public class Example {
    private static final String html = "<html><body><p>Test1<br/
    >Test2<br/>Test3</p></body></html>";


    public static void main( String[] args ) throws Exception {
    final XPathFactory xPathFactory = XPathFactory.newInstance();

    XPath xPath = xPathFactory.newXPath();
    String value = (String)xPath.evaluate(
    "//p",
    new InputSource( new StringReader( html ) ),
    XPathConstants.STRING );

    System.out.println( value );

    xPath = xPathFactory.newXPath();
    value = (String)xPath.evaluate(
    "//p/text()",
    new InputSource( new StringReader( html ) ),
    XPathConstants.STRING );

    System.out.println( value );

    xPath = xPathFactory.newXPath();
    value = (String)xPath.evaluate(
    "//p/node()",
    new InputSource( new StringReader( html ) ),
    XPathConstants.STRING );

    System.out.println( value );
    }
    }

    This code returns:

    Test1Test2Test3
    Test1
    Test1

    Is there any way (XPath function etc) which will return the contents
    as desired?

    Thank you!
    Sven, Apr 27, 2008
    #1
    1. Advertising

  2. Sven wrote:
    > Dear all,
    >
    > I'm trying to extract data from HTML using XPath in Java.
    > Unfortunately the text contents of nodes may contain <br/> tags which
    > are not correctly interpreted, at least not for me ;)
    >
    > A <p> node may contain this text:
    >
    > <p>
    > Test1<br/>
    > Test2<br/>
    > Test3
    > </p>
    >
    > Which is returned by the XPath query as "Test1Test2Test3" but I need
    > it as "Test1\nTest2\nTest3" or "Test1 Test2 Test3".
    >
    > Here's example code (Java 6):
    >
    > public class Example {
    > private static final String html =
    > "<html><body><p>Test1<br/> Test2<br/> Test3</p></body></html>";
    >
    > }
    >
    > This code returns:
    >
    > Test1Test2Test3
    > Test1
    > Test1
    >
    > Is there any way (XPath function etc) which will return the contents
    > as desired?
    >
    > Thank you!


    String sanitized = html.replaceAll("<br/>","\n");
    and then replace you usages of `html' with those of `sanitized'.

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
    Joshua Cranmer, Apr 27, 2008
    #2
    1. Advertising

  3. Joshua Cranmer a écrit :
    > String sanitized = html.replaceAll("<br/>","\n");
    > and then replace you usages of `html' with those of `sanitized'.


    Hi,

    This usually doesn't work for thousand different reasons, for examples :

    <br></br>
    <!-- this <br/> isn't a line break -->
    <![CDATA[this <br/> isn't a line break]]>
    <br><?todo : buy some <br/>ead?></br>

    etc...

    This is the main reason why we have to use parsers : this way, one can
    process things for what they are rather than for what they look like.

    With a SAX filter you can have a more verbose code, but correct :

    public class LineBreakFilter extends XMLFilterImpl {
    public void startElement(String uri, String localName, String
    qName, Attributes atts) {
    if ( "br".equals(localName) ) {
    characters("\n".toCharArray(), 0, 1);
    } else {
    super.startElement(...);
    }
    }
    public void endElement(String uri, String localName, String qName) {
    if ( ! "br".equals(localName) ) {
    super.endElement(...);
    } // else do nothing
    }
    }

    You just have to plug it to a SAX parser (beware to namespaces if you
    have some).

    --
    Cordialement,

    ///
    (. .)
    --------ooO--(_)--Ooo--------
    | Philippe Poulard |
    -----------------------------
    http://reflex.gforge.inria.fr/
    Have the RefleX !
    Philippe Poulard, Apr 28, 2008
    #3
  4. Philippe Poulard a écrit :
    > public class LineBreakFilter extends XMLFilterImpl {
    > public void startElement(String uri, String localName, String qName,
    > Attributes atts) {
    > if ( "br".equals(localName) ) {
    > characters("\n".toCharArray(), 0, 1);
    > } else {
    > super.startElement(...);
    > }
    > }
    > public void endElement(String uri, String localName, String qName) {
    > if ( ! "br".equals(localName) ) {
    > super.endElement(...);
    > } // else do nothing
    > }
    > }


    I forgot to add in the test: && uri == null (or && uri.length == 0, I
    don't remember what the SAX parser is supposed to give)

    --
    Cordialement,

    ///
    (. .)
    --------ooO--(_)--Ooo--------
    | Philippe Poulard |
    -----------------------------
    http://reflex.gforge.inria.fr/
    Have the RefleX !
    Philippe Poulard, Apr 28, 2008
    #4
  5. Sven

    Sven Guest

    On 28 Apr., 00:11, Joshua Cranmer <> wrote:

    > String sanitized = html.replaceAll("<br/>","\n");
    > and then replace you usages of `html' with those of `sanitized'.


    Thanks for the hint! Although Philippe noted that this may not work in
    all situations it's sufficient enough for me at the moment.

    Now I have another problem with text nodes (damn text nodes *g*).
    Still the same scenario where I try to extract data from XHTML pages,
    let's assume we have nodes like this

    <div>
    Text1
    <a href="http://...">Link</a>
    Text2
    </div>

    Then \\div\text() will only return "Text1". It's basically the same
    problem where text nodes are interrupted by child nodes. Any way with
    pure XPath to fetch the whole text?

    Thanks!
    Sven, May 23, 2008
    #5

  6. > <div>
    > Text1
    > <a href="http://...">Link</a>
    > Text2
    > </div>
    >
    > Then \\div\text() will only return "Text1".


    //div/text() (note: FORWARD slashes in XPath!) will return two text
    nodes. Whatever you are doing with the result of that path may be
    operating on only the first node returned, but you didn't show us that
    which makes it hard to advise you.

    Alernatively, you could retrieve the text value of the <div> element --
    but that would include Link as well, since it's defined as all contained
    text.
    Joseph J. Kesselman, May 23, 2008
    #6
  7. Sven wrote:

    > <div>
    > Text1
    > <a href="http://...">Link</a>
    > Text2
    > </div>
    >
    > Then \\div\text() will only return "Text1". It's basically the same
    > problem where text nodes are interrupted by child nodes. Any way with
    > pure XPath to fetch the whole text?


    Well
    string(/div)
    will give you the text contained in that element which is
    "
    Text1
    Link
    Text2
    "

    And
    /div/text()
    as an XPath 1.0 expression selects two text nodes over which you can
    iterate to extract
    "
    Text1
    "
    and
    "
    Text2
    "


    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
    Martin Honnen, May 23, 2008
    #7
  8. Sven

    Sven Guest

    On 23 Mai, 19:33, "Joseph J. Kesselman" <>
    wrote:

    > //div/text() (note: FORWARD slashes in XPath!) will return two text
    > nodes. Whatever you are doing with the result of that path may be
    > operating on only the first node returned, but you didn't show us that
    > which makes it hard to advise you.


    Thanks, this was my bad! In Java I used XPath#evaluate( String
    expression, InputSource source, QName returnType ) with returnType ==
    XPathConstants.STRING which apparently doesn't concatenate multiple
    results. I'm now using XPathConstants.NODESET and everything works as
    expected. Great!
    Sven, May 23, 2008
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    1,399
  2. Alastair Cameron
    Replies:
    1
    Views:
    7,366
    SQL Server Development Team [MSFT]
    Jul 8, 2003
  3. Replies:
    3
    Views:
    11,626
  4. Tjerk Wolterink
    Replies:
    2
    Views:
    1,399
    Dimitre Novatchev
    Aug 24, 2006
  5. Sven
    Replies:
    8
    Views:
    1,046
Loading...

Share This Page