Extract links from HTML

Discussion in 'Java' started by Noel, Oct 22, 2008.

  1. Noel

    Noel Guest

    Hello,

    I have a string containing the HTML code of a web search engine
    result. The web search result particularly contains links that are of
    interest to my application, and the goal is to extract the links.

    Does anyone know of any java method (or package or something similar)
    that is able to retrieve the URLs from a given block of HTML? I'd like
    something simple like a method that takes a string argument
    (containing the HTML text) and returning an array or vector of URLs.

    Thanks

    N
     
    Noel, Oct 22, 2008
    #1
    1. Advertising

  2. On Oct 22, 7:26 am, Noel <> wrote:
    > Hello,
    >
    > I have a string containing the HTML code of a web search engine
    > result. The web search result particularly contains links that are of
    > interest to my application, and the goal is to extract the links.
    >
    > Does anyone know of any java method (or package or something similar)
    > that is able to retrieve the URLs from a given block of HTML? I'd like
    > something simple like a method that takes a string argument
    > (containing the HTML text) and returning an array or vector of URLs.
    >
    > Thanks
    >
    > N


    These days, you've got Java's regular expressions support to help you.
    See package java.util.regex.
     
    softwarepearls_com, Oct 22, 2008
    #2
    1. Advertising

  3. Noel

    Stefan Ram Guest

    Noel <> writes:
    >Does anyone know of any java method (or package or something similar)
    >that is able to retrieve the URLs from a given block of HTML?


    (If answering to this post, please do not quote all of it,
    but only the parts you directly refer to.)

    public class Main
    { public final static void main( final java.lang.String[] args )
    { try
    { java.io.Reader reader = new java.io.StringReader
    ( "<html><head><title></title></head>" +
    "<body><p>" +
    "<a href=\"alpha\">beta</a>" +
    "<!-- <a href=\"gamma\">delta</a> -->" +
    "<i class='<a href=\"epsilon\">zeta</a>'></i>" +
    "<a href=\"eta\">theta</a>" +
    "</p></body>" );
    final javax.swing.text.html.parser.ParserDelegator parserDelegator =
    new javax.swing.text.html.parser.ParserDelegator();
    final javax.swing.text.html.HTMLEditorKit.ParserCallback
    parserCallback =
    new javax.swing.text.html.HTMLEditorKit.ParserCallback()
    { public void handleText( final char[] data, final int pos ){}
    public void handleStartTag
    ( final javax.swing.text.html.HTML.Tag tag,
    final javax.swing.text.MutableAttributeSet attribute,
    final int pos )
    { if( tag == javax.swing.text.html.HTML.Tag.A )
    { final java.lang.String address =( java.lang.String )
    attribute.getAttribute
    ( javax.swing.text.html.HTML.Attribute.HREF );
    java.lang.System.out.println( address ); }}
    public void handleEndTag
    ( final javax.swing.text.html.HTML.Tag t, final int pos ){}
    public void handleSimplTag
    ( final javax.swing.text.html.HTML.Tag t,
    final javax.swing.text.MutableAttributeSet a, final int pos ){}
    public void handleComment
    ( final char[] data, final int pos ){}
    public void handleError
    ( final java.lang.String errMsg, final int pos ){} };
    parserDelegator.parse( reader, parserCallback, false );
    java.lang.System.out.println(); }
    catch( final java.io.IOException iOException )
    { java.lang.System.err.println( iOException ); }}}

    /* prints:
    alpha
    eta
    */
     
    Stefan Ram, Oct 22, 2008
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    5
    Views:
    11,935
    MadhuP
    Mar 11, 2011
  2. livin
    Replies:
    1
    Views:
    10,891
    Steven Bethard
    Sep 14, 2005
  3. Replies:
    1
    Views:
    5,767
    lordy
    Aug 7, 2006
  4. Replies:
    0
    Views:
    141
  5. Sita Rami Reddy
    Replies:
    10
    Views:
    280
    Remco Swoany
    Feb 5, 2009
Loading...

Share This Page