HTML parsing using Java and Xerces

Discussion in 'Java' started by Camk, Mar 19, 2007.

  1. Camk

    Camk Guest

    Hey, Is it possible to do the following?

    1. Enter a search term in ask.com (Manually) and hit search
    2. Once the Result page is shown, view the source file and save it to
    the hard disk (Manually)
    3. Use a Java program with the HTML parser embedded to extract the
    returned URLs
    4. Once the URLs are returned, they are to be automatically stored in
    a MYSQL database.
    The database has a Single table with the following columns:
    Query - Stores a string of the search query used.
    SearchEngine - Stores a string of the search engine (e.g. Ask)
    ReturnedURL - Stores a string of the returned URL (this is got from
    the parsed source sheet)
    URLNo - Stores an int the position of the Returned URL (i.e. the first
    URL is number 1 and so on)
     
    Camk, Mar 19, 2007
    #1
    1. Advertising

  2. Camk

    Chris Guest

    Camk wrote:
    > Hey, Is it possible to do the following?
    >
    > 1. Enter a search term in ask.com (Manually) and hit search
    > 2. Once the Result page is shown, view the source file and save it to
    > the hard disk (Manually)
    > 3. Use a Java program with the HTML parser embedded to extract the
    > returned URLs
    > 4. Once the URLs are returned, they are to be automatically stored in
    > a MYSQL database.
    > The database has a Single table with the following columns:
    > Query - Stores a string of the search query used.
    > SearchEngine - Stores a string of the search engine (e.g. Ask)
    > ReturnedURL - Stores a string of the returned URL (this is got from
    > the parsed source sheet)
    > URLNo - Stores an int the position of the Returned URL (i.e. the first
    > URL is number 1 and so on)
    >


    Yes, it is possible. Lots of ways to do it. The trick is to find a
    reliable way to recognize the various entities in the page.

    I would start by reading the page into a String or char array, and then
    seeing if I could write regular expressions to recognize things. See
    java.util.regex.

    Don't use Xerces. It will choke on any ill-formed html.
     
    Chris, Mar 20, 2007
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. bugbear
    Replies:
    0
    Views:
    1,019
    bugbear
    Aug 28, 2003
  2. cvissy
    Replies:
    0
    Views:
    609
    cvissy
    Nov 16, 2004
  3. Hans Bijvoet

    HTML parsing with Xerces

    Hans Bijvoet, Jan 28, 2005, in forum: XML
    Replies:
    1
    Views:
    824
    Stanimir Stamenkov
    Jan 28, 2005
  4. Girish
    Replies:
    3
    Views:
    880
    Nick Kew
    Apr 11, 2005
  5. Replies:
    0
    Views:
    533
Loading...

Share This Page