Web scrapping

Discussion in 'Java' started by raybonds, May 3, 2007.

  1. raybonds

    raybonds Guest

    I am trying to extract data from a website and store it. Would
    someone pose different ways to approach this problem or even
    literature that I could read to help?
    raybonds, May 3, 2007
    1. Advertisements

  2. raybonds

    Lulu58e2 Guest

    This is pretty quick in Groovy using the following:

    def parser = new org.cyberneko.html.parsers.SAXParser()
    parser.setFeature('http://xml.org/sax/features/namespaces', false)
    def HTML = new XmlSlurper(parser).parse('http://www.somepage.html')
    HTML.BODY.DIV[2].P[4].LI[2].TABLE[0].TR.each() { /* do something
    */ } // as an example

    Lulu58e2, May 3, 2007
    1. Advertisements

  3. Linux has the command-line-tool "wget" for downloading web-sites.
    See http://www.google.com/search?q=wget
    Thomas Fritsch, May 3, 2007
  4. burped up warm pablum in
    Here's the info from a spider I have used a few times:

    * That class implements a reusable spider. To use this
    * class you must have a class setup to recieve
    * the information found by the spider. This class must
    * implement the ISpiderReportable method. Written by
    * Jeff Heaton. Jeff Heaton is the author of "Programming
    * Spiders, Bots, and Aggregators" by Sybex. Jeff can be
    * contacted through his web site at http://www.jeffheaton.com.
    * @author Jeff Heaton(http://www.jeffheaton.com)
    * @version 1.0
    Tris Orendorff, May 3, 2007
  5. raybonds

    Ian Wilson Guest

    1. Use the site's API or RSS instead. If available.
    2. Check the site's terms and conditions of use.
    Ian Wilson, May 4, 2007
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.