Web scrapping

Discussion in 'Java' started by raybonds, May 3, 2007.

  1. raybonds

    raybonds Guest

    I am trying to extract data from a website and store it. Would
    someone pose different ways to approach this problem or even
    literature that I could read to help?
     
    raybonds, May 3, 2007
    #1
    1. Advertisements

  2. raybonds

    Lulu58e2 Guest

    This is pretty quick in Groovy using the following:

    def parser = new org.cyberneko.html.parsers.SAXParser()
    parser.setFeature('http://xml.org/sax/features/namespaces', false)
    def HTML = new XmlSlurper(parser).parse('http://www.somepage.html')
    HTML.BODY.DIV[2].P[4].LI[2].TABLE[0].TR.each() { /* do something
    */ } // as an example

    C>
     
    Lulu58e2, May 3, 2007
    #2
    1. Advertisements

  3. Linux has the command-line-tool "wget" for downloading web-sites.
    See http://www.google.com/search?q=wget
     
    Thomas Fritsch, May 3, 2007
    #3
  4. burped up warm pablum in
    Here's the info from a spider I have used a few times:

    /**
    * That class implements a reusable spider. To use this
    * class you must have a class setup to recieve
    * the information found by the spider. This class must
    * implement the ISpiderReportable method. Written by
    * Jeff Heaton. Jeff Heaton is the author of "Programming
    * Spiders, Bots, and Aggregators" by Sybex. Jeff can be
    * contacted through his web site at http://www.jeffheaton.com.
    *
    * @author Jeff Heaton(http://www.jeffheaton.com)
    * @version 1.0
    */
     
    Tris Orendorff, May 3, 2007
    #4
  5. raybonds

    Ian Wilson Guest

    1. Use the site's API or RSS instead. If available.
    2. Check the site's terms and conditions of use.
     
    Ian Wilson, May 4, 2007
    #5
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.