Web scrapping

raybonds · May 3, 2007

I am trying to extract data from a website and store it. Would
someone pose different ways to approach this problem or even
literature that I could read to help?

Lulu58e2 · May 3, 2007

I am trying to extract data from a website and store it. Would
someone pose different ways to approach this problem or even
literature that I could read to help?

This is pretty quick in Groovy using the following:

def parser = new org.cyberneko.html.parsers.SAXParser()
parser.setFeature('http://xml.org/sax/features/namespaces', false)
def HTML = new XmlSlurper(parser).parse('http://www.somepage.html')
HTML.BODY.DIV[2].P[4].LI[2].TABLE[0].TR.each() { /* do something
*/ } // as an example

C>

Thomas Fritsch · May 3, 2007

I am trying to extract data from a website and store it. Would
someone pose different ways to approach this problem or even
literature that I could read to help?

Linux has the command-line-tool "wget" for downloading web-sites.
See http://www.google.com/search?q=wget

Tris Orendorff · May 3, 2007

(e-mail address removed) burped up warm pablum in

I am trying to extract data from a website and store it. Would
someone pose different ways to approach this problem or even
literature that I could read to help?

Here's the info from a spider I have used a few times:

/**
* That class implements a reusable spider. To use this
* class you must have a class setup to recieve
* the information found by the spider. This class must
* implement the ISpiderReportable method. Written by
* Jeff Heaton. Jeff Heaton is the author of "Programming
* Spiders, Bots, and Aggregators" by Sybex. Jeff can be
* contacted through his web site at http://www.jeffheaton.com.
*
* @author Jeff Heaton(http://www.jeffheaton.com)
* @version 1.0
*/

Ian Wilson · May 4, 2007

I am trying to extract data from a website and store it. Would
someone pose different ways to approach this problem or even
literature that I could read to help?

1. Use the site's API or RSS instead. If available.
2. Check the site's terms and conditions of use.

jQuery Scrapping & Formatting Inputted Paste	2	Sep 30, 2020
[ANN] gg_scrapper -- scrapping of the Google Groups	0	Jan 4, 2014
How to Restore Original Mask from Overlayed Image Using CNN?	0	Oct 29, 2023
Simple web framework - improvements to makefile	0	Feb 1, 2023
Mandlebrot set web surfing	0	Sep 13, 2021
Trying to figure out http request POST phrasing	1	Mar 30, 2023
Web Site	1	Feb 20, 2023
Bash scripts for web apps	1	Jan 16, 2023

Web scrapping

raybonds

Lulu58e2

Thomas Fritsch

Tris Orendorff

Ian Wilson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads