Open URI & web scraping. Part II

Jean Nibee · Nov 13, 2007

Hi

(short form of a post I made yesterday that got no love, I suspect it'
sbecuase I was long winded)

Nutshell if I use open URI (and Hpricot) to download a web page and
'scrape' all the images to write them to my local disk dynamic images
always have improper format (Size 0) but static images are fine.

Example would be : <img
src="http://myserver:8080/Someservlet?name=blah&param=value&etc=etc">

Whether I copy/paste this URL in another browser or use open URI to
"get" the image I get an an error of:

XML Parsing Error: no element found
Location: http://myserver:8080/Someservlet?name=blah&param=value&etc=etc
Line Number 1, Column 1:

BUT, this image is displayed PERFECTLY in the html.

How can I get this image to download? (I suspect it's the mime type
being set on the server side but I am not 100% sure)

***
OUTPUT
***
[[URI information...]]
Fetched document:
http://myserver:8080/Someservlet?name=blah&param=value&etc=etc
Content Type: application/voicexml+xml
Charset:
Content-Encoding:
Last Modified:
IMAGE INFO!!! ->
Writing to file ::
D:\sandbox\auto_attendant\archive_reports\trunk\dumps\1194882652_854.gif

Thanks for your help.

Axel Etzold · Nov 13, 2007

-------- Original-Nachricht --------

Datum: Tue, 13 Nov 2007 22:21:14 +0900
Von: Jean Nibee <[email protected]>
An: (e-mail address removed)
Betreff: Open URI & web scraping. Part II

Hi

(short form of a post I made yesterday that got no love, I suspect it'
sbecuase I was long winded)

Nutshell if I use open URI (and Hpricot) to download a web page and
'scrape' all the images to write them to my local disk dynamic images
always have improper format (Size 0) but static images are fine.

Example would be : <img
src="http://myserver:8080/Someservlet?name=blah&param=value&etc=etc">

Whether I copy/paste this URL in another browser or use open URI to
"get" the image I get an an error of:

XML Parsing Error: no element found
Location: http://myserver:8080/Someservlet?name=blah&param=value&etc=etc
Line Number 1, Column 1:

BUT, this image is displayed PERFECTLY in the html.

How can I get this image to download? (I suspect it's the mime type
being set on the server side but I am not 100% sure)

***
OUTPUT
***
[[URI information...]]
Fetched document:
http://myserver:8080/Someservlet?name=blah&param=value&etc=etc
Content Type: application/voicexml+xml
Charset:
Content-Encoding:
Last Modified:
IMAGE INFO!!! ->
Writing to file ::
D:\sandbox\auto_attendant\archive_reports\trunk\dumps\1194882652_854.gif

Thanks for your help.

Dear Jean,

maybe you can use ruby's rio (http://rio.rubyforge.org/) to download
an entire website. I'm thinking in particular of the examples
given in
http://rio.rubyforge.org/classes/RIO/Doc/INTRO.html under the
headers

"Creating a Rio that refers to a web page" and
"Creating a Rio that refers to a file or directory on a FTP server".

Otherwise, maybe you get better responses on the Rails mailing list ?

Best regards,

Axel

Jean Nibee · Nov 13, 2007

Axel said:
Dear Jean,

maybe you can use ruby's rio (http://rio.rubyforge.org/) to download
an entire website. I'm thinking in particular of the examples
given in
http://rio.rubyforge.org/classes/RIO/Doc/INTRO.html under the
headers

"Creating a Rio that refers to a web page" and
"Creating a Rio that refers to a file or directory on a FTP server".

Otherwise, maybe you get better responses on the Rails mailing list ?

Best regards,

Axel

Same issue with RIO (albeit a little more complex to get thae page and
parse it as I"m doing w/ OpenURI / HPricot.)

I didn't post to rails since this isn't using the rails framework, but,
maybe they do more web work that it will clue them into an issue I'm
missing.

Thanks for your reply and help!

Peter Szinek · Nov 13, 2007

Hi Jean,

Same issue with RIO (albeit a little more complex to get thae page and
parse it as I"m doing w/ OpenURI / HPricot.)

What does an aggressive wget (i.e. with grab everything options) do?

Cheers,
Peter
___
http://www.rubyrailways.com
http://scrubyt.org

Open URI and web scraping...	0	Nov 12, 2007
Web scraping i guess (Yet to start, maybe this should be done in python?)	1	Nov 10, 2021
Hpricot scraping returns nil	4	Nov 20, 2008
PHP RSS Feed Aggregator changing to todays date everytime feed is aggregated	1	Jan 11, 2022
Help with my responsive home page	2	Dec 14, 2022
Help with code	0	Jun 12, 2022
Using URI::Find to detect Web URL's	0	Sep 9, 2003
NANWSI: Not Another .NET Web Service Issue	4	Oct 14, 2009

Open URI & web scraping. Part II

Jean Nibee

Axel Etzold

Jean Nibee

Peter Szinek

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads