Open URI and web scraping...

J

Jean Nibee

Hi

I apologize if this is off topic but I couldn't figure out how to reach
the right audience based on the lists on this forum.

Here goes; I am writing a web scraper in an attempt to mimic a VERY
light version of 'wget'. I have to call a reports page, get the html,
write the html (and any associated images for graphs) to the filesystem
and this end up being used in a pdf converter to create a nice pdf file
for a client.

There are other more easy solutions such as printer drivers and such,
but at my office this is not an option. The current 'flow' I'm using is
the only one I am allowed to use right or wrong. (and understand a lot
of the right ways to do it are NOT distributable under their licenses
agreements so I have to create it from scratch).

That being said, here's my problem...

The images in the document are NOT referenced as static files on the
file system they are actually HTTP urls to the webservers' servlet to be
generated 'on the fly'.

Example would be : <img
src="http://myserver:8080/Someservlet?name=blah&param=value&etc=etc">

When I try and get these images (and I write the content from the http
request to the file system) the filesizes are sized 0. No errors happen
on the http request.

Also if I copy/paste thsi URL in a browser windows I'm told an error of:

XML Parsing Error: no element found
Location: http://myserver:8080/Someservlet?name=blah&param=value&etc=etc
Line Number 1, Column 1:

BUT, this image is displayed PERFECTLY in the report.

How can I get this image to download? (I suspect it's the mime type
being set on the server side but I am not 100% sure)

PSEUDO CODE:
img_path =
"http://myserver:8080/Someservlet?name=blah&param=value&etc=etc"
open( img_path,
"User-Agent" => "Ruby/#{RUBY_VERSION}",
"From" => "me@mecom", "Referer" => "http://localhost:8080/") { |i|
dump_uri_info(i)
puts "IMAGE INFO!!! -> #{i.read}"
write_to_binary( rnd_filename, i.read )
}

def dump_uri_info(f)
puts "[[URI information...]]"
puts "Fetched document: #{f.base_uri}"
puts "Content Type: #{f.content_type}"
puts "Charset: #{f.charset}"
puts "Content-Encoding: #{f.content_encoding}"
puts "Last Modified: #{f.last_modified}"
end

***
OUTPUT
***
[[URI information...]]
Fetched document:
http://myserver:8080/Someservlet?name=blah&param=value&etc=etc
Content Type: application/voicexml+xml
Charset:
Content-Encoding:
Last Modified:
IMAGE INFO!!! ->
Writing to file ::
D:\sandbox\auto_attendant\archive_reports\trunk\dumps\1194882652_854.gif


last comment... if I 'fetch' an actual image referenced on the file
system (http://myserver:8080/images/log.gif) It will be created properly
and if I look at it it's perfect. Its' only the dynamically generated
images.

Thanks for your help.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top