Parsing HTML / following links etc

D

Dan Cuddeford

Hello all,

I've was pushed towards ruby from by a friend. I'm used to the usual
shell scripting and was told this will be much more powerful /
gracefully / easier.

It does look all very exciting but I'm having a problem looking for the
easiest way to implement something quite simple.

I'm used to using wget to crawl some of my sites to a certain layer.

wget http://www.digg.com -r -l 2 will digg down two layers from the
front page (follow the links).

I can't find an easy way of doing this. Open-uri doesn't seem to
supporting recursive following. I've looked at pulling down the HTML and
parsing it back to open-uri but there doesn't seem to be an easy way of
doing this.

Another thing I would like to do it pull down other elements from the
html such as images so I explored html-parsing but they all seemed to
be geared towards manipulation rather than downloading information for
manipulation later :(

Thanks for your help if you can

Dan
 
F

Florian Ebeling

I've was pushed towards ruby from by a friend. I'm used to the usual
shell scripting and was told this will be much more powerful /
gracefully / easier.

It does look all very exciting but I'm having a problem looking for
the
easiest way to implement something quite simple.

I'm used to using wget to crawl some of my sites to a certain layer.

wget http://www.digg.com -r -l 2 will digg down two layers from the
front page (follow the links).

I can't find an easy way of doing this. Open-uri doesn't seem to
supporting recursive following. I've looked at pulling down the HTML
and
parsing it back to open-uri but there doesn't seem to be an easy way
of
doing this.

Another thing I would like to do it pull down other elements from the
html such as images so I explored html-parsing but they all seemed to
be geared towards manipulation rather than downloading information for
manipulation later :(

you might want to try the hpricot (!) gem, but that covers only
html parsing. then you use the standard html client and pull
the documents you fancy.

a regular http client library does not typically include getting
all referenced objects, this is rather a higher-level 'application'
feature.
 
D

Dan Cuddeford

Mmmm thanks for your advice. One thought - is it possible to get ruby is
run wget externally and pull into a directory and then set ruby off to
do its magic once this is done?
 
S

Stefano Crocco

Alle Wednesday 23 January 2008, Dan Cuddeford ha scritto:
Mmmm thanks for your advice. One thought - is it possible to get ruby is
run wget externally and pull into a directory and then set ruby off to
do its magic once this is done?

In ruby, you can execute a command in a subshell using the system method or
the backticks (`) operator. This way, you can create a ruby script which
downloads what you need using wget, then goes on doing whatever you want with
those files. To run the command in your original post from ruby, you'd use

system 'wget http://www.digg.com -r -l 2'

or

`wget http://www.digg.com -r -l 2`

The difference between the two methods is that system returns true or false
depending on wether the command exited correctly or with an error status,
while `cmd` returns the standard output of cmd (which is not displayed on
screen).

I hope this helps

Stefano
 
D

Dan Cuddeford

Thanks for the advice guys.

It's a shame there isn't an easy way to use the -r -p switches from wget
but I will try to learn how to use these other gems to get the job done.
 
S

Stefano Crocco

Alle Wednesday 23 January 2008, Dan Cuddeford ha scritto:
Thanks for the advice guys.

It's a shame there isn't an easy way to use the -r -p switches from wget
but I will try to learn how to use these other gems to get the job done.

As I explained in my previous post, you can run any command you in a shell
from ruby using `cmd` or system('cmd'). This includes calling wget with any
options. If I understood you correctly, this is what you're looking for. Am I
missing something?

Stefano
 
M

Marc Heiler

It's a shame there isn't an easy way to use the -r -p switches from wget

But there is.

system 'wget -r -p http://blabla.whatever/lalala/yooo.avi'

You can mimic wget's behaviour in pure ruby too, but wget is quite big,
it is not a quick job to implement all of it's features in ruby. I
myself only use open-uri to download a single file, but if anyone writes
a happy wget-ruby class that includes recursive downloads I'd happily
switch to use that too.
 
D

Dan Cuddeford

Stefano said:
In ruby, you can execute a command in a subshell using the system method
or
the backticks (`) operator. This way, you can create a ruby script which
downloads what you need using wget, then goes on doing whatever you want
with
those files.


How to I wack a variable in here? It doesn't seem to want a string :(
 
J

Jörg W Mittag

Dan said:
How to I wack a variable in here? It doesn't seem to want a string :(

The backticks operator -- like most other operators in Ruby -- isn't
actually an operator, it's a method. In this case, it's a method named
'`' that lives on the Kernel module and takes a String as its
parameter.

So, `wget http://www.digg.com -r -l 2` is actually just syntactic
sugar for `("wget http://www.digg.com -r -l 2"). Well, except that's
not valid Ruby syntax. But this is:

send:)`, "wget http://www.digg.com -r -l 2")

So, now that you know that the argument *is*, in fact, a String, you
can probably guess what you can do with that String: String
Interpolation!

`wget #{uri} -r -l 2`

jwm
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,480
Members
44,900
Latest member
Nell636132

Latest Threads

Top