Is there link extractor or similar html processing libs for Ruby

D

Desireco

Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

Thanks.


Zeljko Dakic
http://www.dakic.com
 
R

Ross Bamford

Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

Maybe try:

http://www.crummy.com/software/RubyfulSoup/
 
M

Marcin Mielżyński

Desireco said:
Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

Thanks.


Zeljko Dakic
http://www.dakic.com

You meant something like this ? (quite dirty but works)

puts open("some.html").read.scan(/<a href="?(.+?)"?>/)

lopex
 
J

James Edward Gray II

You meant something like this ? (quite dirty but works)

puts open("some.html").read.scan(/<a href=3D"?(.+?)"?>/)

No, it doesn't, trust me. ;) Toss a simple "\n" in there and you're =20=

sunk:

<a
href=3D"whatever">

Parsing HTML is hard and you don't want to use regular expressions to =20=

do it.

James Edward Gray II
 
M

Marcin Mielżyński

James said:
No, it doesn't, trust me. ;) Toss a simple "\n" in there and you're sunk:

<a
href="whatever">

Parsing HTML is hard and you don't want to use regular expressions to do
it.

James Edward Gray II


Yep, I realized that after seeing xerces sources :D

lopex
 
B

Bill Kelly

From: "James Edward Gray II said:
No, it doesn't, trust me. ;) Toss a simple "\n" in there and you're
sunk:

<a
href="whatever">

Parsing HTML is hard and you don't want to use regular expressions to
do it.

Hi, not trying to be argumentative, just surprised. I thought parsing HTML
with regexps was pretty easy. Well, lexing HTML into tokens, I mean.

Since there are no recursive structures (that I know of) in the syntax for
an open or closing tag, it seemed reasonably well suited to regexps to me.

. . . . Heheh, or maybe the passage of time has given the memories a
rosy glow. I just looked up the last HTML lexer I wrote, 5 years ago,
and it's 19 lines of regexp. Admittedlly it's a very clean 19 lines, but still,
lengthier than I remembered.... :)


Regards,

Bill
 
J

James Edward Gray II

Hi, not trying to be argumentative, just surprised. I thought
parsing HTML with regexps was pretty easy. Well, lexing HTML into
tokens, I mean.

There's a lot of pretty darn ugly HTML out there my friend. Here's a
semi-paranoid attempt to grab just the start of anchor tag:

/<\s*a[^>]*?href\s*=\s*(['"]?)[^'"]*\1?[^>]*>/i

Am I getting close yet? No, the quotes are all wrong. That would
fail to match an extremely common link like:

<a href="alert('You broke it!')">

I would try to fix that, but my brain has already melted and leaked
out my ear. :) I'm sure I made other mistakes too.

If you want to capture the name of the link too, this gets *much* worse!

James Edward Gray II
 
J

James Edward Gray II

A few regexps here, a few .scans there, and you're done...

Or you can load RubyfulSoup and call find() a few times. About they
same effort, but a *lot* safer, eh? ;)

James Edward Gray II
 
W

William James

Desireco said:
Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

Thanks.


Zeljko Dakic
http://www.dakic.com

class String
def xtag(s)
result = []
scan( %r!
< #{s} (?: \s+ ( [^>]* ) )? / >
|
< #{s} (?: \s+ ( [^>]* ) )? >
( .*? ) </ #{s} >
!mix ) \
{ |unpaired, attr, data| h = { }
( unpaired || attr || "" ).
scan( %r{ ( \w+ ) \s* = \s*
(?: ( ["'] ) ( .*? ) \2 | ( \S+ ) )
}x ) { |k,q,v,v2|
h[k.downcase] = (v || v2) }
block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
}
result
end
end

DATA.read.xtag('a'){|atr,txt| puts atr['href'], txt }

__END__
<a
href = "alert('Junior broke it!')" >foo bar</a>
<a
href = www.foo.bar >foo bar
</a>
upcoming <A HREF="./">HTML 3.2 reference</A>. All the
is <A HREF='./special/a.html'>A</A>, with the attribute HREF.
<a target="_blank" href="/support?hl=en">Help</a> |
 
G

Gregor Kopp

gem install mechanize

require 'mechanize'
browser = WWW::Mechanize.new
url = "http://www.eineseite.de"
page = browser.get url
page.links.each do |link|
puts "#{url}#{link.href}"
end


take also a look at html tokenizer from gems
 
J

James Edward Gray II

class String
def xtag(s)
result = []
scan( %r!
< #{s} (?: \s+ ( [^>]* ) )? / >
|
< #{s} (?: \s+ ( [^>]* ) )? >
( .*? ) </ #{s} >
!mix ) \
{ |unpaired, attr, data| h = { }
( unpaired || attr || "" ).
scan( %r{ ( \w+ ) \s* = \s*
(?: ( ["'] ) ( .*? ) \2 | ( \S+ ) )
}x ) { |k,q,v,v2|
h[k.downcase] = (v || v2) }
block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
}
result
end
end

DATA.read.xtag('a'){|atr,txt| puts atr['href'], txt }

__END__
<a
href = "alert('Junior broke it!')" >foo bar</a>
<a
href = www.foo.bar >foo bar
</a>
upcoming <A HREF="./">HTML 3.2 reference</A>. All the
is <A HREF='./special/a.html'>A</A>, with the attribute HREF.
<a target="_blank" href="/support?hl=en">Help</a> |

<a href="if (my_var > 5) { whatever() }">Javascript Link</a>

James Edward Gray II
 
W

William James

James said:
class String
def xtag(s)
result = []
scan( %r!
< #{s} (?: \s+ ( [^>]* ) )? / >
|
< #{s} (?: \s+ ( [^>]* ) )? >
( .*? ) </ #{s} >
!mix ) \
{ |unpaired, attr, data| h = { }
( unpaired || attr || "" ).
scan( %r{ ( \w+ ) \s* = \s*
(?: ( ["'] ) ( .*? ) \2 | ( \S+ ) )
}x ) { |k,q,v,v2|
h[k.downcase] = (v || v2) }
block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
}
result
end
end

DATA.read.xtag('a'){|atr,txt| puts atr['href'], txt }

__END__
<a
href = "alert('Junior broke it!')" >foo bar</a>
<a
href = www.foo.bar >foo bar
</a>
upcoming <A HREF="./">HTML 3.2 reference</A>. All the
is <A HREF='./special/a.html'>A</A>, with the attribute HREF.
<a target="_blank" href="/support?hl=en">Help</a> |

<a href="if (my_var > 5) { whatever() }">Javascript Link</a>

class String
def xtag(str)
result = [] ; re =
%r{ < #{str} (?: \s+ ( (?> [^>"/]* (?> "[^"]*" )? )* ) )? }xi
scan( %r{ #{re} / > | #{re} > ( .*? ) </ #{str} >
}mix ) \
{ |unpaired, attr, data| h = { }
( unpaired || attr || "" ).
scan( %r{ ( \w+ ) \s* = \s*
(?: ( ["'] ) ( .*? ) \2 | ( \S+ ) )
}mx ) { |k,q,v,v2|
h[k.downcase] = (v || v2) }
block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
}
result
end
end

DATA.read.xtag('a'){|atr,txt| puts "-"*9; p atr['href']; puts txt }

__END__
<a
href = "alert('Junior broke it!')" >foo bar</a>
<a
href = www.foo.bar >foo bar
</a>
upcoming <A HREF="./">HTML 3.2 reference</A>. All the
is <A HREF="./special/a.html">A</A>, with the attribute HREF.
<a target="_blank" href="/support?hl=en">Help</a> |
<a href="if (my_var > 5) { whatever() }">Javascript Link</a>
<a name = "foo-bar"
href = "if (foo_bar > 14)
{ fluct() }"
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top