Is there link extractor or similar html processing libs for Ruby

Desireco · Mar 7, 2006

Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

Thanks.

Zeljko Dakic
http://www.dakic.com

Ross Bamford · Mar 7, 2006

Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

Maybe try:

http://www.crummy.com/software/RubyfulSoup/

Marcin MielÅ¼yÅ„ski · Mar 7, 2006

Desireco said:
Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

Thanks.

Zeljko Dakic
http://www.dakic.com

You meant something like this ? (quite dirty but works)

puts open("some.html").read.scan(/<a href="?(.+?)"?>/)

lopex

James Edward Gray II · Mar 7, 2006

You meant something like this ? (quite dirty but works)

puts open("some.html").read.scan(/<a href=3D"?(.+?)"?>/)

No, it doesn't, trust me.

Toss a simple "\n" in there and you're =20=

sunk:

<a
href=3D"whatever">

Parsing HTML is hard and you don't want to use regular expressions to =20=

do it.

James Edward Gray II

Desireco · Mar 7, 2006

Thank you guys. RubyfulSoup looks like what I am after.

Zeljko

Marcin MielÅ¼yÅ„ski · Mar 7, 2006

James said:
No, it doesn't, trust me. Toss a simple "\n" in there and you're sunk:

<a
href="whatever">

Parsing HTML is hard and you don't want to use regular expressions to do
it.

James Edward Gray II

Yep, I realized that after seeing xerces sources

lopex

Bill Kelly · Mar 7, 2006

From: "James Edward Gray II said:
No, it doesn't, trust me. Toss a simple "\n" in there and you're
sunk:

<a
href="whatever">

Parsing HTML is hard and you don't want to use regular expressions to
do it.

Hi, not trying to be argumentative, just surprised. I thought parsing HTML
with regexps was pretty easy. Well, lexing HTML into tokens, I mean.

Since there are no recursive structures (that I know of) in the syntax for
an open or closing tag, it seemed reasonably well suited to regexps to me.

. . . . Heheh, or maybe the passage of time has given the memories a
rosy glow. I just looked up the last HTML lexer I wrote, 5 years ago,
and it's 19 lines of regexp. Admittedlly it's a very clean 19 lines, but still,
lengthier than I remembered....

Regards,

Bill

James Edward Gray II · Mar 7, 2006

Hi, not trying to be argumentative, just surprised. I thought
parsing HTML with regexps was pretty easy. Well, lexing HTML into
tokens, I mean.

There's a lot of pretty darn ugly HTML out there my friend. Here's a
semi-paranoid attempt to grab just the start of anchor tag:

/<\s*a[^>]*?href\s*=\s*(['"]?)[^'"]*\1?[^>]*>/i

Am I getting close yet? No, the quotes are all wrong. That would
fail to match an extremely common link like:

<a href="alert('You broke it!')">

I would try to fix that, but my brain has already melted and leaked
out my ear.

I'm sure I made other mistakes too.

If you want to capture the name of the link too, this gets *much* worse!

James Edward Gray II

James Edward Gray II · Mar 7, 2006

A few regexps here, a few .scans there, and you're done...

Or you can load RubyfulSoup and call find() a few times. About they
same effort, but a *lot* safer, eh?

James Edward Gray II

William James · Mar 8, 2006

Desireco said:
Hi,

in cool Perl there are a bunch of libraries that process html files and

help you when you need to extract info. I remember hearing something
for Ruby as well, if someone had experience with this, it would help me

if he could point me in right direction. Basically I need to extract
links and info from html pages.

Thanks.

Zeljko Dakic
http://www.dakic.com

class String
def xtag(s)
result = []
scan( %r!
< #{s} (?: \s+ ( [^>]* ) )? / >
|
< #{s} (?: \s+ ( [^>]* ) )? >
( .*? ) </ #{s} >
!mix ) \
{ |unpaired, attr, data| h = { }
( unpaired || attr || "" ).
scan( %r{ ( \w+ ) \s* = \s*
(?: ( ["'] ) ( .*? ) \2 | ( \S+ ) )
}x ) { |k,q,v,v2|
h[k.downcase] = (v || v2) }
block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
}
result
end
end

DATA.read.xtag('a'){|atr,txt| puts atr['href'], txt }

__END__
<a
href = "alert('Junior broke it!')" >foo bar</a>
<a
href = www.foo.bar >foo bar
</a>
upcoming <A HREF="./">HTML 3.2 reference</A>. All the
is <A HREF='./special/a.html'>A</A>, with the attribute HREF.
<a target="_blank" href="/support?hl=en">Help</a> |

Gregor Kopp · Mar 8, 2006

gem install mechanize

require 'mechanize'
browser = WWW::Mechanize.new
url = "http://www.eineseite.de"
page = browser.get url
page.links.each do |link|
puts "#{url}#{link.href}"
end

take also a look at html tokenizer from gems

Gregor Kopp · Mar 8, 2006

Gregor said:
take also a look at html tokenizer from gems

or do a gem search html

James Edward Gray II · Mar 8, 2006

class String
def xtag(s)
result = []
scan( %r!
< #{s} (?: \s+ ( [^>]* ) )? / >
|
< #{s} (?: \s+ ( [^>]* ) )? >
( .*? ) </ #{s} >
!mix ) \
{ |unpaired, attr, data| h = { }
( unpaired || attr || "" ).
scan( %r{ ( \w+ ) \s* = \s*
(?: ( ["'] ) ( .*? ) \2 | ( \S+ ) )
}x ) { |k,q,v,v2|
h[k.downcase] = (v || v2) }
block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
}
result
end
end

DATA.read.xtag('a'){|atr,txt| puts atr['href'], txt }

__END__
<a
href = "alert('Junior broke it!')" >foo bar</a>
<a
href = www.foo.bar >foo bar
</a>
upcoming <A HREF="./">HTML 3.2 reference</A>. All the
is <A HREF='./special/a.html'>A</A>, with the attribute HREF.
<a target="_blank" href="/support?hl=en">Help</a> |

<a href="if (my_var > 5) { whatever() }">Javascript Link</a>

James Edward Gray II

William James · Mar 10, 2006

James said:
class String
def xtag(s)
result = []
scan( %r!
< #{s} (?: \s+ ( [^>]* ) )? / >
|
< #{s} (?: \s+ ( [^>]* ) )? >
( .*? ) </ #{s} >
!mix ) \
{ |unpaired, attr, data| h = { }
( unpaired || attr || "" ).
scan( %r{ ( \w+ ) \s* = \s*
(?: ( ["'] ) ( .*? ) \2 | ( \S+ ) )
}x ) { |k,q,v,v2|
h[k.downcase] = (v || v2) }
block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
}
result
end
end

DATA.read.xtag('a'){|atr,txt| puts atr['href'], txt }

__END__
<a
href = "alert('Junior broke it!')" >foo bar</a>
<a
href = www.foo.bar >foo bar
</a>
upcoming <A HREF="./">HTML 3.2 reference</A>. All the
is <A HREF='./special/a.html'>A</A>, with the attribute HREF.
<a target="_blank" href="/support?hl=en">Help</a> |

Click to expand...

<a href="if (my_var > 5) { whatever() }">Javascript Link</a>

class String
def xtag(str)
result = [] ; re =
%r{ < #{str} (?: \s+ ( (?> [^>"/]* (?> "[^"]*" )? )* ) )? }xi
scan( %r{ #{re} / > | #{re} > ( .*? ) </ #{str} >
}mix ) \
{ |unpaired, attr, data| h = { }
( unpaired || attr || "" ).
scan( %r{ ( \w+ ) \s* = \s*
(?: ( ["'] ) ( .*? ) \2 | ( \S+ ) )
}mx ) { |k,q,v,v2|
h[k.downcase] = (v || v2) }
block_given? ? ( yield [ h, data ] ) : result << [ h, data ]
}
result
end
end

DATA.read.xtag('a'){|atr,txt| puts "-"*9; p atr['href']; puts txt }

__END__
<a
href = "alert('Junior broke it!')" >foo bar</a>
<a
href = www.foo.bar >foo bar
</a>
upcoming <A HREF="./">HTML 3.2 reference</A>. All the
is <A HREF="./special/a.html">A</A>, with the attribute HREF.
<a target="_blank" href="/support?hl=en">Help</a> |
<a href="if (my_var > 5) { whatever() }">Javascript Link</a>
<a name = "foo-bar"
href = "if (foo_bar > 14)
{ fluct() }"

Using ruby for generic language parsing (or any language-specificparsing libraries out there?)	0	Apr 13, 2009
magic/xml library for easy XML processing	5	Aug 5, 2006
Parse Word/HTML Docs for database inserts	3	Jul 16, 2009
Why is there no Smalltalk-like IDE for Ruby?	61	Jun 24, 2006
Configuring LAMP for Ruby created web pages (not Rails). 2010	2	Nov 24, 2010
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
emacs lisp as text processing language...	1	Oct 29, 2007
yet another private method `gsub' called for nil:NilClass error	2	Dec 29, 2010

Is there link extractor or similar html processing libs for Ruby

Desireco

Ross Bamford

Marcin MielÅ¼yÅ„ski

James Edward Gray II

Desireco

Marcin MielÅ¼yÅ„ski

Bill Kelly

James Edward Gray II

James Edward Gray II

William James

Gregor Kopp

Gregor Kopp

James Edward Gray II

William James

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads