search question

H

hgwoss

Hi,

I would like to extract a certain link url and link title from an html
document, which is stored in a text file.

it may look like this:

"A lot of text. <a href="linkurl.html">Link Title</a> Even more Text."

My question is: What is the most efficient way of doing that?
 
M

Matija Papec

X-Ftn-To: (e-mail address removed)

I would like to extract a certain link url and link title from an html
document, which is stored in a text file.

it may look like this:

"A lot of text. <a href="linkurl.html">Link Title</a> Even more Text."

My question is: What is the most efficient way of doing that?

from perldoc,
perldoc -q extract
=========
How do I extract URLs?
You can easily extract all sorts of URLs from HTML with
"HTML::SimpleLinkExtor" which handles anchors, images, objects,
frames, and many other tags that can contain a URL. If you need
anything more complex, you can create your own subclass of
"HTML::LinkExtor" or "HTML::parser". You might even use
"HTML::SimpleLinkExtor" as an example for something specifically
suited to your needs.

You can use URI::Find to extract URLs from an arbitrary text
document.
 
W

William James

Hi,

I would like to extract a certain link url and link title from an html
document, which is stored in a text file.

it may look like this:

"A lot of text. <a href="linkurl.html">Link Title</a> Even more Text."

My question is: What is the most efficient way of doing that?

text = <<HERE
A lot of text. <a href="linkurl.html">Link Title</a>
Even more Text.
HERE

if text =~ /<a href="(.*?)">(.*?)<\/a>/m
printf "%s links to %s.\n", $2, $1
end
 
J

Jürgen Exner

William said:
text = <<HERE
A lot of text. <a href="linkurl.html">Link Title</a>
Even more Text.
HERE

if text =~ /<a href="(.*?)">(.*?)<\/a>/m

Which works for the given example but of course fails for a myriad of other,
probably legitimate examples. See the FAQ and Google about why using simple
REs for parsing HTML is not a good idea at all.

jue
 
J

John Bokma

Bill Segraves said:
Normally, I do.

In this case, however, the Ruby troll neglected to mention his code
was written in Ruby, which might have been misleading to the OP,
especially re: "jue" Exner's response. My response was intended for
the benefit of the OP.

Ah, ok, I understand, apologies :)
For the further benefit of the OP, what could be simpler than the
first example given in the documentation for HTML::TokeParser? This
"correct" code parses HTML with <A> tags and textual information
spread across multiple lines, while the code the Ruby troll posted
fails miserably on similarly-mangled HTML.

Yup, it's a troll. I mean, why is it hanging out in a Perl related group,
there must be a Ruby group.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top