Perl HTML searching

Steve · Mar 19, 2010

I started a little project where I need to search web pages for their
text and return the links of those pages to me. I am using
LWP::Simple, HTML::LinkExtor, and Data:

umper. Basically all I have
done so far is a list of URL's from my search query of a website, but
I want to be able to filter this content based on the pages contents.
How can I do this? How can I get the content of a web page, and not
just the URL?

Kyle T. Jones · Mar 19, 2010

Steve said:
I started a little project where I need to search web pages for their
text and return the links of those pages to me. I am using
LWP::Simple, HTML::LinkExtor, and Data:umper. Basically all I have
done so far is a list of URL's from my search query of a website, but
I want to be able to filter this content based on the pages contents.
How can I do this? How can I get the content of a web page, and not
just the URL?

my $pagecontents=get("url");

Then you'll have to parse it yourself to pull out whatever stuff you're
interested in...

Cheers.

Jürgen Exner · Mar 19, 2010

Steve said:
I started a little project where I need to search web pages for their
text and return the links of those pages to me. I am using
LWP::Simple, HTML::LinkExtor, and Data:umper. Basically all I have
done so far is a list of URL's from my search query of a website, but
I want to be able to filter this content based on the pages contents.
How can I do this? How can I get the content of a web page, and not
just the URL?

???

I don't understand.

use LWP::Simple;
$content = get("http://www.whateverURL");

will get you exactly the content of that web page and assign it to
$content and apparently you are doing that already.

So what is your problem?

jue

Steve · Mar 19, 2010

???

I don't understand.

use LWP::Simple;
$content = get("http://www.whateverURL");

will get you exactly the content of that web page and assign it to
$content and apparently you are doing that already.

So what is your problem?

jue

Sorry I am a little overwhelmed with the coding so far (I'm not very
good at perl). I have what you have posted, but my problem is that I
would like to filter that content... like lets say I searched a site
that had 15 news links and 3 of them said "Hello" in the title. I
would want to extract only the links that said hello in the title.

J. Gleixner · Mar 19, 2010

Steve said:
Sorry I am a little overwhelmed with the coding so far (I'm not very
good at perl). I have what you have posted, but my problem is that I
would like to filter that content... like lets say I searched a site
that had 15 news links and 3 of them said "Hello" in the title. I
would want to extract only the links that said hello in the title.

'"Hello" in the title'??.. The title element of the HTML????
Or the 'a' element contains 'Hello'?? e.g. <a href="...">Hello Kitty</a>

How are you using HTML::LinkExtor??

That seems like the right choice.

Why are you using Data:

umper?

That's helpful when debugging, or logging, so how are you using it?

Post your very short example, because there's something you're
missing and no one can tell what that is based on your description.

Kyle T. Jones · Mar 19, 2010

Steve said:
Sorry I am a little overwhelmed with the coding so far (I'm not very
good at perl). I have what you have posted, but my problem is that I
would like to filter that content... like lets say I searched a site
that had 15 news links and 3 of them said "Hello" in the title. I
would want to extract only the links that said hello in the title.

Read up on perl regular expressions.

for instance, taking the above, you might first split it into a
"one-line per" array -

@stuff=split(/\n/, $content);

then parse each line for hello -

foreach(@stuff){
if($_=~/Hello/){
do whatever;}
}

Cheers.

J. Gleixner · Mar 19, 2010

J. Gleixner said:
'"Hello" in the title'??.. The title element of the HTML????
Or the 'a' element contains 'Hello'?? e.g. <a href="...">Hello Kitty</a>

How are you using HTML::LinkExtor??

That seems like the right choice.

After looking at it further, HTML::LinkExtor only gives the
attributes, not the text that makes up the hyperlink. Seems
like that would be a useful enhancement.

This might help you:

http://cpansearch.perl.org/src/GAAS/HTML-Parser-3.64/eg/hanchors

Steve · Mar 19, 2010

'"Hello" in the title'??.. The title element of the HTML????
Or the 'a' element contains 'Hello'?? e.g. <a href="...">Hello Kitty</a>

How are you using HTML::LinkExtor??

That seems like the right choice.

Why are you using Data:umper?

That's helpful when debugging, or logging, so how are you using it?

Post your very short example, because there's something you're
missing and no one can tell what that is based on your description.

Based on what you all said, I can make a more clear description.
Essentially, I'm trying to search craigslist more efficiently. I want
the link the a tag points to, as well as the description. here is
code I used already that I made that gets me only the links:
-----------------------------

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::LinkExtor;
use Data:

umper;

###### VARIABLES ######
my $craigs = "http://seattle.craigslist.org";
my $source = "$craigs/search/sss?query=what+Im+Looking
+for&catAbbreviation=sss";
my $browser = 'google-chrome';

###### SEARCH #######

my $page = get("$source");
my $parser = HTML::LinkExtor->new();

$parser->parse($page);
my @links = $parser->links;
open LINKS, ">/home/me/Desktop/links.txt";
print LINKS Dumper \@links;

open READLINKS, "</home/me/Desktop/links.txt";
open OUT, ">/home/me/Desktop/final.txt";
while (<READLINKS>){
if ( /html/ ){
my $url = $_;
for ($url){
s/\'//g;
s/^\s+//;
}

print OUT "$craigs$url";
}
}
open BROWSE, "</home/me/Desktop/final.txt";

system ($browser);
foreach(<BROWSE>){
system ($browser, $_);
}

Steve · Mar 19, 2010

Quoth Steve <[email protected]>:

Are you sure craigslist's Terms of Use allow this? Most sites of this
nature don't.

Use 3-arg open.
Use lexical filehandles.
*Always* check the return value of open.

open my $LINKS, ">", "/home/me/Desktop/links.txt"
or die "can't write to 'links.txt': $!";

You may wish to consider using the 'autodie' module from CPAN, which
will do the 'or die' checks for you.

As above.

Why are you writing the links out to a file only to read them in again?
Just use the array you already have:

for (@links) {

As above.

Ben

I have no idea, but it's personal use. I don't see what so bad about
it, if I was using my web browser I'd be doing the same thing.
Craigslist is just an example.

That's aside the point though, I'm just doing it for fun/practice/
learning. Let's say we are using a different site then, perhaps one
I'm going to make, it makes no difference to me.

So any way I can do this or...?

Steve · Mar 19, 2010

Quoth Steve <[email protected]>:

That's not the point. If their TOS say 'no robots' then that means 'no
robots', not 'no robots unless it's for personal use and you can't see
why you shouldn't'. Apart from anything else, a lot of these sites make
money from ads, which you will completely bypass.

I've already suggested using XML::LibXML. Others have pointed you to an
example of using HTML:arser. Pick one and try it.

Ben

I realize this, I'm not using craigslist. It was the first thing I
could think of for an example. This is for internal/personal use
only, and I don't like how you're labeling me as breaking any TOS for
an _EXAMPLE_. Notice how my home folder is changed to "me"? I'm
putting as little personal information here, hence the craigslist
example.

Peter J. Holzer · Mar 20, 2010

=======

I realize this,

Please quote only the relevant parts of the posting you are responding
to and write your answer directly beneath the part you are referring to.

Nobody knows what "this" is that you realize. From your quoting it looks
like you realize that you should use XML::LibXML or HTML:

arser. But
from the content of your reply it seems more likely you realize that you
should abide of the terms of use of any site you use. If so you should
have inserted your response at the point I've marked with "======="
above. And if you don't intend to respond to the part about the tools
you should use, don't quote it (and change the subject, since the topic
is now no longer "Perl HTML searching" but "TOS of web pages").

hp

sln · Mar 20, 2010

Are you sure craigslist's Terms of Use allow this? Most sites of this
nature don't.

There is no "Terms of Use" web page making a caller
agree to, sign, a legal notorized document as a condition of usage.
Its a public record, available to be parsed, quoted or anything else,
by routers, virus scanners, BROWSERs, hosts filters, search engines,
Operating Systems, etc..

As for alterring the content and viewing just what the viewer wants,
its a one way street. I filter adds, active controls/content, links
and anything else I want to.

Don't make me laugh, this lame phrase is just that -- LAME!

-sln

sln · Mar 20, 2010

Sorry I am a little overwhelmed with the coding so far (I'm not very
good at perl). I have what you have posted, but my problem is that I
would like to filter that content... like lets say I searched a site
that had 15 news links and 3 of them said "Hello" in the title. I
would want to extract only the links that said hello in the title.

This might help you. Requires Perl 5.10 or better.

-sln

Output:
Specific Tag/Attr Titles found --
Hello:
"http://helloA.com"
"helloB.com"
no_title:
"/info/twitter.aspx"

All Tag/Attr found --
a-href:
"http://helloA.com"
"/info/twitter.aspx"
"helloB.com"
link-href:
"/includes/css/main.css"

Code:
# -------------------------------------------
# rx_html_href.pl
# -sln, 3/20/2010
#
# Util to extract some attribute/val's from
# html/xml
# -------------------------------------------

use strict;
use warnings;

my ($Name,$Rxmarkup);
InitName();

my $rxopen = "(?: $Name )"; # Open tag with 'href' attrib, cannot be empty alternation

#my $rxopen = "(?: a )"; # Open tag with 'href' attrib, cannot have an empty alternation
my $rxattr = "(?: href )"; # Attribute we seek, cannot have an empty alternation
my $rxclose = "(?: a )"; # Close tag to match with content, cannot have an empty alternation
my $rxtitle = "(?: Hello | )"; # Content Title, can be empty alternation

my %hTitles; # hash of titles => attribute values matching tag open, title, and tag close
my %hHrefs; # hash of tag => attribute values matching tag open expression, not necessaryily titles

InitRegex();

##
# open my $fh, '<', 'C:/temp/XML/tennis1.html' or
# die "can't open file for input: $!";
# my $html = join '', <$fh>;
# close $fh;

my $html = join '', <DATA>;

##
ParseHref(\$html);

##
print "\nSpecific Tag/Attr Titles found --\n";
for my $key (keys %hTitles) {
print " $key:\n";
for my $val (@{$hTitles{$key}}) {
print " $val\n";
}
}

print "\nAll Tag/Attr found -- \n";
for my $key (keys %hHrefs) {
print " $key:\n";
for my $val (@{$hHrefs{$key}}) {
print " $val\n";
}
}

exit (0);

##
sub ParseHref
{
my ($markup) = @_;
my (
$url,
$title,
$content,
$tfound,
$lcbpos,
$last_content_pos,
$begin_pos
) = ('','','',0,0,0,0);

## parse loop
while ($$markup =~ /$Rxmarkup/g)
{
## handle content buffer
if (defined $+{C1}) {
## speed it up
$content .= $+{C1};
if (length $+{C2})
{
if ($lcbpos == pos($$markup)) {
$content .= $+{C2};
} else {
$lcbpos = pos($$markup);
pos($$markup) = $lcbpos - 1;
}
}
$last_content_pos = pos($$markup);
next;
}
## content here ... take it off
if (length $content)
{
$begin_pos = $last_content_pos;
## check '<'
if ($content =~ /</) {
## markup in content
#print "Markup '<' in content, da stuff is crap!\n";
}
if ($content =~ /($rxtitle)/x && length $url) {
$tfound = 1;
$title = $1;
$title =~ s/^\s*//;
$title =~ s/\s*$//;
$title = 'no_title' if !length($title);
}
$content = '';
}
## markup here ... take it off
if (defined $+{OPEN}) {
push @{$hHrefs{$+{OPEN}.'-'.$+{ATTR}}}, $+{VAL} ;
$url = $+{VAL};
$tfound = 0;
$title = '';
}
elsif (defined $+{CLOSE}) {
if (length $url && $tfound) {
push @{$hTitles{$title}}, $url;
}
$url = '';
$tfound = 0;
$title = '';
}
} ## end parse loop

## check for leftover content
if (length $content)
{
## check '<'
if ($content =~ /</) {
## markup in content
#print "Markup '<' in left over content, da stuff is crap!\n";
}
}
}

sub InitName
{
my @UC_Nstart = (
"\\x{C0}-\\x{D6}",
"\\x{D8}-\\x{F6}",
"\\x{F8}-\\x{2FF}",
"\\x{370}-\\x{37D}",
"\\x{37F}-\\x{1FFF}",
"\\x{200C}-\\x{200D}",
"\\x{2070}-\\x{218F}",
"\\x{2C00}-\\x{2FEF}",
"\\x{3001}-\\x{D7FF}",
"\\x{F900}-\\x{FDCF}",
"\\x{FDF0}-\\x{FFFD}",
"\\x{10000}-\\x{EFFFF}",
);
my @UC_Nchar = (
"\\x{B7}",
"\\x{0300}-\\x{036F}",
"\\x{203F}-\\x{2040}",
);
my $Nstrt = "[A-Za-z_:".join ('',@UC_Nstart)."]";
my $Nchar = "[\\w:.".join ('',@UC_Nchar).join ('',@UC_Nstart)."-]";
$Name = "(?:$Nstrt$Nchar*)";
}

sub InitRegex
{
$Rxmarkup = qr/
(?:
<
(?:
# Specific markup
(?: (?<OPEN> $rxopen ) \s+[^>]*? (?<=\s) (?<ATTR> $rxattr) \s*=\s* (?<VAL> ".+?"|'.+?')[^>]*? \s* \/?) # OPEN, ATTR, VAL
|(?: (?<CLOSE> \/$rxclose ) \s* ) # CLOSE

# Ordinary exclusionary markup
|(?: \/* $Name \s* \/*)
|(?: $Name (?:\s+(?:".*?"|'.*?'|[^>]*?)+) \s* \/?)
|(?: \?.*?\?)
|(?:
!
(?: # markup types that have '!'
(?: DOCTYPE.*?)
|(?: \[CDATA\[.*?\]\])
|(?: --.*?--)
|(?: \[[A-Z][A-Z\ ]*\[.*?\]\]) # who knows?
|(?: ATTLIST.*?)
|(?: ENTITY.*?)
|(?: ELEMENT.*?)
# add more if necessary
)
)
))
# This alternation handles content
| (?<C1> [^<]*) (?<C2> <?) # C1, C2
/xs;

}

__DATA__
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 # $ \ Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D "MSHTML 6.00.2900.3395" name=3DGENERATOR>

<STYLE></STYLE>
<test name = " thi<s # $ \ is a " test>
</HEAD>
<BODY bgColor=3D#ffffff>

should fix these: # $ \
but not these: ¯
fix some here: &&%#$ &as; &&#a0

<a href="http://helloA.com">Hello</a>

<IMG SRC = "foo.gif" ALT = "A > B">
<IMG SRC = "foo.gif"
ALT = "A > # $ \ B">

<NN & a # $ \>
<AA & # $ \>

<# Just data #>

<![INCLUDE CDATA [ >>>>>\\ # $ \ >>>>>>> ]]>



<link rel="stylesheet" type="text/css" href="/includes/css/main.css">

at root # $ \ > # $ \ level

<a href="/info/twitter.aspx" target="_top">
<img src="/images/icons/icon_twitter.gif" border="0" align="absmiddle">
</a>

<html><body>
<p>Hello
Kitty</p>
<a
href
=
"helloB.com"

Hello</a

</body></html>

sln · Mar 20, 2010

There is no legal need to sign anything.

http://www.craigslist.org/about/terms.of.use

By using the Service in any way, you are agreeing to comply with the TOU.

Whether it is public or private does not matter either.

It is copyrighted either way.

^^^^^^^^^^^^^^
There is nothing copyrighted about a href link. There is
nothing copyrighted about words, html, xml, browsers, nor
anything else that flows through the public airways, nor
is air, water or food copyrighted.

If craig has some unique combination of words that may
be considered "artfull and unique" and apart from all others, that
may be extracted from thier "public" broadcast, they would publish
it as literrary content.

Otherwise, the computer rips appart, repackages, transmits data
as it sees fit, unless you think the HOSTS file violates that
"artfull and unique" web page.

The owner can impose whatever restrictions they want.

^^^^^^^^^^^^^^^
No, they cannot. Give an example.

This license does not include:
...

BEGIN Browser definition

(b) any collection, aggregation, copying, duplication, display
or derivative use of the Service nor any use of data mining,
robots, spiders, or similar data gathering and extraction tools
for any purpose unless expressly permitted by craigslist.

END Browser definition

Just because you violate the license you've been given does not
make it OK for others to also violate the license.

Just because you say it doesen't make it so.
Its not a movie, music, literrary art. Its a composition
of ordinary off the shelf components that can be broken down
and examined. Happens every day, its public information, and
public information cannot be licensed for which craig has any
patent.

-sln

Peter J. Holzer · Mar 21, 2010

This is getting a bit off-topic, but ...

That may or may not be binding.

^^^^^^^^^^^^^^
There is nothing copyrighted about a href link. There is
nothing copyrighted about words, html, xml, browsers, nor
anything else that flows through the public airways, nor
is air, water or food copyrighted.

If craig has some unique combination of words that may
be considered "artfull and unique" and apart from all others, that
may be extracted from thier "public" broadcast, they would publish
it as literrary content.

^^^^^^^^^^^^^^^
No, they cannot. Give an example.

"whatever restrictions they want" is too strong. The copyright law has
some limits.

BEGIN Browser definition ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
END Browser definition

I would assume that viewing stuff in the browser is expressly permitted
by craigslist.

Just because you say it doesen't make it so. Its not a movie, music,
literrary art. Its a composition of ordinary off the shelf components
that can be broken down and examined. Happens every day, its public
information, and public information cannot be licensed for which craig
has any patent.

Don't know about the US, but in Europe "a composition of ordinary ...
information" is more strongly protected by copyright law than "a movie,
music, literrary art". Because while the former need to be "artful and
unique" as you say (in Austrian law the term is "Werkshöhe"), no such
restriction exists for databases. So if you if you compile a list of the
students of your final year in high school, that's copyrighted. Same for
the data on craigs list.

(Similarly for programs: A "hello world" program is copyrighted, a
literary work of the same originality wouldn't be - but that's not the
point here)

hp

sln · Mar 22, 2010

[snip]

Don't know about the US, but in Europe "a composition of ordinary ...
information" is more strongly protected by copyright law than "a movie,
music, literrary art". Because while the former need to be "artful and
unique" as you say (in Austrian law the term is "Werkshöhe"), no such
restriction exists for databases. So if you if you compile a list of the
students of your final year in high school, that's copyrighted. Same for
the data on craigs list.

I would say a "list" is just that and nothing more, not copyrighted at
all. A list of students is not unique nor copyrighted. The published
year book is copyrighted as an entity, not the parts. A list of
credit card names and numbers are not copyrighted either and are not
published for legal reasons. Besides that, lists are not unique in the
sense that they are composed of common publicly obtained (private
information or not, but obtained from public sources) items that
idividually or collectively cannot be copyrighted. You just can't
say you have invented a unique color from 24-bit registers.

The point at which something becomes copyrightable is blurred.
A word/phrase in a book? Probably not. A sequential paragraph or two
in a book? Probably so. Its unique and highly unlikey to be randomly
duplicated. That is not the case of public information that can be
filterred to create a comparable list. In this case, the dimensions
of information are too easy to duplicate, unlike that of say a few
paragraphs of a book.

It is not likely that public information can be wrapped in a list
structure and its contents declared copyrighted. Copyright label is
attached to everything in general. It doesen't even need to be filed
with the copyright office. When in doubt, just 'say' its copyrighted
in a flimsy 'Terms Of Use', then blast it out in an uncontrolled public
fashion. Yeah, thats legal to do, but it holds no weight - especially
when the listed items themselves are not copyrighted or trademarked,
and otherwise, specifically public or general-knowlege information in
nature.

-sln

John Bokma · Mar 22, 2010

[snip]

Don't know about the US, but in Europe "a composition of ordinary ...
information" is more strongly protected by copyright law than "a
movie,

Click to expand...

If this European law is the same as the Dutch Databanken-recht "database
law", a database is protected under that law if there has been put
substantial effort into the compilation of such a database. This is
*not* copyright however, it's a separate law.

I would say a "list" is just that and nothing more, not copyrighted at
all. A list of students is not unique nor copyrighted.

Correct, and under the Dutch law such a list is only protected if there
has been put a substantial effort into its compilation. So there is
probably no way you can protect a list of 500 students, but most likely
you can if such a list has thousands and thousands of students, and
effort has been put into keeping the addresses of each student actual,
etc.

IANAL,

Mart van de Wege · Mar 23, 2010

Tad McClellan said:
There is no legal need to sign anything.

http://www.craigslist.org/about/terms.of.use

By using the Service in any way, you are agreeing to comply with the TOU.

Irrelevant.

The protocols do not specify what I should or should not GET from an
HTTP server. If I am using a text-based browser, I don't download
images, for example.

And Terms of Use are nice, but unless you can prove I read them, you
cannot force me to abide by them.

Not using robots is common courtesy, Terms of Use have no legal power to
stop me from using them.

Mart

Randal L. Schwartz · Mar 23, 2010

sln> I would say a "list" is just that and nothing more, not copyrighted at
sln> all.

And since lawyers disagree with you, a smart person would be wise to ignore
you and find a lawyer.

Peter J. Holzer · Mar 23, 2010

If this European law is the same as the Dutch Databanken-recht "database
law", a database is protected under that law if there has been put
substantial effort into the compilation of such a database.

Or "if the selection or arrangement are his own creation" (my
translation from Austrian UrhG, §40f). As I read it, only one of these
criteria needs to be fulfilled for protection.

An example of the former category is a telephone book: The "selection or
arrangement" are not the creation of the publisher: They have been the
same for decades. But compiling the data and keeping it up to date is a
substantial investment, so it is protected (unless you are a phone
company - then you have the data anyway, so there is no investment, and
hence no protection (say the Judges[1][2])).

But if I come up with a new way to arrange the data (let's say I sort
them by phone number instead of name (well, that isn't that new, but it
serves as an example) then this new database is protected even if I
didn't have a substantial investment in the data.

This is *not* copyright however, it's a separate law.

Strictly speaking, "copyright" doesn't exist in continental Europe.
What is called "Urheberrecht" in German emerged during the French
revolution and is based on quite different ideas. But since in practice
the difference is almost non-existent and there doesn't seem to be a
commonly accepted English term for this law, I talk about "copyright"
unless the topic is the difference between these laws.

In Austria (and, AFAIK, Germany) IP rights for databases are part of
the "Urheberrechtsgesetz" (§40f, §76, etc. in Austria). It is possible
that the Netherlands made it a seperate law, but that doesn't matter -
the contents are (substantially) the same.

Correct, and under the Dutch law such a list is only protected if there
has been put a substantial effort into its compilation. So there is
probably no way you can protect a list of 500 students,

You can if you come up with a "selection or arrangement of your own
creation". Or maybe you can argue that collecting data about 500
students was a substantial effort (depending on the students and the
data this may be true).

but most likely you can if such a list has thousands and thousands of
students, and effort has been put into keeping the addresses of each
student actual, etc.

IANAL,

Neither am I.

hp

[1] It was actually about horse races, not phone numbers, but I think
that makes no difference.
[2] The lecturer who mentioned this example wasn't very fond of the law.
If I say that he called it a complete failure I'm only slightly
exaggerating.

Searching the smaller picture in the larger picture	2	Jan 24, 2024
HTML form to csv file on server	1	Feb 12, 2025
I want to Display Excel As HTML In js	2	Feb 24, 2023
Is it possible to filter a hierarchy of web pages with a javascript questionnaire?	0	Sep 15, 2025
Create and Preview HTML & PDF with Custom Encryption and Micro Cloud Storage	0	Nov 11, 2024
How to melt 'display:grid' and 'semantic tags' by HTML 5 ?	1	Feb 25, 2026
How to convert MBOX to HTML for email backup?	1	Mar 7, 2026
Im having some issues with my html website	1	Jun 3, 2024

Perl HTML searching

Steve

Kyle T. Jones

Jürgen Exner

Steve

J. Gleixner

Kyle T. Jones

J. Gleixner

Steve

Steve

Steve

Peter J. Holzer

sln

sln

sln

Peter J. Holzer

sln

John Bokma

Mart van de Wege

Randal L. Schwartz

Peter J. Holzer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads