Perl HTML searching

W

Willem

Mart van de Wege wrote:
) Irrelevant.
)
) The protocols do not specify what I should or should not GET from an
) HTTP server. If I am using a text-based browser, I don't download
) images, for example.
)
) And Terms of Use are nice, but unless you can prove I read them, you
) cannot force me to abide by them.

One could argue that you're *not* allowed to do anything whatsoever
with a web page, *except* when the copyright holder allows it.
Which he does, obviously, through a terms-of-use agreement.

With that basis, viewing a web page *does* imply that you agree with
the terms of use, because if you didn't, then you would not have had
the right to download and view anything.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
S

sln

sln> I would say a "list" is just that and nothing more, not copyrighted at
sln> all.

And since lawyers disagree with you, a smart person would be wise to ignore
you and find a lawyer.

No need to do that. Below is a general explaination of what a copyright
is. Nope, nothing about "lists" as being copyrighted. Even if you could extrapolate
and declare that a single field, filterred list from a table, is a "database", it is
not a distinct collection of uncommon information, nor is it substantive in its nature
to even qualify as a database.

Its a huge leap to say a list is copyrighted, if it was it would be a "related right"
as a database, with extreme limitations and qualifications. Even then, only the UA
recognizes it as such, NOT the United States nor Australia.

In reality, a "list" is just a collection of uncreative facts, nothing more,
not copyrighted at all.

-sln

COPYRIGHT
----------
Copyright is the set of exclusive rights granted to the author or
creator of an original work, including the right to copy, distribute
and adapt the work. These rights can be licensed,
transferred and/or assigned.

The type of works which are subject to copyright has been expanded
over time. Initially only covering books, copyright law was revised
in the 19th century to include maps, charts, engravings, prints,
musical compositions, dramatic works, photographs, paintings,
drawings and sculptures. In the 20th century copyright was expanded
to cover motion pictures, computer programs, sound recordings,
dance and architectural works.

Copyright law is typically designed to protect the fixed expression
or manifestation of an idea rather than the fundamental idea itself.

RELATED RIGHTS
----------------
Related rights is used to describe database rights, public lending
rights (rental rights), artist resale rights and performers’ rights.

Related rights award copyright protection to works which are
not author works, but rather technical media works which allowed
author works to be communicated to a new audience in a different
form. The substance of protection is usually not as great as
there is for author works.

- DATABASES
EU:
In European Union law, a database right is a legal right,
introduced in 1996. Database rights are specifically coded
(i.e. sui generis) laws on the copying and dissemination
of information in computer databases.
... giving a specific and separate legal rights
(and limitations) to certain computer records.
Rights afforded to manual records under EU database right
law are similar in format, but not identical,
to those afforded artistic works.

United States:
Uncreative collections of facts are outside of
Congressional authority under Article I, § 8, cl. 8,
i.e. the Copyright Clause, of the United States
Constitution, therefore no database right exists
in the United States.

Australia:
No specific law exists in Australia protecting databases.
Databases may only be protected if they fall under general
copyright law. Australian copyright law protects "compilations",
which can include databases, phone books, etc.
This copyright protection only covers the unique arrangement
of data within the compilation, however, not the data itself.
 
S

sreservoir

Mart van de Wege wrote:
) Irrelevant.
)
) The protocols do not specify what I should or should not GET from an
) HTTP server. If I am using a text-based browser, I don't download
) images, for example.
)
) And Terms of Use are nice, but unless you can prove I read them, you
) cannot force me to abide by them.

One could argue that you're *not* allowed to do anything whatsoever
with a web page, *except* when the copyright holder allows it.
Which he does, obviously, through a terms-of-use agreement.

With that basis, viewing a web page *does* imply that you agree with
the terms of use, because if you didn't, then you would not have had
the right to download and view anything.

of course, as the terms of use are on the website, reading them implies
agreeing to them. so.
 
M

Mart van de Wege

Willem said:
Mart van de Wege wrote:
) Irrelevant.
)
) The protocols do not specify what I should or should not GET from an
) HTTP server. If I am using a text-based browser, I don't download
) images, for example.
)
) And Terms of Use are nice, but unless you can prove I read them, you
) cannot force me to abide by them.

One could argue that you're *not* allowed to do anything whatsoever
with a web page, *except* when the copyright holder allows it.
Which he does, obviously, through a terms-of-use agreement.
Yeah, but that works two ways. One could also argue that putting
information on a publicly reachable server, using a protocol
specifically designed for publishing, without access controls, implies
that you want the world to read your pages.

IMO, using a robot that doesn't GET faster than a human would is about
as bad as using Lynx.

Mart
 
K

Kyle T. Jones

Tad said:
While reading up on regular expressions is certainly a good idea,
it is a horrid idea for the purposes of parsing HTML.

Ummm. Could you expand on that?

My initial reaction would be something like - I'm pretty sure *any*
method, including the use of HTML::LinkExtor, or XML transform (both
outlined upthread), involves using regular expressions "for the purposes
of parsing HTML".

At best, you're just abstracting the regex work back to the includes.
AFAIK, and feel free to correct me (I'll go take a look at some of the
relevant module code in a bit), every CPAN module that is involved with
parsing HTML uses fairly straightforward regex matching somewhere within
that module's methods.

I think there's an argument that, considering you can do this so easily
(in under 15 lines of code) without the overhead of unnecessary
includes, my way would be more efficient. We can run some benchmarks if
you want (see further down for working code).
Have you read the FAQ answers that mention HTML?

perldoc -q HTML




The code below prints "do whatever" 3 times, but there is only one link
containing "Hello"...

I should have been clearer - the above wasn't a "solution", meant to be
copied, pasted, and put into use - it was just meant to illustrate the
basic operation.

I think this works fine:

#!/usr/bin/perl -w
use strict;
use warnings;
use LWP::Simple;

my $targeturl="http://www.google.com";
my $searchstring="google";
my $contents=get($targeturl);
my @semiparsed=split(/href/i, $contents);

foreach(@semiparsed){
if($_=~/^\s*=\s*('|")(.*?)('|")/){
my $link=$2;
if($link=~/$searchstring/i){
print "Link: $link\n";
}
}
}

OUTPUT:

Link: http://images.google.com/imghp?hl=en&tab=wi
Link: http://video.google.com/?hl=en&tab=wv
Link: http://maps.google.com/maps?hl=en&tab=wl
Link: http://news.google.com/nwshp?hl=en&tab=wn
Link: http://www.google.com/prdhp?hl=en&tab=wf
Link: http://mail.google.com/mail/?hl=en&tab=wm
Link: http://www.google.com/intl/en/options/
Link:
/url?sa=p&pref=ig&pval=3&q=http://www.google.com/ig?hl=en&source=iglk&usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg
Link:
https://www.google.com/accounts/Login?hl=en&continue=http://www.google.com/
Link:
/aclk?sa=L&ai=CbpBLOFeqS_gX3ZmVB_SbuZINs_2WoQHf44OSEMHZnNkTEAEgwVRQpuf5xAJgPaoEhQFP0M0ypnTnQAI3b4WYFAHIvHiLv4iZWVehmiie-78BOdRJQOj6QayRkYYHH4cKXyaNmAp2rmQiiPSHxtEyaVD5OZo41Kxvy6SAeAAF6CIw-SQAFsLT-9iHRfJUcoYh4qlpGqGbC080ZVCWlUUipS404rornNJFmeGlP89sgXehqOfpe8uL&num=1&sig=AGiWqtw95aIEfk5F25oGM2i6eMwkBBuj6Q&q=http://www.google.com/doodle4google/




Or, if you're only interested in the http/https links, you can do this:

#!/usr/bin/perl -w
use strict;
use warnings;
use LWP::Simple;

my $targeturl="http://www.google.com";
my $searchstring="google";
my $contents=get($targeturl);
my @semiparsed=split(/href/i, $contents);

foreach(@semiparsed){
if($_=~/^\s*=\s*('|")(http.*?)('|")/i){
my $link=$2;
if($link=~/$searchstring/i){
print "Link: $link\n";
}
}
}

OUTPUT:

Link: http://images.google.com/imghp?hl=en&tab=wi
Link: http://video.google.com/?hl=en&tab=wv
Link: http://maps.google.com/maps?hl=en&tab=wl
Link: http://news.google.com/nwshp?hl=en&tab=wn
Link: http://www.google.com/prdhp?hl=en&tab=wf
Link: http://mail.google.com/mail/?hl=en&tab=wm
Link: http://www.google.com/intl/en/options/
Link:
https://www.google.com/accounts/Login?hl=en&continue=http://www.google.com/



Like I said, if you want to present a different method where you push
all the regex work off to an include like HTML::LinkExtor, please post
it, and I can run both using a benchmark module to determine which
method is more efficient. I could be way off, here - maybe using one or
more of the modules mentioned in this thread somehow improves
efficiency. If so, please let me know.

By the way - I can think of wrenches to throw into this solution, too -
addressing the use of ' or " inside a link, for instance - but, then, I
could throw "you prolly won't ever see this but it's theoretically
possible" wrenches into most of the HTML parsing CPAN modules, too, so...

Cheers.
 
J

Jürgen Exner

Kyle T. Jones said:
Ummm. Could you expand on that?

My initial reaction would be something like - I'm pretty sure *any*
method, including the use of HTML::LinkExtor, or XML transform (both
outlined upthread), involves using regular expressions "for the purposes
of parsing HTML".

Regular expressions recognize regular languages. But HTML is a
context-free language and therefore cannot be recognized solely by a
regular parser.
Having said that Perl's extended regular expressions are indeed more
powerful than regular, but still it is a bad idea because the
expressions are becoming way to complex.
At best, you're just abstracting the regex work back to the includes.
AFAIK, and feel free to correct me (I'll go take a look at some of the
relevant module code in a bit), every CPAN module that is involved with
parsing HTML uses fairly straightforward regex matching somewhere within
that module's methods.

Using REs to do _part_ of the work of parsing any language is a
no-brainer, of course everyone does it e.g. in the tokenizer.

But unless your language is a regular language (and there aren't many
useful regular languages because regular is just too restrictive) you
need additional algorithms that cannot be expressed as REs to actually
parse a context-free or context-sensitive language.
I think there's an argument that, considering you can do this so easily
(in under 15 lines of code) without the overhead of unnecessary
includes, my way would be more efficient. We can run some benchmarks if
you want (see further down for working code).

But you cannot! Ever heard of the Chomsky Hierarchy? No recollection of
Theory of Computer Languages or Basics of Compiler Construction?
What do people learn in Computer Science today?

jue
 
K

Kyle T. Jones

Jürgen Exner said:
Regular expressions recognize regular languages. But HTML is a
context-free language and therefore cannot be recognized solely by a
regular parser.
Having said that Perl's extended regular expressions are indeed more
powerful than regular, but still it is a bad idea because the
expressions are becoming way to complex.


Using REs to do _part_ of the work of parsing any language is a
no-brainer, of course everyone does it e.g. in the tokenizer.

But unless your language is a regular language (and there aren't many
useful regular languages because regular is just too restrictive) you
need additional algorithms that cannot be expressed as REs to actually
parse a context-free or context-sensitive language.


But you cannot! Ever heard of the Chomsky Hierarchy? No recollection of
Theory of Computer Languages or Basics of Compiler Construction?
What do people learn in Computer Science today?

jue

But isn't the Chomsky Hierarchy completely irrelevant in this (forgive
the pun) context? Surely you "get" that my input is analyzed in terms
of being nothing more or less than a sequence of characters - that it
was originally written in HTML, or any other CFG-based language, is
meaningless - both syntactical and semantical considerations of that
original language are irrelevant in the (again, forgive me) context of
what I'm attempting - which is simply to match one finite sequence of
characters against another finite sequence of characters - I could care
less what those characters mean, what href indicates, what a <body> tag
is, etc.

I don't need to understand English to count the # of e's in the above
passage, right? Neither does Perl.

I believe what you say above is true - to truly "parse" the page AS HTML
is beyond the ability of REs - but I'm not parsing anything AS HTML, if
that makes sense. In fact, to take that a step further, I'm not
"parsing" period - so perhaps it was a mistake for me to use that term.
I meant to use the term colloquially, sorry if that caused any confusion.

Cheers.


" 'Regular expressions' [...] are only marginally related to real
regular expressions. Nevertheless, the term has grown with the
capabilities of our pattern matching engines, so I'm not going to try to
fight linguistic necessity here. I will, however, generally call them
"regexes" (or "regexen", when I'm in an Anglo-Saxon mood)" - Larry Wall
 
J

Jürgen Exner

Kyle T. Jones said:
Jürgen Exner said:
Kyle T. Jones said:
Tad McClellan wrote:
Steve wrote:
like lets say I searched a site
that had 15 news links and 3 of them said "Hello" in the title. I
would want to extract only the links that said hello in the title.
Read up on perl regular expressions.

While reading up on regular expressions is certainly a good idea,
it is a horrid idea for the purposes of parsing HTML.

Ummm. Could you expand on that?
[...]
Regular expressions recognize regular languages. But HTML is a
context-free language and therefore cannot be recognized solely by a
regular parser. [...]
But you cannot! Ever heard of the Chomsky Hierarchy? No recollection of
Theory of Computer Languages or Basics of Compiler Construction?
What do people learn in Computer Science today?

But isn't the Chomsky Hierarchy completely irrelevant in this (forgive
the pun) context? Surely you "get" that my input is analyzed in terms
of being nothing more or less than a sequence of characters - that it
was originally written in HTML, or any other CFG-based language, is
meaningless - both syntactical and semantical considerations of that
original language are irrelevant in the (again, forgive me) context of
what I'm attempting - which is simply to match one finite sequence of
characters against another finite sequence of characters - I could care
less what those characters mean, what href indicates, what a <body> tag
is, etc.

True. If you know exactly what format your input can possibly have (and
if that input can be described using a finite state automaton) then by
all means yes, go for it. REs are perfect for such tasks.

But that is not what you have been asking, see the Subject of this
thread.
I believe what you say above is true - to truly "parse" the page AS HTML
is beyond the ability of REs - but I'm not parsing anything AS HTML, if
that makes sense. In fact, to take that a step further, I'm not
"parsing" period - so perhaps it was a mistake for me to use that term.
I meant to use the term colloquially, sorry if that caused any confusion.

Well, yes and no. If you are in control of the format and you know
exactly what format is allowed and which formats are not allowed, then
you are right.
But if you are not in control of the input format, e.g. you are reading
from a third-party web page or you get your input data from finance or
marketing or the subsidiary on the opposite side of the world, then your
code must be able to handle any legal HTML because the format could be
changed on you at any time. Which in turn means you must formally parse
the HTML code as HTML code, their is just no way around it.

jue
 
S

sln

Quoth Tad McClellan said:
"pattern matching" is not at all the same as "parsing".

Regular expressions are *great* for pattern matching.

It is mathematically impossible to do a proper parse of a context-free
lanuguage such as HTML with nothing more than regular expressions.

They do not contain the requisite power.

Google for the "Chomsky hierarchy".

HTML allows a table within a table within a table within a table,
to an arbitrary depth. ie. it is not "regular".

Perl's regexen are not regular. With the new features in 5.10 it's easy
to match something like that (it was possible before with (??{}), but
not easy):

perl -E'"[[][[][]]]" =~ m!(?<nest> \[ (?&nest)* \] )!x
and say $+{nest}'
[[][[][]]]
^^^^^^^^^^
All this shows is balanced character '[' ']' matching using the
recursive ability of the 5.10 engine.

Could this be an example such that each square bracket is a
markup instruction, like <tag> ?
It certainly doesen't pertain the the '<' angle brackets, the
parsing delimeter of the instruction.

There is no compliance in HTML to have closing tags so as embedded
markup ustructions interspersed with content are parsed, a guess is
made, if errors are found, where to discontinue the instruction
as applied to the context. And in general, where the nesting is stopped.

There is a separation between the markup instruction and the content
via the markup delimeter '<'. That is the first level of parsing,
extracting the instruction from its delimeter and thereby the
content. The second level is structuring the markup instruction
within the content.

When a complete discreet structure is obtained, the document processor
renders it, a chunk at a time, mid-stream.

The first level, separating markup instructions from its delimeter
(and as a side-effect, exposing content) can be done by any language
that can compare characters.

The second level can be done by any language that can do a stack
or nested variables.

There is no place for balanced text processing for the first
level of parsing markup instructions. Instructions within
instructions are NOT well formed and will be kicked out of
processors.

So essentially, as slow as it can be, if the aim is to peal away
delimeters to expose the markup instruction, regular expressions
work great. C processors work about 100 - 500 times faster but
don't have the ability to give extended (look ahead) errors,
nor will they self correct and continue. Most cases, a
regular expression can identify errant markup instruction syntax
while correctly encapsulating the delimeting expression.
If there is an errant '<' delimeter in content, it is not
well-formed but is still captured as content and easily reported.

Overall, there is no requirement for processors to stop on
not well-formed, but most do because they are full featured
and compliant. Most go out and bring in includes, do substitutions,
reparse, etc.

No, you won't get that with regular expressions, but there
is nothing stopping anybody from using them to parse out
markup instructions and content, nothing at all. Just compare
characters is all you do.

The reason regex is so slow is that it does pattern matching
with backtracking, grouping, etc.

This doesen't mean it can't compare characters, it sure can,
and in a variable way which allows looking ahead which has
benifits over state processing.

As long as the regex takes into account ALL possible markup
instructions and delimeters as exclusionary items, there is
no reason why it can't be used to find specific sub-patterns
either in content or, markup instructions themselves.

And it can drive over and re-align after discrete syntax errors without
stopping. All in all, its a niche parser and perfect at times
when a Dom or SAX is just too cumbersome, too much code overhead
for something simple.

-sln
 
T

Ted Zlatanov

MvdW> Yeah, but that works two ways. One could also argue that putting
MvdW> information on a publicly reachable server, using a protocol
MvdW> specifically designed for publishing, without access controls, implies
MvdW> that you want the world to read your pages.

(OT but slightly relevant to WWW::Mechanize for example)

Sadly this common-sense interpretation has been eroded by Congress and
courts in the USA. Look for info on the Computer Fraud and Abuse Act,
e.g. http://www.techdirt.com/articles/20100305/0404088432.shtml

Ted
 
K

Kyle T. Jones

Tad McClellan wrote:


Thanks for the reply - in particular, some of the code you provided and
corrected was interesting and informative.

You make a big deal about my use of the term "parse" throughout - I sure
felt as if I was being chastised. I was kind of surprised that I did
use it, to be honest. I figured I must have used it casually - and
mentioned such in another response:

"I believe what you say above is true - to truly "parse" the page AS
HTML is beyond the ability of REs - but I'm not parsing anything AS
HTML, if that makes sense. In fact, to take that a step further, I'm
not "parsing" period - so perhaps it was a mistake for me to use that
term. I meant to use the term colloquially, sorry if that caused any
confusion. " - me

I'll attempt to stay away from such casual use of that particular term
in future interactions here. As for suggestions that I google "Chomsky
hierarchy" - all my peeps got a kick out of that one.

Cheers.
 
P

Peter J. Holzer

I think the FAQ answer does a pretty good job of it.




"pattern matching" is not at all the same as "parsing".

Regular expressions are *great* for pattern matching.

It is mathematically impossible to do a proper parse of a context-free
lanuguage such as HTML with nothing more than regular expressions.

They do not contain the requisite power.

Google for the "Chomsky hierarchy".

HTML allows a table within a table within a table within a table,
to an arbitrary depth. ie. it is not "regular".

However, for extracting links you don't need to process nested tables.
You can view the file as linear sequence of tags and text. And this can
be done with a regular grammar, you don't need a context-free grammar.

Try it with this:

-------------------
my $contents = '
<html><body>
<!--
this is NOT a link...
<a href="google.com">Google</a>
-->
</body></html>
';
-------------------

Comments in HTML can also be described by regular expressions - no need
to write a context-free grammar for that.

But this is a good example why you should use an existing module instead
of rolling your own: When you roll your own it is easy to forget about
special cases like this. A module which has been in use by lots of
people for some time is unlikely to contain such a bug.

Also, your code does not address the OP's question.

It tests the URL for a string rather than testing the <a> tag's _contents_.

A tag doesn't have content, an element has.
That is, he wanted to test

<a href="...">...</a>
^^^
^^^ here

rather than

<a href="...">...</a>
^^^
^^^

There are two tags in this snippet:

* <a href="...">
* </a>

The a element consists of the start tag, the end tag and the content,
which is enclosed between the two tags.

For some elements the end tag and for some even the start tag can be
omitted, but the element is still there.

hp
 
P

Peter J. Holzer

Well, yes and no. If you are in control of the format and you know
exactly what format is allowed and which formats are not allowed, then
you are right.
But if you are not in control of the input format, e.g. you are reading
from a third-party web page or you get your input data from finance or
marketing or the subsidiary on the opposite side of the world, then your
code must be able to handle any legal HTML because the format could be
changed on you at any time. Which in turn means you must formally parse
the HTML code as HTML code, their is just no way around it.

Actually it is much worse. If you read from a third-party web page or
get your input from some crap application finance or marketing happens
to use you can't formally parse HTML because you won't get HTML. Instead
you will get, as a friend of mine likes to call it, a file with pointy
brackets. So you need a parser which can cope with all the usual errors.

(An HTML5 parser might do this - AIUI, HTML5 is completely deterministic
for every possible input).

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,781
Messages
2,569,615
Members
45,294
Latest member
LandonPigo

Latest Threads

Top