Extracting HTML Content

M

masterGaurav

Hi,

I have some HTML content. I want to strip off the HTML tags and
retain the raw text.
That's simple:

$data =~ /<.+?>//gsm

However, now I have a condition... I want to strip off the tags only
if the content has least one '$' character.

For example:

<p><a href='#'>This is low priced at $50</a></p>

Should return the raw content, however

<p><a href='#'>This is low priced at 50</a></p>

should return nothing.

Can it be done using one regex, may be in a loop?


Cheers,
Gaurav
 
R

robic0

Hi,

I have some HTML content. I want to strip off the HTML tags and
retain the raw text.
That's simple:

$data =~ /<.+?>//gsm

However, now I have a condition... I want to strip off the tags only
if the content has least one '$' character.

For example:

<p><a href='#'>This is low priced at $50</a></p>

Should return the raw content, however

<p><a href='#'>This is low priced at 50</a></p>

should return nothing.

Can it be done using one regex, may be in a loop?


Cheers,
Gaurav

Submit your request to robic0's RXParse. Do a search on it.
 
J

Jürgen Exner

masterGaurav said:
Hi,

I have some HTML content. I want to strip off the HTML tags and
retain the raw text.
That's simple:

No, it isn't. Contrary to popular believe parsing HTML correctly is quite
difficult
$data =~ /<.+?>//gsm

Which fails in many, many cirumstances.
Please see the FAQ (perldoc -q html) why and how to parse HTML correctly.

BTW: this is a Very FAQ.

jue
 
T

Tad McClellan

masterGaurav said:
I have some HTML content. I want to strip off the HTML tags and
retain the raw text.


perldoc -q HTML

How do I remove HTML from a string?

That's simple:


If you think it is simple, then you haven't been thinking
about it long enough.

$data =~ /<.+?>//gsm


The FAQ answer has a half dozen snippets of valid HTML that
make that trip on its face.

However, now I have a condition... I want to strip off the tags only
if the content has least one '$' character.

Can it be done using one regex,


Why do you care what form the answer takes, don't you want an
answer if it is not in the form of a regex?

Use a module that understands HTML for processing HTML data.
 
R

robic0

I am new to this forum. I searhed for "robic0's RXParse" and found only
one result:

http://www.codecomments.com/archive235-2006-2-806819.html

Can you please explain...


Cheers,
Gaurav

I meant in this forum. As many will tell you, you can't strip/alter html yourself.
Its much more complicated than you think. What robic0 proposes is to create a
generalized xhtml/xml modification method(s) that is safe and guarantee's compliant.
You can peruse his code in this group, and try out whats there on your html.
Submit your request's to him (within this group).
 
K

Keith Keller

I am new to this forum. I searhed for "robic0's RXParse" and found only
one result:

Whatever you choose to do, do not use robic0's RXParse module. It is
a good example of how not to code. Instead, do as others have suggested
and read the perldocs on the subject, and then use one of the standard
and well-coded HTML modules available from CPAN.

--keith
 
R

robic0

Whatever you choose to do, do not use robic0's RXParse module.
And if he does use it, whats the consequences?
You never used it and don't even know what 'it' is
a good example of how not to code.
The module is not a code example. Its a working 1.1 standard parser
that you know as much about as how the foreskin appeared on your limp dick.
Instead, do as others have suggested
that you never ever did or have ever done parsing
and read the perldocs on the subject,
you don't know the subject. You don't know what the phrase parsing means at all..
and then use one of the standard
there is no standard other than the w3c ones. Can you name one standard?
and well-coded HTML modules available from CPAN.
and how would you know? Your an idiot, and dumb!
 
M

masterGaurav

Thanks everybody for your help and pointers.

I was 120% sure that it's impossible to write HTML-parser in one Regex.
TagStripper is ok... my code would work for scenarios except for cases
where, say, Javascript exists (e.g.: i < 10 && j > 20).

Well, anyway... I found some workaround for my special case.

Thanks once again for your time.


Cheers,
Gaurav Vaish
http://mastergaurav.org
-----------------
 
R

robic0

Thanks everybody for your help and pointers.

I was 120% sure that it's impossible to write HTML-parser in one Regex.
TagStripper is ok... my code would work for scenarios except for cases
where, say, Javascript exists (e.g.: i < 10 && j > 20).

Well, anyway... I found some workaround for my special case.

Thanks once again for your time.


Cheers,
Gaurav Vaish
http://mastergaurav.org
-----------------

Good that you understand. Javascript nor whatever you can think of will
make a simple regexp work for parsing.

I didn't mean to referr you to RXParse. Sorry if that was a problem.
Keep learning, many years ahead of you. The shortcuts don't really work
until your old. You have a long time to get old. Unfortunately, while
your young, its hard to know which way to go, hard to know whats real,
hard to cut out wasted time. No fear though, theres only one path,
most travel it.
 
A

A. Sinan Unur

masterGaurav said:
I am new to this forum.

In that case, please read the posting guidelines.
I searhed for "robic0's RXParse"

Why did you search for that?

You seem to be under the illusion that comp.lang.perl.misc is a
web based forum. It is not. It is a UseNet group.

http://en.wikipedia.org/wiki/Usenet
Can you please explain...

You have already been pointed to perldoc -q HTML. I am going
to recommend that you skim through the entire FAQ list at least once.

There are well established CPAN modules you can use to parse HTML.

http://search.cpan.org/~gaas/HTML-Parser-3.54/
http://search.cpan.org/~ovid/HTML-TokeParser-Simple-3.15/

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
 
R

robic0

In that case, please read the posting guidelines.


Why did you search for that?

I told him to as you well know. Is that a problem?
You seem to be under the illusion that comp.lang.perl.misc is a
web based forum. It is not. It is a UseNet group.
There is no illusion that web based news readers abound.
Or maybe you don't know that?

Sumthing wrong with his web hit?
You have already been pointed to perldoc -q HTML. I am going
to recommend that you skim through the entire FAQ list at least once.

There are well established CPAN modules you can use to parse HTML.
There are NO, I repeat NO modules that do what he wants to do.
He never intended to parse html. Show me where big boy ...

Stay clear of sinan advice. He means to bust your balls is all.
 
D

DJ Stunks

robic0 said:
Submit your request to robic0's RXParse. Do a search on it.

I hereby suggest that you, masterGaurav post some sample HTML and you,
robic0 give us a sample script which demonstrates the use of your
crummy parser to parse it.

and robic, don't give us any bullshit about how expensive your time is
and you can't afford to show a sample, a solid product demonstration
will pay for itself 10-fold.

with bated breath,
-jp
 
R

robic0

There are NO, I repeat NO modules that do what he wants to do.
He never intended to parse html. Show me where big boy ...

$ perl toke.pl
TEXT: $100.00 Dollars
TEXT: discount: $75.25

$ cat toke.pl
use warnings;
use strict;

use HTML::TokeParser;

my $p = HTML::TokeParser->new( \ join('', <DATA>) );
$p->unbroken_text(1);

while (my $token = $p->get_token) {
if ( $token->[0] eq 'T' && $token->[1] =~ m|\$| ) {
print('TEXT: ' . $token->[1] . "\n");
}
}

__DATA__
<html>
<head>
<title>$100.00 Dollars</title>
</head>
<body>
<img src="foo.img" alt="$100.00 USD">
<div>size: 50 x 50</div>
<div>discount: $75.25</div>
</body>
</html>

This is a debug output from RXParse using your data.
Seems your missing a closing tag somewhere?


SCALAR ref
--------------------
char _:

--------------------
start _: html
--------------------
char _:

--------------------
start _: head
--------------------
char _:

--------------------
start _: title
--------------------
char _: $100.00 Dollars
--------------------
end _: /title
--------------------
char _:

--------------------
end _: /head
--------------------
char _:
 
R

robic0

I hereby suggest that you, masterGaurav post some sample HTML and you,
robic0 give us a sample script which demonstrates the use of your
crummy parser to parse it.

and robic, don't give us any bullshit about how expensive your time is
and you can't afford to show a sample, a solid product demonstration
will pay for itself 10-fold.

with bated breath,
-jp

Hey thats fair. The RXParse code (on this forum) needs to have the top usage examples,
before the package declaration, commented out and a "1;" added to before the __END__.
Add a 'strict' statement after the package name. The file should be namee RXParse.pm.

Using Todd W's data sample within this thread, just using the default handlers with debug off
(turning debug off just does syntax checking) yeilds this nice error:

======================================================
use strict;
use warnings;
use RXParse;

my $parse_ln = '
<html>
<head>
<title>$100.00 Dollars</title>
</head>
<body>
<img src="foo.img" alt="$100.00 USD">
<div>size: 50 x 50</div>
<div>discount: $75.25</div>
</body>
</html>
';

my $p = new RXParse();

#$p->setDebugMode(1);
$p->parse(\$parse_ln);

__END__
rp_error_05, expected closing tag '/img' (line 10, col 9)

==========================================================

There, now that didn't take alot of my time.
Un-commenting the debug line above yeilds this:


SCALAR ref
--------------------
char _:

--------------------
start _: html
--------------------
char _:

--------------------
start _: head
--------------------
char _:

--------------------
start _: title
--------------------
char _: $100.00 Dollars
--------------------
end _: /title
--------------------
char _:

--------------------
end _: /head
--------------------
char _:

--------------------
start _: body
--------------------
char _:

--------------------
start _: img
src = foo.img
alt = $100.00 USD
--------------------
char _:

--------------------
start _: div
--------------------
char _: size: 50 x 50
--------------------
end _: /div
--------------------
char _:

--------------------
start _: div
--------------------
char _: discount: $75.25
--------------------
end _: /div
--------------------
char _:

--------------------
rp_error_05, expected closing tag '/img' (line 10, col 9)

=============================================================

You can expect the same data to be passed to your user defined handlers.
The code that sets the user defined handlers goes like this:

sub setHandlers {
my ($self, @args) = @_;
my %oldh = ();
if (!scalar(@args)) {
while (my ($name,$val) = splice (@args, 0, 2)) {
$name =~ s/^\s+//s; $name =~ s/\s+$//s;
my $hname = "h".lc($name);
if (exists $self->{$hname}) {
$oldh{$name} = $self->{$hname};
if (ref($val) eq 'CODE') {
$self->{$hname} = $val;
} else {
# if its not a CODE ref,
# just set default handler
$self->setDfltHandlers ($name);
}
}
}
}
return %oldh;
}

I'll make up a sample user friendly template for you in a little bit.
The parameters passed to the handlers, as well as setting the handlers
are those typical Expat, mostly. There are many ways this can parse a block.
See the parse method. I wouldn't be so quick to call this a 'crappy' parser boy!

robic0
 
L

l v

robic0 wrote:

[snip]
__DATA__
<html>
<head>
<title>$100.00 Dollars</title>
</head>
<body>
<img src="foo.img" alt="$100.00 USD">
<div>size: 50 x 50</div>
<div>discount: $75.25</div>
</body>
</html>

This is a debug output from RXParse using your data.
Seems your missing a closing tag somewhere?

[snip]

rp_error_05, expected closing tag '/img' (line 10, col 9)

From http://www.w3schools.com/tags/tag_img.asp

HTML <img> tag
Definition and Usage

The img element defines an image.

Differences Between HTML and XHTML

In HTML the <img> tag has no end tag.

In XHTML the <img> tag must be properly closed.

Len
 
L

l v

robic0 wrote:

[big snip]
<html>
<head>
<title>$100.00 Dollars</title>
</head>
<body>
<img src="foo.img" alt="$100.00 USD">
<div>size: 50 x 50</div>
<div>discount: $75.25</div>
</body>
</html>
rp_error_05, expected closing tag '/img' (line 10, col 9)

[snip]

I'll make up a sample user friendly template for you in a little bit.
The parameters passed to the handlers, as well as setting the handlers
are those typical Expat, mostly. There are many ways this can parse a block.
See the parse method. I wouldn't be so quick to call this a 'crappy' parser boy!

robic0

From http://www.w3schools.com/tags/tag_img.asp

HTML <img> tag
Definition and Usage

The img element defines an image.

Differences Between HTML and XHTML

In HTML the <img> tag has no end tag.

In XHTML the <img> tag must be properly closed.
 
R

robic0

robic0 wrote:

[snip]
__DATA__
<html>
<head>
<title>$100.00 Dollars</title>
</head>
<body>
<img src="foo.img" alt="$100.00 USD">
<div>size: 50 x 50</div>
<div>discount: $75.25</div>
</body>
</html>

This is a debug output from RXParse using your data.
Seems your missing a closing tag somewhere?

[snip]

rp_error_05, expected closing tag '/img' (line 10, col 9)

From http://www.w3schools.com/tags/tag_img.asp

HTML <img> tag
Definition and Usage

The img element defines an image.

Differences Between HTML and XHTML

In HTML the <img> tag has no end tag.

In XHTML the <img> tag must be properly closed.

Len

Don't know what to say. DOCTYPE? Invoke with a force flag?
Namespace, xmlns? Can't look at <html attr's>. Avoiding dtd imports
so won't give <html> power, if I've halted at parsing ENTITY, ATTRIB, ELEMENT
contents for now.

A flag can force html, xhtml, xml. Minor regexp modifications (3 separate). Standards
changing. I haven't gotten into loading namespace yet. Avoiding that at this stage
trying to unitize the outer constructs. Wizz around w3c site a while.
 
R

robic0

robic0 wrote:

[snip]
__DATA__
<html>
<head>
<title>$100.00 Dollars</title>
</head>
<body>
<img src="foo.img" alt="$100.00 USD">
<div>size: 50 x 50</div>
<div>discount: $75.25</div>
</body>
</html>


This is a debug output from RXParse using your data.
Seems your missing a closing tag somewhere?

[snip]

rp_error_05, expected closing tag '/img' (line 10, col 9)

From http://www.w3schools.com/tags/tag_img.asp

HTML <img> tag
Definition and Usage

The img element defines an image.

Differences Between HTML and XHTML

In HTML the <img> tag has no end tag.

In XHTML the <img> tag must be properly closed.

Len

Don't know what to say. DOCTYPE? Invoke with a force flag?
Namespace, xmlns? Can't look at <html attr's>. Avoiding dtd imports
so won't give <html> power, if I've halted at parsing ENTITY, ATTRIB, ELEMENT
contents for now.

A flag can force html, xhtml, xml. Minor regexp modifications (3 separate). Standards
changing. I haven't gotten into loading namespace yet. Avoiding that at this stage
trying to unitize the outer constructs. Wizz around w3c site a while.

Some pages for reference:

http://www.w3.org/TR/html4/strict.dtd
http://www.w3schools.com/tags/default.asp
http://www.w3.org/TR/xml11/

Certainly I'm all over this.
Then there's that SGML thing too...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top