Extracting HTML Content

masterGaurav · May 1, 2006

Hi,

I have some HTML content. I want to strip off the HTML tags and
retain the raw text.
That's simple:

$data =~ /<.+?>//gsm

However, now I have a condition... I want to strip off the tags only
if the content has least one '$' character.

For example:

<p><a href='#'>This is low priced at $50</a></p>

Should return the raw content, however

<p><a href='#'>This is low priced at 50</a></p>

should return nothing.

Can it be done using one regex, may be in a loop?

Cheers,
Gaurav

robic0 · May 1, 2006

Hi,

I have some HTML content. I want to strip off the HTML tags and
retain the raw text.
That's simple:

$data =~ /<.+?>//gsm

However, now I have a condition... I want to strip off the tags only
if the content has least one '$' character.

For example:

<p><a href='#'>This is low priced at $50</a></p>

Should return the raw content, however

<p><a href='#'>This is low priced at 50</a></p>

should return nothing.

Can it be done using one regex, may be in a loop?

Cheers,
Gaurav

Submit your request to robic0's RXParse. Do a search on it.

masterGaurav · May 1, 2006

I am new to this forum. I searhed for "robic0's RXParse" and found only
one result:

http://www.codecomments.com/archive235-2006-2-806819.html

Can you please explain...

Cheers,
Gaurav

Jürgen Exner · May 1, 2006

masterGaurav said:
Hi,

I have some HTML content. I want to strip off the HTML tags and
retain the raw text.
That's simple:

No, it isn't. Contrary to popular believe parsing HTML correctly is quite
difficult

$data =~ /<.+?>//gsm

Which fails in many, many cirumstances.
Please see the FAQ (perldoc -q html) why and how to parse HTML correctly.

BTW: this is a Very FAQ.

jue

Tad McClellan · May 1, 2006

masterGaurav said:
I have some HTML content. I want to strip off the HTML tags and
retain the raw text.

perldoc -q HTML

How do I remove HTML from a string?

That's simple:

If you think it is simple, then you haven't been thinking
about it long enough.

$data =~ /<.+?>//gsm

The FAQ answer has a half dozen snippets of valid HTML that
make that trip on its face.

However, now I have a condition... I want to strip off the tags only
if the content has least one '$' character.

Can it be done using one regex,

Why do you care what form the answer takes, don't you want an
answer if it is not in the form of a regex?

Use a module that understands HTML for processing HTML data.

robic0 · May 1, 2006

I am new to this forum. I searhed for "robic0's RXParse" and found only
one result:

http://www.codecomments.com/archive235-2006-2-806819.html

Can you please explain...

Cheers,
Gaurav

I meant in this forum. As many will tell you, you can't strip/alter html yourself.
Its much more complicated than you think. What robic0 proposes is to create a
generalized xhtml/xml modification method(s) that is safe and guarantee's compliant.
You can peruse his code in this group, and try out whats there on your html.
Submit your request's to him (within this group).

Keith Keller · May 1, 2006

I am new to this forum. I searhed for "robic0's RXParse" and found only
one result:

Whatever you choose to do, do not use robic0's RXParse module. It is
a good example of how not to code. Instead, do as others have suggested
and read the perldocs on the subject, and then use one of the standard
and well-coded HTML modules available from CPAN.

--keith

robic0 · May 1, 2006

Whatever you choose to do, do not use robic0's RXParse module.

And if he does use it, whats the consequences?
You never used it and don't even know what 'it' is

a good example of how not to code.

The module is not a code example. Its a working 1.1 standard parser
that you know as much about as how the foreskin appeared on your limp dick.

Instead, do as others have suggested

that you never ever did or have ever done parsing

and read the perldocs on the subject,

you don't know the subject. You don't know what the phrase parsing means at all..

and then use one of the standard

there is no standard other than the w3c ones. Can you name one standard?

and well-coded HTML modules available from CPAN.

and how would you know? Your an idiot, and dumb!

masterGaurav · May 1, 2006

Thanks everybody for your help and pointers.

I was 120% sure that it's impossible to write HTML-parser in one Regex.
TagStripper is ok... my code would work for scenarios except for cases
where, say, Javascript exists (e.g.: i < 10 && j > 20).

Well, anyway... I found some workaround for my special case.

Thanks once again for your time.

Cheers,
Gaurav Vaish
http://mastergaurav.org
-----------------

robic0 · May 1, 2006

Thanks everybody for your help and pointers.

I was 120% sure that it's impossible to write HTML-parser in one Regex.
TagStripper is ok... my code would work for scenarios except for cases
where, say, Javascript exists (e.g.: i < 10 && j > 20).

Well, anyway... I found some workaround for my special case.

Thanks once again for your time.

Cheers,
Gaurav Vaish
http://mastergaurav.org
-----------------

Good that you understand. Javascript nor whatever you can think of will
make a simple regexp work for parsing.

I didn't mean to referr you to RXParse. Sorry if that was a problem.
Keep learning, many years ahead of you. The shortcuts don't really work
until your old. You have a long time to get old. Unfortunately, while
your young, its hard to know which way to go, hard to know whats real,
hard to cut out wasted time. No fear though, theres only one path,
most travel it.

A. Sinan Unur · May 1, 2006

masterGaurav said:
I am new to this forum.

In that case, please read the posting guidelines.

I searhed for "robic0's RXParse"

Why did you search for that?

and found only one result:

http://www.codecomments.com/archive235-2006-2-806819.html

You seem to be under the illusion that comp.lang.perl.misc is a
web based forum. It is not. It is a UseNet group.

http://en.wikipedia.org/wiki/Usenet

Can you please explain...

You have already been pointed to perldoc -q HTML. I am going
to recommend that you skim through the entire FAQ list at least once.

There are well established CPAN modules you can use to parse HTML.

http://search.cpan.org/~gaas/HTML-Parser-3.54/
http://search.cpan.org/~ovid/HTML-TokeParser-Simple-3.15/

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

robic0 · May 1, 2006

In that case, please read the posting guidelines.

Why did you search for that?

I told him to as you well know. Is that a problem?

You seem to be under the illusion that comp.lang.perl.misc is a
web based forum. It is not. It is a UseNet group.

There is no illusion that web based news readers abound.
Or maybe you don't know that?

http://en.wikipedia.org/wiki/Usenet

Sumthing wrong with his web hit?

You have already been pointed to perldoc -q HTML. I am going
to recommend that you skim through the entire FAQ list at least once.

There are well established CPAN modules you can use to parse HTML.

There are NO, I repeat NO modules that do what he wants to do.
He never intended to parse html. Show me where big boy ...

http://search.cpan.org/~gaas/HTML-Parser-3.54/
http://search.cpan.org/~ovid/HTML-TokeParser-Simple-3.15/

Sinan

Stay clear of sinan advice. He means to bust your balls is all.

DJ Stunks · May 1, 2006

robic0 said:
Submit your request to robic0's RXParse. Do a search on it.

I hereby suggest that you, masterGaurav post some sample HTML and you,
robic0 give us a sample script which demonstrates the use of your
crummy parser to parse it.

and robic, don't give us any bullshit about how expensive your time is
and you can't afford to show a sample, a solid product demonstration
will pay for itself 10-fold.

with bated breath,
-jp

A. Sinan Unur · May 1, 2006

....

It must be hard to live with an IQ below room temparature.

http://search.cpan.org/src/GAAS/HTML-Parser-3.54/eg/hstrip

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

robic0 · May 5, 2006

There are NO, I repeat NO modules that do what he wants to do.
He never intended to parse html. Show me where big boy ...

Click to expand...

$ perl toke.pl
TEXT: $100.00 Dollars
TEXT: discount: $75.25

$ cat toke.pl
use warnings;
use strict;

use HTML::TokeParser;

my $p = HTML::TokeParser->new( \ join('', <DATA>) );
$p->unbroken_text(1);

while (my $token = $p->get_token) {
if ( $token->[0] eq 'T' && $token->[1] =~ m|\$| ) {
print('TEXT: ' . $token->[1] . "\n");
}
}

__DATA__
<html>
<head>
<title>$100.00 Dollars</title>
</head>
<body>
<img src="foo.img" alt="$100.00 USD">
<div>size: 50 x 50</div>
<div>discount: $75.25</div>
</body>
</html>

This is a debug output from RXParse using your data.
Seems your missing a closing tag somewhere?

SCALAR ref
--------------------
char _:

--------------------
start _: html
--------------------
char _:

--------------------
start _: head
--------------------
char _:

--------------------
start _: title
--------------------
char _: $100.00 Dollars
--------------------
end _: /title
--------------------
char _:

--------------------
end _: /head
--------------------
char _:

robic0 · May 5, 2006

I hereby suggest that you, masterGaurav post some sample HTML and you,
robic0 give us a sample script which demonstrates the use of your
crummy parser to parse it.

and robic, don't give us any bullshit about how expensive your time is
and you can't afford to show a sample, a solid product demonstration
will pay for itself 10-fold.

with bated breath,
-jp

Hey thats fair. The RXParse code (on this forum) needs to have the top usage examples,
before the package declaration, commented out and a "1;" added to before the __END__.
Add a 'strict' statement after the package name. The file should be namee RXParse.pm.

Using Todd W's data sample within this thread, just using the default handlers with debug off
(turning debug off just does syntax checking) yeilds this nice error:

======================================================
use strict;
use warnings;
use RXParse;

my $parse_ln = '
<html>
<head>
<title>$100.00 Dollars</title>
</head>
<body>
<img src="foo.img" alt="$100.00 USD">
<div>size: 50 x 50</div>
<div>discount: $75.25</div>
</body>
</html>
';

my $p = new RXParse();

#$p->setDebugMode(1);
$p->parse(\$parse_ln);

__END__
rp_error_05, expected closing tag '/img' (line 10, col 9)

==========================================================

There, now that didn't take alot of my time.
Un-commenting the debug line above yeilds this:

SCALAR ref
--------------------
char _:

--------------------
start _: html
--------------------
char _:

--------------------
start _: head
--------------------
char _:

--------------------
start _: title
--------------------
char _: $100.00 Dollars
--------------------
end _: /title
--------------------
char _:

--------------------
end _: /head
--------------------
char _:

--------------------
start _: body
--------------------
char _:

--------------------
start _: img
src = foo.img
alt = $100.00 USD
--------------------
char _:

--------------------
start _: div
--------------------
char _: size: 50 x 50
--------------------
end _: /div
--------------------
char _:

--------------------
start _: div
--------------------
char _: discount: $75.25
--------------------
end _: /div
--------------------
char _:

--------------------
rp_error_05, expected closing tag '/img' (line 10, col 9)

=============================================================

You can expect the same data to be passed to your user defined handlers.
The code that sets the user defined handlers goes like this:

sub setHandlers {
my ($self, @args) = @_;
my %oldh = ();
if (!scalar(@args)) {
while (my ($name,$val) = splice (@args, 0, 2)) {
$name =~ s/^\s+//s; $name =~ s/\s+$//s;
my $hname = "h".lc($name);
if (exists $self->{$hname}) {
$oldh{$name} = $self->{$hname};
if (ref($val) eq 'CODE') {
$self->{$hname} = $val;
} else {
# if its not a CODE ref,
# just set default handler
$self->setDfltHandlers ($name);
}
}
}
}
return %oldh;
}

I'll make up a sample user friendly template for you in a little bit.
The parameters passed to the handlers, as well as setting the handlers
are those typical Expat, mostly. There are many ways this can parse a block.
See the parse method. I wouldn't be so quick to call this a 'crappy' parser boy!

robic0

l v · May 5, 2006

robic0 wrote:

[snip]

__DATA__
<html>
<head>
<title>$100.00 Dollars</title>
</head>
<body>
<img src="foo.img" alt="$100.00 USD">
<div>size: 50 x 50</div>
<div>discount: $75.25</div>
</body>
</html>

Click to expand...

This is a debug output from RXParse using your data.
Seems your missing a closing tag somewhere?

[snip]

rp_error_05, expected closing tag '/img' (line 10, col 9)

From http://www.w3schools.com/tags/tag_img.asp

HTML <img> tag
Definition and Usage

The img element defines an image.

Differences Between HTML and XHTML

In HTML the <img> tag has no end tag.

In XHTML the <img> tag must be properly closed.

Len

l v · May 5, 2006

robic0 wrote:

[big snip]

<html>
<head>
<title>$100.00 Dollars</title>
</head>
<body>
<img src="foo.img" alt="$100.00 USD">
<div>size: 50 x 50</div>
<div>discount: $75.25</div>
</body>
</html>
rp_error_05, expected closing tag '/img' (line 10, col 9)

[snip]

I'll make up a sample user friendly template for you in a little bit.
The parameters passed to the handlers, as well as setting the handlers
are those typical Expat, mostly. There are many ways this can parse a block.
See the parse method. I wouldn't be so quick to call this a 'crappy' parser boy!

robic0

From http://www.w3schools.com/tags/tag_img.asp

HTML <img> tag
Definition and Usage

The img element defines an image.

Differences Between HTML and XHTML

In HTML the <img> tag has no end tag.

In XHTML the <img> tag must be properly closed.

robic0 · May 6, 2006

robic0 wrote:

[snip]

__DATA__
<html>
<head>
<title>$100.00 Dollars</title>
</head>
<body>
<img src="foo.img" alt="$100.00 USD">
<div>size: 50 x 50</div>
<div>discount: $75.25</div>
</body>
</html>

Click to expand...

This is a debug output from RXParse using your data.
Seems your missing a closing tag somewhere?

[snip]

rp_error_05, expected closing tag '/img' (line 10, col 9)

From http://www.w3schools.com/tags/tag_img.asp

Click to expand...

HTML <img> tag
Definition and Usage

The img element defines an image.

Differences Between HTML and XHTML

In HTML the <img> tag has no end tag.

In XHTML the <img> tag must be properly closed.

Len

Don't know what to say. DOCTYPE? Invoke with a force flag?
Namespace, xmlns? Can't look at <html attr's>. Avoiding dtd imports
so won't give <html> power, if I've halted at parsing ENTITY, ATTRIB, ELEMENT
contents for now.

A flag can force html, xhtml, xml. Minor regexp modifications (3 separate). Standards
changing. I haven't gotten into loading namespace yet. Avoiding that at this stage
trying to unitize the outer constructs. Wizz around w3c site a while.

robic0 · May 6, 2006

robic0 wrote:

[snip]

__DATA__
<html>
<head>
<title>$100.00 Dollars</title>
</head>
<body>
<img src="foo.img" alt="$100.00 USD">
<div>size: 50 x 50</div>
<div>discount: $75.25</div>
</body>
</html>

This is a debug output from RXParse using your data.
Seems your missing a closing tag somewhere?

[snip]

rp_error_05, expected closing tag '/img' (line 10, col 9)

From http://www.w3schools.com/tags/tag_img.asp

Click to expand...

HTML <img> tag
Definition and Usage

The img element defines an image.

Differences Between HTML and XHTML

In HTML the <img> tag has no end tag.

In XHTML the <img> tag must be properly closed.

Len

Click to expand...

Don't know what to say. DOCTYPE? Invoke with a force flag?
Namespace, xmlns? Can't look at <html attr's>. Avoiding dtd imports
so won't give <html> power, if I've halted at parsing ENTITY, ATTRIB, ELEMENT
contents for now.

A flag can force html, xhtml, xml. Minor regexp modifications (3 separate). Standards
changing. I haven't gotten into loading namespace yet. Avoiding that at this stage
trying to unitize the outer constructs. Wizz around w3c site a while.

Some pages for reference:

http://www.w3.org/TR/html4/strict.dtd
http://www.w3schools.com/tags/default.asp
http://www.w3.org/TR/xml11/

Certainly I'm all over this.
Then there's that SGML thing too...

Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Generate one HTML from API based on the object key language and their value	2	Aug 19, 2022
How to have two html audio players on one page?	0	May 3, 2022
I need help making an html website	2	Aug 2, 2023
Canvas drawing HTML Javascript on elementor	1	Feb 22, 2023
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022
Background image not showing up on html page	3	Sep 23, 2023
HTML Site Problems	11	Nov 25, 2019

Extracting HTML Content

masterGaurav

robic0

masterGaurav

Jürgen Exner

Tad McClellan

robic0

Keith Keller

robic0

masterGaurav

robic0

A. Sinan Unur

robic0

DJ Stunks

A. Sinan Unur

robic0

robic0

l v

l v

robic0

robic0

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads