HTML::Parser and <p> behaviour?

Geoff Cox · Oct 13, 2004

Hello,

I cannot seem to work out how HTML:

arser deals with <p> text </p>
from an html file ...

It breaks up a paragraph by placing a </p> <p> inside a paragraph of
text in what seems to me to be a random fashion...

any rule at work here?

Cheers

Geoff

A. Sinan Unur · Oct 13, 2004

I cannot seem to work out how HTML:arser deals with <p> text </p>
from an html file ...

Please read the posting guidelines posted here regularly.

You have asked a stupid question. A stupid question is one that cannot
generate a useful answer. It is in your best interest to ask smart
question, i.e. ones that contain enough information so that someone can
actually help you.

Of course, this is inapplicable if you are just posting for the heck of it
and are not interested in actually getting your question answered.

So, go back and post a small, self-contained script that still exhibits the
problem.

It breaks up a paragraph by placing a </p> <p> inside a paragraph of
text in what seems to me to be a random fashion...

any rule at work here?

In the immortal words of MJD,

If you have `some weird error', the problem is probably with your
frobnitzer.

(David, thanks for the link

Sinan.

Geoff Cox · Oct 13, 2004

So, go back and post a small, self-contained script that still exhibits the
problem.

OK - does the code below help? In fact there are 2 questions here.

1. if I have a paragraph of text between <p> and </p> I find that the
text is broken into two parts producing

<p> jajlsdkjklasjdkljakdj </p><p> hakljd laksdj </p>

2. I am trying to parse

<ul>
<li>
<li>
</ul>

The code below produces text with <li> jkjkj </li> but I cannot see
how to put the <li>'s between <ul> amd </ul>

Cheers

Geoff

package MyParser;
use base qw(HTML:

arser);
use strict;
use diagnostics;

my ($in_heading,$in_p,$in_li, $fh);

sub register_fh {
$fh = $_[1];
}
sub reset { ($in_heading,$in_p, $in_li)=(0,0)}

sub start {

my ( $self, $tagname, $attr, undef, $origtext ) = @_;

if ( $tagname eq 'h2' ) {
$in_heading = 1;
return;
}

if ( $tagname eq 'p' ) {
$in_p = 1;
return;
}

if ( $tagname eq 'li' ) {
$in_li = 1;
return;
}

if ( $tagname eq 'option' ) {

# print ("\$origtext has value $origtext \n");

main::choice( $attr->{ value } );

}

}

sub end {
my ( $self, $tagname, $origtext ) = @_;
if ( $tagname eq 'h2' ) {
$in_heading = 0;
return;
}

if ( $tagname eq 'p' ) {
$in_p = 0;
return;
}

if ( $tagname eq 'ul' ) {
$in_li = 0;
return;
}

}

sub text {
my ( $self, $origtext ) = @_;
print $fh "<h2>$origtext</h2> \n" if $in_heading;
print $fh "<p>$origtext</p> \n" if $in_p;
print $fh "<li>$origtext</li> \n" if $in_li;

}

package main;

use File::Find;

187 · Oct 13, 2004

A. Sinan Unur said:
Please read the posting guidelines posted here regularly.

You have asked a stupid question. A stupid question is one that cannot
generate a useful answer. It is in your best interest to ask smart
question, i.e. ones that contain enough information so that someone
can actually help you.

Of course, this is inapplicable if you are just posting for the heck
of it and are not interested in actually getting your question
answered.

So, go back and post a small, self-contained script that still
exhibits the problem.

Grated soem sample code would of been nice, the situation was still
described to the point wher someone who has worked with that module
might be able to help.

Your assine tone, as well as your direct insults to the OP, was
completely unwarrented.

You could of just rplied asking for more information, but if you could
not tell the situation from the initial post then I doubt you would of
been able to help. (I would attemt but I am not very familiar with this
module).

There is NO excuse for your tone in this thread.

A. Sinan Unur · Oct 13, 2004

OK - does the code below help? In fact there are 2 questions here.

No.

It is not self-contained. That is, I cannot, without doing extra work, just
run it and see what happens.

Sinan.

A. Sinan Unur · Oct 13, 2004

OK - does the code below help? In fact there are 2 questions here.

1. if I have a paragraph of text between <p> and </p> I find that the
text is broken into two parts producing

Here is the relevant part from your code:

sub text {
my ( $self, $origtext ) = @_;
print $fh "<h2>$origtext</h2> \n" if $in_heading;
print $fh "<p>$origtext</p> \n" if $in_p;
print $fh "<li>$origtext</li> \n" if $in_li;

}

I suspect the handler is being called multiple times, each time with a
different part of the original text. You can test this hypothesis by
putting a debug statement in here.

Ben Morrow · Oct 13, 2004

Quoth "A. Sinan Unur said:
Here is the relevant part from your code:

I suspect the handler is being called multiple times, each time with a
different part of the original text.

....as indeed the HTML:

arser documentation says it will be. You can
prevent this with the ->unborken_text method [typo left because it
amused me

]

Ben

Tom · Oct 13, 2004

187 said:
Grated soem sample code would of been nice, the situation was still
described to the point wher someone who has worked with that module
might be able to help.

No it wasn't.

Your assine tone, as well as your direct insults to the OP, was
completely unwarrented.

No they weren't. There are too many idiots like you who should be
frequenting comp.lang.basic rather than this newsgroup.

You could of just rplied asking for more information, but if you could
not tell the situation from the initial post then I doubt you would of
been able to help. (I would attemt but I am not very familiar with this
module).

There is NO excuse for your tone in this thread.

There is NO excuse for you and your bad spelling to frequent this
esteemed newsgroup at all. The OP's question was stupid, and in fact
incapable of being answered. Go away.

Geoff Cox · Oct 14, 2004

news:[email protected]:

I suspect the handler is being called multiple times, each time with a
different part of the original text. You can test this hypothesis by
putting a debug statement in here.

You seem to be correct - I have simplified the code and placed a
simple html file (below) in d:\fred and the result appears in
d:\fred\jim and indeed the <p> ... </p>text is there twice. Any ideas
why?

Thanks

Geoff

package MyParser;
use base qw(HTML:

arser);
use strict;
use diagnostics;

my ($in_heading,$in_p,$fh);

sub register_fh {

$fh = $_[1];
}

sub reset { ($in_heading,$in_p)=(0,0)}

sub start {

my ( $self, $tagname, $attr, undef, $origtext ) = @_;

if ( $tagname eq 'h2' ) {
$in_heading = 1;
return;
}

if ( $tagname eq 'p' ) {
$in_p = 1;
return;
}

}

sub end {
my ( $self, $tagname, $origtext ) = @_;

if ( $tagname eq 'h2' ) {
$in_heading = 0;
return;
}

if ( $tagname eq 'p' ) {
$in_p = 0;
return;
}

}

sub text {
my ( $self, $origtext ) = @_;

print $fh "<h2>$origtext</h2> \n" if $in_heading;
print $fh "<p>$origtext</p> \n" if $in_p;

}

package main;

use File::Find;

my $dir = "d:/fred";
my $parser = MyParser->new;

find sub {
return if -d $_;

my $name = $_;
open( OUT, ">>d:/fred/jim/$name" )
|| die "can't open d:/fred/jim/$name: $!";

print OUT ("<html><head><title>test</title>
</head><body> \n");

$parser->register_fh(\*OUT);
$parser->parse_file($_);
$parser->reset;

print OUT ("</body></html> \n");

}, $dir;

--------------- html file ---------------------------
<html>
<head>
<title>test</title>
</head>

<body>

<h2>test file</h2>

<p>The is some text which I am using to test whether para.pl using
HTML:

arser will output all of the text in this paragraph in one
paragraph, or, in two smaller paragraphs.</p>

</body>
</html>

A. Sinan Unur · Oct 14, 2004

You seem to be correct - I have simplified the code and placed a
simple html file (below) in d:\fred and the result appears in
d:\fred\jim and indeed the <p> ... </p>text is there twice. Any ideas
why?

I think you probably want to emit the start and end tags only when the
start and end callbacks are invoked. I tried to shorten your script to deal
only with the p case:

use strict;
use warnings;

package MyParser;
use base qw(HTML:

arser);

my ($in_p, $fh);

sub register_fh { $fh = $_[1]; }

sub start {
my ($p, $t, $a, undef, $txt ) = @_;

if ($t eq 'p') {
$in_p = 1;
print $fh '<p>';
return;
}
}

sub end {
my ($p, $t, $txt) = @_;

if ($t eq 'p') {
$in_p = 0;
print $fh "</p>\n";
return;
}
}

sub text {
my ($p, $txt) = @_;
print $fh $txt if ($in_p);
}

package main;

my $p = MyParser->new;
$p->register_fh(\*STDOUT);

print <<HEADER;
<html>
<head>
<title>Test Output</title>
</head>
<body>
HEADER

$p->parse_file(\*DATA);

print <<FOOTER;
</body>
</html>
FOOTER

__DATA__
<html>
<head>
<title>test</title>
</head>

<body>

<h2>test file</h2>

<p>The is some text which I am using to test whether para.pl using
HTML:

arser will output all of the text in this paragraph in one
paragraph, or, in two smaller paragraphs.</p>

</body>
</html>

Tad McClellan · Oct 14, 2004

187 said:
Grated soem sample code would of been nice,

More than nice. It nearly guarantees a useable answer.

It is in a poster's best interest to do what they can to get a
useable answer.

Your assine tone, as well as your direct insults to the OP, was
completely unwarrented.

I think you are lacking some pertinent information, and your
conclusion makes you look foolish.

You could of just rplied asking for more information,

We have done this dozens of times for this OP.

We have asked this OP many times[1] to see the Posting Guidelines
if he wants the best chance at getting an answer.

There is a significant history here that you appear to be ignorant of.

There is NO excuse for your tone in this thread.

There is NO excuse for acting like you know what has happened here
when you have not been here to see what is happening here, so:

*plonk*

You are speaking from ignorance. Mighty embarrassing in such
a public forum! Perhaps you should have waited until you could
followup on something that you actually know something about.

Geoff has proven a rather persistent disregard for the time
of other people. We are here only to serve his needs, so it
is no big deal if we have to work a little harder or ignore
his threads.

[1] Here are at least 6 such times:

http://groups.google.com/groups?as_q="Geoff Cox" "Posting Guidelines"&as_ugroup=comp.lang.perl.misc

I get the feeling we are "talking to the hand" here...

187 · Oct 14, 2004

Tom said:
187 wrote:

[ snip unwarrented and non-sequitor drivel ]

FYI I've programming, working networks, IT, and help desks since the
80's. You know nothing of me. It's people like /you/ who can only
attempt to form an argument or rebuttal using personal attacks, and try
to bait fights.

I care not for you reply, as you have proven you have nothing of worth
to say. We do no need sheer hate being spread here.

Tad McClellan · Oct 14, 2004

Geoff Cox said:
I cannot seem to work out how HTML:arser deals with <p> text </p>

It breaks up a paragraph by placing a </p> <p> inside a paragraph of
text in what seems to me to be a random fashion...

If you show us your code, we might be able to help fix it.

Make a short and complete program *that we can run* that shows
this phantom "</p> <p>" thing, and we will explain to you
how it got there.

Why not follow the suggestions given in the Posting Guidelines?

If you had, the question would have been answered already instead
of another of your round-and-round threads where there is not
enough information given to be able to solve the problem.

Post a small program. We will fix it.

187 · Oct 14, 2004

Sorry for my horrible typing. I've only had a total of 4 hours of sleep
the past week and a half

More than nice. It nearly guarantees a useable answer.

It is in a poster's best interest to do what they can to get a
useable answer.

I fully agree. It just seemed to me like the OP at least tried to
explain his problem, though more in words and not so much in code.

I once again apologize if I got a little "off the deep end" back there.

Fred Canis · Oct 14, 2004

Tad said:
More than nice. It nearly guarantees a useable answer.

It is in a poster's best interest to do what they can to get a
useable answer.
True.

I think you are lacking some pertinent information, and your
conclusion makes you look foolish.

I beg to differ. The post in question here wasn't exactly blossoming
with positive vibes, and as such not likely to help the OP. Come on,
even you can surely admit this?

You could of just rplied asking for more information,

Click to expand...

We have done this dozens of times for this OP.

We have asked this OP many times[1] to see the Posting Guidelines
if he wants the best chance at getting an answer.

True, but most of which have bene clouded by the rather nagative vibes
there within.

There is a significant history here that you appear to be ignorant of.

If you mean the history of the OP, well I think it's hardly fair to
exact everyone who posts or read here to keep track of the posting
histroy of every poster. It doesn't mean it couldn't be checked, of
course it could, but I do not think it right at all to give someone such
a thorough beating for missing one. I hardly think this is been such a
heinous offense.

There is NO excuse for acting like you know what has happened here
when you have not been here to see what is happening here, so:

*plonk*

Please don't start there. You and others can be great people with a
wealth of knowlege, but some of you are *way* too quick to puul the
plonk-trigger. I think what we have here is a mis understanding, at
least that's how I see it.

You are speaking from ignorance. Mighty embarrassing in such
a public forum! Perhaps you should have waited until you could
followup on something that you actually know something about.

Again, I don't see the point of chastising a poster for not knowing the
posting history of every past poster in this group. It's almost absurd
to expect everyone to know. I regularly frequent this group, rarely
posting though, but a rather regular reader, and I too have missed what
ever incident this OP was apparently involved in.

Maybe best to let thread jsut die instead of going on so bitterly.

187 · Oct 14, 2004

Tad said:
You are speaking from ignorance. Mighty embarrassing in such
a public forum! Perhaps you should have waited until you could
followup on something that you actually know something about.

Geoff has proven a rather persistent disregard for the time
of other people. We are here only to serve his needs, so it
is no big deal if we have to work a little harder or ignore
his threads.

I am sorry I did nto know this at first. I do not have time to keep
track of everything that goes on in the dozens of groups I suscribe in.
It seems illogical to me to exact everyone to automatically be aware of
any and all events. May I should go tosses the Op's name in google (and
set the group to this one.) I might of done jsut that if I wasn't in
such a rush.

I once again want to pologize for how I came out in this. I'm not a bad
person, I am in fact a tech as well. I love Perl and think it's the best
thing for programming since sliced bread, and that it's one of the most
intelligent language I've ver had the privilage to learn and become
proficient in. Before Perl I was just a C/C++ programmer doing mostly
freelance. I learned Perl and completely changed how I code many things
in shell/cgi/etc for both work and personal tasks.

I hope there are no hard feelings.

David H. Adler · Oct 14, 2004

If you mean the history of the OP, well I think it's hardly fair to
exact everyone who posts or read here to keep track of the posting
histroy of every poster. It doesn't mean it couldn't be checked, of
course it could, but I do not think it right at all to give someone such
a thorough beating for missing one. I hardly think this is been such a
heinous offense.

I think tad's point is that if one is going to lambaste people for their
response to a poster, one might check whether there is some reason for
that response, rather than that one should check the posting history
relevant to any given thread.

dha

A. Sinan Unur · Oct 14, 2004

I am sorry I did nto know this at first.
....

I'm not a bad person, I am in fact a tech as well. I love Perl

Well, clearly. Why else would you read this group?

OTOH, I would think that someone who wishes to appear friendly might want
to avoid the nickname '187'.

As for my reaction to the OP, I find it particularly unproductive and
annoying to blame the package rather than look for an error in one's own
code. This was especially so since the OP provided no code of his own which
we could have looked at to test his assertion that the package was somehow
at fault.

Referring to the original post:

I cannot seem to work out how HTML:arser deals with <p> text </p>
from an html file ...

It breaks up a paragraph by placing a </p> <p> inside a paragraph of
text in what seems to me to be a random fashion...

The "It" in the second paragraph surely refers to HTML:

arser.

I stand by my belief that this is a "stupid" way to approach programming
problems. One should ask

+ Here is what I am doing
+ Here is what I would like to have happen
+ Instead, here is what is happening
+ What am I doing wrong?

Repeatedly ignoring this advice brings to mind the extremely apropos maxim
(I am probably misquoting this):

once is happenstance, twice coincidence, thrice is enemy action

and deserves a somewhat harsher than the usual mild reminder that the OP
read the posting guidelines.

Sinan.

Geoff Cox · Oct 14, 2004

I think you probably want to emit the start and end tags only when the
start and end callbacks are invoked. I tried to shorten your script to deal
only with the p case:

Many thanks for the code below - it woks fine and I will try to get to
grips with how it achieves this. You can no doubt see that I do not
have a very good understanding of HTML:

arser !

Do you have any suggestions re possible HTML:

arser tutorial type
places on the net? I have looked but not found anything that starts
far enough back ...

I have now bought the Perl & LWP book by Sean Burke - any others
spring to mind?

Thanks again for your help.

Cheers

Geoff

use strict;
use warnings;

package MyParser;
use base qw(HTML:arser);

my ($in_p, $fh);

sub register_fh { $fh = $_[1]; }

sub start {
my ($p, $t, $a, undef, $txt ) = @_;

if ($t eq 'p') {
$in_p = 1;
print $fh '<p>';
return;
}
}

sub end {
my ($p, $t, $txt) = @_;

if ($t eq 'p') {
$in_p = 0;
print $fh "</p>\n";
return;
}
}

sub text {
my ($p, $txt) = @_;
print $fh $txt if ($in_p);
}

package main;

my $p = MyParser->new;
$p->register_fh(\*STDOUT);

print <<HEADER;
<html>
<head>
<title>Test Output</title>
</head>
<body>
HEADER

$p->parse_file(\*DATA);

print <<FOOTER;
</body>
</html>
FOOTER

__DATA__
<html>
<head>
<title>test</title>
</head>

<body>

<h2>test file</h2>

<p>The is some text which I am using to test whether para.pl using
HTML:arser will output all of the text in this paragraph in one
paragraph, or, in two smaller paragraphs.</p>

</body>
</html>

187 · Oct 14, 2004

A. Sinan Unur said:
Well, clearly. Why else would you read this group?

Good point.

OTOH, I would think that someone who wishes to appear friendly might
want to avoid the nickname '187'.

Why is that? This is how I uniquely identify myself. If "187" means
something else that I am not awre of please let me know. Other wise I
can go back ot posting under my name "Al" (which also appears in my
email address.)

As for my reaction to the OP, I find it particularly unproductive and
annoying to blame the package rather than look for an error in one's
own code. This was especially so since the OP provided no code of his
own which we could have looked at to test his assertion that the
package was somehow at fault.

True. I do admit I was quick to judge and I'm sorry for snapping at you.

Referring to the original post:

news:[email protected]:

That has got to be the longest message ID I've ever laid eyes on
(between 'amd '@ax.com'.)

The "It" in the second paragraph surely refers to HTML:arser.

I stand by my belief that this is a "stupid" way to approach
programming problems. One should ask

+ Here is what I am doing
+ Here is what I would like to have happen
+ Instead, here is what is happening
+ What am I doing wrong?

Repeatedly ignoring this advice brings to mind the extremely apropos
maxim (I am probably misquoting this):

Agreed. No arguements here.

once is happenstance, twice coincidence, thrice is enemy action

and deserves a somewhat harsher than the usual mild reminder that the
OP read the posting guidelines.

All understood. I am once again sorry, and hope there are no hard
feelings.

Stuck with html and css	25	Dec 14, 2022
Changing .html in URL	3	Jul 11, 2022
Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
HTML Anchor tag not working	2	Dec 15, 2020
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022
<p>(.*)</p> Doesn't Work	6	Jun 14, 2006
Chrome not displaying uploaded HTML and CSS code	5	Nov 17, 2022
perl html parser	1	Nov 11, 2010

HTML::Parser and <p> behaviour?

Geoff Cox

A. Sinan Unur

Geoff Cox

187

A. Sinan Unur

A. Sinan Unur

Ben Morrow

Tom

Geoff Cox

A. Sinan Unur

Tad McClellan

187

Tad McClellan

187

Fred Canis

187

David H. Adler

A. Sinan Unur

Geoff Cox

187

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads