HTML Parser

A

Asoup

Hullo,

I am working on a perl script that will catch data from a HTML file.
Thus I am using HTML::FormatText module in order to work with the data.

My question is the following; Is there a way in HTML::FormatText that
will allow me to catch a text that is located in, let's say <p
class=news_title> (this is the text I need) </p> ?
If you know of any other tricks please let me know :) thanks!
 
A

A. Sinan Unur

I am working on a perl script that will catch data from a HTML file.
Thus I am using HTML::FormatText module in order to work with the data.

Then why do you mention HTML::parser in your subject line.
My question is the following; Is there a way in HTML::FormatText that
will allow me to catch a text that is located in, let's say <p
class=news_title> (this is the text I need) </p> ?
If you know of any other tricks please let me know :) thanks!

Use HTML::parser.

Post your attempt if you encounter any problems.

Sinan,
 
T

Tad McClellan

I am working on a perl script that will catch data from a HTML file.
Thus I am using HTML::FormatText module in order to work with the data.


That module will only help with the output part of your program.

What about getting the input in the first place?

Are you using HTML::TreeBuilder or some other module for that?

My question is the following; Is there a way

Yes.


in HTML::FormatText

No.


that
will allow me to catch a text that is located in, let's say <p
class=news_title> (this is the text I need) </p> ?


Sure, there are many modules that will read HTML into some format
suitable for further processing (but HTML::FormatText isn't one of them).

Have you looked at search.cpan.org yet?

Or even tried looking for Frequently Asked Questions that
mention HTML?

perldoc -q HTML

If you know of any other tricks please let me know


Programming with "tricks" is NOT something professional
programmers aspire to. [1]



[1] at work. Playing around is a whole different story.
 
A

Asoup

Then why do you mention HTML::parser in your subject line.

I did not mention the module 'HTML::parser' however, I think it is
related to the post. Anyway, thanks for the reply, however it did not
solve the problem.
 
U

Uri Guttman

A> I did not mention the module 'HTML::parser' however, I think it is
A> related to the post. Anyway, thanks for the reply, however it did not
A> solve the problem.

i think you are in some sort of cone of foolishness. you mention PARSING
html in your subject and that is covered by HTML::parser and related
modules. but you say you are using some form of HTML text generating
module to do your PARSING? either you don't know the meaning of the word
parse or you are doing a poor job of communicating what you are trying
to do. the quote above makes absolutely no sense at all as you did
mention the need in the subject. and it is the module you need. and
saying it did not solve the problem is no better than the classic newbie
cry of "it didn't work". how did it not solve the problem? what really
is the problem? do you have a proper grasp of the problem? is it an XY
problem (suspiciously so since you claim to be using one module for the
opposite of its intended purpose)?

i eagerly await your clear answer which explains your thought processes.

uri
 
A

Asoup

Ted,

I've been working on this project on my spare time, and I've tried many
regexps :) None of them helped that's why I switched to perl modules
that remove the HTML Tags and leave the plain text (this is just a part
of the success). However, I want specific part of the text to be
displayed...

Also, thanks for the advice, I looked at the perldoc html... it didn't
help much.
 
A

Asoup

What a schmuck... Is there a way to ignore people in this groups? lol
cuz Uri, you're wasting your time trying to make me upset "cry of
it"... lol
 
U

Uri Guttman

A> What a schmuck... Is there a way to ignore people in this groups? lol
A> cuz Uri, you're wasting your time trying to make me upset "cry of
A> it"... lol

too bad. you lose. most of the regulars will now rightfully plonk
you. your reply to tad gave me a clue this would happen. this response
guarantees it. you can plonk me for all i care (if you can ever figure
out how to do it).

goodbye. go learn php.

uri
 
U

Uri Guttman

A> I've been working on this project on my spare time, and I've tried many
A> regexps :) None of them helped that's why I switched to perl modules
A> that remove the HTML Tags and leave the plain text (this is just a part
A> of the success). However, I want specific part of the text to be
A> displayed...

the perl FAQ already says you can't parse html with a regex. you would
have saved yourself all that work.

that request has nothing to do with html. so go back and rewrite the
subject and question. this is a good exercise for you. but i doubt you
will do it for various obvious reasons.

A> Also, thanks for the advice, I looked at the perldoc html... it didn't
A> help much.

and it was 'perldoc -q HTML'. there is a major difference from what you
say you did and that command.

uri
 
S

Sherm Pendley

Asoup said:
What a schmuck... Is there a way to ignore people in this groups?

Yes there is - unfortunately for you. Most of the people who might have
helped you will now be using it to ignore you.
you're wasting your time trying to make me upset

He's not trying to make you upset. He's trying to help you clarify your
question. Your subject states that your parsing HTML, and your message
more or less agrees with that - but you say you're using a module that
*creates* HTML. In other words, your question makes no sense.

If you want a useful answer, you need to ask a coherent question.

sherm--
 
J

Jürgen Exner

Asoup said:
I've been working on this project on my spare time, and I've tried
many regexps :) None of them helped

Well, yeah, no big surprise there. As has been pointed out in this NG many
_MANY_ times REs are not a suitable tool to parse HTML.
that's why I switched to perl
modules that remove the HTML Tags and leave the plain text (this is
just a part of the success).

The right move. Had you read the FAQ or any of the previous threads about
this very subject you would have found that solution much earlier.
However, I want specific part of the
text to be displayed...

Ok, what have you tried (hint, hint: show us your code!), what did you
expect it to do, and what behaviour did you observe?

One way to do that is included as an example with HTML::parser.
Unfortunately the examples are not part of the standard installation, so you
will have to download and manually unpack the HTML::parser module from CPAN.
Also, thanks for the advice, I looked at the perldoc html... it didn't
help much.

What do you mean by "the perldoc html"? I hope you are talking about the
Perl FAQ answer concerning HTML?

jue
 
S

Sherm Pendley

I think you have hard time understanding my question.

Then write a better question.
people. I don't want to turn this discussion into a Arab-Israeli
conflic (sarcasm to you who's name pretty much shows who you are...)...

WTF are you talking about? My last name is English, although the
earliest ancestor I know about came to the US over three centuries ago.

Are you trying to say something about CamelBones? That's a play on words
- Cocoa libraries are packaged in units called frameworks. CamelBones is
a framework for Perl. Perl's mascot is a camel. A camel's "framework" is
bones. Get it?
Let's just continue the discussion... If you ask nicely what am I
trying to make, I may reply to you :)

How about I ask you nicely to kiss my ass instead?

sherm--

Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org
 
A

Asoup

Jürgen Exner said:
Well, yeah, no big surprise there. As has been pointed out in this NG many
_MANY_ times REs are not a suitable tool to parse HTML.


The right move. Had you read the FAQ or any of the previous threads about
this very subject you would have found that solution much earlier.
Actually, yes, I think that people here would like to argue than
actually help. So I did read some documentation on cpan. And I am going
to study the HTML::Element module closely.
Ok, what have you tried (hint, hint: show us your code!), what did you
expect it to do, and what behaviour did you observe?

One way to do that is included as an example with HTML::parser.
Unfortunately the examples are not part of the standard installation, so you
will have to download and manually unpack the HTML::parser module from CPAN.


What do you mean by "the perldoc html"? I hope you are talking about the
Perl FAQ answer concerning HTML?

jue

Here is what I have right now:

#!/usr/bin/perl

use lib '/perl/lib';

use LWP::Simple;
use HTML::TreeBuilder;
use HTML::FormatText;


$output_file = "/public_html/rss.txt";

$html =
get("http://www.yerkir.am/eng/index.php?sub=news_arm&id=11918");

$formatter = HTML::FormatText->new;

$tree_builder = HTML::TreeBuilder->new;

$tree_builder->parse($html);

$text = $formatter->format($tree_builder);

open(FILE,">$output_file");
print FILE $text;
close(FILE);

# It just removes the tags, but now I don't know how to sort and *grab*
the text I need and remove the rest...
 
J

Jürgen Exner

Asoup said:
Jürgen Exner said:
Asoup wrote: [...]
The right move. Had you read the FAQ or any of the previous threads
about this very subject you would have found that solution much
earlier.
Actually, yes, I think that people here would like to argue than
actually help. So I did read some documentation on cpan. And I am
going to study the HTML::Element module closely.

I've no idea what HTML::Element does, but I wonder why you persistently
resist looking at HTML::parser as suggested by several people.
Here is what I have right now:

#!/usr/bin/perl

use lib '/perl/lib';

use LWP::Simple;
use HTML::TreeBuilder;

I haven't used HTML::TreeBuilder, so I can't comment on that.

[code snipped]
# It just removes the tags, but now I don't know how to sort and
*grab* the text I need and remove the rest...

Well, I suppose after you removed the tags there is nothing left to help you
identify the desired parts. So grab the right text _before_ removing the
tags resp. while you still have the syntax tree or whatever
HTML::TreeBuilder returns.

And once again: the documentation for HTML::parser already contains an
example for how to extract the body of a <title> element.
<quote>
The next example prints out the text that is inside the <title> element of
an HTML document. Here we start by setting up a start handler. When it sees
the title start tag it enables a text handler that prints any text found and
an end handler that will terminate parsing as soon as the title end tag is
seen:
[...]
More examples are found in the eg/ directory of the HTML-Parser
distribution: the program hrefsub shows how you can edit all links found in
a document; the program htextsub shows how to edit the text only; the
program hstrip shows how you can strip out certain tags/elements and/or
attributes; and the program htext show how to obtain the plain text, but not
any script/style content.
</quote>

It can't be that difficult to adapt those examples for whatever you need to
extract. BTW: did you notice, that you forgot to tell us _which_ part of the
HTML file you want to extract?

jue
 
A

Asoup

Jürgen Exner said:
Asoup said:
Jürgen Exner said:
Asoup wrote: [...]
The right move. Had you read the FAQ or any of the previous threads
about this very subject you would have found that solution much
earlier.
Actually, yes, I think that people here would like to argue than
actually help. So I did read some documentation on cpan. And I am
going to study the HTML::Element module closely.

I've no idea what HTML::Element does, but I wonder why you persistently
resist looking at HTML::parser as suggested by several people.
Here is what I have right now:

#!/usr/bin/perl

use lib '/perl/lib';

use LWP::Simple;
use HTML::TreeBuilder;

I haven't used HTML::TreeBuilder, so I can't comment on that.

[code snipped]
# It just removes the tags, but now I don't know how to sort and
*grab* the text I need and remove the rest...

Well, I suppose after you removed the tags there is nothing left to help you
identify the desired parts. So grab the right text _before_ removing the
tags resp. while you still have the syntax tree or whatever
HTML::TreeBuilder returns.

And once again: the documentation for HTML::parser already contains an
example for how to extract the body of a <title> element.
<quote>
The next example prints out the text that is inside the <title> element of
an HTML document. Here we start by setting up a start handler. When it sees
the title start tag it enables a text handler that prints any text found and
an end handler that will terminate parsing as soon as the title end tag is
seen:
[...]
More examples are found in the eg/ directory of the HTML-Parser
distribution: the program hrefsub shows how you can edit all links found in
a document; the program htextsub shows how to edit the text only; the
program hstrip shows how you can strip out certain tags/elements and/or
attributes; and the program htext show how to obtain the plain text, but not
any script/style content.
</quote>

It can't be that difficult to adapt those examples for whatever you need to
extract. BTW: did you notice, that you forgot to tell us _which_ part of the
HTML file you want to extract?

jue

jue,

Thanks I will look at the HTML::parser again.

I am looking at the <body>...</body>

The text that I need is located after: <b class=title_news>
And another part that I need is located after: <b class=num2>
 
G

Gisle Aas

Asoup said:
The text that I need is located after: <b class=title_news>
And another part that I need is located after: <b class=num2>

The simplest module to use in this case is probably HTML::TokeParser:

#!/usr/bin/perl -w

use strict;
use HTML::TokeParser;
my $p = HTML::TokeParser->new(shift || die "Usage: $0 <file>");

while (my $t = $p->get_tag("b")) {
my $class = $t->[1]{class};
next unless $class && ($class eq "title_news" || $class eq "num2");
print "$class: " . $p->get_trimmed_text("b") . "\n";
}
 
A

Asoup

Gisle said:
Asoup said:
The text that I need is located after: <b class=title_news>
And another part that I need is located after: <b class=num2>

The simplest module to use in this case is probably HTML::TokeParser:

#!/usr/bin/perl -w

use strict;
use HTML::TokeParser;
my $p = HTML::TokeParser->new(shift || die "Usage: $0 <file>");

while (my $t = $p->get_tag("b")) {
my $class = $t->[1]{class};
next unless $class && ($class eq "title_news" || $class eq "num2");
print "$class: " . $p->get_trimmed_text("b") . "\n";
}

Thank you so much! This was very helpful!
 
T

Tad McClellan

Asoup said:
I think that people here would like to argue than
actually help.


I don't.

I prefer to ignore rather than argue or help when my
"bad attitude detector" starts clanging like this.

So long.

Here is what I have right now:

#!/usr/bin/perl


You should ask for all the help you can get by including
these lines in all of your Perl programs:

use warnings;
use strict;

open(FILE,">$output_file");


You should always, yes *always*, check the return value from open():

open(FILE,">$output_file") or die "could not open '$output_file' $!";
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,609
Members
45,253
Latest member
BlytheFant

Latest Threads

Top