HTML Parser

Asoup · Dec 17, 2004

Hullo,

I am working on a perl script that will catch data from a HTML file.
Thus I am using HTML::FormatText module in order to work with the data.

My question is the following; Is there a way in HTML::FormatText that
will allow me to catch a text that is located in, let's say (this is the text I need) ?
If you know of any other tricks please let me know

thanks!

A. Sinan Unur · Dec 17, 2004

I am working on a perl script that will catch data from a HTML file.
Thus I am using HTML::FormatText module in order to work with the data.

Then why do you mention HTML:

arser in your subject line.

My question is the following; Is there a way in HTML::FormatText that
will allow me to catch a text that is located in, let's say (this is the text I need) ?
If you know of any other tricks please let me know thanks!

Use HTML:

arser.

Post your attempt if you encounter any problems.

Sinan,

Tad McClellan · Dec 17, 2004

I am working on a perl script that will catch data from a HTML file.
Thus I am using HTML::FormatText module in order to work with the data.

That module will only help with the output part of your program.

What about getting the input in the first place?

Are you using HTML::TreeBuilder or some other module for that?

My question is the following; Is there a way

Yes.

in HTML::FormatText

No.

that
will allow me to catch a text that is located in, let's say (this is the text I need) ?

Sure, there are many modules that will read HTML into some format
suitable for further processing (but HTML::FormatText isn't one of them).

Have you looked at search.cpan.org yet?

Or even tried looking for Frequently Asked Questions that
mention HTML?

perldoc -q HTML

If you know of any other tricks please let me know

Programming with "tricks" is NOT something professional
programmers aspire to. [1]

[1] at work. Playing around is a whole different story.

Asoup · Dec 17, 2004

Then why do you mention HTML:

arser in your subject line.

I did not mention the module 'HTML:

arser' however, I think it is
related to the post. Anyway, thanks for the reply, however it did not
solve the problem.

Uri Guttman · Dec 17, 2004

A> I did not mention the module 'HTML:

arser' however, I think it is
A> related to the post. Anyway, thanks for the reply, however it did not
A> solve the problem.

i think you are in some sort of cone of foolishness. you mention PARSING
html in your subject and that is covered by HTML:

arser and related
modules. but you say you are using some form of HTML text generating
module to do your PARSING? either you don't know the meaning of the word
parse or you are doing a poor job of communicating what you are trying
to do. the quote above makes absolutely no sense at all as you did
mention the need in the subject. and it is the module you need. and
saying it did not solve the problem is no better than the classic newbie
cry of "it didn't work". how did it not solve the problem? what really
is the problem? do you have a proper grasp of the problem? is it an XY
problem (suspiciously so since you claim to be using one module for the
opposite of its intended purpose)?

i eagerly await your clear answer which explains your thought processes.

uri

Asoup · Dec 17, 2004

Ted,

I've been working on this project on my spare time, and I've tried many
regexps

None of them helped that's why I switched to perl modules
that remove the HTML Tags and leave the plain text (this is just a part
of the success). However, I want specific part of the text to be
displayed...

Also, thanks for the advice, I looked at the perldoc html... it didn't
help much.

Asoup · Dec 17, 2004

What a schmuck... Is there a way to ignore people in this groups? lol
cuz Uri, you're wasting your time trying to make me upset "cry of
it"... lol

Uri Guttman · Dec 17, 2004

A> What a schmuck... Is there a way to ignore people in this groups? lol
A> cuz Uri, you're wasting your time trying to make me upset "cry of
A> it"... lol

too bad. you lose. most of the regulars will now rightfully plonk
you. your reply to tad gave me a clue this would happen. this response
guarantees it. you can plonk me for all i care (if you can ever figure
out how to do it).

goodbye. go learn php.

uri

Uri Guttman · Dec 17, 2004

A> I've been working on this project on my spare time, and I've tried many
A> regexps

None of them helped that's why I switched to perl modules
A> that remove the HTML Tags and leave the plain text (this is just a part
A> of the success). However, I want specific part of the text to be
A> displayed...

the perl FAQ already says you can't parse html with a regex. you would
have saved yourself all that work.

that request has nothing to do with html. so go back and rewrite the
subject and question. this is a good exercise for you. but i doubt you
will do it for various obvious reasons.

A> Also, thanks for the advice, I looked at the perldoc html... it didn't
A> help much.

and it was 'perldoc -q HTML'. there is a major difference from what you
say you did and that command.

uri

Sherm Pendley · Dec 17, 2004

Asoup said:
What a schmuck... Is there a way to ignore people in this groups?

Yes there is - unfortunately for you. Most of the people who might have
helped you will now be using it to ignore you.

you're wasting your time trying to make me upset

He's not trying to make you upset. He's trying to help you clarify your
question. Your subject states that your parsing HTML, and your message
more or less agrees with that - but you say you're using a module that
*creates* HTML. In other words, your question makes no sense.

If you want a useful answer, you need to ask a coherent question.

sherm--

Jürgen Exner · Dec 17, 2004

Asoup said:
I've been working on this project on my spare time, and I've tried
many regexps None of them helped

Well, yeah, no big surprise there. As has been pointed out in this NG many
_MANY_ times REs are not a suitable tool to parse HTML.

that's why I switched to perl
modules that remove the HTML Tags and leave the plain text (this is
just a part of the success).

The right move. Had you read the FAQ or any of the previous threads about
this very subject you would have found that solution much earlier.

However, I want specific part of the
text to be displayed...

Ok, what have you tried (hint, hint: show us your code!), what did you
expect it to do, and what behaviour did you observe?

One way to do that is included as an example with HTML:

arser.
Unfortunately the examples are not part of the standard installation, so you
will have to download and manually unpack the HTML:

arser module from CPAN.

Also, thanks for the advice, I looked at the perldoc html... it didn't
help much.

What do you mean by "the perldoc html"? I hope you are talking about the
Perl FAQ answer concerning HTML?

jue

Sherm Pendley · Dec 17, 2004

I think you have hard time understanding my question.

Then write a better question.

people. I don't want to turn this discussion into a Arab-Israeli
conflic (sarcasm to you who's name pretty much shows who you are...)...

WTF are you talking about? My last name is English, although the
earliest ancestor I know about came to the US over three centuries ago.

Are you trying to say something about CamelBones? That's a play on words
- Cocoa libraries are packaged in units called frameworks. CamelBones is
a framework for Perl. Perl's mascot is a camel. A camel's "framework" is
bones. Get it?

Let's just continue the discussion... If you ask nicely what am I
trying to make, I may reply to you

How about I ask you nicely to kiss my ass instead?

sherm--

Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org

Asoup · Dec 17, 2004

Jürgen Exner said:
Well, yeah, no big surprise there. As has been pointed out in this NG many
_MANY_ times REs are not a suitable tool to parse HTML.

The right move. Had you read the FAQ or any of the previous threads about
this very subject you would have found that solution much earlier.

Actually, yes, I think that people here would like to argue than
actually help. So I did read some documentation on cpan. And I am going
to study the HTML::Element module closely.

Ok, what have you tried (hint, hint: show us your code!), what did you
expect it to do, and what behaviour did you observe?

One way to do that is included as an example with HTML:arser.
Unfortunately the examples are not part of the standard installation, so you
will have to download and manually unpack the HTML:arser module from CPAN.

What do you mean by "the perldoc html"? I hope you are talking about the
Perl FAQ answer concerning HTML?

jue

Here is what I have right now:

#!/usr/bin/perl

use lib '/perl/lib';

use LWP::Simple;
use HTML::TreeBuilder;
use HTML::FormatText;

$output_file = "/public_html/rss.txt";

$html =
get("http://www.yerkir.am/eng/index.php?sub=news_arm&id=11918");

$formatter = HTML::FormatText->new;

$tree_builder = HTML::TreeBuilder->new;

$tree_builder->parse($html);

$text = $formatter->format($tree_builder);

open(FILE,">$output_file");
print FILE $text;
close(FILE);

# It just removes the tags, but now I don't know how to sort and *grab*
the text I need and remove the rest...

Jürgen Exner · Dec 17, 2004

Asoup said:
Jürgen Exner said:

Asoup wrote: [...]
The right move. Had you read the FAQ or any of the previous threads
about this very subject you would have found that solution much
earlier.

Click to expand...

Actually, yes, I think that people here would like to argue than
actually help. So I did read some documentation on cpan. And I am
going to study the HTML::Element module closely.

I've no idea what HTML::Element does, but I wonder why you persistently
resist looking at HTML:

arser as suggested by several people.

Here is what I have right now:

#!/usr/bin/perl

use lib '/perl/lib';

use LWP::Simple;
use HTML::TreeBuilder;

I haven't used HTML::TreeBuilder, so I can't comment on that.

[code snipped]

# It just removes the tags, but now I don't know how to sort and
*grab* the text I need and remove the rest...

Well, I suppose after you removed the tags there is nothing left to help you
identify the desired parts. So grab the right text _before_ removing the
tags resp. while you still have the syntax tree or whatever
HTML::TreeBuilder returns.

And once again: the documentation for HTML:

arser already contains an
example for how to extract the body of a <title> element.
<quote>
The next example prints out the text that is inside the <title> element of
an HTML document. Here we start by setting up a start handler. When it sees
the title start tag it enables a text handler that prints any text found and
an end handler that will terminate parsing as soon as the title end tag is
seen:
[...]
More examples are found in the eg/ directory of the HTML-Parser
distribution: the program hrefsub shows how you can edit all links found in
a document; the program htextsub shows how to edit the text only; the
program hstrip shows how you can strip out certain tags/elements and/or
attributes; and the program htext show how to obtain the plain text, but not
any script/style content.
</quote>

It can't be that difficult to adapt those examples for whatever you need to
extract. BTW: did you notice, that you forgot to tell us _which_ part of the
HTML file you want to extract?

jue

Asoup · Dec 17, 2004

Jürgen Exner said:
Asoup said:

Jürgen Exner said:

Asoup wrote: [...]
The right move. Had you read the FAQ or any of the previous threads
about this very subject you would have found that solution much
earlier.

Click to expand...

Actually, yes, I think that people here would like to argue than
actually help. So I did read some documentation on cpan. And I am
going to study the HTML::Element module closely.

Click to expand...

I've no idea what HTML::Element does, but I wonder why you persistently
resist looking at HTML:arser as suggested by several people.

Here is what I have right now:

#!/usr/bin/perl

use lib '/perl/lib';

use LWP::Simple;
use HTML::TreeBuilder;

Click to expand...

I haven't used HTML::TreeBuilder, so I can't comment on that.

[code snipped]

# It just removes the tags, but now I don't know how to sort and
*grab* the text I need and remove the rest...

Click to expand...

Well, I suppose after you removed the tags there is nothing left to help you
identify the desired parts. So grab the right text _before_ removing the
tags resp. while you still have the syntax tree or whatever
HTML::TreeBuilder returns.

And once again: the documentation for HTML:arser already contains an
example for how to extract the body of a <title> element.
<quote>
The next example prints out the text that is inside the <title> element of
an HTML document. Here we start by setting up a start handler. When it sees
the title start tag it enables a text handler that prints any text found and
an end handler that will terminate parsing as soon as the title end tag is
seen:
[...]
More examples are found in the eg/ directory of the HTML-Parser
distribution: the program hrefsub shows how you can edit all links found in
a document; the program htextsub shows how to edit the text only; the

program hstrip shows how you can strip out certain tags/elements and/or
attributes; and the program htext show how to obtain the plain text, but not
any script/style content.
</quote>

It can't be that difficult to adapt those examples for whatever you need to
extract. BTW: did you notice, that you forgot to tell us _which_ part of the
HTML file you want to extract?

jue

jue,

Thanks I will look at the HTML:

arser again.

I am looking at the <body>...</body>

The text that I need is located after: 
And another part that I need is located after:

Gisle Aas · Dec 17, 2004

Asoup said:
The text that I need is located after: 
And another part that I need is located after:

The simplest module to use in this case is probably HTML::TokeParser:

#!/usr/bin/perl -w

use strict;
use HTML::TokeParser;
my $p = HTML::TokeParser->new(shift || die "Usage: $0 <file>");

while (my $t = $p->get_tag("b")) {
my $class = $t->[1]{class};
next unless $class && ($class eq "title_news" || $class eq "num2");
print "$class: " . $p->get_trimmed_text("b") . "\n";
}

Asoup · Dec 17, 2004

Gisle said:
Asoup said:

The text that I need is located after: 
And another part that I need is located after: 

Click to expand...

The simplest module to use in this case is probably HTML::TokeParser:

#!/usr/bin/perl -w

use strict;
use HTML::TokeParser;
my $p = HTML::TokeParser->new(shift || die "Usage: $0 <file>");

while (my $t = $p->get_tag("b")) {
my $class = $t->[1]{class};
next unless $class && ($class eq "title_news" || $class eq "num2");
print "$class: " . $p->get_trimmed_text("b") . "\n";
}

Thank you so much! This was very helpful!

Tad McClellan · Dec 17, 2004

Asoup said:
I think that people here would like to argue than
actually help.

I don't.

I prefer to ignore rather than argue or help when my
"bad attitude detector" starts clanging like this.

So long.

Here is what I have right now:

#!/usr/bin/perl

You should ask for all the help you can get by including
these lines in all of your Perl programs:

use warnings;
use strict;

open(FILE,">$output_file");

You should always, yes *always*, check the return value from open():

open(FILE,">$output_file") or die "could not open '$output_file' $!";

Tad McClellan · Dec 17, 2004

Asoup said:
What a schmuck...

Yes, you certainly are.

Is there a way to ignore people in this groups?

Yes, and most people are now going to ignore all of *your* future posts.

http://www.catb.org/~esr/jargon/html/P/plonk.html

A. Sinan Unur · Dec 17, 2004

Here is what I have right now:

#!/usr/bin/perl

use lib '/perl/lib';

use warnings;
use strict;

use LWP::Simple;
use HTML::TreeBuilder;
use HTML::FormatText;

$output_file = "/public_html/rss.txt";

$html get("http://www.yerkir.am/eng/index.php?sub=news_arm&id=11918");

What is this line supposed to do?

Sinan.

Background image not showing up on html page	3	Sep 23, 2023
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022
Generate one HTML from API based on the object key language and their value	2	Aug 19, 2022
I need help making an html website	2	Aug 2, 2023
How to have two html audio players on one page?	0	May 3, 2022
Python client/server that reads HTML body from server	1	Apr 12, 2023
Getting extra blank rows from appending HTML..?	2	Oct 24, 2023
Errors with HTML packing slip code	2	Jul 5, 2023

HTML Parser

Asoup

A. Sinan Unur

Tad McClellan

Asoup

Uri Guttman

Asoup

Asoup

Uri Guttman

Uri Guttman

Sherm Pendley

Jürgen Exner

Sherm Pendley

Asoup

Jürgen Exner

Asoup

Gisle Aas

Asoup

Tad McClellan

Tad McClellan

A. Sinan Unur

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads