Pulling out data between <TD> tags using regular expressions

T

tdmailbox

If I had this tag and wanted to return 123 how would I do it? I have
tried countless methods but can not get the only the 123 without the
<TD> tags

<TD class=tblform3 id=L_listing width=23>123</TD>

After 3 hours I am giving up and asking the experts.
 
G

Gunnar Hjalmarsson

If I had this tag and wanted to return 123 how would I do it? I have
tried countless methods but can not get the only the 123 without the
<TD> tags

<TD class=tblform3 id=L_listing width=23>123</TD>

After 3 hours I am giving up and asking the experts.

Did you study the applicable docs during those 3 hours?

perldoc perlrequick
perldoc perlretut
perldoc perlre
perldoc -f m
perldoc perlop

Or did you read this FAQ entry

perldoc -q "remove HTML"

which lets you know that you'd better think twice before attempting to
use regexes for this task?

If you have studied those documents, please post the code you have and
somebody may be able to help you fix it.
 
E

Eric Schwartz

If I had this tag and wanted to return 123 how would I do it? I have
tried countless methods but can not get the only the 123 without the
<TD> tags

<TD class=tblform3 id=L_listing width=23>123</TD>

After 3 hours I am giving up and asking the experts.

If you'd asked your computer, you'd have had the answer much faster:

perldoc -q HTML

And the first returned result is:

"How do I remove HTML from a string?"

Which is exactly what you need. If you get in the habit of searching
your local documentation first, then you'll get better answers faster,
as you won't have to wait for an answer here, and also the people who
can give you the best answers to your questions are tired of answering
them all the time, which is why they wrote the FAQ in the first place!
So if you ask FAQs here, then you will by definition only get the
less-experienced people answering your questions, as a rule.

But I'm feeling generous, also I'd been meaning to poke at
HTML::parser for a while anyhow. So I whipped up this little example:

#!/usr/bin/perl
use warnings;
use strict;
use HTML::parser ();

sub start_handler
{
return if shift ne "td";
my $self = shift;
$self->handler(text => sub { print shift }, "dtext");
$self->handler(end => sub { shift->eof if shift eq "td"; },
"tagname,self");
}

my $p = HTML::parser->new(api_version => 3);
$p->handler( start => \&start_handler, "tagname, self" );
$p->parse( <<EODATA );
<TD class=tblform3 id=L_listing width=23>123</TD>
EODATA
print "\n";
__END__

For future reference, if you have a problem, you're going to get the
best results here if you can create an example of it that looks
something like that-- short (I went to 21 lines, and that's about as
big as I try to let them get), complete, and clearly state what is
happening, and how that differs from what you wanted to happen.

Also, note that the above example stops parsing after the first </TD>;
if you are going to parse text containing multiple TD elements, you'll
want to read the HTML::parser documentation to find out better ways of
doing that.

-=Eric
 
G

Gunnar Hjalmarsson

Eric said:
use HTML::parser ();

sub start_handler
{
return if shift ne "td";
my $self = shift;
$self->handler(text => sub { print shift }, "dtext");
$self->handler(end => sub { shift->eof if shift eq "td"; },
"tagname,self");
}

my $p = HTML::parser->new(api_version => 3);
$p->handler( start => \&start_handler, "tagname, self" );
$p->parse( <<EODATA );
<TD class=tblform3 id=L_listing width=23>123</TD>
EODATA
print "\n";

And this is a "simple-minded" way:

print '<TD class=tblform3 id=L_listing width=23>123</TD>'
=~ m{<td.*?>([^<]+)</td>}is, "\n";

If I was to parse a whole HTML page, possibly with nested elements, and
whose design I don't control, I wouldn't dream of using regular
expressions. If, OTOH, the task actually is as simple as the literal
question asked by the OP, I wouldn't dream of using a parsing module.

Which way is most suitable depends reasonably on the complexity of the
task together with how much you know about regular expressions.
 
E

Eric Schwartz

($result) = ($bunch_of_html =~ /<td.*?>(.*?)<\/td>/i);

Hrm.

#!/usr/bin/perl
use warnings;
use strict;

my $bunch_of_html = <<EOHTML;
<td><img src='closetd.jpg' alt='image of </td>' /></td>
EOHTML
my ($result) = ($bunch_of_html =~ /<td.*?>(.*?)<\/td>/i);
print "result: [$result]\n";
__END__

gives:

result: [<img src='foo.jpg' alt='image of ]

Parsing HTML with a regex is, ultimately, an exercise in futility.
You can do it for one small subset, but as soon as you change it even
a small amount, your solution can easily break. And then you tweak.
And then it breaks again. It's easier to spend a little effort
up-front with HTML::parser or the like, than to constantly be fixing
regex-based hacks.

-=Eric
 
J

John Bokma

wrote:
If I had this tag and wanted to return 123 how would I do it? I have
tried countless methods but can not get the only the 123 without the
<TD> tags

<TD class=tblform3 id=L_listing width=23>123</TD>

After 3 hours I am giving up and asking the experts.

use strict;
use warnings;

use HTML::TreeBuilder;

:
:

my $root = HTML::TreeBuilder->
new_from_content( $content );

my $td = $root->look_down( _tag => 'td',
class => 'tblform3', id => 'L_listing' );

defined $td or die "TD not found";

print $td->as_text, "\n";


(untested, assumes $content contains the HTML)

see also:

http://johnbokma.com/perl/phpbb-remote-backup.html
http://johnbokma.com/perl/froogle-script.html
 
E

Eric Schwartz

Gunnar Hjalmarsson said:
And this is a "simple-minded" way:

print '<TD class=tblform3 id=L_listing width=23>123</TD>'
=~ m{<td.*?>([^<]+)</td>}is, "\n";

Which, as you knew, fails if the <TD> has comments in it:

$ perl -e 'print "<TD class=tblform3 id=L_listing width=23>\n123<!-- this is the item ID from the database -->\n</td>" =~ m{<td.*?>([^<]+)</td>}is, "\n";'

$

If there is content on both sides of the comment, only the
post-comment parts get printed, but if the content is after the
comment, it will do what it's supposed to. This is the sort of thing
that causes me to lose sleep and pull out my hair before its time. I
know you knew that, I'm just pointing out to the OP how fragile a
regex-based solution can be. It may work now, in one place, but
there's all sorts of things that could cause it to fail later, some of
which can be very subtle.
Which way is most suitable depends reasonably on the complexity of the
task together with how much you know about regular expressions.

Also the likelihood of your input changing-- a regex solution might be
right in at first, but can easily fail later-- as well as the intended
scope of use. Subroutines have a way around here of quickly migrating
out into general-use modules, where they are used by people in very
different contexts from where they originated. What works for one
particular task is likely to need serious changes if used for others.

-=Eric
 
A

andrewflanders

That's true if you are writing a web crawler but most of the time the
purpose for doing this is to strip spread sheet style data from a
website you don't control and insert it into your own database in which
case the html formating of the target HTML is likely to be fairly
consistant and in this case it's quicker for me to write that regex
than install and learn how to use HTML::parser. Add that to the fact
that your case example is silly.
 
G

Gunnar Hjalmarsson

Eric said:
I'm just pointing out to the OP how fragile a
regex-based solution can be.

Agreed. You need to know that no comments will be inserted that way, and
that there are no attributes containing '>' characters, etc., etc.
 
E

Eric Schwartz

That's true if you are writing a web crawler but most of the time the
purpose for doing this is to strip spread sheet style data from a
website you don't control and insert it into your own database in which
case the html formating of the target HTML is likely to be fairly
consistant and in this case it's quicker for me to write that regex
than install and learn how to use HTML::parser. Add that to the fact
that your case example is silly.

Please quote the messages you're replying to, at least enough so that
we can tell what you're replying to. Guessing that you're replying to
my reply to you, the fact that you don't control the HTML is exactly
why you need something like HTML::parser-- if you control the HTML,
you can force it to always be produced so your regex can parse it. If
you don't, though, the producer of that HTML can do all kinds of
things to break your regex. Inserting comments in the middle of table
data is only one of the most obvious ways a regex can break; see my
reply to Gunnar's regex solution for more detail.

-=Eric
 
T

tdmailbox

<TD class=tblform3 id=L_listnum.*?>(.*?)<\/TD>

That works.. however it returns the whole <TD> tag.. I just want the
value inside the tag. That is my core issue that I cant find the
solution to. I can find plenty of expressions that will find the right
<TD> tag but not one that will just give me the data between the tags
 
A

andrewflanders

"the fact that you don't control the HTML is exactly why you need
something like HTML::parser"

You don't realy know that the target html is as dirty as you assume.
Unless the poster says he or she is writing a long-term use and robust
data miner I'm assuming it's a one-off script where the html and data
in question is uniform because this is most often the case.
 
S

Scott Bryce

You don't realy know that the target html is as dirty as you assume.

Maybe not dirty. Maybe just subject to change. I have been bit by that
even using an HTML parser.
Unless the poster says he or she is writing a long-term use and robust
data miner I'm assuming it's a one-off script

We don't know that, so the discussion has merit.
 
G

Gunnar Hjalmarsson

<TD class=tblform3 id=L_listnum.*?>(.*?)<\/TD>

That works.. however it returns the whole <TD> tag..

No, it doesn't. It doesn't return anything.

Have you read any of the replies in this thread??
 
T

tdmailbox

Since L_listing is what makes the take you unique I took your code and
modified it to
<TD class=tblform3 id=L_listnum.*?>([^<]+)</TD>

and I get the right tag..

However the issue is that I only want to return the data between the
tag. The expression above includes the tag.
<TD class=tblform3 id=L_listnum width=106>$799,000</TD></TR>


thanks an advance for any help on that.
 
G

Gunnar Hjalmarsson

[ Please provide some context when replying!! Most people are not
reading this group via Google Groups. ]

Gunnar said:
And this is a "simple-minded" way:

print '<TD class=tblform3 id=L_listing width=23>123</TD>'
=~ m{<td.*?>([^<]+)</td>}is, "\n";

Since L_listing is what makes the take you unique I took your code and
modified it to
<TD class=tblform3 id=L_listnum.*?>([^<]+)</TD>

and I get the right tag..

However the issue is that I only want to return the data between the
tag. The expression above includes the tag.
<TD class=tblform3 id=L_listnum width=106>$799,000</TD></TR>

Don't try to just explain in English what you are doing, but post a
short but complete program that demonstrates the problem you are having.

Also, have you read the description of the m// operator in "perldoc perlop"?
 
A

andrewflanders

You must be accessing the result of the match wrong because the match
that is found between the ( ) will not include the entire td tag but
it's possible that some other variable does. Try printing $1 after the
match is supposed to occur and see if it prints the value you want to
parse out.
 
P

Paul

<TD class=tblform3 id=L_listnum.*?>(.*?)<\/TD>

That works.. however it returns the whole <TD> tag.. I just want the
value inside the tag. That is my core issue that I cant find the
solution to. I can find plenty of expressions that will find the right
<TD> tag but not one that will just give me the data between the tags
Read up on HTML::TableExtract.

Getting this sort of data using regex or similar is tricky and the page
definition may change ( will change ).

If the tables are not well structured you may have to search by depth
and count to get the right table. You will have to come to grips with
the structure of the data you are dealing with - the tables and the form.

Start here
"http://search.cpan.org/~msisk/HTML-TableExtract-1.08/lib/HTML/TableExtract.pm"

Happy reading.
 
J

Joe Smith

Since L_listing is what makes the take you unique I took your code and
modified it to
<TD class=tblform3 id=L_listnum.*?>([^<]+)</TD>

and I get the right tag..

However the issue is that I only want to return the data between the
tag. The expression above includes the tag.

No, it doesn't. You must not be using the regex in the proper manner.

Hint: /(.*)/; m/(.*)/; m%(.*)%;, m{(.*)}; m[(.*)]; m<(.*)>;

-Joe
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,586
Members
45,085
Latest member
cryptooseoagencies

Latest Threads

Top