perl parse (reg exp)

L

Larry

Hi,

I'm using this chunk of code to extract some content from a piece of
html.

($parse) = ($html =~ /<tbody>(.*?)<\/tbody>/sg);

so that I can grab everything between <tbody> and </tbody>

Now, I have a some <tr> tags I'd like to parse as follows:

<tr class="odd"></tr>
<tr class="even"></tr>

Yet, I want to skip the attribute. I am a newbie with reg exp and I am stuck
at this:

(@parse) = ($parse =~ /<tr (\W+)>(.*?)<\/tr>/sg);

but it's not working..how should I go about?

thanks
 
J

Jürgen Exner

Larry said:
I'm using this chunk of code to extract some content from a piece of
html. [...]
but it's not working..how should I go about?

To parse HTML you should use an HTML parser. Unless you are writing such
a beast and you know what you are doing because you have experience in
e.g. writing compilers it is in general A Very Bad Idea to try parsing
HTLM using ad-hoc REs.

For further information see the FAQ:
- "How do I match XML, HTML, or other nasty, ugly things with a regex?"
- "How do I remove HTML from a string?"
or the many, many previous discussions about this perpetual topic.

jue
 
M

Martijn Lievaart

Larry said:
I'm using this chunk of code to extract some content from a piece of
html. [...]
but it's not working..how should I go about?

To parse HTML you should use an HTML parser. Unless you are writing such
a beast and you know what you are doing because you have experience in
e.g. writing compilers it is in general A Very Bad Idea to try parsing
HTLM using ad-hoc REs.

For further information see the FAQ:
- "How do I match XML, HTML, or other nasty, ugly things with a regex?"
- "How do I remove HTML from a string?" or the many, many previous
discussions about this perpetual topic.

The one exception to this is when the input is not truly html, but
nevertheless correctly displayed in a browser[1]. This is getting less
and less common, but still happens.

In that case your options are:
1) Bitch to the website maker
2) Create an ad-hoc parser and hope it does not break
3) Try to repair the html before parsing.

M4


[1] "a" as in some browser, not necessary all browsers.
 
W

Willem

Martijn Lievaart wrote:
) The one exception to this is when the input is not truly html, but
) nevertheless correctly displayed in a browser[1]. This is getting less
) and less common, but still happens.
)
) In that case your options are:
) 1) Bitch to the website maker
) 2) Create an ad-hoc parser and hope it does not break
) 3) Try to repair the html before parsing.

4) Fix the html module you're using to accept broken html.
5) Share this with the module author so everybody can benefit.

Or:

6) Hope somebody already did this and just use an html parser that
handles broken html gracefully.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
M

Martijn Lievaart

4) Fix the html module you're using to accept broken html.
5) Share this with the module author so everybody can benefit.

Or:

6) Hope somebody already did this and just use an html parser that
handles broken html gracefully.

Should have thought of that, thanks. Know of any modules that do this?

M4
 
S

sln

Hi,

I'm using this chunk of code to extract some content from a piece of
html.

($parse) = ($html =~ /<tbody>(.*?)<\/tbody>/sg);

so that I can grab everything between <tbody> and </tbody>

Now, I have a some <tr> tags I'd like to parse as follows:

<tr class="odd"></tr>
<tr class="even"></tr>

Yet, I want to skip the attribute. I am a newbie with reg exp and I am stuck
at this:

(@parse) = ($parse =~ /<tr (\W+)>(.*?)<\/tr>/sg);

but it's not working..how should I go about?

thanks

Something like this maybe.

-sln

-------------------
use strict;
use warnings;

## Requires 5.10 or above

## OP: Now, I have a some <tr> tags I'd like to parse as follows:
## <tr class="odd"></tr>
## <tr class="even"></tr>
## Yet, I want to skip the attribute.

##
my $xml = join '', <DATA>;

##
my $open = q{ <tr\s*( [^>]*? )(?<!\/)> };
my $close = q{ <\/tr\s*> };

my $regx = qr/

<script\s*[^>]*?(?<!\/)> .*? <\/script\s*>
|
(?-i: <!(?:\[CDATA\[.*?\]\]|--.*?--|\[[A-Z][A-Z\ ]*\[.*?\]\])> )
|
( #1
(?: $open ) #2
( #3
(?:
(?>
(?:
(?-i: <!(?:\[CDATA\[.*?\]\]|--.*?--|\[[A-Z][A-Z\ ]*\[.*?\]\])> )
| (?! $open | $close ) .
)+
)
| (?1)
)*
)
$close
)
/ixs;

##
my @records;

while ( $xml =~ /$regx/g )
{
if (defined $1) {
print "-->\$2 = '$2'\n";
print "-->\$3 = '$3'\n";
push @records, $3;
}
}

print "---------\nDone!\n";

exit;

__DATA__


<![CDATA[
<tr class="odd">
trodd
</tr>
<tr class="even">
treven
</tr>
]]>

<script>
function search0(){
document.forms[0].submit()
}
function Upper()
{
var up = document.getElementById("h_sn");
return up.value = up.value.toUpperCase();
}
</script>

<TABLE width="800" BORDER=1 align="center" CELLPADDING=1 CELLSPACING=0>
<TR VALIGN="TOP" >
<TD colspan="11" align="left" valign="middle" class="style31"><img
src="image1/Home/Export.png" width="45" height="13" /></TD>
</TR>
<TR VALIGN="TOP" >
<TD WIDTH="33" align="center" class="style25">Item</TD>
<TD width="73" align="center" class="style25">AWB No </TD>
<TD WIDTH="69" align="center" class="style25">Flight No </TD>
<TD WIDTH="87" align="center" class="style25">Flight Date</TD>
<TD WIDTH="42" align="center" class="style25">Origin</TD>
<TD WIDTH="42" align="center" class="style25">Dest</TD>
<TD WIDTH="99" align="center" class="style25">ULD No </TD>
<TD WIDTH="105" align="center" class="style25">Status</TD>
<TD WIDTH="50" align="center" class="style25"> Pieces </TD>
<TD WIDTH="58" align="center" class="style25">Weight </TD>
<TD WIDTH="96" align="center" class="style25">Time </TD>
</TR>
<TR bgcolor='#99CCFF' >
<TD ALIGN="center" NOWRAP="TRUE" class="style12">1</TD>
<TD ALIGN="left" NOWRAP="TRUE" class="style12">176-75064953</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">EK 419</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">Oct 15 2010 </TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">BKK</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">DXB</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">Flight Change&nbsp;</TD>
<!--// This is check status -->
<TD ALIGN="center" NOWRAP="TRUE" class="style12">Export
Transshipment</TD>
<TD ALIGN="RIGHT" NOWRAP="TRUE" class="style12">3</TD>
<TD ALIGN="RIGHT" NOWRAP="TRUE" class="style12">743.00</TD>
<TD ALIGN="right" NOWRAP="TRUE" class="style12">Oct 14 2010 5:37PM </TD>
</TR>
<TR bgcolor='#99FFCC' >
<TD ALIGN="center" NOWRAP="TRUE" class="style12">2</TD>
<TD ALIGN="left" NOWRAP="TRUE" class="style12">176-75064953</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">EK 419</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">Oct 15 2010 </TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">BKK</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">DXB</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">&nbsp;</TD>
<!--// This is check status -->
<TD ALIGN="center" NOWRAP="TRUE" class="style12">Accepted</TD>
<TD ALIGN="RIGHT" NOWRAP="TRUE" class="style12">3</TD>
<TD ALIGN="RIGHT" NOWRAP="TRUE" class="style12">743.00</TD>
<TD ALIGN="right" NOWRAP="TRUE" class="style12">Oct 14 2010 5:37PM </TD>
</TR>
<TR bgcolor='#99CCFF' >
<TD ALIGN="center" NOWRAP="TRUE" class="style12">3</TD>
<TD ALIGN="left" NOWRAP="TRUE" class="style12">176-75064953</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">EK 373</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">Oct 15 2010 </TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">BKK</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">DXB</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">Flight Change&nbsp;</TD>
<!--// This is check status -->
<TD ALIGN="center" NOWRAP="TRUE" class="style12">Export
Transshipment</TD>
<TD ALIGN="RIGHT" NOWRAP="TRUE" class="style12">3</TD>
<TD ALIGN="RIGHT" NOWRAP="TRUE" class="style12">743.00</TD>
<TD ALIGN="right" NOWRAP="TRUE" class="style12">Oct 14 2010 6:12PM </TD>
</TR>
<TR bgcolor='#99FFCC' >
<TD ALIGN="center" NOWRAP="TRUE" class="style12">4</TD>
<TD ALIGN="left" NOWRAP="TRUE" class="style12">176-75064953</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">EK 373</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">Oct 15 2010 </TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">BKK</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">DXB</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">SHC&nbsp;</TD>
<!--// This is check status -->
<TD ALIGN="center" NOWRAP="TRUE" class="style12">Export
Transshipment</TD>
<TD ALIGN="RIGHT" NOWRAP="TRUE" class="style12">3</TD>
<TD ALIGN="RIGHT" NOWRAP="TRUE" class="style12">743.00</TD>
<TD ALIGN="right" NOWRAP="TRUE" class="style12">Oct 14 2010 6:12PM </TD>
</TR>
<TR bgcolor='#99CCFF' >
<TD ALIGN="center" NOWRAP="TRUE" class="style12">5</TD>
<TD ALIGN="left" NOWRAP="TRUE" class="style12">176-75064953</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">EK 373</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">Oct 14 2010 </TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">BKK</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">DXB</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">Flight Change&nbsp;</TD>
<!--// This is check status -->
<TD ALIGN="center" NOWRAP="TRUE" class="style12">Export
Transshipment</TD>
<TD ALIGN="RIGHT" NOWRAP="TRUE" class="style12">3</TD>
<TD ALIGN="RIGHT" NOWRAP="TRUE" class="style12">743.00</TD>
<TD ALIGN="right" NOWRAP="TRUE" class="style12">Oct 14 2010 6:42PM </TD>
</TR>
<TR bgcolor='#99FFCC' >
<TD ALIGN="center" NOWRAP="TRUE" class="style12">6</TD>
<TD ALIGN="left" NOWRAP="TRUE" class="style12">176-75064953</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">EK 373</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">Oct 14 2010 </TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">BKK</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">DXB</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">PMC31131EK&nbsp;</TD>
<!--// This is check status -->
<TD ALIGN="center" NOWRAP="TRUE" class="style12">Manifested</TD>
<TD ALIGN="RIGHT" NOWRAP="TRUE" class="style12">3</TD>
<TD ALIGN="RIGHT" NOWRAP="TRUE" class="style12">743.00</TD>
<TD ALIGN="right" NOWRAP="TRUE" class="style12">Oct 14 2010 6:57PM </TD>
</TR>
<TR bgcolor='#99CCFF' >
<TD ALIGN="center" NOWRAP="TRUE" class="style12">7</TD>
<TD ALIGN="left" NOWRAP="TRUE" class="style12">176-75064953</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">EK 373</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">Oct 14 2010 </TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">BKK</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">DXB</TD>
<TD ALIGN="center" NOWRAP="TRUE" class="style12">&nbsp;</TD>
<!--// This is check status -->
<TD ALIGN="center" NOWRAP="TRUE" class="style12">Departed</TD>
<TD ALIGN="RIGHT" NOWRAP="TRUE" class="style12">3</TD>
<TD ALIGN="RIGHT" NOWRAP="TRUE" class="style12">743.00</TD>
<TD ALIGN="right" NOWRAP="TRUE" class="style12">Oct 14 2010 9:54PM </TD>
</TR>
</TABLE>

<script>
function show_adv(){
var ko = document.getElementById("showadv");
//var ko2 = document.getElementById("showadv2");
ko.style.display="";
//ko2.style.display="";
}
</script>
 
P

Peter Makholm

Larry said:
I'm using this chunk of code to extract some content from a piece of
html.

As other already said, use a module that already parses HTML.
<tr class="odd"></tr>
<tr class="even"></tr>

Yet, I want to skip the attribute. I am a newbie with reg exp and I am
stuck at this:

(@parse) = ($parse =~ /<tr (\W+)>(.*?)<\/tr>/sg);

Try reading it out loud. You are trying to match "the exact string
'<tr' followed by a space. Then capturing a non-zero number of
non-word characters. Then a '>' and then capturing as few characters
as possible. Finally mathc the exact string '</tr>'".

So between '<tr ' and '>' you tries to capture a string of non-word
characters. But in you example you have plenty of word-characters
there.
but it's not working..how should I go about?

First you have to listen to the people telling you to use one of the
existing HTML parsing modules.

//Makholm
 
T

Ted Zlatanov

ML> Should have thought of that, thanks. Know of any modules that do this?

Heh heh.

I have worked around this by fixing the HTML before feeding it to the
parser. I often wish it was not so easy to create broken HTML.

Ted
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top