perl html parser

K

kevin kitenik

Hi everybody,

i have a piece of html file, that countain special if-then-else statements :
like these ones:
<if condition="$vboptions['hometitle']"><a href="$vboptions[homeurl]">$vboptions[hometitle]</a> -</
if>
<if condition="$vboptions[privacyurl]"><a href="$vboptions[privacyurl]"><else><tr><td>test here
</if>

thos i statement can be imbricated :
if ... then ...
else
if .. then ...
fi
fi


the problem is i awant a wat to tansform these staments to :
((cond1)) ? (exec1)) : ((exec2)) styles.

how can i do this ???
i used the cpan without any succes !!

use Parse::RecDescent;
my @s=( q{<if condition="$vboptions['hometitle']"> <a href="$vboptions[homeurl]">$vboptions
[hometitle]</a> - </if>});

&pars;
sub pars {
my $parser = new Parse::RecDescent( q{
startrule: S
S: if ifC '">' then S else S fi {$return="(($item[2]) ? (\"$item[5]\") : ($item[7]))";}
| if ifC '">' then S fi {$return="(($item[2]) ? (\"$item[5]\") : (\"\"))";}
| html {$return=$item[1];}
if: '<if condition="'
fi: '</if>'
ifC: /[^"]+/
then: ''
else: '<else>' | '<else />'
html: /[\w\d_\$,\[\] ="\/\<\>-]+/ });
foreach my $s (@s){
print $s . ":\n" . $parser->startrule( $s ) . "\n"} }


i thank you in advance, for any syggestions, cause i have a headeack ;-)
 
S

sln

Hi everybody,

i have a piece of html file, that countain special if-then-else statements :
like these ones:
<if condition="$vboptions['hometitle']"><a href="$vboptions[homeurl]">$vboptions[hometitle]</a> -</
if>
<if condition="$vboptions[privacyurl]"><a href="$vboptions[privacyurl]"><else><tr><td>test here
</if>

thos i statement can be imbricated :
if ... then ...
else
if .. then ...
fi
fi


the problem is i awant a wat to tansform these staments to :
((cond1)) ? (exec1)) : ((exec2)) styles.

how can i do this ???
i used the cpan without any succes !!

use Parse::RecDescent;
my @s=( q{<if condition="$vboptions['hometitle']"> <a href="$vboptions[homeurl]">$vboptions
[hometitle]</a> - </if>});

&pars;
sub pars {
my $parser = new Parse::RecDescent( q{
startrule: S
S: if ifC '">' then S else S fi {$return="(($item[2]) ? (\"$item[5]\") : ($item[7]))";}
| if ifC '">' then S fi {$return="(($item[2]) ? (\"$item[5]\") : (\"\"))";}
| html {$return=$item[1];}
if: '<if condition="'
fi: '</if>'
ifC: /[^"]+/
then: ''
else: '<else>' | '<else />'
html: /[\w\d_\$,\[\] ="\/\<\>-]+/ });
foreach my $s (@s){
print $s . ":\n" . $parser->startrule( $s ) . "\n"} }


i thank you in advance, for any syggestions, cause i have a headeack ;-)

I'm not supprised you have a headache.
You could see what its doing if you set $::RD_TRACE = 1;

Lets look at one of your data strings.
q{<if condition="$vboptions[privacytitle]"><a href="$vboptions[privacyurl]">
<else><tr><td> test here
</if>}

-----------------------------------
use strict;
use warnings;
use Parse::RecDescent;


$::RD_TRACE = 1;

my @s=(
q{<if condition="$vboptions[privacytitle]"><a href="$vboptions[privacyurl]">
<else><tr><td> test here
</if>}
);

&pars;

sub pars {
my $parser = new Parse::RecDescent( q{

startrule: S
S: if ifC '">' then S else S fi {$return="(($item[2]) ? (\"$item[5]\") : ($item[7]))";}
| if ifC '">' then S fi {$return="(($item[2]) ? (\"$item[5]\") : (\"\"))";}
| html {$return=$item[1];}

if: '<if condition="'
fi: '</if>'
ifC: /[^"]*/
then: ''
else: '<else>' | '<else />'
html: /[\w\d_\$,\[\] ="\/<>-]+/ });
foreach my $s (@s)
{
print "\n",'+'x30,"\n",$s,":\n",'-'x30,"\n", ($parser->startrule( $s )),"\n";
}
}

__END__
-----------------------------------


The first time through S, it finds
if ifC '">' then
which is
if: $item[1] - '<if condition="' (literal)
ifC: $item[2] - '$vboptions[privacytitle]' =~ /[^"]+/
$item[3] - '">' (literal)
then: $item[4] - '' (literal)

Then it recurses S, it finds
html
which is
$item[5] - '<a href="$vboptions[privacyurl]">' =~ /[\w\d_\$,\[\] ="\/<>-]+/

Back from recursion, it then finds
else
which is
$item[6] - '<else>' (literal)

Then, recurse S again, it finds
html
which is
$item[7] - '<tr><td> test here' =~ /[\w\d_\$,\[\] ="\/<>-]+/

Back from recursion, it then finds
fi
which is
$item[8] - '</fi>' (literal)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


This code will produce a proper result if and
only if there is a separator between

S (separator) else
and
S (separator) fi

that is NOT in the html production /[\w\d_\$,\[\] ="\/<>-]+/.

This can be a TAB or a NewLine because that is not in the character
class of that regex.

For example, this:
q{<if condition="$vboptions[privacytitle]"><a href="$vboptions[privacyurl]"><else>
<tr><td> test here
</if>}
will fail because
'<a href="$vboptions[privacyurl]"><else>' =~ /[\w\d_\$,\[\] ="\/<>-]+/
will match, taking else: with it

And,
q{<if condition="$vboptions[privacytitle]"><a href="$vboptions[privacyurl]">
<else><tr><td> test here </if>}
will fail because
'<tr><td> test here </if>' =~ /[\w\d_\$,\[\] ="\/<>-]+/
will match, taking fi: with it

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The regular expressions (as you have them there) are independent.
They are not set up to backtrack.
I think that backtracking is available as a more advanced production concept,
however this can't be done with something as trivial as
/[\w\d_\$,\[\] ="\/<>-]+/
Indeed, the whole realm of discreet, character level parsing is needed for markup.

If however, you are in control of creating the input data, just fashion it so
that known delimeters are inserted where necessary. Then you can generate the correct
html, or whatever it is you are doing.

-sln
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top