C
CoDeReBeL
OK, it's like this... I am far from an expert in Perl but I really
think this should work...
use strict;
use warnings;
use diagnostics;
use HTML::TreeBuilder;
use HTML::Entities;
use HTML::Element;
sub traverse {
foreach (@_) {
if (ref $_) {
if ($_->tag() ne "head"
&& $_->tag() ne "script"
&& $_->tag() ne "img"
&& $_->tag() ne "object"
&& $_->tag() ne "applet") {
my @contents = $_->content_list() ;
foreach my $next (@contents) {
traverse ($next) ;
}
}
}
else {
$_ =~ s/\s&\s/ & /g ;
$_ =~ s/</</g ;
$_ =~ s/>/>/g ;
$_ =~ s/'em\s/’em /g ;
$_ =~ s/'tis\s/’tis /g ;
$_ =~ s/'twas\s/’twas /g ;
$_ =~ s/'Twas\s/’Twas /g ;
$_ =~ s/'Tis\s/’Tis / ;
$_ =~ s/'\s/’ /g ;
$_ =~ s/^'/‘/g ;
$_ =~ s/(\s)'/$1‘/g ;
$_ =~ s/"'/“lsquo;/g ;
$_ =~ s/'"/’”/g ;
$_ =~ s/\s"/ “/g ;
$_ =~ s/^'/‘/g ;
$_ =~ s/^"/“/g ;
$_ =~ s/"\s/” /g ;
$_ =~ s/'$/’/g ;
$_ =~ s/"$/”/g ;
$_ =~ s/(,|\.)'/$1’/g ;
$_ =~ s/(,|\.)"/$1”/g ;
$_ =~ s/(\S)'(\S)/$1’$2/g ;
}
}
return $_ ;
}
foreach my $file_name (@ARGV) {
my $tree = HTML::TreeBuilder->new ;
$tree->parse_file($file_name);
$tree = traverse ($tree);
$tree = $tree->delete ;
}
sub traverse ;
I've spent about 36 hours now chugging coffee and Mountain Dew, trying
every possible thing I could think of or find reference to in any Perl
documentation anywhere, and for the life of me I can't get the altered
text strings to keep their value. I've had all kind of print
statements inserted everywhere in the traverse subroutine and I have
run the program about 100 times today. Every time that I said
print $_ ; (Or $_[0] or any of the other 20 things that I've called it
today) I've seen the string altered just like it should be by the
regular expressions. The HTML::Tree package is just a little weird
with the argument being passed to the recursive routine ... it might
be a reference to a hash and it might be a string. But the ref check
is working and execution proceeds accordingly.
I've had $tree be local to the main function at the bottom, I've had
it be global, etc. I've tried everything I could think of. I've run it
in debug mode and checked and verified that the check_persistence()
routine (which I left out here for brevity but is identical to the
traverse sub except that it only prints the string and doesn't modify
it) was looking at the same address ... oh, sorry, sorry ... the C++
in me came out a little bit there ... the REFERENCE was represented by
the same exact 7 or 8 digit hexadecimal number that looks a lot like a
pointer ...
I've tried just returning the damn $_ reference at the end of the
routine. It shows all the elements of whatever file I test it with
just like it was... but the text components snap back to their
original values as soon as the traverse routine exits no matter what I
try.
At this point I'm pretty much fed up with trying, since the whole
exercise is just to give me a little script to save myself the trouble
of typing those damn entities all the time and curling the quotes,
etc. I can easily open an output file in the main routine just before
traverse, pass in a *<glob> with the $tree reference and write the
string to the file at the end of traverse, where the text strings are
what I want them to be and just print the whole damn thing inside
traverse to a file.
So this is not really a problem per se as much as a puzzle, since
there's MORE THAN ONE WAY TO DO IT!
But my curiosity remains
piqued. Why the hell won't this work? Is it just plain impossible or
what? From what I've read on the topic it seems like it just might be
impossible. I've searched high and low and have yet to find an example
anywhere that does just quite the same thing.
I also read somewhere that if you assign $_ or @_ to any variable or
ref inside the sub that there is no way in hell that they will stay
modified when the sub exits, so I've done my best not to call them in
any way, but I can tell you right now for sure that Perl does not like
it when you call the $s->tag() method on a string, so I have to do the
ref check. Pretty much have to call the tag() method and the
content_list() method too.
Anyway, I've had it. I give up. You guys are the experts. Clue me in,
willya?
Thanks.
think this should work...
use strict;
use warnings;
use diagnostics;
use HTML::TreeBuilder;
use HTML::Entities;
use HTML::Element;
sub traverse {
foreach (@_) {
if (ref $_) {
if ($_->tag() ne "head"
&& $_->tag() ne "script"
&& $_->tag() ne "img"
&& $_->tag() ne "object"
&& $_->tag() ne "applet") {
my @contents = $_->content_list() ;
foreach my $next (@contents) {
traverse ($next) ;
}
}
}
else {
$_ =~ s/\s&\s/ & /g ;
$_ =~ s/</</g ;
$_ =~ s/>/>/g ;
$_ =~ s/'em\s/’em /g ;
$_ =~ s/'tis\s/’tis /g ;
$_ =~ s/'twas\s/’twas /g ;
$_ =~ s/'Twas\s/’Twas /g ;
$_ =~ s/'Tis\s/’Tis / ;
$_ =~ s/'\s/’ /g ;
$_ =~ s/^'/‘/g ;
$_ =~ s/(\s)'/$1‘/g ;
$_ =~ s/"'/“lsquo;/g ;
$_ =~ s/'"/’”/g ;
$_ =~ s/\s"/ “/g ;
$_ =~ s/^'/‘/g ;
$_ =~ s/^"/“/g ;
$_ =~ s/"\s/” /g ;
$_ =~ s/'$/’/g ;
$_ =~ s/"$/”/g ;
$_ =~ s/(,|\.)'/$1’/g ;
$_ =~ s/(,|\.)"/$1”/g ;
$_ =~ s/(\S)'(\S)/$1’$2/g ;
}
}
return $_ ;
}
foreach my $file_name (@ARGV) {
my $tree = HTML::TreeBuilder->new ;
$tree->parse_file($file_name);
$tree = traverse ($tree);
$tree = $tree->delete ;
}
sub traverse ;
I've spent about 36 hours now chugging coffee and Mountain Dew, trying
every possible thing I could think of or find reference to in any Perl
documentation anywhere, and for the life of me I can't get the altered
text strings to keep their value. I've had all kind of print
statements inserted everywhere in the traverse subroutine and I have
run the program about 100 times today. Every time that I said
print $_ ; (Or $_[0] or any of the other 20 things that I've called it
today) I've seen the string altered just like it should be by the
regular expressions. The HTML::Tree package is just a little weird
with the argument being passed to the recursive routine ... it might
be a reference to a hash and it might be a string. But the ref check
is working and execution proceeds accordingly.
I've had $tree be local to the main function at the bottom, I've had
it be global, etc. I've tried everything I could think of. I've run it
in debug mode and checked and verified that the check_persistence()
routine (which I left out here for brevity but is identical to the
traverse sub except that it only prints the string and doesn't modify
it) was looking at the same address ... oh, sorry, sorry ... the C++
in me came out a little bit there ... the REFERENCE was represented by
the same exact 7 or 8 digit hexadecimal number that looks a lot like a
pointer ...
I've tried just returning the damn $_ reference at the end of the
routine. It shows all the elements of whatever file I test it with
just like it was... but the text components snap back to their
original values as soon as the traverse routine exits no matter what I
try.
At this point I'm pretty much fed up with trying, since the whole
exercise is just to give me a little script to save myself the trouble
of typing those damn entities all the time and curling the quotes,
etc. I can easily open an output file in the main routine just before
traverse, pass in a *<glob> with the $tree reference and write the
string to the file at the end of traverse, where the text strings are
what I want them to be and just print the whole damn thing inside
traverse to a file.
So this is not really a problem per se as much as a puzzle, since
there's MORE THAN ONE WAY TO DO IT!
piqued. Why the hell won't this work? Is it just plain impossible or
what? From what I've read on the topic it seems like it just might be
impossible. I've searched high and low and have yet to find an example
anywhere that does just quite the same thing.
I also read somewhere that if you assign $_ or @_ to any variable or
ref inside the sub that there is no way in hell that they will stay
modified when the sub exits, so I've done my best not to call them in
any way, but I can tell you right now for sure that Perl does not like
it when you call the $s->tag() method on a string, so I have to do the
ref check. Pretty much have to call the tag() method and the
content_list() method too.
Anyway, I've had it. I give up. You guys are the experts. Clue me in,
willya?
Thanks.