Modification of a hash-by-reference parameter in a recursive sub

C

CoDeReBeL

OK, it's like this... I am far from an expert in Perl but I really
think this should work...

use strict;
use warnings;
use diagnostics;
use HTML::TreeBuilder;
use HTML::Entities;
use HTML::Element;

sub traverse {
foreach (@_) {
if (ref $_) {
if ($_->tag() ne "head"
&& $_->tag() ne "script"
&& $_->tag() ne "img"
&& $_->tag() ne "object"
&& $_->tag() ne "applet") {
my @contents = $_->content_list() ;
foreach my $next (@contents) {
traverse ($next) ;
}
}
}
else {
$_ =~ s/\s&\s/ & /g ;
$_ =~ s/</&lt;/g ;
$_ =~ s/>/&gt;/g ;
$_ =~ s/'em\s/&rsquo;em /g ;
$_ =~ s/'tis\s/&rsquo;tis /g ;
$_ =~ s/'twas\s/&rsquo;twas /g ;
$_ =~ s/'Twas\s/&rsquo;Twas /g ;
$_ =~ s/'Tis\s/&rsquo;Tis / ;
$_ =~ s/'\s/&rsquo; /g ;
$_ =~ s/^'/&lsquo;/g ;
$_ =~ s/(\s)'/$1&lsquo;/g ;
$_ =~ s/"'/&ldquo;lsquo;/g ;
$_ =~ s/'"/&rsquo;&rdquo;/g ;
$_ =~ s/\s"/ &ldquo;/g ;
$_ =~ s/^'/&lsquo;/g ;
$_ =~ s/^"/&ldquo;/g ;
$_ =~ s/"\s/&rdquo; /g ;
$_ =~ s/'$/&rsquo;/g ;
$_ =~ s/"$/&rdquo;/g ;
$_ =~ s/(,|\.)'/$1&rsquo;/g ;
$_ =~ s/(,|\.)"/$1&rdquo;/g ;
$_ =~ s/(\S)'(\S)/$1&rsquo;$2/g ;
}
}
return $_ ;
}

foreach my $file_name (@ARGV) {
my $tree = HTML::TreeBuilder->new ;
$tree->parse_file($file_name);
$tree = traverse ($tree);
$tree = $tree->delete ;
}

sub traverse ;

I've spent about 36 hours now chugging coffee and Mountain Dew, trying
every possible thing I could think of or find reference to in any Perl
documentation anywhere, and for the life of me I can't get the altered
text strings to keep their value. I've had all kind of print
statements inserted everywhere in the traverse subroutine and I have
run the program about 100 times today. Every time that I said

print $_ ; (Or $_[0] or any of the other 20 things that I've called it
today) I've seen the string altered just like it should be by the
regular expressions. The HTML::Tree package is just a little weird
with the argument being passed to the recursive routine ... it might
be a reference to a hash and it might be a string. But the ref check
is working and execution proceeds accordingly.

I've had $tree be local to the main function at the bottom, I've had
it be global, etc. I've tried everything I could think of. I've run it
in debug mode and checked and verified that the check_persistence()
routine (which I left out here for brevity but is identical to the
traverse sub except that it only prints the string and doesn't modify
it) was looking at the same address ... oh, sorry, sorry ... the C++
in me came out a little bit there ... the REFERENCE was represented by
the same exact 7 or 8 digit hexadecimal number that looks a lot like a
pointer ...

I've tried just returning the damn $_ reference at the end of the
routine. It shows all the elements of whatever file I test it with
just like it was... but the text components snap back to their
original values as soon as the traverse routine exits no matter what I
try.

At this point I'm pretty much fed up with trying, since the whole
exercise is just to give me a little script to save myself the trouble
of typing those damn entities all the time and curling the quotes,
etc. I can easily open an output file in the main routine just before
traverse, pass in a *<glob> with the $tree reference and write the
string to the file at the end of traverse, where the text strings are
what I want them to be and just print the whole damn thing inside
traverse to a file.

So this is not really a problem per se as much as a puzzle, since
there's MORE THAN ONE WAY TO DO IT! :) But my curiosity remains
piqued. Why the hell won't this work? Is it just plain impossible or
what? From what I've read on the topic it seems like it just might be
impossible. I've searched high and low and have yet to find an example
anywhere that does just quite the same thing.

I also read somewhere that if you assign $_ or @_ to any variable or
ref inside the sub that there is no way in hell that they will stay
modified when the sub exits, so I've done my best not to call them in
any way, but I can tell you right now for sure that Perl does not like
it when you call the $s->tag() method on a string, so I have to do the
ref check. Pretty much have to call the tag() method and the
content_list() method too.

Anyway, I've had it. I give up. You guys are the experts. Clue me in,
willya?


Thanks.
 
U

Uri Guttman

C> sub traverse {
C> foreach (@_) {
C> if (ref $_) {
C> if ($_->tag() ne "head"
C> && $_->tag() ne "script"
C> && $_->tag() ne "img"
C> && $_->tag() ne "object"
C> && $_->tag() ne "applet") {

just for speed ups you should get $_->tag() into a temp var so you don't
call the method over and over. another way would be to check the value
against a hash of those tag names. a series of compares to the same
value in perl is usually a bad idea.


C> my @contents = $_->content_list() ;

you make a copy of the list so any changes to it will not seen be in the
original tree.

C> foreach my $next (@contents) {

try this:

foreach my $next ($_->content_list()) {

that will alias $next to each element of the tree so changes to $next
will be seen in the actual tree.


C> traverse ($next) ;
C> }
C> }
C> }
C> else {
C> $_ =~ s/\s&\s/ &amp; /g ;

$_ is the default for s/// ops so you can lose all the $_ =~ redundant noise.

C> $_ =~ s/</&lt;/g ;
C> $_ =~ s/>/&gt;/g ;

i know there is a better way to do all these s/// calls but i am not in
the mood to work on it.

uri
 
J

Jürgen Exner

Uri Guttman said:
C> sub traverse {
C> foreach (@_) {
C> if (ref $_) {
C> if ($_->tag() ne "head"
C> && $_->tag() ne "script"
C> && $_->tag() ne "img"
C> && $_->tag() ne "object"
C> && $_->tag() ne "applet") {

It may be more readable to put all those values as keys into a hash
%mytags and then simply check for

unless (exists ($mytags{$_->tag()} )) {
....

jue
 
S

smallpond

OK, it's like this... I am far from an expert in Perl but I really
think this should work...

use strict;
use warnings;
use diagnostics;
use HTML::TreeBuilder;
use HTML::Entities;
use HTML::Element;

sub traverse {
foreach (@_) {
if (ref $_) {
if ($_->tag() ne "head"


Why use $_ and not a named variable?
foreach my $html(@_) {
if (ref $html) {
if ($html->tag() ne "head"
etc.

It's more readable and less likely that you accidentally
modify $_ in any loop longer than a few lines.
 
C

CoDeReBeL

OK, I finally got it to work. Thanks for your help, all. I took your
advice about most everything except the regular expressions. I'm sure
they can be made a lot better but I wasn't going to start playing with
them until I actually had the program running. Here is the latest
version, in case anyone is interested and just in case someone has a
similar problem some day they might find this...

use strict;
use warnings;
use diagnostics;
use HTML::TreeBuilder;
use HTML::Element;
use Scalar::Util ;

sub traverse ;
sub curly_quotes ($) ;
sub go_ahead ($) ;

foreach my $file_name (@ARGV) {
my $tree = HTML::TreeBuilder->new ;
$tree->parse_file($file_name);
print "\n\nWhere would you like to put the output file for
$file_name? " ;
my $output = <STDIN> ;
open OUTPUT_FILE, "> $output" or die $! ;
traverse ($tree->find('body')) ;
print OUTPUT_FILE $tree->as_HTML (""," ",{}) ;
$tree = $tree->delete ;
close OUTPUT_FILE or die $!;
}

sub traverse {
for my $element (@_) {
if (Scalar::Util::blessed ($element)) {
if (go_ahead($element)) {
my @contents = $element->content_list() ;
print "Before: ", @contents, "\n\n" ;
traverse(@contents) ;
print "After: ", @contents, "\n\n" ;
$element->detach_content() ;
$element->push_content (@contents) ;
}
}
else {
print "Processing a string: " ;
$element = curly_quotes($element) ;
print $element, "\n\n" ;
}
}
}

sub curly_quotes ($) {
my $s = $_[0] ;
$s =~ s/\s&\s/ &amp; /g ;
$s =~ s/</&lt;/g ;
$s =~ s/>/&gt;/g ;
$s =~ s/'em\s/&rsquo;em /g ;
$s =~ s/'tis\s/&rsquo;tis /g ;
$s =~ s/'twas\s/&rsquo;twas /g ;
$s =~ s/'Twas\s/&rsquo;Twas /g ;
$s =~ s/'Tis\s/&rsquo;Tis / ;
$s =~ s/'\s/&rsquo; /g ;
$s =~ s/^'/&lsquo;/g ;
$s =~ s/(\s)'/$1&lsquo;/g ;
$s =~ s/"'/&ldquo;lsquo;/g ;
$s =~ s/'"/&rsquo;&rdquo;/g ;
$s =~ s/\s"/ &ldquo;/g ;
$s =~ s/^'/&lsquo;/g ;
$s =~ s/^"/&ldquo;/g ;
$s =~ s/"\s/&rdquo; /g ;
$s =~ s/'$/&rsquo;/g ;
$s =~ s/"$/&rdquo;/g ;
$s =~ s/(,|\.)'/$1&rsquo;/g ;
$s =~ s/(,|\.)"/$1&rdquo;/g ;
$s =~ s/(\S)'(\S)/$1&rsquo;$2/g ;
return $s ;
}

sub go_ahead ($) {
my $element = $_[0] ;
my $s = $element->tag() ;
my %tags = (
head => "head",
script => "script",
img => "img",
object => "object",
applet => "applet",
pre => "pre"
) ;
return !(exists $tags{$s}) ;
}

Thanks again!
 
U

Uri Guttman

C> sub traverse ;
C> sub curly_quotes ($) ;
C> sub go_ahead ($) ;

no need to predeclare subs in perl. and using prototypes is mostly
useless and doesn't really do what you think it does. so drop those
lines and all the prototypes you have below.

C> foreach my $file_name (@ARGV) {
C> my $tree = HTML::TreeBuilder->new ;
C> $tree->parse_file($file_name);
C> print "\n\nWhere would you like to put the output file for
C> $file_name? " ;
C> my $output = <STDIN> ;
C> open OUTPUT_FILE, "> $output" or die $! ;

you don't chomp the filename. it could cause odd issues with the open.
and use lexical file handles, not barewords.

C> traverse ($tree->find('body')) ;
C> print OUTPUT_FILE $tree->as_HTML (""," ",{}) ;
C> $tree = $tree->delete ;

no need to delete the tree as it will go out of scope here.

C> close OUTPUT_FILE or die $!;

same if you used a lexical handle. and close is rarely checked as it has
maybe one fault possible on a plain file (filesystem full).

C> }
C> if (Scalar::Util::blessed ($element)) {

import blessed from that module and you won't need to use the full path
to it.

C> sub curly_quotes ($) {
C> my $s = $_[0] ;

my( $s ) = @_ ;

that is a better style. never use $_[] unless there is a special reason
(and there are a few)

C> sub go_ahead ($) {
C> my $element = $_[0] ;
C> my $s = $element->tag() ;
C> my %tags = (
C> head => "head",
C> script => "script",
C> img => "img",
C> object => "object",
C> applet => "applet",
C> pre => "pre"

you assign this hash EACH time the sub is called even though the hash is
constant. assign it once in a file level lexical at the top and it will
be reused each time in this sub.

and STOP using single letter variable names. that is your worst
offense. you may think you know this stuff but it is not proper in any
language. you are writing code to be read by humans, not computers and
OTHER humans, not yourself.

uri
 
T

Tad J McClellan

sub traverse ;
sub curly_quotes ($) ;
sub go_ahead ($) ;


Why do you include those 3 lines?

What do you think they are doing for you?

What happens if you delete them and then test your program?

If you don't know why those lines are there, then those lines
should not be there. Delete them, they do nothing for you.

foreach my $file_name (@ARGV) {
my $tree = HTML::TreeBuilder->new ;
$tree->parse_file($file_name);
print "\n\nWhere would you like to put the output file for
$file_name? " ;
my $output = <STDIN> ;


chomp $output;

open OUTPUT_FILE, "> $output" or die $! ;


You should use the 3-arg form of open() with a lexical filehandle:

open my $OUTPUT_FILE, '>', $output or die "could not open '$output' $!";

traverse ($tree->find('body')) ;
print OUTPUT_FILE $tree->as_HTML (""," ",{}) ;


print $OUTPUT_FILE $tree->as_HTML (""," ",{}) ;

$tree = $tree->delete ;
close OUTPUT_FILE or die $!;

close $OUTPUT_FILE or die $!;
}

sub traverse {
for my $element (@_) {
if (Scalar::Util::blessed ($element)) {
if (go_ahead($element)) {


You can combine those 2 "if"s and do without go_ahead():

if (Scalar::Util::blessed ($element) and
$element->tag() !~ /^(head|script|img|object|applet|pre)$/ ) {

my @contents = $element->content_list() ;
print "Before: ", @contents, "\n\n" ;
traverse(@contents) ;
print "After: ", @contents, "\n\n" ;
$element->detach_content() ;
$element->push_content (@contents) ;
}
}
else {
print "Processing a string: " ;
$element = curly_quotes($element) ;
print $element, "\n\n" ;


You can combine those last 2 lines:

print curly_quotes($element), "\n\n" ;
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top