Replacing hundreds of hash keys with their values in a text document

A

Arvin Portlock

I'm writing a script that replaces the direct form of a
special character with its SDATA equivalent. For example
it would replace all occurences of é with é. I've
compiled an enormous hash with the "direct" form as the
key and the SDATA version as its value. I can think of two
ways to accomplish this. The first is two loop through all
keys and do a global replace with the correct value:

foreach my $key (keys %characters) {
$fulltext =~ s/$key/$characters{$key}/g;
}

The second is to process the document character by character
and if the character is in the hash then replace it:

local $/ = undef;
open (FILE, $file);
my $fulltext = <FILE>;
close (FILE);
my @chars = split (//, $fulltext);
foreach my $char (@chars) {
if ($characters{$char}) {
print $characters{$char};
} else {
print $char;
}
}

The second seems the faster option, but neither one of them
is exactly and elegant solution. Is there something obvious
I'm missing?

Arvin
 
B

Ben Morrow

Arvin Portlock said:
I'm writing a script that replaces the direct form of a
special character with its SDATA equivalent. For example
it would replace all occurences of é with &eacute;. I've
compiled an enormous hash with the "direct" form as the
key and the SDATA version as its value. I can think of two
ways to accomplish this. The first is two loop through all
keys and do a global replace with the correct value:

foreach my $key (keys %characters) {
$fulltext =~ s/$key/$characters{$key}/g;
}

The second is to process the document character by character
and if the character is in the hash then replace it:

local $/ = undef;
open (FILE, $file);
my $fulltext = <FILE>;
close (FILE);
my @chars = split (//, $fulltext);
foreach my $char (@chars) {
if ($characters{$char}) {
print $characters{$char};
} else {
print $char;
}
}

The second seems the faster option, but neither one of them
is exactly and elegant solution. Is there something obvious
I'm missing?

If you're using 5.8, and don't mind having instead of named
entities, you can do

use Encode qw/:fallbacks/;

$PerlIO::encoding::fallback = FB_HTMLCREF;
binmode STDOUT, ':encoding(ascii)';

open my $FILE, '<:encoding(latin1)', $file or die...;
# or whatever encoding is appropriate
print while <$FILE>;

Otherwise, I'd do

open my $FILE, $file or die...;
while (<$FILE>) {
s/([^[:ascii:]])/$characters{$1}/g;
print;
}

If your %characters doesn't include all the non-ascii in the file, you
could use

my $to_encode = '[' . (join '', keys %characters) . ']';
while (<$FILE>) {
s/($to_encode)/$characters{$1}/g;
print;
}

Ben
 
A

Arvin Portlock

If your %characters doesn't include all the non-ascii in the file, you
could use

my $to_encode = '[' . (join '', keys %characters) . ']';
while (<$FILE>) {
s/($to_encode)/$characters{$1}/g;
print;
}

Ben


Boy, do I feel like an idiot. That makes MUCH more sense and is just
what I'll do. I have no idea what I was thinking.
If you're using 5.8, and don't mind having instead of named
entities, you can do

use Encode qw/:fallbacks/;

$PerlIO::encoding::fallback = FB_HTMLCREF;
binmode STDOUT, ':encoding(ascii)';

open my $FILE, '<:encoding(latin1)', $file or die...;
# or whatever encoding is appropriate
print while <$FILE>;

Nah, I have to use SDATA entities. I'm not dealing with HTML.
But this is a good trick for another project: converting unicode
characters to numeric decimal entities in HTML files so older
browsers can view them.

Thanks!

Arvin
 
B

Brian McCauley

Arvin Portlock said:
If your %characters doesn't include all the non-ascii in the file, you
could use

my $to_encode = '[' . (join '', keys %characters) . ']';
while (<$FILE>) {
s/($to_encode)/$characters{$1}/g;
print;
}

Ben


That makes MUCH more sense and is just what I'll do.

I've not benchmarked it but I suspect it would be more efficient to
take the appending of the () outsie the loop. I'd also explicitly
precompile the regex - although I think Perl will actually manage to
avoid unnecessary recompilation anyhow.

my $to_encode = join '', keys %characters;
$to_encode = qr/([$to_encode])/;
while (<$FILE>) {
s/$to_encode/$characters{$1}/g;
print;
}

Note: Some people would use @{[]} interpolation here but although I'm
a proponent of @{[]} in here-docs I think it looks messy in qr//.

my $to_encode = qr/([@{[ join '', keys %characters ]}])/;

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top