Replacing hundreds of hash keys with their values in a text document

Arvin Portlock · Feb 12, 2004

I'm writing a script that replaces the direct form of a
special character with its SDATA equivalent. For example
it would replace all occurences of é with é. I've
compiled an enormous hash with the "direct" form as the
key and the SDATA version as its value. I can think of two
ways to accomplish this. The first is two loop through all
keys and do a global replace with the correct value:

foreach my $key (keys %characters) {
$fulltext =~ s/$key/$characters{$key}/g;
}

The second is to process the document character by character
and if the character is in the hash then replace it:

local $/ = undef;
open (FILE, $file);
my $fulltext = <FILE>;
close (FILE);
my @chars = split (//, $fulltext);
foreach my $char (@chars) {
if ($characters{$char}) {
print $characters{$char};
} else {
print $char;
}
}

The second seems the faster option, but neither one of them
is exactly and elegant solution. Is there something obvious
I'm missing?

Arvin

Ben Morrow · Feb 12, 2004

Arvin Portlock said:
I'm writing a script that replaces the direct form of a
special character with its SDATA equivalent. For example
it would replace all occurences of é with é. I've
compiled an enormous hash with the "direct" form as the
key and the SDATA version as its value. I can think of two
ways to accomplish this. The first is two loop through all
keys and do a global replace with the correct value:

foreach my $key (keys %characters) {
$fulltext =~ s/$key/$characters{$key}/g;
}

The second is to process the document character by character
and if the character is in the hash then replace it:

local $/ = undef;
open (FILE, $file);
my $fulltext = <FILE>;
close (FILE);
my @chars = split (//, $fulltext);
foreach my $char (@chars) {
if ($characters{$char}) {
print $characters{$char};
} else {
print $char;
}
}

The second seems the faster option, but neither one of them
is exactly and elegant solution. Is there something obvious
I'm missing?

If you're using 5.8, and don't mind having instead of named
entities, you can do

use Encode qw/:fallbacks/;

$PerlIO::encoding::fallback = FB_HTMLCREF;
binmode STDOUT, ':encoding(ascii)';

open my $FILE, '<:encoding(latin1)', $file or die...;
# or whatever encoding is appropriate
print while <$FILE>;

Otherwise, I'd do

open my $FILE, $file or die...;
while (<$FILE>) {
s/([^[:ascii:]])/$characters{$1}/g;
print;
}

If your %characters doesn't include all the non-ascii in the file, you
could use

my $to_encode = '[' . (join '', keys %characters) . ']';
while (<$FILE>) {
s/($to_encode)/$characters{$1}/g;
print;
}

Ben

Arvin Portlock · Feb 13, 2004

If your %characters doesn't include all the non-ascii in the file, you

could use

my $to_encode = '[' . (join '', keys %characters) . ']';
while (<$FILE>) {
s/($to_encode)/$characters{$1}/g;
print;
}

Ben

Boy, do I feel like an idiot. That makes MUCH more sense and is just
what I'll do. I have no idea what I was thinking.

If you're using 5.8, and don't mind having instead of named
entities, you can do

use Encode qw/:fallbacks/;

$PerlIO::encoding::fallback = FB_HTMLCREF;
binmode STDOUT, ':encoding(ascii)';

open my $FILE, '<:encoding(latin1)', $file or die...;
# or whatever encoding is appropriate
print while <$FILE>;

Nah, I have to use SDATA entities. I'm not dealing with HTML.
But this is a good trick for another project: converting unicode
characters to numeric decimal entities in HTML files so older
browsers can view them.

Thanks!

Arvin

Brian McCauley · Feb 13, 2004

Arvin Portlock said:
If your %characters doesn't include all the non-ascii in the file, you
could use

my $to_encode = '[' . (join '', keys %characters) . ']';
while (<$FILE>) {
s/($to_encode)/$characters{$1}/g;
print;
}

Ben

Click to expand...

That makes MUCH more sense and is just what I'll do.

I've not benchmarked it but I suspect it would be more efficient to
take the appending of the () outsie the loop. I'd also explicitly
precompile the regex - although I think Perl will actually manage to
avoid unnecessary recompilation anyhow.

my $to_encode = join '', keys %characters;
$to_encode = qr/([$to_encode])/;
while (<$FILE>) {
s/$to_encode/$characters{$1}/g;
print;
}

Note: Some people would use @{[]} interpolation here but although I'm
a proponent of @{[]} in here-docs I think it looks messy in qr//.

my $to_encode = qr/([@{[ join '', keys %characters ]}])/;

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\

Push regex search result into hash with multiple values	14	May 19, 2014
use of "delete" for hash keys	6	Nov 1, 2010
Hash key types and equality of hash keys	2	Mar 1, 2012
multiple text replacements from a hash	2	Apr 1, 2013
having trouble with hash of arrays...	12	Jul 3, 2013
hash of arrays	1	Sep 13, 2012
Sort keys in a hash numerically	5	Aug 8, 2006
Comparing values of multiple hash keys	8	Jul 26, 2006

Replacing hundreds of hash keys with their values in a text document

Arvin Portlock

Ben Morrow

Arvin Portlock

Brian McCauley

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads