Replacing hundreds of hash keys with their values in a text document

Discussion in 'Perl Misc' started by Arvin Portlock, Feb 12, 2004.

  1. I'm writing a script that replaces the direct form of a
    special character with its SDATA equivalent. For example
    it would replace all occurences of é with é. I've
    compiled an enormous hash with the "direct" form as the
    key and the SDATA version as its value. I can think of two
    ways to accomplish this. The first is two loop through all
    keys and do a global replace with the correct value:

    foreach my $key (keys %characters) {
    $fulltext =~ s/$key/$characters{$key}/g;
    }

    The second is to process the document character by character
    and if the character is in the hash then replace it:

    local $/ = undef;
    open (FILE, $file);
    my $fulltext = <FILE>;
    close (FILE);
    my @chars = split (//, $fulltext);
    foreach my $char (@chars) {
    if ($characters{$char}) {
    print $characters{$char};
    } else {
    print $char;
    }
    }

    The second seems the faster option, but neither one of them
    is exactly and elegant solution. Is there something obvious
    I'm missing?

    Arvin
     
    Arvin Portlock, Feb 12, 2004
    #1
    1. Advertising

  2. Arvin Portlock

    Ben Morrow Guest

    Arvin Portlock <> wrote:
    > I'm writing a script that replaces the direct form of a
    > special character with its SDATA equivalent. For example
    > it would replace all occurences of é with &eacute;. I've
    > compiled an enormous hash with the "direct" form as the
    > key and the SDATA version as its value. I can think of two
    > ways to accomplish this. The first is two loop through all
    > keys and do a global replace with the correct value:
    >
    > foreach my $key (keys %characters) {
    > $fulltext =~ s/$key/$characters{$key}/g;
    > }
    >
    > The second is to process the document character by character
    > and if the character is in the hash then replace it:
    >
    > local $/ = undef;
    > open (FILE, $file);
    > my $fulltext = <FILE>;
    > close (FILE);
    > my @chars = split (//, $fulltext);
    > foreach my $char (@chars) {
    > if ($characters{$char}) {
    > print $characters{$char};
    > } else {
    > print $char;
    > }
    > }
    >
    > The second seems the faster option, but neither one of them
    > is exactly and elegant solution. Is there something obvious
    > I'm missing?


    If you're using 5.8, and don't mind having instead of named
    entities, you can do

    use Encode qw/:fallbacks/;

    $PerlIO::encoding::fallback = FB_HTMLCREF;
    binmode STDOUT, ':encoding(ascii)';

    open my $FILE, '<:encoding(latin1)', $file or die...;
    # or whatever encoding is appropriate
    print while <$FILE>;

    Otherwise, I'd do

    open my $FILE, $file or die...;
    while (<$FILE>) {
    s/([^[:ascii:]])/$characters{$1}/g;
    print;
    }

    If your %characters doesn't include all the non-ascii in the file, you
    could use

    my $to_encode = '[' . (join '', keys %characters) . ']';
    while (<$FILE>) {
    s/($to_encode)/$characters{$1}/g;
    print;
    }

    Ben

    --
    Like all men in Babylon I have been a proconsul; like all, a slave ... During
    one lunar year, I have been declared invisible; I shrieked and was not heard,
    I stole my bread and was not decapitated.
    ~ ~ Jorge Luis Borges, 'The Babylon Lottery'
     
    Ben Morrow, Feb 12, 2004
    #2
    1. Advertising

  3. > If your %characters doesn't include all the non-ascii in the file, you
    > could use
    >
    > my $to_encode = '[' . (join '', keys %characters) . ']';
    > while (<$FILE>) {
    > s/($to_encode)/$characters{$1}/g;
    > print;
    > }
    >
    > Ben



    Boy, do I feel like an idiot. That makes MUCH more sense and is just
    what I'll do. I have no idea what I was thinking.

    > If you're using 5.8, and don't mind having instead of named
    > entities, you can do
    >
    > use Encode qw/:fallbacks/;
    >
    > $PerlIO::encoding::fallback = FB_HTMLCREF;
    > binmode STDOUT, ':encoding(ascii)';
    >
    > open my $FILE, '<:encoding(latin1)', $file or die...;
    > # or whatever encoding is appropriate
    > print while <$FILE>;


    Nah, I have to use SDATA entities. I'm not dealing with HTML.
    But this is a good trick for another project: converting unicode
    characters to numeric decimal entities in HTML files so older
    browsers can view them.

    Thanks!

    Arvin
     
    Arvin Portlock, Feb 13, 2004
    #3
  4. Arvin Portlock <> writes:

    > > If your %characters doesn't include all the non-ascii in the file, you
    > > could use
    > >
    > > my $to_encode = '[' . (join '', keys %characters) . ']';
    > > while (<$FILE>) {
    > > s/($to_encode)/$characters{$1}/g;
    > > print;
    > > }
    > >
    > > Ben

    >
    >
    > That makes MUCH more sense and is just what I'll do.


    I've not benchmarked it but I suspect it would be more efficient to
    take the appending of the () outsie the loop. I'd also explicitly
    precompile the regex - although I think Perl will actually manage to
    avoid unnecessary recompilation anyhow.

    my $to_encode = join '', keys %characters;
    $to_encode = qr/([$to_encode])/;
    while (<$FILE>) {
    s/$to_encode/$characters{$1}/g;
    print;
    }

    Note: Some people would use @{[]} interpolation here but although I'm
    a proponent of @{[]} in here-docs I think it looks messy in qr//.

    my $to_encode = qr/([@{[ join '', keys %characters ]}])/;

    --
    \\ ( )
    . _\\__[oo
    .__/ \\ /\@
    . l___\\
    # ll l\\
    ###LL LL\\
     
    Brian McCauley, Feb 13, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Matthew Thorley

    Can dictionary values access their keys?

    Matthew Thorley, Apr 8, 2005, in forum: Python
    Replies:
    20
    Views:
    736
    Terry Reedy
    Apr 8, 2005
  2. rp
    Replies:
    1
    Views:
    539
    red floyd
    Nov 10, 2011
  3. Alex Fenton

    Hash#values and Hash#keys order

    Alex Fenton, Apr 7, 2006, in forum: Ruby
    Replies:
    1
    Views:
    142
    George Ogata
    Apr 15, 2006
  4. Mage

    hash.keys and hash.values

    Mage, Aug 13, 2006, in forum: Ruby
    Replies:
    14
    Views:
    182
  5. Ronald Fischer

    Hash#keys, Hash#values order question

    Ronald Fischer, Aug 23, 2007, in forum: Ruby
    Replies:
    0
    Views:
    156
    Ronald Fischer
    Aug 23, 2007
Loading...

Share This Page