How to replace UniCode representation with actual character?

Discussion in 'Perl Misc' started by Wes Groleau, Dec 18, 2013.

  1. Wes Groleau

    Wes Groleau Guest

    I have a huge file with information about Chinese characters. But
    instead of the character, each line starts with the Unicode hex,
    e.g., U+AC34

    It would be trivial to use awk or perl to write a long script containing
    the substitution for each line, but then every line
    would have to be checked against every sub, for an N² processing time.

    Not good for 36K lines.

    What I tried to do instead was to use the hex value to compute the
    character, for an N² processing time.

    But my not-as-clever-as-I-thought method didn't work:

    iMac:Anki wgroleau$ perl -CSD -p -i -e \
    's:(U\+[A-F0-9]{4})(\s):\1\2\N{\1}\2:g;' \
    /tmp/Chars_Info.txt
    Unknown charname '\1' at -e line 1.
    Deprecated character in \N{...}; marked by <-- HERE in \N{\<-- HERE 1}
    at -e line 1.

    I suspect "there's more than one way" to do it,
    but a perl guru I am definitely not.
     
    Wes Groleau, Dec 18, 2013
    #1
    1. Advertisements

  2. So you got a textual representation of a code point, i.e. of number.
    To convert this text back into an actual number that the program can
    work with you could use evil eval():
    my $s = '0xAC34';
    my $codepoint = eval "$s";
    And then simply use chr() to get the character at that code point:
    my $char = chr($codepoint);

    Converting 'U+AC34' into '0xAC34' beforehand is left as an excercise.

    jue
     
    Jürgen Exner, Dec 18, 2013
    #2
    1. Advertisements

  3. Wes Groleau

    Wes Groleau Guest

    Hmmm, I was hoping for in-place substitution. Will this work?

    perl -CSD -p -i -e \
    's/U+([A-F0-9]{4})(?{chr(eval "$1");})\t/U+$1\t$^R\t/r;' \
    /tmp/Chars_Info.txt

    Nope. On every line, what was inserted for $^R was the bytes efbfbd.

    First line had U+7684, so the UTF-8 bytes hoped for are e79a84 for çš„

    Every thing I tried after that inexplicably prevented matching.

    Ah, well, too late for me to be up anyway.

    (Slaps face) Oh, duh, forgot to put in the '0x' and escape the plus sign !!

    Hmmm. Still isn't matching. G'night, all!
     
    Wes Groleau, Dec 18, 2013
    #3
  4. * Jürgen Exner wrote in comp.lang.perl.misc:
    The `hex` and `oct` functions should be used instead.
     
    Bjoern Hoehrmann, Dec 18, 2013
    #4
  5. TIMTOWTDI, :)

    But of course you are right, hex() is the way to go.

    jue
     
    Jürgen Exner, Dec 18, 2013
    #5
  6. Saving you text to a file (utf-8 encoded) and processing that with

    perl -pe 'BEGIN { binmode($_, 'utf8') for (*STDIN, *STDOUT) } s/U\+([A-F0-9]{4})/chr(hex($1))/eg'

    yields

    ,----
    | Hmmm, I was hoping for in-place substitution. Will this work?
    |
    | perl -CSD -p -i -e \
    | 's/U+([A-F0-9]{4})(?{chr(eval "$1");})\t/U+$1\t$^R\t/r;' \
    | /tmp/Chars_Info.txt
    |
    | Nope. On every line, what was inserted for $^R was the bytes efbfbd.
    |
    | First line had µÄ, so the UTF-8 bytes hoped for are e79a84 for µÄ
    |
    | Every thing I tried after that inexplicably prevented matching.
    `----

    NB: The binmode(STDOUT, 'utf8') isn't strictly needed, its rather a kow tow
    in front of the idea that the character encoding used by perl should be
    "weird and different from anything else" because that's An
    Abstraction[tm].
     
    Rainer Weikusat, Dec 18, 2013
    #6
  7. Wes Groleau

    Wes Groleau Guest

    Thanks. That did replace the code with the character. But
    I apparently didn't express clearly. I want to keep the code
    and ADD the character. I tried several ways to pre-pend $1\t
    and kept getting syntax errors. FINALLY succeeded with

    perl -CSD -p -i -e \
    's/U\+([A-F0-9]{4})/"U+$1\t".chr(hex($1))/eg;' \
    /tmp/Chars_Info.txt
    As far as I can tell, -CSD makes _everything_ UTF-8.

    Why is -CSD the same as -C -S -D

    and -pe the same as -p -e

    but -pie and -CSDpie are errors

    when -CSD -p -i -e work fine ?

    Anyway, it works. Thanks very much guys
     
    Wes Groleau, Dec 19, 2013
    #7
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.