How to replace UniCode representation with actual character?

Discussion in 'Perl Misc' started by Wes Groleau, Dec 18, 2013.

  1. Wes Groleau

    Wes Groleau Guest

    I have a huge file with information about Chinese characters. But
    instead of the character, each line starts with the Unicode hex,
    e.g., U+AC34

    It would be trivial to use awk or perl to write a long script containing
    the substitution for each line, but then every line
    would have to be checked against every sub, for an N² processing time.

    Not good for 36K lines.

    What I tried to do instead was to use the hex value to compute the
    character, for an N² processing time.

    But my not-as-clever-as-I-thought method didn't work:

    iMac:Anki wgroleau$ perl -CSD -p -i -e \
    's:(U\+[A-F0-9]{4})(\s):\1\2\N{\1}\2:g;' \
    /tmp/Chars_Info.txt
    Unknown charname '\1' at -e line 1.
    Deprecated character in \N{...}; marked by <-- HERE in \N{\<-- HERE 1}
    at -e line 1.

    I suspect "there's more than one way" to do it,
    but a perl guru I am definitely not.

    --
    Wes Groleau

    He that is good for making excuses, is seldom good for anything else.
    — Benjamin Franklin
    Wes Groleau, Dec 18, 2013
    #1
    1. Advertising

  2. Wes Groleau <> wrote:
    >I have a huge file with information about Chinese characters. But
    >instead of the character, each line starts with the Unicode hex,
    >e.g., U+AC34


    So you got a textual representation of a code point, i.e. of number.
    To convert this text back into an actual number that the program can
    work with you could use evil eval():
    my $s = '0xAC34';
    my $codepoint = eval "$s";
    And then simply use chr() to get the character at that code point:
    my $char = chr($codepoint);

    Converting 'U+AC34' into '0xAC34' beforehand is left as an excercise.

    jue
    Jürgen Exner, Dec 18, 2013
    #2
    1. Advertising

  3. Wes Groleau

    Wes Groleau Guest

    On 12-17-2013, 23:23, Jürgen Exner wrote:
    > Wes Groleau <> wrote:
    >> I have a huge file with information about Chinese characters. But
    >> instead of the character, each line starts with the Unicode hex,
    >> e.g., U+AC34

    >
    > So you got a textual representation of a code point, i.e. of number.
    > To convert this text back into an actual number that the program can
    > work with you could use evil eval():
    > my $s = '0xAC34';
    > my $codepoint = eval "$s";
    > And then simply use chr() to get the character at that code point:
    > my $char = chr($codepoint);
    >
    > Converting 'U+AC34' into '0xAC34' beforehand is left as an exercise.


    Hmmm, I was hoping for in-place substitution. Will this work?

    perl -CSD -p -i -e \
    's/U+([A-F0-9]{4})(?{chr(eval "$1");})\t/U+$1\t$^R\t/r;' \
    /tmp/Chars_Info.txt

    Nope. On every line, what was inserted for $^R was the bytes efbfbd.

    First line had U+7684, so the UTF-8 bytes hoped for are e79a84 for çš„

    Every thing I tried after that inexplicably prevented matching.

    Ah, well, too late for me to be up anyway.

    (Slaps face) Oh, duh, forgot to put in the '0x' and escape the plus sign !!

    Hmmm. Still isn't matching. G'night, all!

    --
    Wes Groleau

    Expert, n.:
    Someone who comes from out of town and shows slides.
    Wes Groleau, Dec 18, 2013
    #3
  4. * Jürgen Exner wrote in comp.lang.perl.misc:
    >Wes Groleau <> wrote:
    >>I have a huge file with information about Chinese characters. But
    >>instead of the character, each line starts with the Unicode hex,
    >>e.g., U+AC34

    >
    >So you got a textual representation of a code point, i.e. of number.
    >To convert this text back into an actual number that the program can
    >work with you could use evil eval():
    > my $s = '0xAC34';
    > my $codepoint = eval "$s";


    The `hex` and `oct` functions should be used instead.
    --
    Björn Höhrmann · mailto: · http://bjoern.hoehrmann.de
    Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
    25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
    Bjoern Hoehrmann, Dec 18, 2013
    #4
  5. Bjoern Hoehrmann <> wrote:
    >* Jürgen Exner wrote in comp.lang.perl.misc:
    >>Wes Groleau <> wrote:
    >>>I have a huge file with information about Chinese characters. But
    >>>instead of the character, each line starts with the Unicode hex,
    >>>e.g., U+AC34

    >>
    >>So you got a textual representation of a code point, i.e. of number.
    >>To convert this text back into an actual number that the program can
    >>work with you could use evil eval():
    >> my $s = '0xAC34';
    >> my $codepoint = eval "$s";

    >
    >The `hex` and `oct` functions should be used instead.


    TIMTOWTDI, :)

    But of course you are right, hex() is the way to go.

    jue
    Jürgen Exner, Dec 18, 2013
    #5
  6. Wes Groleau <> writes:
    > On 12-17-2013, 23:23, J¨¹rgen Exner wrote:
    >> Wes Groleau <> wrote:
    >>> I have a huge file with information about Chinese characters. But
    >>> instead of the character, each line starts with the Unicode hex,
    >>> e.g., U+AC34

    >>
    >> So you got a textual representation of a code point, i.e. of number.
    >> To convert this text back into an actual number that the program can
    >> work with you could use evil eval():
    >> my $s = '0xAC34';
    >> my $codepoint = eval "$s";
    >> And then simply use chr() to get the character at that code point:
    >> my $char = chr($codepoint);
    >>
    >> Converting 'U+AC34' into '0xAC34' beforehand is left as an exercise.

    >
    > Hmmm, I was hoping for in-place substitution. Will this work?
    >
    > perl -CSD -p -i -e \
    > 's/U+([A-F0-9]{4})(?{chr(eval "$1");})\t/U+$1\t$^R\t/r;' \
    > /tmp/Chars_Info.txt
    >
    > Nope. On every line, what was inserted for $^R was the bytes efbfbd.
    >
    > First line had U+7684, so the UTF-8 bytes hoped for are e79a84 for µÄ
    >
    > Every thing I tried after that inexplicably prevented matching.
    >
    > Ah, well, too late for me to be up anyway.
    >
    > (Slaps face) Oh, duh, forgot to put in the '0x' and escape the plus sign !!
    >
    > Hmmm. Still isn't matching.


    Saving you text to a file (utf-8 encoded) and processing that with

    perl -pe 'BEGIN { binmode($_, 'utf8') for (*STDIN, *STDOUT) } s/U\+([A-F0-9]{4})/chr(hex($1))/eg'

    yields

    ,----
    | Hmmm, I was hoping for in-place substitution. Will this work?
    |
    | perl -CSD -p -i -e \
    | 's/U+([A-F0-9]{4})(?{chr(eval "$1");})\t/U+$1\t$^R\t/r;' \
    | /tmp/Chars_Info.txt
    |
    | Nope. On every line, what was inserted for $^R was the bytes efbfbd.
    |
    | First line had µÄ, so the UTF-8 bytes hoped for are e79a84 for µÄ
    |
    | Every thing I tried after that inexplicably prevented matching.
    `----

    NB: The binmode(STDOUT, 'utf8') isn't strictly needed, its rather a kow tow
    in front of the idea that the character encoding used by perl should be
    "weird and different from anything else" because that's An
    Abstraction[tm].
    Rainer Weikusat, Dec 18, 2013
    #6
  7. Wes Groleau

    Wes Groleau Guest

    On 12-18-2013, 07:11, Rainer Weikusat wrote:
    > Saving you text to a file (utf-8 encoded) and processing that with
    >
    > perl -pe 'BEGIN { binmode($_, 'utf8') for (*STDIN, *STDOUT) } s/U\+([A-F0-9]{4})/chr(hex($1))/eg'
    >
    > yields


    Thanks. That did replace the code with the character. But
    I apparently didn't express clearly. I want to keep the code
    and ADD the character. I tried several ways to pre-pend $1\t
    and kept getting syntax errors. FINALLY succeeded with

    perl -CSD -p -i -e \
    's/U\+([A-F0-9]{4})/"U+$1\t".chr(hex($1))/eg;' \
    /tmp/Chars_Info.txt

    > NB: The binmode(STDOUT, 'utf8') isn't strictly needed, its rather a kow tow
    > in front of the idea that the character encoding used by perl should be
    > "weird and different from anything else" because that's An
    > Abstraction[tm].


    As far as I can tell, -CSD makes _everything_ UTF-8.

    Why is -CSD the same as -C -S -D

    and -pe the same as -p -e

    but -pie and -CSDpie are errors

    when -CSD -p -i -e work fine ?

    Anyway, it works. Thanks very much guys

    --
    Wes Groleau

    You always have time for what you do first.
    Wes Groleau, Dec 19, 2013
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. herman
    Replies:
    5
    Views:
    7,574
    =?ISO-8859-1?Q?Erik_Wikstr=F6m?=
    Aug 30, 2007
  2. Replies:
    1
    Views:
    776
    Alexey Smirnov
    Jul 10, 2008
  3. Alexey Smirnov
    Replies:
    0
    Views:
    672
    Alexey Smirnov
    Jul 10, 2008
  4. Alan J. Flavell
    Replies:
    4
    Views:
    313
    Alan J. Flavell
    Feb 22, 2004
  5. Ryan Chan

    Replace Unicode character

    Ryan Chan, Oct 5, 2009, in forum: Perl Misc
    Replies:
    5
    Views:
    439
Loading...

Share This Page