How to replace UniCode representation with actual character?

Wes Groleau · Dec 17, 2013

I have a huge file with information about Chinese characters. But
instead of the character, each line starts with the Unicode hex,
e.g., U+AC34

It would be trivial to use awk or perl to write a long script containing
the substitution for each line, but then every line
would have to be checked against every sub, for an NÂ² processing time.

Not good for 36K lines.

What I tried to do instead was to use the hex value to compute the
character, for an NÂ² processing time.

But my not-as-clever-as-I-thought method didn't work:

iMac:Anki wgroleau$ perl -CSD -p -i -e \
's

U\+[A-F0-9]{4})(\s):\1\2\N{\1}\2:g;' \
/tmp/Chars_Info.txt
Unknown charname '\1' at -e line 1.
Deprecated character in \N{...}; marked by <-- HERE in \N{\<-- HERE 1}
at -e line 1.

I suspect "there's more than one way" to do it,
but a perl guru I am definitely not.

Jürgen Exner · Dec 17, 2013

Wes Groleau said:
I have a huge file with information about Chinese characters. But
instead of the character, each line starts with the Unicode hex,
e.g., U+AC34

So you got a textual representation of a code point, i.e. of number.
To convert this text back into an actual number that the program can
work with you could use evil eval():
my $s = '0xAC34';
my $codepoint = eval "$s";
And then simply use chr() to get the character at that code point:
my $char = chr($codepoint);

Converting 'U+AC34' into '0xAC34' beforehand is left as an excercise.

jue

Wes Groleau · Dec 18, 2013

So you got a textual representation of a code point, i.e. of number.
To convert this text back into an actual number that the program can
work with you could use evil eval():
my $s = '0xAC34';
my $codepoint = eval "$s";
And then simply use chr() to get the character at that code point:
my $char = chr($codepoint);

Converting 'U+AC34' into '0xAC34' beforehand is left as an exercise.

Hmmm, I was hoping for in-place substitution. Will this work?

perl -CSD -p -i -e \
's/U+([A-F0-9]{4})(?{chr(eval "$1");})\t/U+$1\t$^R\t/r;' \
/tmp/Chars_Info.txt

Nope. On every line, what was inserted for $^R was the bytes efbfbd.

First line had U+7684, so the UTF-8 bytes hoped for are e79a84 for çš„

Every thing I tried after that inexplicably prevented matching.

Ah, well, too late for me to be up anyway.

(Slaps face) Oh, duh, forgot to put in the '0x' and escape the plus sign !!

Hmmm. Still isn't matching. G'night, all!

Bjoern Hoehrmann · Dec 18, 2013

* Jürgen Exner wrote in comp.lang.perl.misc:

So you got a textual representation of a code point, i.e. of number.
To convert this text back into an actual number that the program can
work with you could use evil eval():
my $s = '0xAC34';
my $codepoint = eval "$s";

The `hex` and `oct` functions should be used instead.

Jürgen Exner · Dec 18, 2013

Bjoern Hoehrmann said:
* Jürgen Exner wrote in comp.lang.perl.misc:

The `hex` and `oct` functions should be used instead.

TIMTOWTDI,

But of course you are right, hex() is the way to go.

jue

Rainer Weikusat · Dec 18, 2013

Wes Groleau said:
So you got a textual representation of a code point, i.e. of number.
To convert this text back into an actual number that the program can
work with you could use evil eval():
my $s = '0xAC34';
my $codepoint = eval "$s";
And then simply use chr() to get the character at that code point:
my $char = chr($codepoint);

Converting 'U+AC34' into '0xAC34' beforehand is left as an exercise.

Click to expand...

Hmmm, I was hoping for in-place substitution. Will this work?

perl -CSD -p -i -e \
's/U+([A-F0-9]{4})(?{chr(eval "$1");})\t/U+$1\t$^R\t/r;' \
/tmp/Chars_Info.txt

Nope. On every line, what was inserted for $^R was the bytes efbfbd.

First line had U+7684, so the UTF-8 bytes hoped for are e79a84 for µÄ

Every thing I tried after that inexplicably prevented matching.

Ah, well, too late for me to be up anyway.

(Slaps face) Oh, duh, forgot to put in the '0x' and escape the plus sign !!

Hmmm. Still isn't matching.

Saving you text to a file (utf-8 encoded) and processing that with

perl -pe 'BEGIN { binmode($_, 'utf8') for (*STDIN, *STDOUT) } s/U\+([A-F0-9]{4})/chr(hex($1))/eg'

yields

,----
| Hmmm, I was hoping for in-place substitution. Will this work?
|
| perl -CSD -p -i -e \
| 's/U+([A-F0-9]{4})(?{chr(eval "$1");})\t/U+$1\t$^R\t/r;' \
| /tmp/Chars_Info.txt
|
| Nope. On every line, what was inserted for $^R was the bytes efbfbd.
|
| First line had µÄ, so the UTF-8 bytes hoped for are e79a84 for µÄ
|
| Every thing I tried after that inexplicably prevented matching.
`----

NB: The binmode(STDOUT, 'utf8') isn't strictly needed, its rather a kow tow
in front of the idea that the character encoding used by perl should be
"weird and different from anything else" because that's An
Abstraction[tm].

Wes Groleau · Dec 18, 2013

Saving you text to a file (utf-8 encoded) and processing that with

perl -pe 'BEGIN { binmode($_, 'utf8') for (*STDIN, *STDOUT) } s/U\+([A-F0-9]{4})/chr(hex($1))/eg'

yields

Thanks. That did replace the code with the character. But
I apparently didn't express clearly. I want to keep the code
and ADD the character. I tried several ways to pre-pend $1\t
and kept getting syntax errors. FINALLY succeeded with

perl -CSD -p -i -e \
's/U\+([A-F0-9]{4})/"U+$1\t".chr(hex($1))/eg;' \
/tmp/Chars_Info.txt

NB: The binmode(STDOUT, 'utf8') isn't strictly needed, its rather a kow tow
in front of the idea that the character encoding used by perl should be
"weird and different from anything else" because that's An
Abstraction[tm].

As far as I can tell, -CSD makes _everything_ UTF-8.

Why is -CSD the same as -C -S -D

and -pe the same as -p -e

but -pie and -CSDpie are errors

when -CSD -p -i -e work fine ?

Anyway, it works. Thanks very much guys

Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Unicode help please	5	Oct 19, 2013
How can I get a character, given its Unicode index?	5	Aug 30, 2009
How to play corresponding sound?	2	Jun 10, 2023
replace unicode characters by &#number; representation	4	Feb 21, 2004
Replace an occurrence of a regexp with a function call on a substringof the match, multiple times on	4	Sep 16, 2013
I need help with a Gemini prompt	1	May 14, 2025
How to replace ^ with an other character in a string ?	6	Mar 10, 2011

How to replace UniCode representation with actual character?

Wes Groleau

Jürgen Exner

Wes Groleau

Bjoern Hoehrmann

Jürgen Exner

Rainer Weikusat

Wes Groleau

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads