How to replace UniCode representation with actual character?


W

Wes Groleau

I have a huge file with information about Chinese characters. But
instead of the character, each line starts with the Unicode hex,
e.g., U+AC34

It would be trivial to use awk or perl to write a long script containing
the substitution for each line, but then every line
would have to be checked against every sub, for an N² processing time.

Not good for 36K lines.

What I tried to do instead was to use the hex value to compute the
character, for an N² processing time.

But my not-as-clever-as-I-thought method didn't work:

iMac:Anki wgroleau$ perl -CSD -p -i -e \
's:(U\+[A-F0-9]{4})(\s):\1\2\N{\1}\2:g;' \
/tmp/Chars_Info.txt
Unknown charname '\1' at -e line 1.
Deprecated character in \N{...}; marked by <-- HERE in \N{\<-- HERE 1}
at -e line 1.

I suspect "there's more than one way" to do it,
but a perl guru I am definitely not.
 
Ad

Advertisements

J

Jürgen Exner

Wes Groleau said:
I have a huge file with information about Chinese characters. But
instead of the character, each line starts with the Unicode hex,
e.g., U+AC34

So you got a textual representation of a code point, i.e. of number.
To convert this text back into an actual number that the program can
work with you could use evil eval():
my $s = '0xAC34';
my $codepoint = eval "$s";
And then simply use chr() to get the character at that code point:
my $char = chr($codepoint);

Converting 'U+AC34' into '0xAC34' beforehand is left as an excercise.

jue
 
W

Wes Groleau

So you got a textual representation of a code point, i.e. of number.
To convert this text back into an actual number that the program can
work with you could use evil eval():
my $s = '0xAC34';
my $codepoint = eval "$s";
And then simply use chr() to get the character at that code point:
my $char = chr($codepoint);

Converting 'U+AC34' into '0xAC34' beforehand is left as an exercise.

Hmmm, I was hoping for in-place substitution. Will this work?

perl -CSD -p -i -e \
's/U+([A-F0-9]{4})(?{chr(eval "$1");})\t/U+$1\t$^R\t/r;' \
/tmp/Chars_Info.txt

Nope. On every line, what was inserted for $^R was the bytes efbfbd.

First line had U+7684, so the UTF-8 bytes hoped for are e79a84 for çš„

Every thing I tried after that inexplicably prevented matching.

Ah, well, too late for me to be up anyway.

(Slaps face) Oh, duh, forgot to put in the '0x' and escape the plus sign !!

Hmmm. Still isn't matching. G'night, all!
 
B

Bjoern Hoehrmann

* Jürgen Exner wrote in comp.lang.perl.misc:
So you got a textual representation of a code point, i.e. of number.
To convert this text back into an actual number that the program can
work with you could use evil eval():
my $s = '0xAC34';
my $codepoint = eval "$s";

The `hex` and `oct` functions should be used instead.
 
J

Jürgen Exner

Bjoern Hoehrmann said:
* Jürgen Exner wrote in comp.lang.perl.misc:

The `hex` and `oct` functions should be used instead.

TIMTOWTDI, :)

But of course you are right, hex() is the way to go.

jue
 
R

Rainer Weikusat

Wes Groleau said:
So you got a textual representation of a code point, i.e. of number.
To convert this text back into an actual number that the program can
work with you could use evil eval():
my $s = '0xAC34';
my $codepoint = eval "$s";
And then simply use chr() to get the character at that code point:
my $char = chr($codepoint);

Converting 'U+AC34' into '0xAC34' beforehand is left as an exercise.

Hmmm, I was hoping for in-place substitution. Will this work?

perl -CSD -p -i -e \
's/U+([A-F0-9]{4})(?{chr(eval "$1");})\t/U+$1\t$^R\t/r;' \
/tmp/Chars_Info.txt

Nope. On every line, what was inserted for $^R was the bytes efbfbd.

First line had U+7684, so the UTF-8 bytes hoped for are e79a84 for µÄ

Every thing I tried after that inexplicably prevented matching.

Ah, well, too late for me to be up anyway.

(Slaps face) Oh, duh, forgot to put in the '0x' and escape the plus sign !!

Hmmm. Still isn't matching.

Saving you text to a file (utf-8 encoded) and processing that with

perl -pe 'BEGIN { binmode($_, 'utf8') for (*STDIN, *STDOUT) } s/U\+([A-F0-9]{4})/chr(hex($1))/eg'

yields

,----
| Hmmm, I was hoping for in-place substitution. Will this work?
|
| perl -CSD -p -i -e \
| 's/U+([A-F0-9]{4})(?{chr(eval "$1");})\t/U+$1\t$^R\t/r;' \
| /tmp/Chars_Info.txt
|
| Nope. On every line, what was inserted for $^R was the bytes efbfbd.
|
| First line had µÄ, so the UTF-8 bytes hoped for are e79a84 for µÄ
|
| Every thing I tried after that inexplicably prevented matching.
`----

NB: The binmode(STDOUT, 'utf8') isn't strictly needed, its rather a kow tow
in front of the idea that the character encoding used by perl should be
"weird and different from anything else" because that's An
Abstraction[tm].
 
Ad

Advertisements

W

Wes Groleau

Saving you text to a file (utf-8 encoded) and processing that with

perl -pe 'BEGIN { binmode($_, 'utf8') for (*STDIN, *STDOUT) } s/U\+([A-F0-9]{4})/chr(hex($1))/eg'

yields

Thanks. That did replace the code with the character. But
I apparently didn't express clearly. I want to keep the code
and ADD the character. I tried several ways to pre-pend $1\t
and kept getting syntax errors. FINALLY succeeded with

perl -CSD -p -i -e \
's/U\+([A-F0-9]{4})/"U+$1\t".chr(hex($1))/eg;' \
/tmp/Chars_Info.txt
NB: The binmode(STDOUT, 'utf8') isn't strictly needed, its rather a kow tow
in front of the idea that the character encoding used by perl should be
"weird and different from anything else" because that's An
Abstraction[tm].

As far as I can tell, -CSD makes _everything_ UTF-8.

Why is -CSD the same as -C -S -D

and -pe the same as -p -e

but -pie and -CSDpie are errors

when -CSD -p -i -e work fine ?

Anyway, it works. Thanks very much guys
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top