Replace Unicode character

R

Ryan Chan

Hello,

Below my code which want to replace unicode character "â–¡" with empty
string, what wrong with the code?

###################

use strict;
use warnings;
use utf8;

my $s = "â–¡"; # hex value = A1BC

$s =~ s/\xA1\xBC//gi;
print $s;

###################


Thanks.
 
R

Ryan Chan

Hello,

Your regexp replaces TWO characters, first one A1, second one BC.

Since your target string does not contain either of these
characters, nothing happens.

  BugBear


even I use

$s =~ s/\xA1BC//gi;

the same...

Thanks anyway
 
B

Ben Bullock

$s =~ s/\xA1BC//gi;

\x{A1BC} works though.

It's documented in "perldoc perlunicode".

According to Unicode::UCD this is the character "YI SYLLABLE LIEX".
 
P

Peter J. Holzer

\x{A1BC} works though.

It's documented in "perldoc perlunicode".

According to Unicode::UCD this is the character "YI SYLLABLE LIEX".

Also note that UTF-8 "\xA1\xBC" is not equivalent to U+A1BC. In fact
"\xA1\xBC" is not a valid UTF-8 character at all, U+A1BC is
"\xEA\x86\xBC" in UTF-8, and the character in Ryan's posting was U+25A1
(WHITE SQUARE) or "\xE2\x96\xA1" in UTF-8.

hp
 
J

Jochen Lehmeier

Below my code which want to replace unicode character "â–¡" with empty
string, what wrong with the code?

Since it has not been spelled out yet:

$s contains one character. The regex contains two characters. One
character never matches two characters.

Funnily, if you're working in an utf8 environment, even a simple \xA1 can
actually be stored as two *bytes*:
perl -e '$s="\xa1"; print $s; binmode STDOUT,":encoding(utf8)"; print
$s;' | hexdump -C
00000000 a1 c2 a1 |...|
00000003
 
S

sln

Since it has not been spelled out yet:

$s contains one character. The regex contains two characters. One
character never matches two characters.

Funnily, if you're working in an utf8 environment, even a simple \xA1 can
actually be stored as two *bytes*:

00000000 a1 c2 a1 |...|
00000003

I guess scalar data can actually be stored as bytes (0..255) before say
decoding octets into Perl's internal form. Either the resultant string
is all ASCII or a mix with the utf8 flag turned on (character semantics).

I think this is the base storage strategy for Perl. It speeds things up.
Encoding just converts it back into octets, turning off the utf8 flag
(byte semantics). This process is not always symetrical and there is
sometimes more than one encoding representations of the same thing.

Sort of a bastardized system.

-sln
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,571
Members
45,045
Latest member
DRCM

Latest Threads

Top