Replace Unicode character

Ryan Chan · Oct 5, 2009

Hello,

Below my code which want to replace unicode character "â–¡" with empty
string, what wrong with the code?

###################

use strict;
use warnings;
use utf8;

my $s = "â–¡"; # hex value = A1BC

$s =~ s/\xA1\xBC//gi;
print $s;

###################

Thanks.

Ryan Chan · Oct 5, 2009

Hello,

Your regexp replaces TWO characters, first one A1, second one BC.

Since your target string does not contain either of these
characters, nothing happens.

BugBear

even I use

$s =~ s/\xA1BC//gi;

the same...

Thanks anyway

Ben Bullock · Oct 5, 2009

$s =~ s/\xA1BC//gi;

\x{A1BC} works though.

It's documented in "perldoc perlunicode".

According to Unicode::UCD this is the character "YI SYLLABLE LIEX".

Peter J. Holzer · Oct 5, 2009

\x{A1BC} works though.

It's documented in "perldoc perlunicode".

According to Unicode::UCD this is the character "YI SYLLABLE LIEX".

Also note that UTF-8 "\xA1\xBC" is not equivalent to U+A1BC. In fact
"\xA1\xBC" is not a valid UTF-8 character at all, U+A1BC is
"\xEA\x86\xBC" in UTF-8, and the character in Ryan's posting was U+25A1
(WHITE SQUARE) or "\xE2\x96\xA1" in UTF-8.

hp

Jochen Lehmeier · Oct 5, 2009

Below my code which want to replace unicode character "â–¡" with empty
string, what wrong with the code?

Since it has not been spelled out yet:

$s contains one character. The regex contains two characters. One
character never matches two characters.

Funnily, if you're working in an utf8 environment, even a simple \xA1 can
actually be stored as two *bytes*:

perl -e '$s="\xa1"; print $s; binmode STDOUT,":encoding(utf8)"; print
$s;' | hexdump -C

00000000 a1 c2 a1 |...|
00000003

sln · Oct 5, 2009

Since it has not been spelled out yet:

$s contains one character. The regex contains two characters. One
character never matches two characters.

Funnily, if you're working in an utf8 environment, even a simple \xA1 can
actually be stored as two *bytes*:

00000000 a1 c2 a1 |...|
00000003

I guess scalar data can actually be stored as bytes (0..255) before say
decoding octets into Perl's internal form. Either the resultant string
is all ASCII or a mix with the utf8 flag turned on (character semantics).

I think this is the base storage strategy for Perl. It speeds things up.
Encoding just converts it back into octets, turning off the utf8 flag
(byte semantics). This process is not always symetrical and there is
sometimes more than one encoding representations of the same thing.

Sort of a bastardized system.

-sln

RegExp pattern / replace function	0	Mar 3, 2025
How to replace UniCode representation with actual character?	6	Dec 17, 2013
Unicode help please	5	Oct 19, 2013
Why "Wide character in print"?	40	Sep 30, 2012
Remove Unicode control character from string	2	Oct 4, 2009
How can I get a character, given its Unicode index?	5	Aug 30, 2009
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Regex replace problem	2	Jan 6, 2022

Replace Unicode character

Ryan Chan

Ryan Chan

Ben Bullock

Peter J. Holzer

Jochen Lehmeier

sln

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads